DWCA Module
===========
Reading a DwCA file
-------------------
A Darwin Core Archive file is a compress ``zip`` or ``tar.gz`` file containing the component files in a star scheme and the descriptor files. Whole documentation of Darwin Core in the `text guide `_.
Assuming the archive is inside a ``DwCArchive.zip`` file containing:
- ``meta.xml``: Mandatory descriptor file.
- ``eml.xml``: Metadata file.
- ``taxon.txt``: Core component.
- ``identifier.txt``: Identifier extension component.
- ``speciesprofile.txt``: Species profile extension component.
- ``reference.txt``: Reference extension component.
All of these files must be specified in the ``meta.xml`` file as described in the `text guide, section metafile content `_.
To read this file you can:
.. code-block:: python
from dwca import DarwinCoreArchive
darwin_core = DarwinCoreArchive.from_archive("DwCArchive.zip")
To check the components on it:
.. code-block:: python
print(darwin_core.core)
.. code-block:: python
Core:
class: Taxon
filename: taxon.txt
content: 163461 entries
The extensions component are stores in a ``Python list`` and can be access in the same way:
.. code-block:: python
print(len(darwin_core.extension))
.. code-block:: python
3
Check the first one:
.. code-block:: python
print(darwin_core.extension[0])
.. code-block:: python
Extension:
class: SpeciesProfile
filename: speciesprofile.txt
content: 153622 entries
And you can work with this data as an array of ``Python objects``, as ``numpy arrays` or as ``pandas DataFrames``
.. code-block:: python
darwin_core.core.data
.. code-block:: python
[, , ...]
.. code-block:: python
darwin_core.core.data.pandas
.. code-block:: python
taxonID ... institutionCode
0 urn:lsid:example.org:taxname:0 ... DCAE
1 urn:lsid:example.org:taxname:1 ... DCAE
2 urn:lsid:example.org:taxname:2 ... DCAE
3 urn:lsid:example.org:taxname:3 ... DCAE
4 urn:lsid:example.org:taxname:4 ... DCAE
... ... ... ...
163455 urn:lsid:example.org:taxname:292941 ... DCAE
163456 urn:lsid:example.org:taxname:292942 ... DCAE
163457 urn:lsid:example.org:taxname:292944 ... DCAE
163458 urn:lsid:example.org:taxname:292945 ... DCAE
163459 urn:lsid:example.org:taxname:292946 ... DCAE
[163460 rows x 47 columns]
Writing a DwCA archive
----------------------
To generate a new Darwin Core Archive file you can use the same class and build that starting point:
.. code-block:: python
from dwca import DarwinCoreArchive
from eml.resources import EMLResource
from eml.types import ResponsibleParty, IndividualName
# Define the metadata file future location
darwin_core = DarwinCoreArchive(metadata="eml.xml")
The `guidelines `_ suggest to add a metadata file in a standardized form. Alternatives suggest EML (Ecological Metadata Language), FGDC (Federal Geographic Data Committee) or ISO 19115.
For this package, we implemented EML support (`Next section <#eml-module>`_) for the metadata, and can be added and worked like this:
.. code-block:: python
darwin_core.metadata.initialize_resource(
"Example for Darwin Core Archive",
ResponsibleParty(
individual_name=IndividualName(
_id="1"
last_name="Doe",
first_name="John",
salutation="Mr."
)
),
contact=[ResponsibleParty(_id="1", referencing=True)]
)
# Add core data
darwin_core.set_core("taxon.txt")
# Add an extension
darwin_core.add_extension("identifier.txt")
# Write the archive
with open("example.zip", "wb") as example_file:
darwin_core.to_file(example_file)
There are other ways to add data. Check the whole documentation for more information.