Materialization library¶
This library can materialize datasets from Auctus. You can use it to materialize search results directly on your side without relying on the server. It is also used internally by Auctus to materialize datasets (the /download
endpoint downloads the dataset using this library then sends it to you).
Installing datamart-materialize¶
You can get it directly from the Python Package Index using PIP:
pip install datamart-materialize
API¶
This library is organized around pluggable materializers, writers, and converters, which can be registered through Python’s entrypoint mechanism.
If a dataset is provided to download()
that is not recognized or not installed, the library can use the server to do a “proxy” materialization, eg the server will perform the materialization from the original source and send it for us to write. Materializing a dataset from a simple ID rather than materialization information also requires contacting a server.
- datamart_materialize.download(dataset, destination, proxy, format='csv', format_options=None, size_limit=None, http=None)¶
Materialize a dataset on disk.
- Parameters
dataset – Dataset description from search index.
destination – Path where the dataset will be written.
proxy – URL of a Datamart server to use as a proxy if we can’t materialize locally. If
None
,KeyError
will be raised if this materializer is unavailable.format – Output format.
format_options – Dictionary of options for the writer or None.
size_limit – Maximum size of the dataset to download, in bytes. If the limit is reached,
DatasetTooBig
will be raised.http – A requests.sessions.Session to use to download files.
Materializers¶
A materializer is an object that can take materialization information for a dataset (a JSON dictionary such as the one provided by Auctus under the materialize
key) and can materialize it as a CSV file, for example by simply downloading it, by converting a different file to CSV, or possibly by doing multiple API calls to obtain all the rows.
Some datasets provided by Auctus contain a key materialize.direct_url
, in which case no materializer is needed, we download the CSV directly.
- class datamart_materialize.noaa.NoaaMaterializer¶
Only a single materializer is included with
datamart-materialize
fornoaa
data. Downloading from the NOAA API requires numerous API calls that are slow and rate-limited; the JSON results can then be converted to a CSV. Use of the NOAA API requires a token that can be obtained from NOAA’s Climate Data Online: Web Services Documentation and should be set as the environment variableNOAA_TOKEN
for the materializer to work.
Writers¶
- class datamart_materialize.CsvWriter(destination, format_options=None)¶
Writer for the
csv
format. Writes a CSV file at the provided path.
- class datamart_materialize.PandasWriter(destination, format_options=None)¶
Writer for the
pandas
format. Buffers a CSV file in memory, and returns apandas.DataFrame
object at the end.
- class datamart_materialize.d3m.D3mWriter(destination, format_options=None)¶
Writer for the
d3m
dataset format, following MIT-LL’s schema.https://gitlab.com/datadrivendiscovery/data-supply
The key
version
can be passed in format_options to select the version of the schema to generate, between3.2.0
and4.0.0
.