Materialization library¶
This library can materialize datasets from Auctus. You can use it to materialize search results directly on your side without relying on the server. It is also used internally by Auctus to materialize datasets (the /download endpoint downloads the dataset using this library then sends it to you).
Installing datamart-materialize¶
You can get it directly from the Python Package Index using PIP:
pip install datamart-materialize
API¶
This library is organized around pluggable materializers, writers, and converters, which can be registered through Python’s entrypoint mechanism.
If a dataset is provided to download() that is not recognized or not installed, the library can use the server to do a “proxy” materialization, eg the server will perform the materialization from the original source and send it for us to write. Materializing a dataset from a simple ID rather than materialization information also requires contacting a server.
- datamart_materialize.download(dataset, destination, proxy, format='csv', format_options=None, size_limit=None, http=None)¶
Materialize a dataset on disk.
- Parameters
dataset – Dataset description from search index.
destination – Path where the dataset will be written.
proxy – URL of a Datamart server to use as a proxy if we can’t materialize locally. If
None,KeyErrorwill be raised if this materializer is unavailable.format – Output format.
format_options – Dictionary of options for the writer or None.
size_limit – Maximum size of the dataset to download, in bytes. If the limit is reached,
DatasetTooBigwill be raised.http – A requests.sessions.Session to use to download files.
Materializers¶
A materializer is an object that can take materialization information for a dataset (a JSON dictionary such as the one provided by Auctus under the materialize key) and can materialize it as a CSV file, for example by simply downloading it, by converting a different file to CSV, or possibly by doing multiple API calls to obtain all the rows.
Some datasets provided by Auctus contain a key materialize.direct_url, in which case no materializer is needed, we download the CSV directly.
- class datamart_materialize.noaa.NoaaMaterializer¶
Only a single materializer is included with
datamart-materializefornoaadata. Downloading from the NOAA API requires numerous API calls that are slow and rate-limited; the JSON results can then be converted to a CSV. Use of the NOAA API requires a token that can be obtained from NOAA’s Climate Data Online: Web Services Documentation and should be set as the environment variableNOAA_TOKENfor the materializer to work.
Writers¶
- class datamart_materialize.CsvWriter(destination, format_options=None)¶
Writer for the
csvformat. Writes a CSV file at the provided path.
- class datamart_materialize.PandasWriter(destination, format_options=None)¶
Writer for the
pandasformat. Buffers a CSV file in memory, and returns apandas.DataFrameobject at the end.
- class datamart_materialize.d3m.D3mWriter(destination, format_options=None)¶
Writer for the
d3mdataset format, following MIT-LL’s schema.https://gitlab.com/datadrivendiscovery/data-supply
The key
versioncan be passed in format_options to select the version of the schema to generate, between3.2.0and4.0.0.