Materialization library

This library can materialize datasets from Auctus. You can use it to materialize search results directly on your side without relying on the server. It is also used internally by Auctus to materialize datasets (the /download endpoint downloads the dataset using this library then sends it to you).

Installing datamart-materialize

You can get it directly from the Python Package Index using PIP:

pip install datamart-materialize

API

This library is organized around pluggable materializers, writers, and converters, which can be registered through Python’s entrypoint mechanism.

If a dataset is provided to download() that is not recognized or not installed, the library can use the server to do a “proxy” materialization, eg the server will perform the materialization from the original source and send it for us to write. Materializing a dataset from a simple ID rather than materialization information also requires contacting a server.

datamart_materialize.download(dataset, destination, proxy, format='csv', format_options=None, size_limit=None, http=None)

Materialize a dataset on disk.

Parameters
  • dataset – Dataset description from search index.

  • destination – Path where the dataset will be written.

  • proxy – URL of a Datamart server to use as a proxy if we can’t materialize locally. If None, KeyError will be raised if this materializer is unavailable.

  • format – Output format.

  • format_options – Dictionary of options for the writer or None.

  • size_limit – Maximum size of the dataset to download, in bytes. If the limit is reached, DatasetTooBig will be raised.

  • http – A requests.sessions.Session to use to download files.

Materializers

A materializer is an object that can take materialization information for a dataset (a JSON dictionary such as the one provided by Auctus under the materialize key) and can materialize it as a CSV file, for example by simply downloading it, by converting a different file to CSV, or possibly by doing multiple API calls to obtain all the rows.

Some datasets provided by Auctus contain a key materialize.direct_url, in which case no materializer is needed, we download the CSV directly.

class datamart_materialize.noaa.NoaaMaterializer

Only a single materializer is included with datamart-materialize for noaa data. Downloading from the NOAA API requires numerous API calls that are slow and rate-limited; the JSON results can then be converted to a CSV. Use of the NOAA API requires a token that can be obtained from NOAA’s Climate Data Online: Web Services Documentation and should be set as the environment variable NOAA_TOKEN for the materializer to work.

Writers

class datamart_materialize.CsvWriter(destination, format_options=None)

Writer for the csv format. Writes a CSV file at the provided path.

class datamart_materialize.PandasWriter(destination, format_options=None)

Writer for the pandas format. Buffers a CSV file in memory, and returns a pandas.DataFrame object at the end.

class datamart_materialize.d3m.D3mWriter(destination, format_options=None)

Writer for the d3m dataset format, following MIT-LL’s schema.

https://gitlab.com/datadrivendiscovery/data-supply

The key version can be passed in format_options to select the version of the schema to generate, between 3.2.0 and 4.0.0.

Converters

class datamart_materialize.excel.ExcelConverter(writer)

Adapter converting Excel files to CSV.