Profiling library

This library can be used to profile datasets standalone. You can use it to profile datasets on your side and send that to Auctus for search, instead of uploading the whole dataset. It is also used internally by Auctus to process search-by-example queries (when sending a file to the /search endpoint) and to add datasets to the index (to be queried against later).

Installing datamart-profiler

You can get it directly from the Python Package Index using PIP:

pip install datamart-profiler

API

The datamart_profiler.process_dataset() function is the entrypoint for the library. It returns a dict following Auctus’s JSON result schema.

datamart_profiler.core.process_dataset(data, dataset_id=None, metadata=None, lazo_client=None, nominatim=None, geo_data=None, search=False, include_sample=False, coverage=True, plots=False, indexes=True, load_max_size=None, **kwargs)

Compute all metafeatures from a dataset.

Parameters
  • data – path to dataset, or file object, or DataFrame

  • dataset_id – id of the dataset

  • metadata – The metadata provided by the discovery plugin (might be very limited).

  • lazo_client – client for the Lazo Index Server

  • nominatim – URL of the Nominatim server

  • geo_dataTrue or a datamart_geo.GeoData instance to use to resolve named administrative territorial entities

  • search – True if this method is being called during the search operation (and not for indexing).

  • include_sample – Set to True to include a few random rows to the result. Useful to present to a user.

  • coverage – Whether to compute data ranges

  • plots – Whether to compute plots

  • indexes – Whether to include indexes. If True (the default), the input is a DataFrame, and it has index(es) different from the default range, they will appear in the result with the columns.

  • load_max_size – Target size of the data to be analyzed. The data will be randomly sampled if it is bigger. Defaults to MAX_SIZE, currently 5 MB. This is different from the sample data included in the result.

Returns

JSON structure (dict)

datamart_profiler.temporal.parse_date(string)

Parse a full date from a string.

This will accept dates with low precision, but reject strings that only mention a time or a partial date, e.g. "June 6 11:00" returns None (could be any year) but "June 2020" parses into 2020-06-01 00:00:00 UTC

datamart_profiler.core.count_rows_to_skip(file)

Count non-data rows at the top, such as titles etc.

Command-line usage

You can also use datamart-profiler from the command-line like so:

$ python -m datamart_profiler <file.csv>

It will output the extracted metadata as JSON.