Profiling library¶

This library can be used to profile datasets standalone. You can use it to profile datasets on your side and send that to Auctus for search, instead of uploading the whole dataset. It is also used internally by Auctus to process search-by-example queries (when sending a file to the /search endpoint) and to add datasets to the index (to be queried against later).

Installing datamart-profiler¶

You can get it directly from the Python Package Index using PIP:

pip install datamart-profiler

API¶

The datamart_profiler.process_dataset() function is the entrypoint for the library. It returns a dict following Auctus’s JSON result schema.

datamart_profiler.core.process_dataset(data, dataset_id=None, metadata=None, lazo_client=None, nominatim=None, geo_data=None, search=False, include_sample=False, coverage=True, plots=False, indexes=True, load_max_size=None, **kwargs)¶

Compute all metafeatures from a dataset.

Parameters

data – path to dataset, or file object, or DataFrame
dataset_id – id of the dataset
metadata – The metadata provided by the discovery plugin (might be very limited).
lazo_client – client for the Lazo Index Server
nominatim – URL of the Nominatim server
geo_data – True or a datamart_geo.GeoData instance to use to resolve named administrative territorial entities
search – True if this method is being called during the search operation (and not for indexing).
include_sample – Set to True to include a few random rows to the result. Useful to present to a user.
coverage – Whether to compute data ranges
plots – Whether to compute plots
indexes – Whether to include indexes. If True (the default), the input is a DataFrame, and it has index(es) different from the default range, they will appear in the result with the columns.
load_max_size – Target size of the data to be analyzed. The data will be randomly sampled if it is bigger. Defaults to MAX_SIZE, currently 5 MB. This is different from the sample data included in the result.

Returns

JSON structure (dict)

datamart_profiler.temporal.parse_date(string)¶

Parse a full date from a string.

This will accept dates with low precision, but reject strings that only mention a time or a partial date, e.g. "June 6 11:00" returns None (could be any year) but "June 2020" parses into 2020-06-01 00:00:00 UTC

datamart_profiler.core.count_rows_to_skip(file)¶: Count non-data rows at the top, such as titles etc.

Command-line usage¶

You can also use datamart-profiler from the command-line like so:

$ python -m datamart_profiler <file.csv>

It will output the extracted metadata as JSON.