Profiling library

This library can be used to profile datasets standalone. You can use it to profile datasets on your side and send that to Datamart for search, instead of uploading the whole dataset. It is also used internally by Datamart to process search-by-example queries (when sending a file to the /search endpoint) and to add datasets to the index (to be queried against later).

Installing datamart-profiler

You can get it directly from the Python Package Index using PIP:

pip install datamart-profiler

API

The datamart_profiler.process_dataset() function is the entrypoint for the library. It returns a dict following Datamart’s JSON result schema.

datamart_profiler.process_dataset(data, dataset_id=None, metadata=None, lazo_client=None, search=False, include_sample=False, coverage=True, plots=False, load_max_size=None, **kwargs)

Compute all metafeatures from a dataset.

Parameters
  • data – path to dataset, or file object, or DataFrame

  • dataset_id – id of the dataset

  • metadata – The metadata provided by the discovery plugin (might be very limited).

  • lazo_client – client for the Lazo Index Server

  • search – True if this method is being called during the search operation (and not for indexing).

  • include_sample – Set to True to include a few random rows to the result. Useful to present to a user.

  • coverage – Whether to compute data ranges (using k-means)

  • plots – Whether to compute plots

  • load_max_size – Target size of the data to be analyzed. The data will be randomly sampled if it is bigger. Defaults to MAX_SIZE, currently 50 MB. This is different from the sample data included in the result.

Returns

JSON structure (dict)