Internals

Architecture

Datamart is a cloud-native application divided in multiple components (containers) which can be scaled independently.

_images/architecture.png

Overall architecture of Datamart

At the core of Datamart is an Elasticsearch cluster which stores sketches of the datasets, obtained by downloading and profiling them.

The system is design to be extensible. Plugins called “Discoverers” can be added to support additional data sources.

The different components communicate via the AMQP message-queueing protocol through a RabbitMQ server, allowing the discoverers to queue datasets to be downloaded and profiled when profilers are available, while updating waiting clients about relevant new discoveries.

_images/amqp.png

AMQP queues and exchanges used by Datamart

Discoverers

A discoverer is responsible for finding data. It runs as its own container, and can either:

  • announce all datasets to AMQP when they appear in the source, to be profiled in advance of user queries, or

  • react to user queries, use it to perform a search in the source, and announce the datasets found, on-demand.

Either way, a base Python class is provided as part of datamart_core that can easily be extended instead of re-implementing the AMQP setup.

class datamart_core.discovery.Discoverer(identifier, concurrent=4)

Base class for discoverer plugins.

main_loop()

Optional entrypoint of the discoverer.

Not necessary if the discoverer is only reacting to queries.

handle_query(query, publisher)

Optional hook to react to user queries and find datasets on-demand.

Not necessary if the discoverer is only discovering datasets ahead-of-time.

publisher(materialize, metadata, dataset_id=None)

Callable object used to publish datasets found for that query. They will be profiled if necessary and recorded in the index, as well as considered for the results of the user’s query.

Parameters
  • materialize (dict) – Materialization information for that dataset

  • metadata (dict) – Metadata for that dataset, that might be augmented with profiled information

  • dataset_id (str) – Dataset id. If unspecified, a UUID4 will be generated for it.

record_dataset(materialize, metadata, dataset_id=None)

Publish a found dataset.

The dataset will be profiled if necessary and recorded in the index.

write_to_shared_storage(dataset_id)

Write a file to persistent storage.

This is useful if there is no way to materialize this dataset again in the future, and you need to store it to refer to it. Materialization won’t occur for datasets that are in shared storage already.

delete_dataset(*, full_id=None, dataset_id=None)

Delete a dataset that is no longer present in the source.

class datamart_core.discovery.AsyncDiscoverer(identifier, concurrent=4)

Async variant of Discoverer, eg main_loop() and handle_query().