Bubbles is a Python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch.

Goals:

  • understandability of the process
  • auditability of the data being processed (frequent use of metadata)
  • usability
  • versatility

Bubbles is performance agnostic at the low level of physical data implementation. Performance is assured by the data technology and appropriate use of operations.

When you might consider using Bubbles?

  • data integration
  • data cleansing
  • data monitoring
  • data auditing
  • learn more about unknown datasets
  • heterogenous data environments – different data technologies

Data Objects

Data object is an abstraction of a dataset. Data objects might have multiple representations, such as SQL statement or python iterator. Data object does not have to be backed by physical existence of data.

Principle of data object is to keep data in its natural "habitat" without fetching them if not necessary.

Operations

Data objects are processed using operations which try to operate on metadata if possible. For example, if the data can be represented as SQL, then SQL composition is used. This allows the best use of underlying technology whenever it is possible.

Pipeline

Data processing pipeline can be described as:

p = Pipeline()

p.source_object("csv_source", "data.csv")
p.distinct("category")
p.pretty_print()

p.run()

The steps are described as Python function calls, however they are just used to construct a data processing pipeline graph. The graph is then executed by run(). Different graph execution policies and contexts might be used.