Bubbles is a Python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch.
Goals:
Bubbles is performance agnostic at the low level of physical data implementation. Performance is assured by the data technology and appropriate use of operations.
When you might consider using Bubbles?
Data object is an abstraction of a dataset. Data objects might have multiple representations, such as SQL statement or python iterator. Data object does not have to be backed by physical existence of data.
Principle of data object is to keep data in its natural "habitat" without fetching them if not necessary.
Data objects are processed using operations which try to operate on metadata if possible. For example, if the data can be represented as SQL, then SQL composition is used. This allows the best use of underlying technology whenever it is possible.
Data processing pipeline can be described as:
p = Pipeline() p.source_object("csv_source", "data.csv") p.distinct("category") p.pretty_print() p.run()
The steps are described as Python function calls, however they are just used
to construct a data processing pipeline graph. The graph is then executed by
run()
. Different graph execution policies and contexts might be used.