Backends

Backends connect users to DSI Core middleware and allow DSI middleware data structures to read and write to persistent external storage.

Backends are modular to support user contribution, and users are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent.

In order to be compatible with DSI core middleware, backends need to interface with Python built-in data structures and with the Python collections library.

Note that any contributed backends or extensions must include unit tests in backends/tests to demonstrate new Backend capability. We can not accept pull requests that are not tested.

Figure depicting the current backend class hierarchy.

Figure depicts the current DSI backend class hierarchy.

SQLite

class dsi.backends.sqlite.Sqlite(filename)

SQLite Filesystem Backend to which a user can ingest/process data, generate a Jupyter notebook, and find occurences of a search term

__init__(filename)

Initializes a SQLite backend with a user inputted filename, and creates other internal variables

close()

Closes the SQLite database’s connection.

display(table_name, num_rows=25, display_cols=None)

Returns all data from a specified table in this SQLite backend.

table_namestr

Name of the table to display.

num_rowsint, optional, default=25

Maximum number of rows to print. If the table contains fewer rows, only those are shown.

display_colslist of str, optional

List of specific column names to display from the table.

If None (default), all columns are displayed.

find(query_object)

Searches for all instances of query_object in the SQLite database at the table, column, and cell levels. Includes partial matches as well.

query_objectint, float, or str

The value to search for across all tables in the backend.

returnlist or tuple

A list of ValueObjects representing matches. If no matches are found, returns a tuple of an empty ValueObject and an error message.

  • Note: ValueObjects may vary in structure depending on whether the match occurred at the table, column, or cell level.

  • Refer to find_table(), find_column(), and find_cell() for the specific structure of each ValueObject type.

find_cell(query_object, row=False)

Finds all cells in the database that match or partially match the given query_object.

query_objectint, float, or str

The value to search for at the cell level, across all tables in the backend.

row: bool, optional, default=False

If True, value in the returned ValueObject will be the entire row where a cell matched. If False, value in the returned ValueObject will only be the matching cell value.

return : List of ValueObjects if there is a match.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list of column names.

    • If row=True: list of all column names in the table

    • If row=False: list with one element - the matched column name

  • value:

    • If row=True: full row of values

    • If row=False: value of the matched cell

  • row_num: row index of the match

  • type:

    • If row=True: ‘row’

    • If row=False: ‘cell’

find_column(query_object, range=False)

Finds all columns whose names match or partially match the given query_object.

query_objectstr

The string to search for in column names.

rangebool, optional, default=False

If True, value in the returned ValueObject will be the [min, max] of the matching numerical column. If False, value in the returned ValueObject will be the full list of column data.

return : List of ValueObjects if there is a match.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list containing one element - the matching column name

  • value:

    • If range=True: [min, max]

    • If range=False: list of column data

  • row_num: None

  • type:

    • If range=True: ‘range’

    • If range=False: ‘column’

find_relation(column_name, relation)

Finds all rows in the first table of the database that satisfy the relation applied to the given column.

column_namestr

The name of the column to apply the relation to.

relationstr

The operator and value to apply to the column. Ex: >4, <4, =4, >=4, <=4, ==4, !=4, (4,5)

returnlist of ValueObjects

One ValueObject per matching row in that first table.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list of all columns in the table

  • value: full row of values

  • row_num: row index of the match

  • type: ‘relation’

find_table(query_object)

Finds all tables whose names match or partially match the given query_object.

query_objectstr

The string to search for in table names.

returnlist of ValueObjects

One ValueObject per matching table.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list of all columns in the table

  • value: table data as list of rows (each row is a list)

  • row_num: None

  • type: ‘table’

get_table(table_name, dict_return=False)

Retrieves all data from a specified table without requiring knowledge of SQL.

This method is a simplified alternative to query_artifacts() for users who are only familiar with Python.

table_namestr

Name of the table in the SQLite backend.

dict_returnbool, optional, default=False

If True, returns the result as an OrderedDict. If False, returns the result as a pandas DataFrame.

returnpandas.DataFrame or OrderedDict or tuple
  • If query is valid and dict_return is False: returns a DataFrame.

  • If query is valid and dict_return is True: returns an OrderedDict.

  • If query is invalid: returns a tuple (ErrorType, “error message”). Ex: (ValueError, “this is an error”)

get_table_names(query)

Extracts all table names from a SQL query. Helper function for query_artifacts() that users do not need to call

querystr

A SQL query string, typically passed into query_artifacts().

return: list of str

List of table names referenced in the query.

ingest_artifacts(collection, isVerbose=False)

Primary function to ingest a collection of tables into the defined SQLite database.

Creates the auto generated runTable if the corresponding flag was set to True when initializing a Core.Terminal Also creates a dsi_units table if any units are associated with the ingested data values.

Can only be called if a SQLite database is loaded as a BACK-WRITE backend. (See core.py for distinction between BACK-READ and BACK-WRITE.)

collectionOrderedDict

A nested OrderedDict representing multiple tables and their associated data. Each top-level key is a table name, and its value is an OrderedDict of column names and corresponding data lists.

isVerbosebool, optional, default=False

If True, prints all SQL insert statements during the ingest process for debugging or inspection purposes.

return: None on successful ingestion. If an error occurs, returns a tuple in the format of: (ErrorType, error message). Ex: (ValueError, “this is an error”)

ingest_table_helper(types, foreign_query=None, isVerbose=False)

Internal use only. Do not call

Helper function to create SQLite table based on a passed in schema.

typesDataType
A DataType-derived object that defines:
  • the table name as a string,

  • table properties as a dictionary mapping column names to data,

  • associated units for each column.

foreign_querystr, optional, default=None

A valid SQL string specifying foreign key constraints to apply to the table.

isVerbosebool, optional, default=False

If True, prints the CREATE TABLE statements for debugging or inspection.

list()

Return a list of all tables and their dimensions from this SQLite backend

notebook(interactive=False)

Generates a Jupyter notebook displaying all the data in the SQLite database.

If multiple tables exist, each is displayed as a separate DataFrame.

If database has table relations, it is stored as a separate dataframe. If database has a units table, each table’s units are stored in its corresponding dataframe attrs variable

interactive: default is False. When set to True, creates an interactive Jupyter notebook, otherwise creates an HTML file.

return: None

num_tables()

Prints number of tables in this backend

overwrite_table(table_name, collection)

Overwrites specified table(s) in this SQLite backend using the provided Pandas DataFrame(s).

If a relational schema has been previously loaded into the backend, it will be reapplied to the table. Note: This function permanently deletes the existing table and its data, before inserting the new data.

table_namestr or list
  • If str, name of the table to overwrite in the backend.

  • If list, list of all tables to overwrite in the backend

collectionpandas.DataFrame or list of Pandas.DataFrames
  • If one item, a DataFrame containing the updated data will be written to the table.

  • If a list, all DataFrames with updated data will be written to their own table

process_artifacts(only_units_relations=False)

Reads data from the SQLite database into a nested OrderedDict. Keys are table names, and values are OrderedDicts containing table data.

If the database contains PK/FK relationships, they are stored in a special dsi_relations table.

only_units_relationsbool, default=False

USERS SHOULD IGNORE THIS FLAG. Used internally by sqlite.py.

returnOrderedDict

A nested OrderedDict containing all data from the SQLite database.

put_artifacts_t(collection, tableName='TABLENAME', isVerbose=False)

DSI 1.0 FUNCTIONALITY - DEPRECATING SOON, DO NOT USE

Primary class for insertion of collection of Artifacts metadata into a defined schema, with a table passthrough

collection: A Python Collection of an Artifact derived class that has multiple regular structures of a defined schema, filled with rows to insert.

tableName: A passthrough to define a table and set the name of a table

return: none

query_artifacts(query, isVerbose=False, dict_return=False)

Executes a SQL query on the SQLite backend and returns the result in the specified format dependent on dict_return

querystr

Must be a SELECT or PRAGMA SQL query. Aggregate functions like COUNT are allowed. If dict_return is True, the query must target a single table and cannot include joins.

isVerbosebool, optional, default=False

If True, prints the SQL SELECT statements being executed.

dict_returnbool, optional, default=False

If True, returns the result as an OrderedDict. If False, returns the result as a pandas DataFrame.

returnpandas.DataFrame or OrderedDict or tuple
  • If query is valid and dict_return is False: returns a DataFrame.

  • If query is valid and dict_return is True: returns an OrderedDict.

  • If query is invalid: returns a tuple (ErrorType, “error message”). Ex: (ValueError, “this is an error”)

sql_type(input_list)

Internal use only. Do not call

Evaluates a list and returns the predicted compatible SQLite Type

input_listlist

A list of values to analyze for type compatibility.

return: str

A string representing the inferred SQLite data type for the input list.

summary(table_name=None)

Returns numerical metadata from tables in the first activated backend.

table_namestr, optional

If specified, only the numerical metadata for that table will be returned as a Pandas DataFrame.

If None (default), metadata for all available tables is returned as a list of Pandas DataFrames.

summary_helper(table_name)

Internal use only. Do not call

Generates and returns summary metadata for a specific table in the SQLite backend.

class dsi.backends.sqlite.ValueObject

Data Structure used when returning search results from find, find_table, find_column, find_cell, or find_relation

  • t_name: table name

  • c_name: column name as a list. The length of the list varies based on the find function. Read the description of each one to understand the differences

  • row_num: row number. Useful when finding a value in find_cell, find_relation, or find (includes results from find_cell)

  • type: type of match for this specific ValueObject. {table, column, range, cell, row, relation}

DuckDB

class dsi.backends.duckdb.DuckDB(filename)

DuckDB Filesystem Backend to which a user can ingest/process data, generate a Jupyter notebook, and find occurences of a search term

__init__(filename)

Initializes a DuckDB backend with a user inputted filename, and creates other internal variables

check_table_relations(tables, relation_dict)

Internal use only. Do not call.

Checks if a user-loaded schema has circular dependencies.

If no circular dependencies are found, returns a list of tables ordered from least dependent to most dependent, suitable for staged ingestion into the DuckDB backend.

Note: This method is intended for internal use only. DSI users should not call this directly.

tableslist of str

List of table names to ingest into the DuckDB backend.

relation_dictOrderedDict

An OrderedDict describing table relationships. Structured as the dsi_relations object with primary and foreign keys.

return: tuple of (has_cycle, ordered_tables)
  • has_cycle (bool): True if a circular dependency is detected.

  • ordered_tables (list or None): Ordered list of tables if no cycle is found; None if a circular dependency exists.

close()

Closes the DuckDB database’s connection.

return: None

display(table_name, num_rows=25, display_cols=None)

Returns all data from a specified table in this DuckDB backend.

table_namestr

Name of the table to display.

num_rowsint, optional, default=25

Maximum number of rows to print. If the table contains fewer rows, only those are shown.

display_colslist of str, optional

List of specific column names to display from the table.

If None (default), all columns are displayed.

find(query_object)

Searches for all instances of query_object in the DuckDB database at the table, column, and cell levels. Includes partial matches as well.

query_objectint, float, or str

The value to search for across all tables in the backend.

returnlist or tuple

A list of ValueObjects representing matches. If no matches are found, returns a tuple of an empty ValueObject and an error message.

  • Note: ValueObjects may vary in structure depending on whether the match occurred at the table, column, or cell level.

  • Refer to find_table(), find_column(), and find_cell() for the specific structure of each ValueObject type.

find_cell(query_object, row=False)

Finds all cells in the database that match or partially match the given query_object.

query_objectint, float, or str

The value to search for at the cell level, across all tables in the backend.

row: bool, optional, default=False

If True, value in the returned ValueObject will be the entire row where a cell matched. If False, value in the returned ValueObject will only be the matching cell value.

return : List of ValueObjects if there is a match.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list of column names.

    • If row=True: list of all column names in the table

    • If row=False: list with one element - the matched column name

  • value:

    • If row=True: full row of values

    • If row=False: value of the matched cell

  • row_num: row index of the match

  • type:

    • If row=True: ‘row’

    • If row=False: ‘cell’

find_column(query_object, range=False)

Finds all columns whose names match or partially match the given query_object.

query_objectstr

The string to search for in column names.

rangebool, optional, default=False

If True, value in the returned ValueObject will be the [min, max] of the matching numerical column. If False, value in the returned ValueObject will be the full list of column data.

return : List of ValueObjects if there is a match.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list containing one element - the matching column name

  • value:

    • If range=True: [min, max]

    • If range=False: list of column data

  • row_num: None

  • type:

    • If range=True: ‘range’

    • If range=False: ‘column’

find_relation(column_name, relation)

Finds all rows in the first table of the database that satisfy the relation applied to the given column.

column_namestr

The name of the column to apply the relation to.

relationstr

The operator and value to apply to the column. Ex: >4, <4, =4, >=4, <=4, ==4, !=4, (4,5)

returnlist of ValueObjects

One ValueObject per matching row in that first table.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list of all columns in the table

  • value: full row of values

  • row_num: row index of the match

  • type: ‘relation’

find_table(query_object)

Finds all tables whose names match or partially match the given query_object.

query_objectstr

The string to search for in table names.

returnlist of ValueObjects

One ValueObject per matching table.

ValueObject Structure:
  • t_name: table name (str)

  • c_name: list of all columns in the table

  • value: table data as list of rows (each row is a list)

  • row_num: None

  • type: ‘table’

get_table(table_name, dict_return=False)

Retrieves all data from a specified table without requiring knowledge of SQL.

This method is a simplified alternative to query_artifacts() for users who are only familiar with Python.

table_namestr

Name of the table in the DuckDB backend.

dict_returnbool, optional, default=False

If True, returns the result as an OrderedDict. If False, returns the result as a pandas DataFrame.

returnpandas.DataFrame or OrderedDict or tuple
  • If query is valid and dict_return is False: returns a DataFrame.

  • If query is valid and dict_return is True: returns an OrderedDict.

  • If query is invalid: returns a tuple (ErrorType, “error message”). Ex: (ValueError, “this is an error”)

get_table_names(query)

Extracts all table names from a SQL query. Helper function for query_artifacts() that users do not need to call

querystr

A SQL query string, typically passed into query_artifacts().

return: list of str

List of table names referenced in the query.

ingest_artifacts(collection, isVerbose=False)

Primary function to ingest a collection of tables into the defined DuckDB database.

Creates the auto generated runTable if the corresponding flag was set to True when initializing a Core.Terminal Also creates a dsi_units table if any units are associated with the ingested data values.

Cannot ingest data if it has a complex schema with circular dependencies, ex: A->B->C->A

Can only be called if a DuckDB database is loaded as a BACK-WRITE backend. (See core.py for distinction between BACK-READ and BACK-WRITE.)

collectionOrderedDict

A nested OrderedDict representing multiple tables and their associated data. Each top-level key is a table name, and its value is an OrderedDict of column names and corresponding data lists.

isVerbosebool, optional, default=False

If True, prints all SQL insert statements during the ingest process for debugging or inspection purposes.

return: None on successful ingestion. If an error occurs, returns a tuple in the format of: (ErrorType, error message). Ex: (ValueError, “this is an error”)

ingest_table_helper(types, foreign_query=None, isVerbose=False)

Internal use only. Do not call

Helper function to create DuckDB table based on a passed in schema.

typesDataType
A DataType-derived object that defines:
  • the table name as a string,

  • table properties as a dictionary mapping column names to data,

  • associated units for each column.

foreign_querystr, optional, default=None

A valid SQL string specifying foreign key constraints to apply to the table.

isVerbosebool, optional, default=False

If True, prints the CREATE TABLE statements for debugging or inspection.

list()

Return a list of all tables and their dimensions from this DuckDB backend

num_tables()

Prints number of tables in this backend

overwrite_table(table_name, collection)

Overwrites specified table(s) in this DuckDB backend using the provided Pandas DataFrame(s).

If a relational schema has been previously loaded into the backend, it will be reapplied to the table. Cannot accept any schemas with circular dependencies.

Note: This function permanently deletes the existing table and its data, before inserting the new data.

table_namestr or list
  • If str, name of the table to overwrite in the backend.

  • If list, list of all tables to overwrite in the backend

collectionpandas.DataFrame or list of Pandas.DataFrames
  • If one item, a DataFrame containing the updated data will be written to the table.

  • If a list, all DataFrames with updated data will be written to their own table

process_artifacts()

Reads data from the DuckDB database into a nested OrderedDict. Keys are table names, and values are OrderedDicts containing table data.

If the database contains PK/FK relationships, they are stored in a special dsi_relations table.

returnOrderedDict

A nested OrderedDict containing all data from the DuckDB database.

query_artifacts(query, isVerbose=False, dict_return=False)

Executes a SQL query on the DuckDB backend and returns the result in the specified format dependent on dict_return

querystr

Must be a SELECT or PRAGMA SQL query. Aggregate functions like COUNT are allowed. If dict_return is True, the query must target a single table and cannot include joins.

isVerbosebool, optional, default=False

If True, prints the SQL SELECT statements being executed.

dict_returnbool, optional, default=False

If True, returns the result as an OrderedDict. If False, returns the result as a pandas DataFrame.

returnpandas.DataFrame or OrderedDict or tuple
  • If query is valid and dict_return is False: returns a DataFrame.

  • If query is valid and dict_return is True: returns an OrderedDict.

  • If query is invalid: returns a tuple (ErrorType, “error message”). Ex: (ValueError, “this is an error”)

sql_type(input_list)

Internal use only. Do not call

Evaluates a list and returns the predicted compatible DuckDB Type

input_listlist

A list of values to analyze for type compatibility.

return: str

A string representing the inferred DuckDB data type for the input list.

summary(table_name=None)

Returns numerical metadata from tables in the first activated backend.

table_namestr, optional

If specified, only the numerical metadata for that table will be returned as a Pandas DataFrame.

If None (default), metadata for all available tables is returned as a list of Pandas DataFrames.

summary_helper(table_name)

Internal use only. Do not call

Generates and returns summary metadata for a specific table in the DuckDB backend.

class dsi.backends.duckdb.ValueObject

Data Structure used when returning search results from find, find_table, find_column, find_cell, or find_relation

  • t_name: table name

  • c_name: column name as a list. The length of the list varies based on the find function. Read the description of each one to understand the differences

  • row_num: row number. Useful when finding a value in find_cell, find_relation, or find (includes results from find_cell)

  • type: type of match for this specific ValueObject. {table, column, range, cell, row, relation}

SQLAlchemy

GUFI

class dsi.backends.gufi.Gufi(prefix, index, dbfile, table, column, verbose=False)

GUFI Datastore

__init__(prefix, index, dbfile, table, column, verbose=False)

prefix: prefix to GUFI commands

index: directory with GUFI indexes

dbfile: sqlite db file from DSI

table: table name from the DSI db we want to join on

column: column name from the DSI db to join on

verbose: print debugging statements or not

query_artifacts(query)

Retrieves GUFI’s metadata joined with a dsi database query: an sql query into the dsi_entries table

Parquet

class dsi.backends.parquet.Parquet(filename, **kwargs)

Support for a Parquet back-end.

Parquet is a convenient format when metadata are larger than SQLite supports.

__init__(filename, **kwargs)
static get_cmd_output(cmd: list) str

Runs a given command and returns the stdout if successful.

If stderr is not empty, an exception is raised with the stderr text.

ingest_artifacts(collection)

Ingest artifacts into file at filename path.

notebook(collection, interactive=False)

Generate Jupyter notebook of Parquet data from filename.

query_artifacts()

Query Parquet data from filename.