Backends
Backends connect users to DSI Core middleware and allow DSI middleware data structures to read and write to persistent external storage.
Backends are modular to support user contribution, and users are encouraged to offer custom backend abstract classes and backend implementations.
A contributed backend abstract class may extend another backend to inherit the properties of the parent.
In order to be compatible with DSI core middleware, backends need to interface with Python built-in data structures and with the Python collections
library.
Note that any contributed backends or extensions must include unit tests in backends/tests
to demonstrate new Backend capability.
We can not accept pull requests that are not tested.

Figure depicts the current DSI backend class hierarchy.
SQLite
- class dsi.backends.sqlite.Sqlite(filename)
SQLite Filesystem Backend to which a user can ingest/process data, generate a Jupyter notebook, and find all occurences of a search term
- __init__(filename)
Initializes a SQLite backend with a user inputted filename, and creates other internal variables
- check_type(input_list)
Users should not use this function. Only used by internal sqlite functions
Evaluates a list and returns the predicted compatible SQLite Type
input_list: list of values to evaluate
return: string description of the list’s SQLite data type
- close()
Closes the SQLite database’s connection.
return: None
- find(query_object)
Function that finds all instances of a query_object in a SQLite database. This includes any partial hits if query_object is part of a table/col/cell
query_object: Object to find in this database. Can be of any type (string, float, int).
return: List of ValueObjects if there is a match. Else returns tuple of empty ValueObject() and an error message.
Note: Return list can have ValueObjects with different structure due to table/column/cell matches having different value variables
Refer to other find functions (table, column and cell) to clearly understand each one’s ValueObject structure
- find_cell(query_object, row=False)
Function that finds all cells that match the query_object. This includes any partial hits if the query_object is part of a cell value
query_object: Object to find in all cells. Can be of any type (string, float, int).
row: default is False. Set to True, if want to return whole row where there is a match between a cell and query_object
return: List of ValueObjects if there is a match.
Structure of ValueObjects for this function:
t_name: string of table name
c_name: list of column names.
row = True, list is all columns in this table
row = False, list is one item – column of cell that matched query_object
value:
row = True, list of whole row where a cell matches query_object
row = False, value of the cell that matches query_object
row_num: row number of the cell that matched
type:
row = True, ‘row’
row = False, ‘cell’
- find_column(query_object, range=False)
Function that finds all columns whose name matches the query_object. This includes any partial hits if the query_object is part of a column name
query_object: Object to find in all column names. HAS TO BE A STRING
range: default is False. Set to True, if want to return min/max of a numerical column whose name matches the query_object, not column data.
return: List of ValueObjects if there is a match.
Structure of ValueObjects for this function:
t_name: string of table name
c_name: list of one, which is the name of the matching column
value:
range = True, [min, max] of the column
range = False, column data as a list
row_num: None
type:
range = True, ‘range’
range = False, ‘column’
- find_table(query_object)
Function that finds all tables whose name matches the query_object. This includes any partial hits if the query_object is part of a table name
query_object: Object to find in all table names. HAS TO BE A STRING
return: List of ValueObjects if there is a match.
- Structure of ValueObjects for this function:
t_name: string of table name
c_name: list of all columns in matching table
value: table’s data as a list of lists (each row is a list)
row_num: None
type: ‘table’
- ingest_artifacts(collection, isVerbose=False)
Primary function to ingest a collection of tables into the defined SQLite database.
Creates the auto generated runTable if flag set to True when setting up a Core.Terminal workflow Creates dsi_units table if there are units for ingested data values.
Can only be called if a SQLite database is loaded as a BACK-WRITE backend (check
core.py
for distinction)collection: A Python Collection of several tables and their data structured as a nested Ordered Dictionary.
isVerbose: default is False. Flag to print all insert table SQLite statements
return: None when stable ingesting. When errors occur, returns a tuple of (ErrorType, error message). Ex: (ValueError, “this is an error”)
- ingest_table_helper(types, foreign_query=None, isVerbose=False)
Users do not interact with this function and should ignore it. Called within ingest_artifacts()
Helper function to create SQLite table based on a passed in schema.
types: DataType derived class that defines the string name, properties (dictionary of table names and table data), and units for each column in the schema.
foreign_query: defaut is None. It is a SQLite string detailing the foreign keys in this table
isVerbose: default is False. Flag to print all create table SQLite statements
return: none
- notebook(interactive=False)
Generates a Jupyter notebook displaying all the data in the specified SQLite database.
To account for multiple tables, the database is stored as a list of dataframes, where each table is a dataframe.
If database has table relations, it is stored as a separate dataframe. If database has a units table, each table’s units are stored in its corresponding dataframe attrs variable
interactive: default is False. When set to True, creates an interactive Jupyter notebook, otherwise creates an HTML file.
return: None
- process_artifacts(only_units_relations=False, isVerbose=False)
Reads in data from the SQLite database into a nested Ordered Dictionary, where keys are table names and values are Ordered Dictionary of table data. If there are PK/FK relations in a database it is stored in a table called dsi_relations.
Can only be called if a loaded SQLite database is a BACK-READ backend in a Core.Terminal workflow (check core.py for distinction)
only_units_relations: default is False. USERS SHOULD IGNORE THIS FLAG. Used by an internal sqlite.py function.
isVerbose: default is False. When set to True, prints all SQLite queries to select data and store in abstraction
return: Nested Ordered Dictionary of all data from the SQLite database
- process_units_helper()
Users do not interact with this function and should ignore it. Called within process_artifacts()
Helper function to create the SQLite database’s units table as an Ordered Dictionary. Only called if dsi_units table exists in the database.
return: units table as an Ordered Dictionary
- put_artifacts_t(collection, tableName='TABLENAME', isVerbose=False)
DSI 1.0 FUNCTIONALITY - DEPRECATING SOON, DO NOT USE
Primary class for insertion of collection of Artifacts metadata into a defined schema, with a table passthrough
collection: A Python Collection of an Artifact derived class that has multiple regular structures of a defined schema, filled with rows to insert.
tableName: A passthrough to define a table and set the name of a table
return: none
- query_artifacts(query, isVerbose=False, dict_return=False)
Function that returns data from a SQLite database based on a specified SQL query. Data returned varies based on the dict_return flag explained below.
query: Must be a SELECT or PRAGMA query. If dict_return is True, then this can only be a simple query on one table, NO JOINS. Query CAN create new aggregate columns such as COUNT to include in the result regardless of dict_return.
isVerbose: default is False. Flag to print all Select table SQLite statements
dict_return: default is False. When set to True, return type is an Ordered Dict of data from the table specified in query.
return:
When query is of correct format and dict_return = False, return a list of database rows
When query is of correct format and dict_return = True, return an Ordered Dictionary of data for the table specified in query
When query is incorrect, return a tuple of (ErrorType, error message). Ex: (ValueError, “this is an error”)
- class dsi.backends.sqlite.ValueObject
Data Structure used when returning search results from
find
,find_table
,find_column
, orfind_cell
t_name: table name
c_name: column name as a list. The length of the list varies based on the find function. Read the description of each one to understand the differences
row_num: row number. Is only important when finding a value in
find_cell
orfind
(which includes results fromfind_cell
)type: type of match for this specific ValueObject. {table, column, range, cell, row}
SQLAlchemy
GUFI
- class dsi.backends.gufi.Gufi(prefix, index, dbfile, table, column, verbose=False)
GUFI Datastore
- __init__(prefix, index, dbfile, table, column, verbose=False)
prefix: prefix to GUFI commands
index: directory with GUFI indexes
dbfile: sqlite db file from DSI
table: table name from the DSI db we want to join on
column: column name from the DSI db to join on
verbose: print debugging statements or not
- query_artifacts(query)
Retrieves GUFI’s metadata joined with a dsi database query: an sql query into the dsi_entries table
Parquet
- class dsi.backends.parquet.Parquet(filename, **kwargs)
Support for a Parquet back-end.
Parquet is a convenient format when metadata are larger than SQLite supports.
- __init__(filename, **kwargs)
- static get_cmd_output(cmd: list) str
Runs a given command and returns the stdout if successful.
If stderr is not empty, an exception is raised with the stderr text.
- ingest_artifacts(collection)
Ingest artifacts into file at filename path.
- notebook(collection, interactive=False)
Generate Jupyter notebook of Parquet data from filename.
- query_artifacts()
Query Parquet data from filename.