Backends

Backends connect users to DSI Core middleware and allow DSI middleware data structures to read and write to persistent external storage.

Backends are modular to support user contribution, and users are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent. In order to be compatible with DSI core middleware, backends need to interface with Python built-in data structures and with the Python collections library.

Note that any contributed backends or extensions must include unit tests in backends/tests to demonstrate new Backend capability. We can not accept pull requests that are not tested.

Figure depicting the current backend class hierarchy.

Figure depicts the current DSI backend class hierarchy.

SQLite

class dsi.backends.sqlite.Sqlite(filename)

SQLite Filesystem Backend to which a user can ingest/process data, generate a Jupyter notebook, and find all occurences of a search term

__init__(filename)

Initializes a SQLite backend with a user inputted filename, and creates other internal variables

check_type(input_list)

Users should not use this function. Only used by internal sqlite functions

Evaluates a list and returns the predicted compatible SQLite Type

input_list: list of values to evaluate

return: string description of the list’s SQLite data type

close()

Closes the SQLite database’s connection.

return: None

find(query_object)

Function that finds all instances of a query_object in a SQLite database. This includes any partial hits if query_object is part of a table/col/cell

query_object: Object to find in this database. Can be of any type (string, float, int).

return: List of ValueObjects if there is a match. Else returns tuple of empty ValueObject() and an error message.

  • Note: Return list can have ValueObjects with different structure due to table/column/cell matches having different value variables

  • Refer to other find functions (table, column and cell) to clearly understand each one’s ValueObject structure

find_cell(query_object, row=False)

Function that finds all cells that match the query_object. This includes any partial hits if the query_object is part of a cell value

query_object: Object to find in all cells. Can be of any type (string, float, int).

row: default is False. Set to True, if want to return whole row where there is a match between a cell and query_object

return: List of ValueObjects if there is a match.

Structure of ValueObjects for this function:

  • t_name: string of table name

  • c_name: list of column names.

    • row = True, list is all columns in this table

    • row = False, list is one item – column of cell that matched query_object

  • value:

    • row = True, list of whole row where a cell matches query_object

    • row = False, value of the cell that matches query_object

  • row_num: row number of the cell that matched

  • type:

    • row = True, ‘row’

    • row = False, ‘cell’

find_column(query_object, range=False)

Function that finds all columns whose name matches the query_object. This includes any partial hits if the query_object is part of a column name

query_object: Object to find in all column names. HAS TO BE A STRING

range: default is False. Set to True, if want to return min/max of a numerical column whose name matches the query_object, not column data.

return: List of ValueObjects if there is a match.

Structure of ValueObjects for this function:

  • t_name: string of table name

  • c_name: list of one, which is the name of the matching column

  • value:

    • range = True, [min, max] of the column

    • range = False, column data as a list

  • row_num: None

  • type:

    • range = True, ‘range’

    • range = False, ‘column’

find_table(query_object)

Function that finds all tables whose name matches the query_object. This includes any partial hits if the query_object is part of a table name

query_object: Object to find in all table names. HAS TO BE A STRING

return: List of ValueObjects if there is a match.

Structure of ValueObjects for this function:
  • t_name: string of table name

  • c_name: list of all columns in matching table

  • value: table’s data as a list of lists (each row is a list)

  • row_num: None

  • type: ‘table’

ingest_artifacts(collection, isVerbose=False)

Primary function to ingest a collection of tables into the defined SQLite database.

Creates the auto generated runTable if flag set to True when setting up a Core.Terminal workflow Creates dsi_units table if there are units for ingested data values.

Can only be called if a SQLite database is loaded as a BACK-WRITE backend (check core.py for distinction)

collection: A Python Collection of several tables and their data structured as a nested Ordered Dictionary.

isVerbose: default is False. Flag to print all insert table SQLite statements

return: None when stable ingesting. When errors occur, returns a tuple of (ErrorType, error message). Ex: (ValueError, “this is an error”)

ingest_table_helper(types, foreign_query=None, isVerbose=False)

Users do not interact with this function and should ignore it. Called within ingest_artifacts()

Helper function to create SQLite table based on a passed in schema.

types: DataType derived class that defines the string name, properties (dictionary of table names and table data), and units for each column in the schema.

foreign_query: defaut is None. It is a SQLite string detailing the foreign keys in this table

isVerbose: default is False. Flag to print all create table SQLite statements

return: none

notebook(interactive=False)

Generates a Jupyter notebook displaying all the data in the specified SQLite database.

To account for multiple tables, the database is stored as a list of dataframes, where each table is a dataframe.

If database has table relations, it is stored as a separate dataframe. If database has a units table, each table’s units are stored in its corresponding dataframe attrs variable

interactive: default is False. When set to True, creates an interactive Jupyter notebook, otherwise creates an HTML file.

return: None

process_artifacts(only_units_relations=False, isVerbose=False)

Reads in data from the SQLite database into a nested Ordered Dictionary, where keys are table names and values are Ordered Dictionary of table data. If there are PK/FK relations in a database it is stored in a table called dsi_relations.

Can only be called if a loaded SQLite database is a BACK-READ backend in a Core.Terminal workflow (check core.py for distinction)

only_units_relations: default is False. USERS SHOULD IGNORE THIS FLAG. Used by an internal sqlite.py function.

isVerbose: default is False. When set to True, prints all SQLite queries to select data and store in abstraction

return: Nested Ordered Dictionary of all data from the SQLite database

process_units_helper()

Users do not interact with this function and should ignore it. Called within process_artifacts()

Helper function to create the SQLite database’s units table as an Ordered Dictionary. Only called if dsi_units table exists in the database.

return: units table as an Ordered Dictionary

put_artifacts_t(collection, tableName='TABLENAME', isVerbose=False)

DSI 1.0 FUNCTIONALITY - DEPRECATING SOON, DO NOT USE

Primary class for insertion of collection of Artifacts metadata into a defined schema, with a table passthrough

collection: A Python Collection of an Artifact derived class that has multiple regular structures of a defined schema, filled with rows to insert.

tableName: A passthrough to define a table and set the name of a table

return: none

query_artifacts(query, isVerbose=False, dict_return=False)

Function that returns data from a SQLite database based on a specified SQL query. Data returned varies based on the dict_return flag explained below.

query: Must be a SELECT or PRAGMA query. If dict_return is True, then this can only be a simple query on one table, NO JOINS. Query CAN create new aggregate columns such as COUNT to include in the result regardless of dict_return.

isVerbose: default is False. Flag to print all Select table SQLite statements

dict_return: default is False. When set to True, return type is an Ordered Dict of data from the table specified in query.

return:

  • When query is of correct format and dict_return = False, return a list of database rows

  • When query is of correct format and dict_return = True, return an Ordered Dictionary of data for the table specified in query

  • When query is incorrect, return a tuple of (ErrorType, error message). Ex: (ValueError, “this is an error”)

class dsi.backends.sqlite.ValueObject

Data Structure used when returning search results from find, find_table, find_column, or find_cell

  • t_name: table name

  • c_name: column name as a list. The length of the list varies based on the find function. Read the description of each one to understand the differences

  • row_num: row number. Is only important when finding a value in find_cell or find (which includes results from find_cell)

  • type: type of match for this specific ValueObject. {table, column, range, cell, row}

SQLAlchemy

GUFI

class dsi.backends.gufi.Gufi(prefix, index, dbfile, table, column, verbose=False)

GUFI Datastore

__init__(prefix, index, dbfile, table, column, verbose=False)

prefix: prefix to GUFI commands

index: directory with GUFI indexes

dbfile: sqlite db file from DSI

table: table name from the DSI db we want to join on

column: column name from the DSI db to join on

verbose: print debugging statements or not

query_artifacts(query)

Retrieves GUFI’s metadata joined with a dsi database query: an sql query into the dsi_entries table

Parquet

class dsi.backends.parquet.Parquet(filename, **kwargs)

Support for a Parquet back-end.

Parquet is a convenient format when metadata are larger than SQLite supports.

__init__(filename, **kwargs)
static get_cmd_output(cmd: list) str

Runs a given command and returns the stdout if successful.

If stderr is not empty, an exception is raised with the stderr text.

ingest_artifacts(collection)

Ingest artifacts into file at filename path.

notebook(collection, interactive=False)

Generate Jupyter notebook of Parquet data from filename.

query_artifacts()

Query Parquet data from filename.