Backends

Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage. Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent. In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python collections library. Backend extensions will be accepted conditional to the extention of backends/tests to demonstrate new Backend capability. We can not accept pull requests that are not tested.

Note that any contributed backends or extensions should include unit tests in backends/tests to demonstrate the new Backend capability.

Figure depicting the current backend class hierarchy.

Figure depicts the current DSI backend class hierarchy.

class dsi.backends.sqlite.Artifact

Primary Artifact class that holds database schema in memory. An Artifact is a generic construct that defines the schema for metadata that defines the tables inside of SQL

class dsi.backends.sqlite.Sqlite(filename)

Primary storage class, inherits sql class

check_type(text)

Tests input text and returns a predicted compatible SQL Type text: text string return: string description of a SQL data type

export_csv(rquery, tname, fname, isVerbose=False)

Function that outputs a csv file of a return query, not the query itself

rquery: return of an already called query output

tname: name of the table for (all) columns to export

fname: target filename (including path) that will output the return query as a csv file

return: none

export_csv_query(query, fname, isVerbose=False)

Function that outputs a csv file of a return query, given an initial query.

query: raw SQL query to be executed on current table

fname: target filename (including path) that will output the return query as a csv file

return: none

get_artifact_list(isVerbose=False)

Function that returns a list of all of the Artifact names (represented as sql tables)

return: list of Artifact names

put_artifact_type(types, isVerbose=False)

Primary class for defining metadata Artifact schema.

types: DataType derived class that defines the string name, properties

(named SQL type), and units for each column in the schema.

return: none

put_artifacts(collection, isVerbose=False)

Primary class for insertion of collection of Artifacts metadata into a defined schema

collection: A Python Collection of an Artifact derived class that has multiple regular structures of a defined schema,

filled with rows to insert.

return: none

put_artifacts_csv(fname, tname, isVerbose=False)

Function for insertion of Artifact metadata into a defined schema by using a CSV file, where the first row of the CSV contains the column names of the schema. Any rows thereafter contain data to be inserted. Data types are automatically assigned based on typecasting and default to a string type if none can be found.

fname: filepath to the .csv file to be read and inserted into the database

tname: String name of the table to be inserted

return: none

put_artifacts_lgcy(artifacts, isVerbose=False)

Legacy function for insertion of artifact metadata into a defined schema

artifacts: data_type derived class that has a regular structure of a defined schema, filled with rows to insert.

return: none

put_artifacts_only(artifacts, isVerbose=False)

Function for insertion of Artifact metadata into a defined schema as a Tuple

Artifacts: DataType derived class that has a regular structure of a defined schema,

filled with rows to insert.

return: none

put_artifacts_t(collection, tableName='TABLENAME', isVerbose=False)

Primary class for insertion of collection of Artifacts metadata into a defined schema, with a table passthrough

collection: A Python Collection of an Artifact derived class that has multiple regular structures of a defined schema,

filled with rows to insert.

tableName: A passthrough to define a table and set the name of a table

return: none

query_fctime(operator, ctime, isVerbose=False)

Function that queries file creation times within the filesystem metadata store

operator: operator input GT, LT, EQ as a modifier for a creation time search

ctime: creation time in POSIX format, see the utils dateToPosix conversion function

return: query list of filenames matching the creation time criteria with modifiers

query_fname(name, isVerbose=False)

Function that queries filenames within the filesystem metadata store

name: string name of a subsection of a filename to be searched

return: query list of filenames matching name string

query_fsize(operator, size, isVerbose=False)

Function that queries ranges of file sizes within the filesystem metadata store

operator: operator input GT, LT, EQ as a modifier for a filesize search

size: size in bytes

return: query list of filenames matching filesize criteria with modifiers

sqlquery(query, isVerbose=False)

Function that provides a direct sql query passthrough to the database.

query: raw SQL query to be executed on current table

return: raw sql query list that contains result of the original query

class dsi.backends.gufi.Gufi(prefix, index, dbfile, table, column, verbose=False)

GUFI Datastore

get_artifacts(query)

Retrieves GUFI’s metadata joined with a dsi database query: an sql query into the dsi_entries table

isVerbose = False

prefix: prefix to GUFI commands index: directory with GUFI indexes dbfile: sqlite db file from DSI table: table name from the DSI db we want to join on column: column name from the DSI db to join on

class dsi.backends.parquet.Parquet(filename, **kwargs)

Support for a Parquet back-end.

Parquet is a convenient format when metadata are larger than SQLite supports.

get_artifacts()

Get Parquet data from filename.

static get_cmd_output(cmd: list) str

Runs a given command and returns the stdout if successful.

If stderr is not empty, an exception is raised with the stderr text.

inspect_artifacts(collection, interactive=False)

Populate a Jupyter notebook with tools required to look at Parquet data.

put_artifacts(collection)

Put artifacts into file at filename path.