Core
The DSI Core middleware defines the Terminal and Sync concepts. An instantiated Terminal is the human/machine DSI interface to connect Readers/Writers and DSI backends. An instantiated Sync supports data movement capabilities between local and remote locations and captures metadata documentation
Core: Terminal
The Terminal class is a structure through which users can interact with Readers/Writers and Backends as “module” objects. Each reader/writer/backend can be “loaded” to be ready for use and users can interact with backends by ingesting, querying, processing, or finding data, as well as generating an interactive notebook of the data.
All relevant functions have been listed below for further clarity. Examples displays different workflows using this Terminal class.
- Notes for users:
All DSI Writers that are loaded must be followed by calling
Terminal.transload
after to execute them. Readers are automatically executed upon loading.Terminal.load_module
: if users wants to group related tables of data from a DSI Reader under the same name, they can use the target_table_prefix input to specify a shared prefix.users must remember that when accessing data from these tables, their names will include the specified prefix. Ex: collection1__math, collection1__english
Terminal.artifact_handler
: ‘notebook’ interaction_type stores data from first loaded backend, NOT existing DSI abstraction, in new notebook fileTerminal.artifact_handler
: review this function description below to clarify which backends are targeted by which interaction_typesTerminal find functions only access the first loaded backend
Terminal.unload_module
: removes last loaded backend of specified mod_name. Ex: if there are 2 loaded Sqlite backends, second is unloadedTerminal handles errors from any loaded DSI/user-written modules (Readers/Writers/backends).
If writing an external Reader/Writer/backend, return any intentionally caught errors as a tuple (error type, error message). Ex: (ValueError, “this is an error”)
- class dsi.core.Terminal(debug=0, backup_db=False, runTable=False)
An instantiated Terminal is the DSI human/machine interface.
Terminals are a home for Plugins and an interface for Backends. Backends may be back-reads or back-writes. Plugins may be writers or readers. See documentation for more information.
- __init__(debug=0, backup_db=False, runTable=False)
Initialization function to configure optional DSI core parameters.
Optional flags
- debugint, default=0
{0: off, 1: user debug log, 2: user + developer debug log}
When set to 1 or 2, debug info will write to a local debug.log text file with various benchmarks.
- backup_dbbool, default=False
If True, creates a backup of the current backend database before committing any new changes.
- runTablebool, default=False
If True, a ‘runTable’ is created, and timestamped each time new data/metadata is ingested. Recommended for in-situ use-cases.
- add_external_python_module(mod_type, mod_name, mod_path)
Adds an external, meaning not from the DSI repo, Python module to the module_collection. Afterwards, load_module can be used to load a DSI module from the added Python module.
Note: mod_type is needed because each Python module only implements plugins or backends.
View Example 9: External Readers/Writers to see how to use this function.
- artifact_handler(interaction_type, query=None, **kwargs)
Interact with loaded DSI backends by ingesting or retrieving data from them.
- interaction_typestr
Specifies the type of action to perform. Accepted values:
‘ingest’ or ‘put’: ingests active DSI abstraction into all loaded BACK-WRITE backends (BACK-READ backends ignored)
if backup_db flag = True in Core instance, a backup is created prior to ingesting data
‘query’ or ‘get’: retrieves data from first loaded backend based on a specified ‘query’
‘notebook’ or ‘inspect’: generates an interactive Python notebook with all data from first loaded backend
‘process’ or ‘read’: overwrites current DSI abstraction with all data from first loaded BACK-READ backend
- querystr, optional
Required only when interaction_type is ‘query’ or ‘get’, and it is an input to a backend’s query_artifact() method.
- kwargs
Additional keyword arguments passed to underlying backend functions. View relevant functions in the DSI backend file to understand other arguments to pass in.
- return: only when interaction_type = ‘query’
By default stores query result as a Pandas.DataFrame. If specified, returns it as an OrderedDict
A DSI Core Terminal may load zero or more Backends with storage functionality.
- close()
Immediately closes all active modules: backends, plugin writers, plugin readers
Clears out the current DSI abstraction
NOTE - This step cannot be undone.
- display(table_name, num_rows=25, display_cols=None)
Prints data from a specified table in the first loaded backend.
- table_namestr
Name of the table to display.
- num_rowsint, optional, default=25
Number of rows to print. If the table contains fewer rows, only those are shown.
- display_colslist of str, optional
List of specific column names to display from the table.
If None (default), all columns are displayed.
- find(query_object)
Find all instances of query_object across all tables, columns, and cells in the first loaded backend.
- query_objectany
The object to search for in the backend. Can be of any type, including str, float, or int.
- returnlist
A list of backend-specific result objects, each representing a match for query_object. The structure of each object depends on the backend implementation.
Refer to the first loaded backend’s documentation to understand the structure of the objects in this list
- find_cell(query_object, row=False)
Find all cells that match the query_object in the first loaded backend.
- query_objectany
The object to search for in the backend. Can be of any type, including str, float, or int.
- rowbool, default=False
If True, includes the entire row of data for each matching cell in return.
If False, includes only the value of the matching cell
- returnlist
A list of backend-specific result objects, each representing a match for query_object. The structure of each object depends on the backend implementation.
Refer to the first loaded backend’s documentation to understand the structure of the objects in this list
- find_column(query_object, range=False)
Find all columns whose name matches query_object in the first loaded backend.
- query_objectstr
The object to search for in the backend. HAS TO BE A str.
- range: bool, optional, default False.
If True, then data-range of all numerical columns which match query_object is included in return
If False, then data for each column that matches query_object is included in return
- returnlist
A list of backend-specific result objects, each representing a match for query_object. The structure of each object depends on the backend implementation.
Refer to the first loaded backend’s documentation to understand the structure of the objects in this list
- find_helper(query_object, return_object, start, find_type)
Users should not call this. Used internally.
Helper function to print/log information for all core find functions: find(), find_table(), find_column(), find_cell()
- find_relation(query_object)
Finds all rows in the first table of the first loaded backend that satisfy a column-level condition. query_object must include a column, operator, and value to define a valid relational condition.
- query_objectstr
A relational expression combining column, operator, and value. Ex: “age > 4”, “age < 4”, “age >= 4”, “age <= 4”, “age = 4”, “age == 4”, “age != 4”, “age (4, 8)”.
- returnlist
A list of backend-specific result objects, each representing a row that satisfies the relation. The structure of each object depends on the backend implementation.
Refer to the first loaded backend’s documentation to understand the structure of the objects in this list
- find_table(query_object)
Find all tables whose name matches query_object in the first loaded backend.
- query_objectstr
The object to search for in the backend. HAS TO BE A str.
- returnlist
A list of backend-specific result objects, each representing a match for query_object. The structure of each object depends on the backend implementation.
Refer to the first loaded backend’s documentation to understand the structure of the objects in this list
- get_current_abstraction(table_name=None)
Returns the current DSI abstraction as a nested Ordered Dict.
- The abstraction is organized such that:
The outer OrderedDict has table names as keys.
Each value is an inner OrderedDict representing a table, where keys are column names and values are lists of column data.
- table_namestr, optional, default is None.
If specified, returns only the OrderedDict corresponding to that table.
If None (default), returns the full nested OrderedDict containing all tables.
- returnOrderedDict
If table_name is None: returns a nested OrderedDict of all tables.
If table_name is provided: returns a single OrderedDict for that table.
- get_table(table_name, dict_return=False)
Returns all data from a specified table in the first loaded backend.
- table_namestr
Name of the table to retrieve data from.
- dict_return: bool, optional, default=False.
If True, returns the data as an OrderedDict. If False (default), returns the data as a pandas DataFrame.
- get_table_names(query)
Extracts and returns all table names referenced in a given query.
- querystr
A query string written in a database language (typically SQL).
- list()
Prints a list of all tables and their dimensions in the first loaded backend
- list_available_modules(mod_type)
List available DSI modules of an arbitrary module type.
This method is useful for Core Terminal setup. Plugin and Backend type DSI modules are supported, but this getter can be extended to support any new DSI module types which are added.
Note: self.VALID_MODULES refers to _DSI_ Modules however, DSI Modules are classes, hence the naming idiosynchrocies below.
- list_loaded_modules()
List DSI modules which have already been loaded.
These Plugins and Backends are active or ready to execute a post-processing task.
- load_module(mod_type, mod_name, mod_function, **kwargs)
Load a DSI module from the available Plugin and Backend module collection.
DSI modules may be loaded which are not explicitly listed by the list_available_modules. This flexibility ensures that advanced users can access higher level abstractions.
We expect most users will work with module implementations rather than templates, but but all high level class abstractions are accessible with this method.
If a loaded module has mod_type=’plugin’ & mod_function=’reader’, it is automatically activated and then unloaded as well. Therefore, a user does not have to activate it separately with transload() (only used by plugin writers) or call unload_module()
- num_tables()
Prints number of tables in the first loaded backend
- overwrite_table(table_name, collection, backup=False)
Overwrites specified table(s) in the first loaded backend with the provided Pandas DataFrame(s).
If a relational schema has been previously loaded into the backend, it will be reapplied to the table. Note: This function permanently deletes the existing table and its data, before inserting the new data.
- table_namestr or list
If str, name of the table to overwrite in the backend.
If list, list of all tables to overwrite in the backend
- collectionpandas.DataFrame or list of Pandas.DataFrames
If one item, a DataFrame containing the updated data will be written to the table.
If a list, all DataFrames with updated data will be written to their own table
- backupbool, optional, default False.
If True, creates a backup file for the DSI backend before updating its data.
If False, (default), only updates the data.
- summary(table_name=None, collection=False)
Returns/Prints numerical metadata from tables in the first loaded backend.
- table_namestr, optional
If specified, only the numerical metadata for that table will be returned/printed.
If None (default), metadata for all available tables is returned/printed.
- collectionbool, optional, default False.
If True, returns either a list of DataFrames (table_name = None), or a single DataFrame of metadata
If False (default), prints metadata from all tables (table_name = None), or just a single table
- transload(**kwargs)
Transloading signals to the DSI Core Terminal that Plugin set up is complete.
Activates all loaded plugin writers by generating their various output files such as an ER Diagram or an image of a table plot
All loaded plugin writers will be unloaded after activation, so there is no need to separately call unload_module() for them
- unload_module(mod_type, mod_name, mod_function)
Unloads a specific DSI module from the active_modules collection.
Primarily used when unloading backends, as plugin readers and writers are automatically unloaded elsewhere.
- update_abstraction(table_name, table_data)
Updates the DSI abstraction, by creating/overwriting the specified table_name with the input table_data
- table_name: str
Name of the table to update/create.
- table_dataOrderedDict or Pandas DataFrame
The new data to store in the table. If it is an Ordered Dict:
Keys are column names.
Values are lists representing column data.
Core: Sync
The DSI Core middleware also defines data management functionality in Sync
.
The purpose of Sync
is to provide file metadata documentation and data movement capabilities when moving data to/from local and remote locations.
The purpose of data documentation is to capture and archive metadata
(i.e. location of local file structure, their access permissions, file sizes, and creation/access/modification dates)
and track their movement to the remote location for future access.
The primary functions, Copy
, Move
, and Get
serve as mechanisms to copy data, move data, or retrieve data from remote locations
by creating a DSI database in the process, or retrieving an existing DSI database that contains the location(s) of the target data.
- class dsi.core.Sync(project_name='test')
A class defined to assist in data management activities for DSI
Sync is where data movement functions such as copy (to remote location) and sync (local filesystem with remote) exist.
- __init__(project_name='test')
- copy(tool='copy', isVerbose=False)
Helper function to perform the data copy over using a preferred API
- dircrawl(filepath)
Crawls the root ‘filepath’ directory and returns files
filepath: source filepath to be crawled
return: returns crawled file-list
- get(project_name='Project')
Helper function that searches remote location based on project name, and retrieves DSI database
- populate(local_loc, remote_loc, isVerbose=False)
Helper function to gather filesystem information, local and remote locations to create a filesystem entry in a new or existing database
Examples
Examples below display various ways users can incorporate DSI into their data science workflows.
They are located in examples/developer/
and must be run from that directory.
Most of them either load or refer to data from examples/clover3d/
.
Example 1: Intro use case
Baseline use of DSI to list Modules
# examples/developer/1.baseline.py
from dsi.core import Terminal
#optional parameters
# debug = 0, 1, 2 (creates log file to analyze runtime)
# backup_db = True/False (creates backup database prior to ingesting data)
# runTable = True/False (creates a runTable to organize rows in all tables by ingest number)
base_terminal = Terminal(debug = 0, backup_db = False, runTable = False)
print(base_terminal.list_available_modules('plugin'))
# ['GitInfo', 'Hostname', 'SystemKernel', 'Bueno', 'Csv']
print(base_terminal.list_available_modules('backend'))
# ['Gufi', 'Sqlite', 'Parquet']
print(base_terminal.list_loaded_modules())
# {'writer': [],
# 'reader': [],
# 'back-read': [],
# 'back-write': []}
Example 2: Ingest data
Loading a Cloverleaf reader and ingesting that data into a Sqlite DSI backend
# examples/developer/2.ingest.py
from dsi.core import Terminal
terminal_ingest = Terminal()
terminal_ingest.load_module('plugin', 'Cloverleaf', 'reader', folder_path="../clover3d/")
terminal_ingest.load_module('backend','Sqlite','back-write', filename='data.db')
terminal_ingest.artifact_handler(interaction_type='ingest')
Example 3: Complex schema with data
Ingesting data from a Cloverleaf Reader to a DSI backend along with a complex schema stored in a JSON file. Read Cloverleaf (Complex Schemas) to better understand how to structure this schema JSON file for the Schema Reader
# examples/developer/3.schema.py
from dsi.core import Terminal
terminal = Terminal()
terminal.load_module('plugin', 'Schema', 'reader', filename="../test/example_schema.json")
terminal.load_module('plugin', 'Cloverleaf', 'reader', folder_path="../clover3d/")
terminal.load_module('backend','Sqlite','back-write', filename='schema_data.db')
terminal.artifact_handler(interaction_type='ingest')
terminal.load_module('plugin', 'ER_Diagram', 'writer', filename = 'er_diagram.png')
terminal.transload()
Example 4: Visualize data
Printing various data and metadata from a DSI backend - number of tables, list of tables, actual table data, and summary of table statistics
# examples/developer/4.visualize.py
from dsi.core import Terminal
terminal_visualize = Terminal()
# Run 3.schema.py so schema_data.db is not empty
terminal_visualize.load_module('backend','Sqlite','back-write', filename='schema_data.db')
terminal_visualize.num_tables()
terminal_visualize.list()
# prints all data from 'input'
terminal_visualize.display("input")
# optional input to specify number of rows from 'input' to print
terminal_visualize.display("input", 2)
# optional input to specify which columns to print
terminal_visualize.display("input", 2, ["sim_id", "state1_density", "state2_density", "initial_timestep", "end_step"])
# prints numerical stats for every table in a backend
terminal_visualize.summary()
# prints numerical stats for only 'input'
terminal_visualize.summary("input")
# prints numerical stats for only 'input' and prints first 5 rows of the actual table
terminal_visualize.summary("input", 5)
terminal_visualize.close()
Example 5: Process and Write data
Processing data from a Sqlite DSI backend and useing a Writer to generate an ER diagram, table plot and CSV file.
# examples/developer/5.process.py
from dsi.core import Terminal
terminal_process = Terminal()
# Run 3.schema.py so schema_data.db is not empty
terminal_process.load_module('backend','Sqlite','back-read', filename='schema_data.db')
terminal_process.artifact_handler(interaction_type="process")
terminal_process.load_module('plugin', 'ER_Diagram', 'writer', filename = 'er_diagram.png')
terminal_process.load_module('plugin', "Table_Plot", "writer", table_name = "output", filename = "output_plot.png")
terminal_process.load_module('plugin', "Csv_Writer", "writer", table_name = "input", filename = "input.csv")
# After loading a plugin WRITER, need to call transload() to execute it
terminal_process.transload()
Example 6: Find data
Finding data from a Sqlite DSI backend - tables, columns, cells, or all matches. If matching data found, it is returned as a list of ValueObject(). Refer to each backend’s ValueObject() description in Backends, as its structure varies by backend.
# examples/developer/6.find.py
from dsi.core import Terminal
terminal_find = Terminal()
# Run 3.schema.py so schema_data.db is not empty
terminal_find.load_module('backend','Sqlite','back-write', filename='schema_data.db')
## TABLE match - return matching table data
data = terminal_find.find_table("input")
for val in data:
print(val.t_name, val.c_name, val.value, val.row_num, val.type)
## COLUMN match - return matching column data
data = terminal_find.find_column("density")
for val in data:
print(val.t_name, val.c_name, val.value, val.row_num, val.type)
## RANGE match (range = True) - return [min, max] of matching cols
data = terminal_find.find_column("density", range = True)
for val in data:
print(val.t_name, val.c_name, val.value, val.row_num, val.type)
## CELL match - return the cells which match the search term
data = terminal_find.find_cell("rectangle")
for val in data:
print(val.t_name, val.c_name, val.value, val.row_num, val.type)
## ROW match (row_return = True) - return the rows where cells match the search term
data = terminal_find.find_cell("rectangle", row = True)
for val in data:
print(val.t_name, val.c_name, val.value, val.row_num, val.type)
## ALL match - find all instances where the search term is found: table, column, cell
data = terminal_find.find("Jun")
for val in data:
print(val.t_name, val.c_name, val.value, val.row_num, val.type)
Example 7: Query data
Querying data from a Sqlite DSI backend.
Users can either use artifact_handler('query', SQL_query)
to store certain data, or get_table()
to retrieve all data from a specified table.
# examples/developer/7.query.py
from dsi.core import Terminal
terminal_query = Terminal()
# Run 3.schema.py so schema_data.db is not empty
terminal_query.load_module('backend','Sqlite','back-write', filename='schema_data.db')
data = terminal_query.artifact_handler(interaction_type='query', query = "SELECT * FROM input;")
print(data)
# Use if want to retrieve all data from a table
data = terminal_query.get_table(table_name="input")
print(data)
Example 8: Overwrite data
Overwriting a table using modified data from get_table()
# examples/developer/8.overwrite.py
from dsi.core import Terminal
terminal_overwrite = Terminal()
# Run 3.schema.py so schema_data.db is not empty
terminal_overwrite.load_module('backend','Sqlite','back-write', filename='schema_data.db')
data = terminal_overwrite.get_table(table_name="input") # data in form of Pandas DataFrame
data["new_column"] = 200 # creating new column
data["end_step"] = 35 # editing existing column
terminal_overwrite.overwrite_table(table_name = "input", collection = data)
terminal_overwrite.display("input", num_rows=5, display_cols= ["sim_id", "state1_density", "state2_density", "end_step", "new_column"])
Example 9: External Readers/Writers
Temporarily adding an external data reader to DSI, allowing DSI to interact with the associated data across all actions.
In this instance, the data is ingested into a backend and viewed using display()
# examples/developer/9.external_plugin.py
from dsi.core import Terminal
term = Terminal()
# Second input is name of the python file where the Reader/Writer is stored
# Third input is full filepath to the python file
term.add_external_python_module('plugin', 'text_file_reader', 'text_file_reader.py')
print(term.list_available_modules('plugin')) # includes TextFile at end of list
# Second input must be the exact spelling of the class name in the external file
term.load_module('plugin', 'TextFile', 'reader', filenames = "../test/test.txt")
term.load_module('backend','Sqlite','back-write', filename='text_data.db')
term.artifact_handler(interaction_type='ingest')
term.display("text_file") #name of the table created in the TextFile external reader
text_file_reader
:
from collections import OrderedDict
from pandas import DataFrame, read_csv, concat
from dsi.plugins.file_reader import FileReader
class TextFile(FileReader):
"""
External Plugin to read in an individual or a set of text files.
Assuming all text files have data for same table
"""
def __init__(self, filenames, **kwargs):
"""
`filenames`: one text file or a list of text files to be ingested
"""
super().__init__(filenames, **kwargs)
if isinstance(filenames, str):
self.text_files = [filenames]
else:
self.text_files = filenames
self.text_file_data = OrderedDict()
def add_rows(self) -> None:
"""
Parses text file data and creates an ordered dict whose keys are table names and values are an ordered dict for each table.
"""
total_df = DataFrame()
for filename in self.text_files:
temp_df = read_csv(filename)
total_df = concat([total_df, temp_df], axis=0, ignore_index=True)
self.text_file_data["text_file"] = OrderedDict(total_df.to_dict(orient='list'))
self.set_schema_2(self.text_file_data)
Example 10: Generate notebook
Generating a python notebook (mostly Jupyter notebook) from a Sqlite DSI backend to view data interactively.
# examples/developer/10.notebook.py
from dsi.core import Terminal
terminal_notebook = Terminal()
#read data
terminal_notebook.load_module('plugin', 'Schema', 'reader', filename="../test/example_schema.json")
terminal_notebook.load_module('plugin', 'Cloverleaf', 'reader', folder_path="../clover3d/")
#ingest data to Sqlite backend
terminal_notebook.load_module('backend','Sqlite','back-write', filename='jupyter_data.db')
terminal_notebook.artifact_handler(interaction_type='ingest')
#generate Jupyter notebook
terminal_notebook.artifact_handler(interaction_type="notebook")