Python API
Users can interact with DSI modules using the DSI class which provides an interface for Readers, Writers, and Backends.
This can be seen below and in dsi/dsi.py
. Example workflows using these functions can be seen in the following section: User Examples
Dsi: DSI
The DSI class is a user-level class that encapsulates the Terminal and Sync classes from DSI Core. DSI interacts with several functions within Terminal and Sync without requiring the user to differentiate them. The functionality has been simplified to improve user experience and reduce complexity.
When creating an instance of DSI(), users can optionally specify the type of backend and filename to use
If neither is provided, a temporary backend is automatically created, allowing users to interact with their data.
Read the __init__
documentation below for more details on the supported backend types.
Users should use read()
to load data into DSI and write()
to export data from DSI into supported external formats.
Their respective list functions print all valid readers/writers that can be used.
The primary backend interactions are find()
, query()
, and get_table()
where users can print a search result, or retrieve the result as a collection of data.
If users modify these collections, they can call
update()
to apply the changes to the active backend. Users must NOT edit any columns beginning with `dsi_`. Readupdate()
below to better understand its behavior.
Users can also view various data/metadata of an active backend with list()
, num_tables()
, display()
, summary()
- Notes for users:
When using a complex schema, must call
schema()
prior toread()
to store the relations with the associated data.If input to
update()
is a modified output fromquery()
, the existing table will be overwritten. Ensure data is secure or add backup flag inupdate()
to create a backup database.Read the DSI Data Cards section to learn which data card standards are supported and where to find templates compatible with DSI.
- class dsi.dsi.DSI(filename='.temp.db', backend_name='Sqlite')
A user-facing interface for DSI’s Core middleware.
The DSI Class abstracts Core.Terminal for managing metadata and Core.Sync for data management and movement.
- __init__(filename='.temp.db', backend_name='Sqlite')
Initializes DSI by activating a backend for data operations; default is a Sqlite backend for temporary data analysis. If users specify filename, data is saved to a permanent backend file. Can now call read(), find(), update(), query(), write() or any backend printing operations
- filenamestr, optional
If not specified, a temporary, hidden backend file is created for users to analyze their data. If specified and backend file already exists, it is activated for a user to explore its data. If specified and backend file does not exist, a file with this name is created.
- Accepted file extensions:
If backend_name = “Sqlite” → .db, .sqlite, .sqlite3
If backend_name = “DuckDB” → .duckdb, .db
- backend_namestr, optional
Name of the backend to activate. Must be either “Sqlite” or “DuckDB”. Default is “Sqlite”.
- close()
Closes the connection to the active backend and clears all loaded DSI modules.
- display(table_name, num_rows=25, display_cols=None)
Prints data from a specified table in the active backend.
- table_namestr
Name of the table to display.
- num_rowsint, optional, default=25
Maximum number of rows to print. If the table contains fewer rows, only those are shown.
- display_colslist of str, optional
List of specific column names to display from the table.
If None (default), all columns are displayed.
- find(query, collection=False, update=False)
Finds all rows in the table where a column-level condition (e.g., “age > 4”) is satisfied.
- querystr
A column-level condition that must be in the format of a [column name] [operator] [value]. The value can be a string or number. Valid operators as example queries:
age > 4
age < 4
age >= 4
age <= 4
age = 4
age == 4
age ~ 4 –> column age contains the number 4
age ~~ 4 –> column age contains the number 4
age != 4
age (4, 8) –> all values in ‘age’ between 4 and 8 (inclusive)
- collectionbool, optional, default False.
If True, returns a pandas DataFrame representing a subset of table rows that satisfy the query.
If False (default), prints the result.
- updatebool, optional, default False.
If True, includes ‘dsi_table_name’ and ‘dsi_row_index’ columns required for
dsi.update()
.If False (default), return object does not include these columns.
return : If there are no matches found, then nothing is returned or printed
- get_table(table_name, collection=False, update=False)
Retrieves all data from a specified table without requiring knowledge of the active backend’s query language.
This method offers a simplified alternative to query() for retrieving a full table data without using SQL.
- table_namestr
Name of the table from which all data will be retrieved.
- collectionbool, optional, default False.
If True, returns the result as a pandas DataFrame.
If False (default), prints the result.
- updatebool, optional, default False.
If True, includes a ‘dsi_table_name’ column required for
dsi.update()
.If False (default), return object does not include this column.
return: If table_name does not exist in the backend, then nothing is returned or printed
- list(collection=False)
Gets the names and dimensions (rows x columns) of all tables in the active backend.
- collectionbool, optional, default False.
If True, returns a Python list of all the table names
If False (default), prints each table’s name and dimensions to the console.
- list_backends()
Prints a list of valid backends that can be used in the backend_name argument in backend()
- list_readers()
Prints a list of valid readers that can be used in the reader_name argument in read()
- list_writers()
Prints a list of valid writers that can be used in the writer_name argument in write()
- num_tables()
Prints the number of tables in the active backend.
- query(statement, collection=False, update=False)
Executes a SQL query on the active backend.
- statementstr
A SQL query to execute. Only SELECT and PRAGMA statements are allowed.
- collectionbool, optional, default False.
If True, returns the result as a pandas DataFrame.
If False (default), prints the result.
- updatebool, optional, default False.
If True, includes a ‘dsi_table_name’ column required for
dsi.update()
.If False (default), return object does not include this column.
return: If the statement is incorrectly formatted, then nothing is returned or printed
- read(filenames, reader_name, table_name=None)
Loads data into DSI using the specified parameter reader_name
- filenamesstr or list of str or data object
Either file path(s) to the data file(s) or an in-memory data object.
- The expected input type depends on the selected reader_name:
“Collection” → Ordered Dictionary of table(s)
“CSV” → .csv
“YAML1” → .yaml or .yml
“TOML1” → .toml
“JSON” → .json
“Ensemble” → .csv
“Cloverleaf” → /path/to/data/directory/
“Bueno” → .data
“DublinCoreDatacard” → .xml
“SchemaOrgDatacard” → .json
“GoogleDatacard” → .yaml or .yml
“Oceans11Datacard” → .yaml or .yml
- reader_namestr
Name of the DSI Reader to use for loading the data.
If using a DSI-supported Reader, this should be one of the reader_names from list_readers().
If using a custom Reader, provide the relative file path to the Python script with the Reader. For guidance on creating a DSI-compatible Reader, view Custom DSI Reader.
- table_namestr, optional
Name to assign to the loaded table.
Required when using the Collection reader to load an Ordered Dictionary representing only one table.
Recommended when the input file contains a single table for the CSV, JSON, or Ensemble reader.
- schema(filename)
Loads a relational database schema into DSI from a specified filename
- filenamestr
Path to a JSON file describing the structure of a relational database. The schema should follow the format described in Cloverleaf (Complex Schemas)
Must be called before reading in any data files associated with the schema
- search(query, collection=False)
Finds all rows across all tables in the active backend where query can be found.
- queryint, float, or str
The value to search for in all rows across all tables.
- collectionbool, optional, default False.
If True, returns a list of pandas DataFrames representing a subset of tables where query is found.
If False (default), prints the matches to the console.
- summary(table_name=None, collection=False)
Prints numerical metadata and (optionally) sample data from tables in the active backend.
- table_namestr, optional
If specified, only the numerical metadata for that table will be printed.
If None (default), metadata for all available tables is printed.
- collectionbool, optional, default False.
If True, and table_name specified, returns a Pandas DataFrame of the summary of that table.
If True, and table_name not specified, returns a list of Pandas DataFrames of the summary of all tables.
If False (default), prints each table’s name and dimensions to the console.
- update(collection, backup=False)
Updates data in one or more tables in the active backend using the provided input. Intended to be used after modifying the output of find(), search(), query(), or get_table()
- collectionpandas.DataFrame
The data used to update a table. DataFrame must include unchanged `dsi_` columns from find(), search(), query() or get_table() to successfully update.
If a query() DataFrame is the input, the corresponding table in the backend will be completely overwritten.
- backupbool, optional, default False.
If True, creates a backup file for the DSI backend before updating its data.
If False (default), only updates the data.
NOTE: Columns from the original table cannot be deleted during update. Only row edits or column additions are allowed.
NOTE: If update() affects a user-defined primary key column, row order may change upon reinsertion.
- write(filename, writer_name, table_name=None)
Exports data from the active backend using the specified writer_name.
- filenamestr
Name of the output file to write.
- Expected file extensions based on writer_name:
“ER_Diagram” → .png, .pdf, .jpg, .jpeg
“Table_Plot” → .png, .jpg, .jpeg
“Csv” → .csv
- writer_namestr
Name of the DSI Writer to export data.
If using a DSI-supported Writer, this should be one of the writer_names from list_writers().
If using a custom Writer, provide the relative file path to the Python script with the Writer. For guidance on creating a DSI-compatible Writer, view Custom DSI Writer.
- table_name: str, optional
Required when using “Table_Plot” or “Csv” to specify which table to export.
DSI Data Cards
DSI is expanding its support of several dataset metadata standards. Currently supported standards include:
Template file structures can be found and copied in examples/test/
.
To be compatible with DSI, a user’s data card must contain all the fields in its corresponding template. However, if certain metadata is not available for a dataset, the values of those fields may be left empty.
The supported datacards can be read into DSI by creating an instance of DSI() and calling:
read("file/path/to/datacard.XML", 'DublinCoreDatacard')
read("file/path/to/datacardh.JSON", 'SchemaOrgDatacard')
read("file/path/to/datacard.YAML", 'GoogleDatacard')
read("file/path/to/datacard.YAML", 'Oceans11Datacard')
Examples of each data card standard for the Wildfire dataset can be found in examples/wildfire/
User Examples
Examples below display various ways users can incorporate DSI into their data science workflows.
They must be executed from their directory in examples/user/
To run them successfully, please unzip clover3d.zip
located in examples/clover3d/
, and execute requirements.extras.txt
.
Example 1: Intro use case
Baseline use of DSI to list all valid Readers, Writers, and Backends, and descriptions of each.
# examples/user/1.baseline.py
from dsi.dsi import DSI
baseline_dsi = DSI()
# Lists available backends, readers, and writers in this dsi installation
baseline_dsi.list_backends()
baseline_dsi.list_readers()
baseline_dsi.list_writers()
Example 2: Read data
Reading Cloverleaf data into a DSI backend, and displaying some of that data
# examples/user/2.read.py
from dsi.dsi import DSI
read_dsi = DSI("data.db") # Target a backend, defaults to SQLite if not defined
#dsi.read(path, reader)
read_dsi.read("../clover3d/", 'Cloverleaf') # Read data into memory
#dsi.display(table_name)
read_dsi.display("input") # Print the specific table's data from the Cloverleaf data
read_dsi.close() # cleans DSI memory of all DSI modules - readers/writers/backends
Example 3: Visualize data
Printing various data and metadata from a DSI backend - number of tables, list of tables, actual table data, and summary of table statistics
# examples/user/3.visualize.py
from dsi.dsi import DSI
visual_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:
visual_dsi.num_tables()
visual_dsi.list()
#dsi.display(table_name, num_rows, display_cols)
# prints all data from 'input'
visual_dsi.display("input")
# optional input to specify number of rows from 'input' to print
visual_dsi.display("input", 2)
# optional input to specify which columns to print
visual_dsi.display("input", 2, ["sim_id", "state1_density", "state2_density", "initial_timestep", "end_step"])
#dsi.summary(table_name, num_rows)
# prints numerical stats for every table in a backend
visual_dsi.summary()
# prints numerical stats for only 'input'
visual_dsi.summary("input")
visual_dsi.close()
Example 4: Find data
Finding data from an active DSI backend that matches an input object.
If using search()
, the input can be a string or number.
If using find()
, the input must be a string in the form of a condition - [column] [operator] [value].
By default, all matches are printed. If True
is passed as an additional argument, the matching rows are returned as a DataFrame instead.
# examples/user/4.find.py
from dsi.dsi import DSI
find_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:
#dsi.search(value)
find_dsi.search("Jun 2025") # searches for the value 'Jun 2025' in all tables
find_df = find_dsi.search("Jun 2025", True) # Returns the first matching table as a DataFrame
#dsi.find(condition, True)
find_dsi.find("state2_density > 5.0") # Finds all rows of one table that match the condition
find_df = find_dsi.find("state2_density > 5.0", True) # Returns matching rows as a DataFrame
find_dsi.close()
Example 5: Update data
Updating data from the edited output of find()
. Users must NOT modify metadata columns starting with `dsi_` even when adding new rows.
The input can be the output of either find()
, query()
, or get_table()
.
# examples/user/5.update.py
from dsi.dsi import DSI
update_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:
#dsi.find(condition, collection)
find_df = update_dsi.find("state2_density > 5.0", True, True) # Returns matching rows as a DataFrame
update_dsi.display(find_df["dsi_table_name"][0], 5) # display table before update
find_df["new_col"] = 50 # add new column to this DataFrame
find_df["max_timestep"] = 100 # update existing column
#dsi.update(collection, backup)
update_dsi.update(find_df, True) # update the table in the backend
update_dsi.display(find_df["dsi_table_name"][0], 5) # display table after update
update_dsi.close()
Example 6: Query data
Querying data from an active DSI backend.
Users can either use query()
to view specific data with a SQL statement, or get_table()
to view all data from a specified table.
By default, all matches are printed. If True
is passed as an additional argument, the matching rows are returned as a DataFrame instead.
# examples/user/6.query.py
from dsi.dsi import DSI
query_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:
#dsi.query(sql_statement)
query_dsi.query("SELECT * FROM input")
#dsi.get_table(table_name)
query_dsi.get_table("input") # alternative to query() if want all data
query_dsi.close()
Example 7: Complex schema with data
Loading a complex JSON file with schema()
, the associated Cloverleaf data with read()
, and an ER Diagram to display the data relations.
Read Cloverleaf (Complex Schemas) to learn how to structure a DSI-compatible input file for schema()
# examples/user/7.schema.py
from dsi.dsi import DSI
schema_dsi = DSI("schema_data.db")
# dsi.schema(filename)
schema_dsi.schema("../clover3d/schema.json") # must execute before reading Cloverleaf data
#dsi.read(path, reader)
schema_dsi.read("../clover3d/", 'Cloverleaf')
#dsi.write(filename, writer)
schema_dsi.write("clover_er_diagram.png", "ER_Diagram")
#dsi.display(table_name, num_rows, display_cols)
schema_dsi.display("simulation")
schema_dsi.display("input", ["sim_id", "state1_density", "state2_density", "initial_timestep", "end_step"])
schema_dsi.display("output", ["sim_id", "step", "wall_clock", "average_time_per_cell"])
schema_dsi.display("viz_files")
schema_dsi.close()
Example 8: Write data
Writing data from a DSI backend as an Entity Relationship diagram, table plot, and CSV.
# examples/user/8.write.py
from dsi.dsi import DSI
write_dsi = DSI("schema_data.db") # Assuming schema_data.db has data from 7.schema.py:
#dsi.write(filename, writer, table)
write_dsi.write("er_diagram.png", "ER_Diagram")
write_dsi.write("input_table_plot.png", "Table_Plot", "input")
write_dsi.write("input.csv", "Csv_Writer", "input")
write_dsi.close()
Example 9: Load an external Reader
Loading an external DSI-compatible Reader and its associated data into DSI to interact with and/or visualize the data. For more information on creating an external Reader/Writer, view Custom DSI Reader and Custom DSI Writer.
# examples/user/9.external_reader.py
from dsi.dsi import DSI
external_dsi = DSI("external_data.db")
#dsi.read(filename, path/to/custom/dsi/reader.py)
external_dsi.read("../test/test.txt", "../test/text_file_reader.py")
#dsi.display(table_name)
external_dsi.display("people")
external_dsi.close()
text_file_reader
:
from collections import OrderedDict
from pandas import DataFrame, read_csv, concat
from dsi.plugins.file_reader import FileReader
class TextFile(FileReader):
"""
External Plugin to read in an individual or a set of text files.
Assuming all text files have data for same table
"""
def __init__(self, filenames, **kwargs):
"""
`filenames`: one text file or a list of text files to be ingested
"""
super().__init__(filenames, **kwargs)
if isinstance(filenames, str):
self.text_files = [filenames]
else:
self.text_files = filenames
self.text_file_data = OrderedDict()
def add_rows(self) -> None:
"""
Parses text file data and creates an ordered dict whose keys are table names and values are an ordered dict for each table.
"""
total_df = DataFrame()
for filename in self.text_files:
temp_df = read_csv(filename)
total_df = concat([total_df, temp_df], axis=0, ignore_index=True)
self.text_file_data["people"] = OrderedDict(total_df.to_dict(orient='list'))
self.set_schema_2(self.text_file_data)