Python API

Users can interact with DSI modules using the DSI class which provides an interface for Readers, Writers, and Backends. This can be seen below and in dsi/dsi.py. Example workflows using these functions can be seen in the following section: User Examples

Dsi: DSI

The DSI class is a user-level class that encapsulates the Terminal and Sync classes from DSI Core. DSI interacts with several functions within Terminal and Sync without requiring the user to differentiate them. The functionality has been simplified to improve user experience and reduce complexity.

When creating an instance of DSI(), users can optionally specify the type of backend and filename to use If neither is provided, a temporary backend is automatically created, allowing users to interact with their data. Read the __init__ documentation below for more details on the supported backend types.

Users should use read() to load data into DSI and write() to export data from DSI into supported external formats. Their respective list functions print all valid readers/writers that can be used.

The primary backend interactions are find() , query(), and get_table() where users can print a search result, or retrieve the result as a collection of data.

If users modify these collections, they can call update() to apply the changes to the active backend. Users must NOT edit any columns beginning with `dsi_`. Read update() below to better understand its behavior.

Users can also view various data/metadata of an active backend with list(), num_tables(), display(), summary()

Notes for users:

When using a complex schema, must call schema() prior to read() to store the relations with the associated data.
If input to update() is a modified output from query(), the existing table will be overwritten. Ensure data is secure or add backup flag in update() to create a backup database.
Read the DSI Data Cards section to learn which data card standards are supported and where to find templates compatible with DSI.

class dsi.dsi.DSI(filename='.temp.db', backend_name='Sqlite')

A user-facing interface for DSI’s Core middleware.

The DSI Class abstracts Core.Terminal for managing metadata and Core.Sync for data management and movement.

__init__(filename='.temp.db', backend_name='Sqlite')

Initializes DSI by activating a backend for data operations; default is a Sqlite backend for temporary data analysis. If users specify filename, data is saved to a permanent backend file. Can now call read(), find(), update(), query(), write() or any backend printing operations

filenamestr, optional

If not specified, a temporary, hidden backend file is created for users to analyze their data. If specified and backend file already exists, it is activated for a user to explore its data. If specified and backend file does not exist, a file with this name is created.

Accepted file extensions:

If backend_name = “Sqlite” → .db, .sqlite, .sqlite3
If backend_name = “DuckDB” → .duckdb, .db

backend_namestr, optional

Name of the backend to activate. Must be either “Sqlite” or “DuckDB”. Default is “Sqlite”.

close(): Closes the connection to the active backend and clears all loaded DSI modules.

display(table_name, num_rows=25, display_cols=None)

Prints data from a specified table in the active backend.

table_namestr

Name of the table to display.

num_rowsint, optional, default=25

Maximum number of rows to print. If the table contains fewer rows, only those are shown.

display_colslist of str, optional

List of specific column names to display from the table.

If None (default), all columns are displayed.

find(query, collection=False, update=False)

Finds all rows in the table where a column-level condition (e.g., “age > 4”) is satisfied.

querystr

A column-level condition that must be in the format of a [column name] [operator] [value]. The value can be a string or number. Valid operators as example queries:

age > 4
age < 4
age >= 4
age <= 4
age = 4
age == 4
age ~ 4 –> column age contains the number 4
age ~~ 4 –> column age contains the number 4
age != 4
age (4, 8) –> all values in ‘age’ between 4 and 8 (inclusive)

collectionbool, optional, default False.

If True, returns a pandas DataFrame representing a subset of table rows that satisfy the query.

If False (default), prints the result.

updatebool, optional, default False.

If True, includes ‘dsi_table_name’ and ‘dsi_row_index’ columns required for dsi.update().

If False (default), return object does not include these columns.

return : If there are no matches found, then nothing is returned or printed

get_table(table_name, collection=False, update=False)

Retrieves all data from a specified table without requiring knowledge of the active backend’s query language.

This method offers a simplified alternative to query() for retrieving a full table data without using SQL.

table_namestr

Name of the table from which all data will be retrieved.

collectionbool, optional, default False.

If True, returns the result as a pandas DataFrame.

If False (default), prints the result.

updatebool, optional, default False.

If True, includes a ‘dsi_table_name’ column required for dsi.update().

If False (default), return object does not include this column.

return: If table_name does not exist in the backend, then nothing is returned or printed

list(collection=False)

Gets the names and dimensions (rows x columns) of all tables in the active backend.

collectionbool, optional, default False.

If True, returns a Python list of all the table names

If False (default), prints each table’s name and dimensions to the console.

list_backends(): Prints a list of valid backends that can be used in the backend_name argument in backend()

list_readers(): Prints a list of valid readers that can be used in the reader_name argument in read()

list_writers(): Prints a list of valid writers that can be used in the writer_name argument in write()

num_tables(): Prints the number of tables in the active backend.

query(statement, collection=False, update=False)

Executes a SQL query on the active backend.

statementstr

A SQL query to execute. Only SELECT and PRAGMA statements are allowed.

collectionbool, optional, default False.

If True, returns the result as a pandas DataFrame.

If False (default), prints the result.

updatebool, optional, default False.

If True, includes a ‘dsi_table_name’ column required for dsi.update().

If False (default), return object does not include this column.

return: If the statement is incorrectly formatted, then nothing is returned or printed

read(filenames, reader_name, table_name=None)

Loads data into DSI using the specified parameter reader_name

filenamesstr or list of str or data object

Either file path(s) to the data file(s) or an in-memory data object.

The expected input type depends on the selected reader_name:

“Collection” → Ordered Dictionary of table(s)
“CSV” → .csv
“YAML1” → .yaml or .yml
“TOML1” → .toml
“JSON” → .json
“Ensemble” → .csv
“Cloverleaf” → /path/to/data/directory/
“Bueno” → .data
“DublinCoreDatacard” → .xml
“SchemaOrgDatacard” → .json
“GoogleDatacard” → .yaml or .yml
“Oceans11Datacard” → .yaml or .yml

reader_namestr

Name of the DSI Reader to use for loading the data.

If using a DSI-supported Reader, this should be one of the reader_names from list_readers().

If using a custom Reader, provide the relative file path to the Python script with the Reader. For guidance on creating a DSI-compatible Reader, view Custom DSI Reader.

table_namestr, optional

Name to assign to the loaded table.

Required when using the Collection reader to load an Ordered Dictionary representing only one table.

Recommended when the input file contains a single table for the CSV, JSON, or Ensemble reader.

schema(filename)

Loads a relational database schema into DSI from a specified filename

filenamestr
Path to a JSON file describing the structure of a relational database. The schema should follow the format described in Cloverleaf (Complex Schemas)

Must be called before reading in any data files associated with the schema

search(query, collection=False)

Finds all rows across all tables in the active backend where query can be found.

queryint, float, or str

The value to search for in all rows across all tables.

collectionbool, optional, default False.

If True, returns a list of pandas DataFrames representing a subset of tables where query is found.

If False (default), prints the matches to the console.

summary(table_name=None, collection=False)

Prints numerical metadata and (optionally) sample data from tables in the active backend.

table_namestr, optional

If specified, only the numerical metadata for that table will be printed.

If None (default), metadata for all available tables is printed.

collectionbool, optional, default False.

If True, and table_name specified, returns a Pandas DataFrame of the summary of that table.

If True, and table_name not specified, returns a list of Pandas DataFrames of the summary of all tables.

If False (default), prints each table’s name and dimensions to the console.

update(collection, backup=False)

Updates data in one or more tables in the active backend using the provided input. Intended to be used after modifying the output of find(), search(), query(), or get_table()

collectionpandas.DataFrame

The data used to update a table. DataFrame must include unchanged `dsi_` columns from find(), search(), query() or get_table() to successfully update.

If a query() DataFrame is the input, the corresponding table in the backend will be completely overwritten.

backupbool, optional, default False.

If True, creates a backup file for the DSI backend before updating its data.

If False (default), only updates the data.

NOTE: Columns from the original table cannot be deleted during update. Only row edits or column additions are allowed.
NOTE: If update() affects a user-defined primary key column, row order may change upon reinsertion.

write(filename, writer_name, table_name=None)

Exports data from the active backend using the specified writer_name.

filenamestr

Name of the output file to write.

Expected file extensions based on writer_name:

“ER_Diagram” → .png, .pdf, .jpg, .jpeg
“Table_Plot” → .png, .jpg, .jpeg
“Csv” → .csv

writer_namestr

Name of the DSI Writer to export data.

If using a DSI-supported Writer, this should be one of the writer_names from list_writers().

If using a custom Writer, provide the relative file path to the Python script with the Writer. For guidance on creating a DSI-compatible Writer, view Custom DSI Writer.

table_name: str, optional

Required when using “Table_Plot” or “Csv” to specify which table to export.

DSI Data Cards

DSI is expanding its support of several dataset metadata standards. Currently supported standards include:

Dublin Core

Schema.org’s Dataset object

Google Data Cards Playbook

Oceans11 DSI Data Server

Template file structures can be found and copied in examples/test/.

To be compatible with DSI, a user’s data card must contain all the fields in its corresponding template. However, if certain metadata is not available for a dataset, the values of those fields may be left empty.

The supported datacards can be read into DSI by creating an instance of DSI() and calling:

read("file/path/to/datacard.XML", 'DublinCoreDatacard')

read("file/path/to/datacardh.JSON", 'SchemaOrgDatacard')

read("file/path/to/datacard.YAML", 'GoogleDatacard')

read("file/path/to/datacard.YAML", 'Oceans11Datacard')

Examples of each data card standard for the Wildfire dataset can be found in examples/wildfire/

User Examples

Examples below display various ways users can incorporate DSI into their data science workflows. They must be executed from their directory in examples/user/

To run them successfully, please unzip clover3d.zip located in examples/clover3d/, and execute requirements.extras.txt.

Example 1: Intro use case

Baseline use of DSI to list all valid Readers, Writers, and Backends, and descriptions of each.

# examples/user/1.baseline.py
from dsi.dsi import DSI

baseline_dsi = DSI()

# Lists available backends, readers, and writers in this dsi installation
baseline_dsi.list_backends()
baseline_dsi.list_readers()
baseline_dsi.list_writers()

Example 2: Read data

Reading Cloverleaf data into a DSI backend, and displaying some of that data

# examples/user/2.read.py
from dsi.dsi import DSI

read_dsi = DSI("data.db") # Target a backend, defaults to SQLite if not defined

#dsi.read(path, reader)
read_dsi.read("../clover3d/", 'Cloverleaf') # Read data into memory

#dsi.display(table_name)
read_dsi.display("input") # Print the specific table's data from the Cloverleaf data

read_dsi.close() # cleans DSI memory of all DSI modules - readers/writers/backends

Example 3: Visualize data

Printing various data and metadata from a DSI backend - number of tables, list of tables, actual table data, and summary of table statistics

# examples/user/3.visualize.py
from dsi.dsi import DSI

visual_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:

visual_dsi.num_tables()
visual_dsi.list()

#dsi.display(table_name, num_rows, display_cols)
# prints all data from 'input'
visual_dsi.display("input")

# optional input to specify number of rows from 'input' to print
visual_dsi.display("input", 2)

# optional input to specify which columns to print
visual_dsi.display("input", 2, ["sim_id", "state1_density", "state2_density", "initial_timestep", "end_step"])


#dsi.summary(table_name, num_rows)
# prints numerical stats for every table in a backend
visual_dsi.summary()

# prints numerical stats for only 'input'
visual_dsi.summary("input")

visual_dsi.close()

Example 4: Find data

Finding data from an active DSI backend that matches an input object.

If using search(), the input can be a string or number. If using find(), the input must be a string in the form of a condition - [column] [operator] [value].

By default, all matches are printed. If True is passed as an additional argument, the matching rows are returned as a DataFrame instead.

# examples/user/4.find.py
from dsi.dsi import DSI

find_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:

#dsi.search(value)
find_dsi.search("Jun 2025") # searches for the value 'Jun 2025' in all tables

find_df = find_dsi.search("Jun 2025", True) # Returns the first matching table as a DataFrame

#dsi.find(condition, True)
find_dsi.find("state2_density > 5.0") # Finds all rows of one table that match the condition

find_df = find_dsi.find("state2_density > 5.0", True) # Returns matching rows as a DataFrame

find_dsi.close()

Example 5: Update data

Updating data from the edited output of find(). Users must NOT modify metadata columns starting with `dsi_` even when adding new rows.

The input can be the output of either find(), query(), or get_table().

# examples/user/5.update.py
from dsi.dsi import DSI

update_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:

#dsi.find(condition, collection)
find_df = update_dsi.find("state2_density > 5.0", True, True) # Returns matching rows as a DataFrame

update_dsi.display(find_df["dsi_table_name"][0], 5) # display table before update

find_df["new_col"] = 50   # add new column to this DataFrame
find_df["max_timestep"] = 100 # update existing column

#dsi.update(collection, backup)
update_dsi.update(find_df, True) # update the table in the backend

update_dsi.display(find_df["dsi_table_name"][0], 5) # display table after update

update_dsi.close()

Example 6: Query data

Querying data from an active DSI backend. Users can either use query() to view specific data with a SQL statement, or get_table() to view all data from a specified table.

By default, all matches are printed. If True is passed as an additional argument, the matching rows are returned as a DataFrame instead.

# examples/user/6.query.py
from dsi.dsi import DSI

query_dsi = DSI("data.db") # Assuming data.db has data from 2.read.py:

#dsi.query(sql_statement)
query_dsi.query("SELECT * FROM input")

#dsi.get_table(table_name)
query_dsi.get_table("input") # alternative to query() if want all data

query_dsi.close()

Example 7: Complex schema with data

Loading a complex JSON file with schema(), the associated Cloverleaf data with read(), and an ER Diagram to display the data relations.

Read Cloverleaf (Complex Schemas) to learn how to structure a DSI-compatible input file for schema()

# examples/user/7.schema.py
from dsi.dsi import DSI

schema_dsi = DSI("schema_data.db")

# dsi.schema(filename)
schema_dsi.schema("../clover3d/schema.json") # must execute before reading Cloverleaf data

#dsi.read(path, reader)
schema_dsi.read("../clover3d/", 'Cloverleaf')

#dsi.write(filename, writer)
schema_dsi.write("clover_er_diagram.png", "ER_Diagram")

#dsi.display(table_name, num_rows, display_cols)
schema_dsi.display("simulation")
schema_dsi.display("input", ["sim_id", "state1_density", "state2_density", "initial_timestep", "end_step"])
schema_dsi.display("output", ["sim_id", "step", "wall_clock", "average_time_per_cell"])
schema_dsi.display("viz_files")

schema_dsi.close()

Example 8: Write data

Writing data from a DSI backend as an Entity Relationship diagram, table plot, and CSV.

# examples/user/8.write.py
from dsi.dsi import DSI

write_dsi = DSI("schema_data.db") # Assuming schema_data.db has data from 7.schema.py:

#dsi.write(filename, writer, table)
write_dsi.write("er_diagram.png", "ER_Diagram")

write_dsi.write("input_table_plot.png", "Table_Plot", "input")

write_dsi.write("input.csv", "Csv_Writer", "input")

write_dsi.close()

Example 9: Load an external Reader

Loading an external DSI-compatible Reader and its associated data into DSI to interact with and/or visualize the data. For more information on creating an external Reader/Writer, view Custom DSI Reader and Custom DSI Writer.

# examples/user/9.external_reader.py
from dsi.dsi import DSI

external_dsi = DSI("external_data.db")

#dsi.read(filename, path/to/custom/dsi/reader.py)
external_dsi.read("../test/test.txt", "../test/text_file_reader.py")

#dsi.display(table_name)
external_dsi.display("people")

external_dsi.close()

text_file_reader:

from collections import OrderedDict
from pandas import DataFrame, read_csv, concat

from dsi.plugins.file_reader import FileReader

class TextFile(FileReader):
    """
    External Plugin to read in an individual or a set of text files.
    Assuming all text files have data for same table
    """
    def __init__(self, filenames, **kwargs):
        """
        `filenames`: one text file or a list of text files to be ingested
        """
        super().__init__(filenames, **kwargs)
        if isinstance(filenames, str):
            self.text_files = [filenames]
        else:
            self.text_files = filenames
        self.text_file_data = OrderedDict()

    def add_rows(self) -> None:
        """
        Parses text file data and creates an ordered dict whose keys are table names and values are an ordered dict for each table.
        """
        total_df = DataFrame()
        for filename in self.text_files:
            temp_df = read_csv(filename)
            total_df = concat([total_df, temp_df], axis=0, ignore_index=True)

        self.text_file_data["people"] = OrderedDict(total_df.to_dict(orient='list'))

        self.set_schema_2(self.text_file_data)