Python API
Users can interact with DSI modules using the DSI class which provides an interface for Readers, Writers, and Backends.
This can be seen below and in dsi/dsi.py
. Example workflows using these functions can be seen in the following section: User Examples
Dsi: DSI
The DSI class is a user-level class that encapsulates the Terminal and Sync classes from DSI Core. DSI interacts with several functions within Terminal and Sync without requiring the user to differentiate them. The functionality has also been simplified to improve user experience and reduce complexity.
Users should call read()
to load data from external data files into DSI. list_readers()
prints all valid readers and a short description of each one
Users should call write()
to export data from DSI into external formats. list_writers()
prints all valid writers and a short description of each one.
Users should call backend()
to activate either a Sqlite or DuckDB backend. list_writers()
prints the valid backends and differences between them.
ingest()
, query()
, process()
are considered backend interactions, and require an active backend to work.
Therefore, backend()
must be called before them.
findt()
, findc()
, find()
also require an active backend as they locate and print where a input search term matches
tables/columns/datapoints respectively.
list()
, num_tables()
, display()
, summary()
all print various information from an active backend. Differences are explained below.
- Notes for users:
Must call
reader()
prior toingest()
to ensure there is actual data ingested into a backendIf there is no data in DSI memory, ie. read() was never called, process() MUST be called on an active backend to ensure data can be exported with write()
Refer to the DSI Data Cards section to learn which/how datacard files are read into DSI Inputs to the datacard readers - Oceans11Datacard, DublinCoreDatacard, SchemaOrgDatacard - must all follow the formats found in dsi/examples/data/
- class dsi.dsi.DSI
A user-facing abstration for DSI’s Core middleware interface.
The DSI Class abstracts Core.Terminal for managing metadata and Core.Sync for data management and movement.
- backend(filename, backend_name='Sqlite')
Activates a backend, default is Sqlite unless specified. Uses can now call the ingest(), query(), or process() functions.
- filename: name of the backend file
if backend_name = “Sqlite” —> file extension can be .db, .sqlite, .sqlite3
if backend_name = “DuckDB” —> file extension can be .duckdb, .db
backend_name: either ‘Sqlite’ or ‘DuckDB. Default is Sqlite
- close()
Closes the connection and finalizes the changes to the backend
- display(table_name, num_rows=25, display_cols=None)
Prints data of a specified table from the first loaded backend.
table_name: table whose data is printed
num_rows: Optional numerical parameter limiting how many rows are printed. Default is 25.
display_cols: Optional parameter specifying which columns in table_name to display. Must be a Python list object
- find(query, row=False)
Finds all individual datapoints that match query input in the first loaded backend
row: Default is False. If False, then printed value is the actual cell that matches query. If True, then printed value is whole row of data where a cell matches query
- findc(query, range=False)
Finds all columns that match query input in the first loaded backend.
range: Default is False. If False, then the printed value is data of each matching column. If True, then the printed value is the min/max of each matching column
- findt(query)
Finds all tables that match query input in the first loaded backend
- ingest()
Ingests data from all previously called read() functions into active backends from backend().
- list()
Prints a list of all tables and their dimensions in the first loaded backend
- list_backends()
Prints a list of valid backends that can be specified in the ‘backend_name’ argument in backend()
- list_readers()
Prints a list of valid readers that can be specified in the ‘reader_name’ argument in read()
- list_writers()
Prints a list of valid writers that can be specified in the ‘writer_name’ argument in write()
- nb()
Generates a Python notebook and stores data from the first activated backend
- num_tables()
Prints number of tables in the first loaded backend
- process()
Reads data from first activated backend into DSI memory.
- query(statement)
Queries data from first activated backend based on specified statement. Prints data as a dataframe
statement: query to run on a backend. statement can only be a SELECT or PRAGMA query.
- read(filenames, reader_name, table_name=None)
Runs a reader to load data into DSI.
filenames: name(s) of the data file(s) to load into DSI
if reader_name = “Oceans11Datacard” —> file extension can be .yaml, .yml
if reader_name = “DublinCoreDatacard” —> file extension can be .xml
if reader_name = “SchemaOrgDatacard” —> file extension can be .json
if reader_name = “Schema” —> file extension can be .json
if reader_name = “Bueno” —> file extension can be .data
if reader_name = “Csv” —> file extension can be .csv
if reader_name = “YAML1” —> file extension can be .yaml, .yml
if reader_name = “TOML1” —> file extension can be .toml
if reader_name = “Wildfire” —> file extension can be .csv
if reader_name = “JSON” —> file extension can be .json
reader_name: name of the DSI reader to use. Call list_readers() to see a list of valid readers
table_name: optional, default None. If filenames only stores one table of data, users can specify name for that table
Csv, JSON, and Wildfire readers are only ones to accept this input
- summary(table_name=None, num_rows=0)
Prints data and numerical metadata of tables from the first loaded backend. Output varies depending on parameters
table_name: default is None. When specified only that table’s numerical metadata is printed. Otherwise every table’s numerical metdata is printed
num_rows: default is 0. When specified, data from the first N rows of a table are printed. Otherwise, only the total number of rows of a table are printed. The tables whose data is printed depends on the table_name parameter.
- write(filename, writer_name, table_name=None)
Runs a writer to export data from DSI. If data to export is in a backend, first call process() before write().
filename: output file name
if writer_name = “ER_Diagram” —> file extension can be .png, .pdf, .jpg, .jpeg
if writer_name = “Table_Plot” —> file extension can be .png, .jpg, .jpeg
if writer_name = “Csv_Writer” —> file extension can only be .csv
writer_name: name of the DSI write to use. Call list_writers() to see a list of valid readers
table_name: optional if writer_name = “ER_Diagram”. Required for Table_Plot and Csv_Writer to export correct table
DSI Data Cards
DSI is expanding its support of several dataset metadata standards. The current supported standards are for:
Template file structures can be copied and found in dsi/examples/data/
.
The fields in a user’s data card must exactly match its respective template to be compatible with DSI.
However, fields can be empty if a user does not have particular information about that dataset.
The supported datacards can be read into DSI by creating an instance of DSI() and calling:
read(filenames="file/path/to/datacard.XML", reader_name='DublinCoreDatacard')
read(filenames="file/path/to/datacardh.JSON", reader_name='SchemaOrgDatacard')
read(filenames="file/path/to/datacard.YAML", reader_name='Oceans11Datacard')
Completed examples of each metadata standard for the Wildfire dataset can also be found in dsi/examples/wildfire/
User Examples
Examples below display various ways users can incorporate DSI into their data science workflows.
They can be found and run in examples/user/
Example 1: Intro use case
Baseline use of DSI to list all valid Readers, Writers, and Backends, and descriptions of each.
# examples/user/1.baseline.py
from dsi.dsi import DSI
baseline_dsi = DSI()
# Lists available backends, readers, and writers in this dsi installation
baseline_dsi.list_backends()
baseline_dsi.list_readers()
baseline_dsi.list_writers()
Example 2: Ingest data
Loading data from a Reader, ingesting it into a backend and displaying some of that data
# examples/user/2.ingest.py
from dsi.dsi import DSI
ingest_dsi = DSI()
#dsi.read(filename, reader)
ingest_dsi.read("../data/student_test1.yml", 'YAML1') # Read data into memory
ingest_dsi.read("../data/student_test2.yml", 'YAML1')
#dsi.backend(filename, backend)
ingest_dsi.backend("data.db") # Target a backend, defaults to SQLite if not defined
ingest_dsi.ingest() # need to call backend() before ingest()
ingest_dsi.summary() # Print the overall summary of the ingest
#dsi.display(table_name)
ingest_dsi.display("math") # Print the specific table name in student_test1.yml
ingest_dsi.close() # cleans DSI memory of all DSI modules - readers/writers/backends
Example 3: Find data
Finding data from an active backend - tables, columns, datapoints matches
# examples/user/3.find.py
from dsi.dsi import DSI
# ASSUMING DATABASE HAS DATA FROM 2.ingest.py:
find_dsi = DSI()
#dsi.backend(filename, reader)
find_dsi.backend("data.db")
#dsi.find(value)
find_dsi.findt("a") # finds "a" in a Table search after backend() loaded
find_dsi.findc("c") # finds "c" in a Column search after backend() loaded
find_dsi.find(5.9) # finds the value 5.9 in a search all cells search after backend() loaded
find_dsi.close()
Example 4: Process data
Processing (reading) data from a backend and load DSI writers to generate an Entity Relationship diagram, plot a table’s data, and export to a CSV
# examples/user/4.process.py
from dsi.dsi import DSI
# ASSUMING DATABASE HAS DATA FROM 2.ingest.py:
process_dsi = DSI()
#dsi.backend(filename, backend)
process_dsi.backend("data.db")
process_dsi.process() # need to call backend() before process() to be able to process data
#dsi.write(filename, writer, table)
process_dsi.write("er_diagram.png", "ER_Diagram")
process_dsi.write("math_table_plot.png", "Table_Plot", "math")
process_dsi.write("math.csv", "Csv_Writer", "math")
process_dsi.close()
Example 5: Query data
Querying data from a backend
# examples/user/5.query.py
from dsi.dsi import DSI
query_dsi = DSI()
#dsi.read(filename, reader)
query_dsi.read("../data/student_test1.yml", 'YAML1')
query_dsi.read("../data/student_test2.yml", 'YAML1')
#dsi.backend(filename, backend)
query_dsi.backend("data.db") # Target a backend, defaults to SQLite if not defined
query_dsi.ingest() # need to call backend() before ingest()
#dsi.query(sql_statement)
query_dsi.query("SELECT * FROM math")
query_dsi.close()
# ---------
# IF DATABASE ALREADY HAS DATA THEN:
query_dsi2 = DSI()
#dsi.backend(filename, backend)
query_dsi2.backend("data.db")
#dsi.query(sql_statement)
query_dsi2.query("SELECT * FROM math") # still need to call backend() before query()
query_dsi2.close()
Example 6: Visualizing a database
Printing different data and metadata from a database - number of tables, dimensions of tables, actual data in tables, and statistics from each table
# examples/user/6.visualize.py
from dsi.dsi import DSI
# ASSUMING DATABASE HAS DATA FROM 2.ingest.py:
visual_dsi = DSI()
#dsi.backend(filename, backend)
visual_dsi.backend("data.db")
visual_dsi.num_tables() # need to call backend() before to get number of tables
visual_dsi.list() # need to call backend() before to list all tables and their dimensions
#dsi.display(table_name, num_rows, column_names)
visual_dsi.display("math") # need to call backend() before to print all data from 'math'
visual_dsi.display("math", 2) # optional input to specify number of rows from 'math' to print
visual_dsi.display("math", 2, ['a', 'c', 'e']) # another optional inputs to specify which columns to print
#dsi.summary(table_name, num_rows)
visual_dsi.summary() # need to call backend() before to print numerical stats from every table in a backend
visual_dsi.summary("math") # prints numerical stats for only 'math'
visual_dsi.summary("math", 5) # prints numerical stats for only 'math' and prints first 5 rows of the actual table
visual_dsi.close()
Example 7: Ingest complex schema with data
Using the Schema Reader to load a complex JSON schema, loading the relevant data, and viewing difference between databases with a schema and no schema Read Complex Schemas in DSI to understand how to structure this schema JSON file for the Schema Reader
# examples/user/7.schema.py
from dsi.dsi import DSI
schema_dsi = DSI()
# loads a complex schema into DSI to apply to a database
#dsi.read(filename, reader)
schema_dsi.read("../data/example_schema.json", "Schema") # view comments in dsi/data/example_schema.json to learn how to structure it
schema_dsi.read("../data/student_test1.yml", 'YAML1')
#dsi.write(filename, writer)
schema_dsi.write("schema_er_diagram.png", "ER_Diagram")
schema_dsi.close()
# DSI without a complex Schema
basic_dsi = DSI()
#dsi.read(filename, reader)
basic_dsi.read("../data/student_test1.yml", 'YAML1')
#dsi.write(filename, writer)
basic_dsi.write("normal_er_diagram.png", "ER_Diagram") # schema_er_diagram.png will be different due to complex schema
basic_dsi.close()