DSI Readers/Writers
Readers/Writers connect data-producing applications to DSI core functionalities. A Reader function deals with existing data files or input streams. A Writer function deals with generating new data.
Readers/Writers are modular to support user contribution and contributors are encouraged to offer custom Readers/Writers abstract classes and implementations.
A contributed Reader/Writer abstract class may either extend another Reader/Writer to inherit the properties of the parent, or be a completely new structure.
In order to be compatible with DSI, Readers should store data in data structures sourced from the Python collections
library (OrderedDict).
Similarly, Writers should be compatible by accepting data structures from Python collections
(OrderedDict) to export data/generate an image.
Note that any contributed Readers/Writers or extension should include unit tests in plugins/tests
to demonstrate the new capability.

Figure depicts prominent portion of the current DSI Readers/Writers class hierarchy.
- class dsi.plugins.plugin.Plugin(path)
Plugin abstract class for DSI core product.
A Plugin connects a data reader or writer to a compatible middleware data structure.
- abstractmethod __init__(path)
Initialize Plugin setup.
Read a Plugin file. Return a Plugin object.
- abstractmethod add_to_output(path)
Initialize Plugin setup.
Read a Plugin file. Return a Plugin object.
Metadata Processing
Note for users: StructuredMetadata
class is used to assign data from a file_reader to the DSI abstraction in core.
If data in a user-written reader is structured as an OrderedDict, only need to call set_schema_2()
at the end of the reader’s add_rows()
- class dsi.plugins.metadata.StructuredMetadata(**kwargs)
plugin superclass that provides handy methods for structured data
- __init__(**kwargs)
Initializes StructuredMetadata class with an output collector (Ordered Dictionary)
- add_to_output(row: list, tableName=None) None
Adds a row of data to the output_collector and guarantees good structure. Useful in a plugin’s add_rows method.
DO NOT use this with
SET_SCHEMA_2()
row: list of row of data
tableName: default None. Specified name of table to ingest row into.
- schema_is_set() bool
Helper method to see if the schema has been set
DO NOT use this with
SET_SCHEMA_2()
- set_schema(table_data: list, validation_model=None) None
Initializes columns in the output_collector and table_cnt. Useful in a plugin’s pack_header method.
DO NOT use this with
SET_SCHEMA_2()
table_data:
for ingested data with multiple tables, table_data is list of tuples where each tuple is structured as (table name, column name list)
for data without multiple tables, table_data is just a list of column names
- set_schema_2(collection, validation_model=None) None
Faster version to update the DSI abstraction as long as collection input is structured as an Ordered Dict
DO NOT use this with
SET_SCHEMA()
,ADD_TO_OUTPUT()
, ORSCHEMA_IS_SET()
collection: data passed in from a plugin as an Ordered Dict.
If only one table of data in there, it is nested in another Ordered Dict with table name as the plugin class name
File Readers
- Note for users:
Assume names of data structure from all data sources are consistent/stable. Ex: table/column names MUST be consistent. Number of columns in a table CAN vary.
DSI Readers can handle data files with mismatched number of columns. Ex: file1: table1 has columns a, b, c. file2: table1 has columns a, b, d
if only reading in one table at a time, users can utilize python pandas to stack mulutiple dataframes vertically (ex: CSV reader)
if multiple tables in a file, users must pad tables with null values (ex: YAML1, which has example code at bottom of
add_rows()
to implement this)
- class dsi.plugins.file_reader.Bueno(filenames, **kwargs)
A DSI Reader that captures performance data from Bueno (github.com/lanl/bueno)
Bueno outputs performance data in keyvalue pairs in a file. Keys and values are delimited by
:
. Keyval pairs are delimited by\n
.- __init__(filenames, **kwargs) None
filenames: one Bueno file or a list of Bueno files to be ingested
- add_rows() None
Parses Bueno data and adds a list containing 1 or more rows.
- class dsi.plugins.file_reader.Csv(filenames, table_name=None, **kwargs)
A DSI Reader that reads in CSV data
- __init__(filenames, table_name=None, **kwargs)
Initializes CSV Reader with user specified filenames and optional table_name.
filenames: Required input. List of CSV files, or just one CSV files to store in DSI. If a list, data in all files must be for the same table
table_name: default None. User can specify table name when loading CSV file. Otherwise DSI uses table_name = “Csv”
- add_rows() None
Adds a list containing one or more rows of the CSV along with file_info to output.
- class dsi.plugins.file_reader.DublinCoreDatacard(filenames, target_table_prefix=None, **kwargs)
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: file names of the Dublin Core data card xml files to be ingested
target_table_prefix: prefix to be added to every table name in the primary and foreign key list
- add_rows() None
- class dsi.plugins.file_reader.JSON(filenames, table_name=None, **kwargs)
A DSI Reader that captures generic JSON data. Assumes all key/value pairs are basic data formats, ie. string, int, float, not a nested dictionary or more
The JSON data’s keys are used as columns and values are rows
- __init__(filenames, table_name=None, **kwargs) None
Initializes generic JSON reader with user-specified filenames
filenames: Required input. List of JSON files, or just one JSON files to store in DSI. If a list, data in all files must be for the same table
table_name: default None. User can specify table name when loading JSON file. Otherwise DSI uses table_name = “JSON”
- add_rows() None
Parses JSON data and stores data into a table as an Ordered Dictionary.
- class dsi.plugins.file_reader.MetadataReader1(filenames, target_table_prefix=None, **kwargs)
DSI Reader that reads in an individual or a set of JSON metadata files
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: one metadata json file or a list of metadata json files to be ingested
target_table_prefix: prefix to be added to every table created to differentiate between other metadata file sources
- add_rows() None
Parses metadata json files and creates an ordered dict whose keys are file names and values are an ordered dict of that file’s data
- class dsi.plugins.file_reader.Oceans11Datacard(filenames, target_table_prefix=None, **kwargs)
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: file names of the oceans11 data card yaml files to be ingested
target_table_prefix: prefix to be added to every table name in the primary and foreign key list
- add_rows() None
- class dsi.plugins.file_reader.Schema(filename, target_table_prefix=None, **kwargs)
DSI Reader that parses the schema of a data source to ingest in same workflow.
Schema file input should be a JSON file that stores primary and foreign keys for all tables in the data source. Stores all relations in global dsi_relations table used for creating backends/writers
Users ingesting complex schemas should load this reader. View Complex Schemas in DSI to see how a schema file should be structured
- __init__(filename, target_table_prefix=None, **kwargs)
filename: file name of the json file to be ingested
target_table_prefix: prefix to be added to every table name in the primary and foreign key list
- add_rows() None
Generates a dsi_relations OrderedDict to be added to the internal DSI abstraction.
The Ordered Dict has 2 keys, primary key and foreign key, with their values a list of PK and FK tuples associating tables and columns
- class dsi.plugins.file_reader.SchemaOrgDatacard(filenames, target_table_prefix=None, **kwargs)
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: file names of the Schema.org data card json files to be ingested
target_table_prefix: prefix to be added to every table name in the primary and foreign key list
- add_rows() None
- class dsi.plugins.file_reader.TOML1(filenames, target_table_prefix=None, **kwargs)
DSI Reader that reads in an individual or a set of TOML files
Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: one toml file or a list of toml files to be ingested
target_table_prefix: prefix to be added to every table created to differentiate between other toml sources
- add_rows() None
Parses TOML data and creates an ordered dict whose keys are table names and values are an ordered dict for each table.
- class dsi.plugins.file_reader.Wildfire(filenames, table_name=None, sim_table=True, **kwargs)
DSI Reader that ingests Wildfire data stored as a CSV
Can be used for other cases if data is post-processed and running only once. Can create a manual simulation table
- __init__(filenames, table_name=None, sim_table=True, **kwargs)
Initializes Wildfire Reader with user specified parameters.
filenames: Required input – Wildfire data files
table_name: default None. User can specify table name when loading the wildfire file.
sim_table: default True. Set to False if DO NOT want to create manual simulation table where each row of Wildfire file is a separate sim
also creates new column in wildfire data for each row to associate to a corresponding row/simulation in sim_table
- add_rows() None
Creates Ordered Dictionary for the wildfire data.
If sim_table = True, a sim_table Ordered Dict also created, and both are nested within a larger Ordered Dict.
- class dsi.plugins.file_reader.YAML1(filenames, target_table_prefix=None, yamlSpace=' ', **kwargs)
DSI Reader that reads in an individual or a set of YAML files
Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary
- __init__(filenames, target_table_prefix=None, yamlSpace=' ', **kwargs)
filenames: one yaml file or a list of yaml files to be ingested
target_table_prefix: prefix to be added to every table created to differentiate between other yaml sources
yamlSpace: indent used in ingested yaml files - default 2 spaces but can change to the indentation used in input
- add_rows() None
Parses YAML data to create a nested Ordered Dict where each table is a key and its data as another Ordered Dict with keys as cols and vals as col data
- check_type(text)
Internal helper function, not used by DSI Users
Function tests input text and returns a predicted compatible SQL Type
text: text string
return: string returned as int, float or still a string
File Writers
- Note for users:
DSI runTable is only included in File Writers if data was previously ingested into a backend in a Core.Terminal workflow where runTable was set to True. In Example 4: Process data, runTable is included in a generated ER Diagram since it uses ingested data from Example 2: Ingest data where runTable = True
- class dsi.plugins.file_writer.Csv_Writer(table_name, filename, export_cols=None, **kwargs)
DSI Writer to output queries as CSV data
- __init__(table_name, filename, export_cols=None, **kwargs)
table_name: name of table to be exported to a csv
filename: name of the CSV file that will be generated
export_cols: default None. When specified, this must be a list of column names to keep in output csv file
Ex: all columns are [a, b, c, d, e]. export_cols = [a, c, e]
- get_rows(collection) None
Function that generates the output CSV file.
collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts
return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)
- class dsi.plugins.file_writer.ER_Diagram(filename, target_table_prefix=None, **kwargs)
DSI Writer that generates an ER Diagram from the current data in the DSI abstraction
- __init__(filename, target_table_prefix=None, **kwargs)
filename: file name of the ER Diagram to be generated
target_table_prefix: if generating diagram for only a select set of tables, can specify prefix to search for all alike tables
Ex: If prefix = “student”, only “student__address”, “student__math”, “student__physics” tables are included
- get_rows(collection) None
Function that generates the ER Diagram.
collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts
return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)
- class dsi.plugins.file_writer.Table_Plot(table_name, filename, display_cols=None, **kwargs)
DSI Writer that plots all numeric column data for a specified table
- __init__(table_name, filename, display_cols=None, **kwargs)
table_name: name of table to be plotted
filename: name of output file the plot will be stored in
display_cols: default None. When specified, must be a list of column names, whose data is NUMERICAL
- get_rows(collection) None
Function that generates the table plot image file.
collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts
return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)
Environment Plugins
- class dsi.plugins.env.Environment
Environment Plugins inspect the calling process’ context.
Environments assume a POSIX-compliant filesystem and always collect UID/GID information.
- __init__()
Initializes StructuredMetadata class with an output collector (Ordered Dictionary)
- class dsi.plugins.env.GitInfo(git_repo_path='./')
A Plugin to capture Git information.
Adds the current git remote and git commit to metadata.
- __init__(git_repo_path='./') None
Initializes the git repo in the given directory and access to git commands
- add_rows() None
Adds a row to the output with POSIX info, git remote, and git commit
- pack_header() None
Set schema with POSIX and Git columns
- class dsi.plugins.env.Hostname(**kwargs)
An example Environment implementation.
This plugin collects the hostname of the machine, and couples this with the POSIX information gathered by the Environment base class.
- __init__(**kwargs) None
Initializes StructuredMetadata class with an output collector (Ordered Dictionary)
- add_rows() None
Parses environment provenance data and adds the row.
- pack_header() None
Set schema with keys of prov_info.
- class dsi.plugins.env.SystemKernel
Plugin for reading environment provenance data.
An environment provenance plugin which does the following:
System Kernel Version
Kernel compile-time config
Kernel boot config
Kernel runtime config
Kernel modules and module config
Container information, if containerized
- __init__() None
Initialize SystemKernel with inital provenance info.
- add_rows() None
Parses environment provenance data and adds the row.
- static get_cmd_output(cmd: list, ignore_stderr=False) str
Runs a given command and returns the stdout if successful.
If stderr is not empty, an exception is raised with the stderr text.
- get_kernel_bt_config() dict
Kernel boot-time configuration is collected by looking at /proc/cmdline.
The output of this command is one string of boot-time parameters. This string is returned in a dict.
- get_kernel_ct_config() dict
Kernel compile-time configuration is collected by looking at /boot/config-(kernel version) and removing comments and empty lines.
The output of said command is newline-delimited option=value pairs.
- get_kernel_mod_config() dict
Kernel module configuration is collected with the “lsmod” and “modinfo” commands.
Each module and modinfo are stored as a key-value pair in the returned dict.
- get_kernel_rt_config() dict
Kernel run-time configuration is collected with the “sysctl -a” command.
The output of this command is lines consisting of two possibilities: option = value (note the spaces), and sysctl: permission denied … The option = value pairs are added to the output dict.
- get_kernel_version() dict
Kernel version is obtained by the “uname -r” command, returns it in a dict.
- get_prov_info() str
Collect and return the different categories of provenance info.
- pack_header() None
Set schema with keys of prov_info.
Optional Plugin Type Enforcement
Plugins take data in an arbitrary format, and transform it into metadata which is queriable in DSI. Plugins may enforce types, but they are not required to enforce types. Plugin type enforcement can be static, like the Hostname default plugin. Plugin type enforcement can also be dynamic, like the Bueno default plugin.
A collection of pydantic models for Plugin schema validation
- class dsi.plugins.plugin_models.EnvironmentModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int])
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class dsi.plugins.plugin_models.GitInfoModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], git_remote: str, git_commit: str)
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'git_commit': FieldInfo(annotation=str, required=True), 'git_remote': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class dsi.plugins.plugin_models.HostnameModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], hostname: str)
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'hostname': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class dsi.plugins.plugin_models.SystemKernelModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], kernel_info: str)
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'kernel_info': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- dsi.plugins.plugin_models.create_dynamic_model(name: str, col_names: list[str], col_types: list[type], base=None) BaseModel
Creates a pydantic model at runtime with given name, column names and types, and an optional base model to extend.
This is useful for when column names are not known until they are retrieved at runtime.