Plugins
Plugins connect data-producing applications to DSI core functionalities. Plugins have writers or readers functions. A Plugin reader function deals with existing data files or input streams. A Plugin writer deals with generating new data. Plugins are modular to support user contribution.
Plugin contributors are encouraged to offer custom Plugin abstract classes and Plugin implementations.
A contributed Plugin abstract class may extend another plugin to inherit the properties of the parent.
In order to be compatible with DSI core, Plugins should produce data in Python built-in data structures or data structures sourced from the Python collections
library.
Note that any contributed plugins or extension should include unit tests in plugins/tests
to demonstrate the new Plugin capability.

Figure depicts prominent portion of the current DSI plugin class hierarchy.
- class dsi.plugins.plugin.Plugin(path)
Plugin abstract class for DSI core product.
A Plugin connects a data reader or writer to a compatible middleware data structure.
- abstractmethod __init__(path)
Initialize Plugin setup.
Read a Plugin file. Return a Plugin object.
- abstractmethod add_to_output(path)
Initialize Plugin setup.
Read a Plugin file. Return a Plugin object.
Metadata Processing
Note for users: StructuredMetadata class is used to assign data from a file_reader to the DSI abstraction in core. If data in a user-written reader is restructured as an OrderedDict, only need to call set_schema_2() at bottom of add_rows().
- class dsi.plugins.metadata.StructuredMetadata(**kwargs)
plugin superclass that provides handy methods for structured data
- __init__(**kwargs)
Initializes a StructuredDataPlugin with an output collector
- add_to_output(row: list, tableName=None) None
Adds a row of data to the output_collector and guarantees good structure. Useful in a plugin’s add_rows method.
DO NOT USE THIS WITH SET_SCHEMA_2()
row: list of row of data
tableName: default None. Specified name of table to ingest row into.
- schema_is_set() bool
Helper method to see if the schema has been set
DO NOT USE THIS WITH SET_SCHEMA_2()
- set_schema(table_data: list, validation_model=None) None
Initializes columns in the output_collector and table_cnt. Useful in a plugin’s pack_header method.
DO NOT USE THIS WITH SET_SCHEMA_2()
table_data:
for ingested data with multiple tables, table_data is list of tuples where each tuple is structured as (table name, column name list)
for data without multiple tables, table_data is just a list of column names
- set_schema_2(collection, validation_model=None) None
Faster version (time and space) of updating output_collector by directly setting ‘collection’ to it, if collection is an Ordered Dict
DO NOT USE THIS WITH SET_SCHEMA(), ADD_TO_OUTPUT(), OR SCHEMA_IS_SET()
collection: data passed in from a plugin as an Ordered Dict.
If only one table of data in there, it is nested in another Ordered Dict with table name as the plugin class name
File Readers
- Note for users:
Assume names of data structure from all data sources are consistent/stable. Ex: table/column names are consistent. Number of columns in table CAN vary.
Plugin readers in DSI repo can/should handle data files with mismatched number of columns. Ex: file1: table1 has columns a, b, c. file2: table1 has columns a, b, d
if only reading in one table, users can utilize python pandas to stack mulutiple dataframes vertically (CSV reader)
if ingesting multiple tables at a time, users must pad tables with null data (YAML1 uses this and has example code at bottom of add_row() to implement this)
- class dsi.plugins.file_reader.Bueno(filenames, **kwargs)
A Structured Data Plugin to capture performance data from Bueno (github.com/lanl/bueno)
Bueno outputs performance data in keyvalue pairs in a file. Keys and values are delimited by
:
. Keyval pairs are delimited by\n
.- __init__(filenames, **kwargs) None
filenames: one Bueno file or a list of Bueno files to be ingested
- add_rows() None
Parses Bueno data and adds a list containing 1 or more rows.
- class dsi.plugins.file_reader.Csv(filenames, table_name=None, **kwargs)
A Structured Data Plugin to ingest CSV data
- __init__(filenames, table_name=None, **kwargs)
Initializes CSV Reader with user specified filenames and optional table_name.
filenames: Required input. List of CSV files, or just one CSV files to store in DSI. If a list, data in all files must be for the same table
table_name: default None. User can specify table name when loading CSV file. Otherwise DSI uses table_name = “Csv”
- add_rows() None
Adds a list containing one or more rows of the CSV along with file_info to output.
- class dsi.plugins.file_reader.JSON(filenames, **kwargs)
A Structured Data Plugin to capture JSON data
The JSON data’s keys are used as columns and values are rows
- __init__(filenames, **kwargs) None
Initializes generic JSON reader with user-specified filenames
- add_rows() None
Parses JSON data and adds a list containing 1 or more rows.
- pack_header() None
Set schema with POSIX and JSON data.
- class dsi.plugins.file_reader.MetadataReader1(filenames, target_table_prefix=None, **kwargs)
Structured Data Plugin to read in an individual or a set of JSON metadata files
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: one metadata json file or a list of metadata json files to be ingested
target_table_prefix: prefix to be added to every table created to differentiate between other metadata file sources
- add_rows() None
Parses metadata json files and creates an ordered dict whose keys are file names and values are an ordered dict of that file’s data
- class dsi.plugins.file_reader.Schema(filename, target_table_prefix=None, **kwargs)
Structured Data Plugin to parse schema of a data source that will be ingested in same workflow.
Schema file input should be a JSON file that stores primary and foreign keys for all tables in the data source. Stores all relations in global dsi_relations table used for creating backends/writers
- __init__(filename, target_table_prefix=None, **kwargs)
filename: file name of the json file to be ingested
target_table_prefix: prefix to be added to every table name in the primary and foreign key list
- add_rows() None
Generates the dsi_relations OrderedDict to be added to the internal DSI abstraction.
The Ordered Dict has 2 keys, primary key and foreign key, with their values a list of PK and FK tuples associating tables and columns
- class dsi.plugins.file_reader.TOML1(filenames, target_table_prefix=None, **kwargs)
Structured Data Plugin to read in an individual or a set of TOML files
Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary
- __init__(filenames, target_table_prefix=None, **kwargs)
filenames: one toml file or a list of toml files to be ingested
target_table_prefix: prefix to be added to every table created to differentiate between other toml sources
- add_rows() None
Parses TOML data and creates an ordered dict whose keys are table names and values are an ordered dict for each table.
- class dsi.plugins.file_reader.Wildfire(filenames, table_name=None, sim_table=False, **kwargs)
A Structured Data Plugin to ingest Wildfire data stored as a CSV
Can be used for other cases if data is post-processed and running only once. Can create a manual simulation table
- __init__(filenames, table_name=None, sim_table=False, **kwargs)
Initializes Wildfire Reader with user specified parameters.
filenames: Required input – Wildfire data files
table_name: default None. User can specify table name when loading the wildfire file.
sim_table: default False. Set to True if creating manual simulation table where each row of Wildfire file is a separate sim
also creates new column in wildfire data for each row to associate to a corresponding row/simulation in sim_table
- add_rows() None
Creates Ordered Dictionary for the wildfire data.
If sim_table = True, a sim_table Ordered Dict also created, and both are nested within a larger Ordered Dict.
- class dsi.plugins.file_reader.YAML1(filenames, target_table_prefix=None, yamlSpace=' ', **kwargs)
Structured Data Plugin to read in an individual or a set of YAML files
Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary
- __init__(filenames, target_table_prefix=None, yamlSpace=' ', **kwargs)
filenames: one yaml file or a list of yaml files to be ingested
target_table_prefix: prefix to be added to every table created to differentiate between other yaml sources
yamlSpace: indent used in ingested yaml files - default 2 spaces but can change to the indentation used in input
- add_rows() None
Parses YAML data to create a nested Ordered Dict where each table is a key and its data as another Ordered Dict with keys as cols and vals as col data
- check_type(text)
Internal helper function that tests input text and returns a predicted compatible SQL Type
text: text string
return: string returned as int, float or still a string
File Writers
- Note for users:
If runTable flag is True in Terminal instantiation, the run table is only included in ER Diagram writer if data is processed from a backend. View Example 4: Process data to see an instance of a runTable included in an ER diagram
- class dsi.plugins.file_writer.Csv_Writer(table_name, filename, export_cols=None, **kwargs)
A Plugin to output queries as CSV data
- __init__(table_name, filename, export_cols=None, **kwargs)
table_name: name of table to be exported to a csv
filename: name of the CSV file that will be generated
export_cols: default None. When specified, this must be a list of column names to keep in output csv file
Ex: all columns are [a, b, c, d, e]. export_cols = [a, c, e]
- get_rows(collection) None
Function called in core.py that generates the output CSV file.
collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts
return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)
- class dsi.plugins.file_writer.ER_Diagram(filename, target_table_prefix=None, **kwargs)
Plugin that generates an ER Diagram from the current data in the DSI abstraction
- __init__(filename, target_table_prefix=None, **kwargs)
filename: file name of the ER Diagram to be generated
target_table_prefix: if generating diagram for only a select set of tables, can specify prefix to search for all alike tables
Ex: prefix = “student” so only “student__address”, “student__math”, “student__physics” tables are displayed here
- get_rows(collection) None
Function called in core.py that generates the ER Diagram.
collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts
return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)
- class dsi.plugins.file_writer.Table_Plot(table_name, filename, display_cols=None, **kwargs)
Plugin that plots all numeric column data for a specified table
- __init__(table_name, filename, display_cols=None, **kwargs)
table_name: name of table to be plotted
filename: name of output file the plot will be stored in
display_cols: default None. When specified, must be a list of column names, whose data is NUMERICAL, to plot
- get_rows(collection) None
Function called in core.py that generates the table plot image file.
collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts
return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)
Environment Plugins
- class dsi.plugins.env.Environment
Environment Plugins inspect the calling process’ context.
Environments assume a POSIX-compliant filesystem and always collect UID/GID information.
- __init__()
Initializes a StructuredDataPlugin with an output collector
- class dsi.plugins.env.GitInfo(git_repo_path='./')
A Plugin to capture Git information.
Adds the current git remote and git commit to metadata.
- __init__(git_repo_path='./') None
Initializes the git repo in the given directory and access to git commands
- add_rows() None
Adds a row to the output with POSIX info, git remote, and git commit
- pack_header() None
Set schema with POSIX and Git columns
- class dsi.plugins.env.Hostname(**kwargs)
An example Environment implementation.
This plugin collects the hostname of the machine, and couples this with the POSIX information gathered by the Environment base class.
- __init__(**kwargs) None
Initializes a StructuredDataPlugin with an output collector
- add_rows() None
Parses environment provenance data and adds the row.
- pack_header() None
Set schema with keys of prov_info.
- class dsi.plugins.env.SystemKernel
Plugin for reading environment provenance data.
An environment provenance plugin which does the following:
System Kernel Version
Kernel compile-time config
Kernel boot config
Kernel runtime config
Kernel modules and module config
Container information, if containerized
- __init__() None
Initialize SystemKernel with inital provenance info.
- add_rows() None
Parses environment provenance data and adds the row.
- static get_cmd_output(cmd: list, ignore_stderr=False) str
Runs a given command and returns the stdout if successful.
If stderr is not empty, an exception is raised with the stderr text.
- get_kernel_bt_config() dict
Kernel boot-time configuration is collected by looking at /proc/cmdline.
The output of this command is one string of boot-time parameters. This string is returned in a dict.
- get_kernel_ct_config() dict
Kernel compile-time configuration is collected by looking at /boot/config-(kernel version) and removing comments and empty lines.
The output of said command is newline-delimited option=value pairs.
- get_kernel_mod_config() dict
Kernel module configuration is collected with the “lsmod” and “modinfo” commands.
Each module and modinfo are stored as a key-value pair in the returned dict.
- get_kernel_rt_config() dict
Kernel run-time configuration is collected with the “sysctl -a” command.
The output of this command is lines consisting of two possibilities: option = value (note the spaces), and sysctl: permission denied … The option = value pairs are added to the output dict.
- get_kernel_version() dict
Kernel version is obtained by the “uname -r” command, returns it in a dict.
- get_prov_info() str
Collect and return the different categories of provenance info.
- pack_header() None
Set schema with keys of prov_info.
Optional Plugin Type Enforcement
Plugins take data in an arbitrary format, and transform it into metadata which is queriable in DSI. Plugins may enforce types, but they are not required to enforce types. Plugin type enforcement can be static, like the Hostname default plugin. Plugin type enforcement can also be dynamic, like the Bueno default plugin.
A collection of pydantic models for Plugin schema validation
- class dsi.plugins.plugin_models.EnvironmentModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int])
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class dsi.plugins.plugin_models.GitInfoModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], git_remote: str, git_commit: str)
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'git_commit': FieldInfo(annotation=str, required=True), 'git_remote': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class dsi.plugins.plugin_models.HostnameModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], hostname: str)
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'hostname': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class dsi.plugins.plugin_models.SystemKernelModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], kernel_info: str)
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'kernel_info': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- dsi.plugins.plugin_models.create_dynamic_model(name: str, col_names: list[str], col_types: list[type], base=None) BaseModel
Creates a pydantic model at runtime with given name, column names and types, and an optional base model to extend.
This is useful for when column names are not known until they are retrieved at runtime.