Plugins

Plugins connect data-producing applications to DSI core functionalities. Plugins have writers or readers functions. A Plugin reader function deals with existing data files or input streams. A Plugin writer deals with generating new data. Plugins are modular to support user contribution.

Plugin contributors are encouraged to offer custom Plugin abstract classes and Plugin implementations. A contributed Plugin abstract class may extend another plugin to inherit the properties of the parent. In order to be compatible with DSI core, Plugins should produce data in Python built-in data structures or data structures sourced from the Python collections library.

Note that any contributed plugins or extension should include unit tests in plugins/tests to demonstrate the new Plugin capability.

Figure depicting the current plugin class hierarchy.

Figure depicts prominent portion of the current DSI plugin class hierarchy.

class dsi.plugins.plugin.Plugin(path)

Plugin abstract class for DSI core product.

A Plugin connects a data reader or writer to a compatible middleware data structure.

abstractmethod __init__(path)

Initialize Plugin setup.

Read a Plugin file. Return a Plugin object.

abstractmethod add_to_output(path)

Initialize Plugin setup.

Read a Plugin file. Return a Plugin object.

Metadata Processing

Note for users: StructuredMetadata class is used to assign data from a file_reader to the DSI abstraction in core. If data in a user-written reader is restructured as an OrderedDict, only need to call set_schema_2() at bottom of add_rows().

class dsi.plugins.metadata.StructuredMetadata(**kwargs)

plugin superclass that provides handy methods for structured data

__init__(**kwargs)

Initializes a StructuredDataPlugin with an output collector

add_to_output(row: list, tableName=None) None

Adds a row of data to the output_collector and guarantees good structure. Useful in a plugin’s add_rows method.

DO NOT USE THIS WITH SET_SCHEMA_2()

row: list of row of data

tableName: default None. Specified name of table to ingest row into.

schema_is_set() bool

Helper method to see if the schema has been set

DO NOT USE THIS WITH SET_SCHEMA_2()

set_schema(table_data: list, validation_model=None) None

Initializes columns in the output_collector and table_cnt. Useful in a plugin’s pack_header method.

DO NOT USE THIS WITH SET_SCHEMA_2()

table_data:

  • for ingested data with multiple tables, table_data is list of tuples where each tuple is structured as (table name, column name list)

  • for data without multiple tables, table_data is just a list of column names

set_schema_2(collection, validation_model=None) None

Faster version (time and space) of updating output_collector by directly setting ‘collection’ to it, if collection is an Ordered Dict

DO NOT USE THIS WITH SET_SCHEMA(), ADD_TO_OUTPUT(), OR SCHEMA_IS_SET()

collection: data passed in from a plugin as an Ordered Dict.

  • If only one table of data in there, it is nested in another Ordered Dict with table name as the plugin class name

File Readers

Note for users:
  • Assume names of data structure from all data sources are consistent/stable. Ex: table/column names are consistent. Number of columns in table CAN vary.

  • Plugin readers in DSI repo can/should handle data files with mismatched number of columns. Ex: file1: table1 has columns a, b, c. file2: table1 has columns a, b, d

    • if only reading in one table, users can utilize python pandas to stack mulutiple dataframes vertically (CSV reader)

    • if ingesting multiple tables at a time, users must pad tables with null data (YAML1 uses this and has example code at bottom of add_row() to implement this)

class dsi.plugins.file_reader.Bueno(filenames, **kwargs)

A Structured Data Plugin to capture performance data from Bueno (github.com/lanl/bueno)

Bueno outputs performance data in keyvalue pairs in a file. Keys and values are delimited by :. Keyval pairs are delimited by \n.

__init__(filenames, **kwargs) None

filenames: one Bueno file or a list of Bueno files to be ingested

add_rows() None

Parses Bueno data and adds a list containing 1 or more rows.

class dsi.plugins.file_reader.Csv(filenames, table_name=None, **kwargs)

A Structured Data Plugin to ingest CSV data

__init__(filenames, table_name=None, **kwargs)

Initializes CSV Reader with user specified filenames and optional table_name.

filenames: Required input. List of CSV files, or just one CSV files to store in DSI. If a list, data in all files must be for the same table

table_name: default None. User can specify table name when loading CSV file. Otherwise DSI uses table_name = “Csv”

add_rows() None

Adds a list containing one or more rows of the CSV along with file_info to output.

class dsi.plugins.file_reader.JSON(filenames, **kwargs)

A Structured Data Plugin to capture JSON data

The JSON data’s keys are used as columns and values are rows

__init__(filenames, **kwargs) None

Initializes generic JSON reader with user-specified filenames

add_rows() None

Parses JSON data and adds a list containing 1 or more rows.

pack_header() None

Set schema with POSIX and JSON data.

class dsi.plugins.file_reader.MetadataReader1(filenames, target_table_prefix=None, **kwargs)

Structured Data Plugin to read in an individual or a set of JSON metadata files

__init__(filenames, target_table_prefix=None, **kwargs)

filenames: one metadata json file or a list of metadata json files to be ingested

target_table_prefix: prefix to be added to every table created to differentiate between other metadata file sources

add_rows() None

Parses metadata json files and creates an ordered dict whose keys are file names and values are an ordered dict of that file’s data

class dsi.plugins.file_reader.Schema(filename, target_table_prefix=None, **kwargs)

Structured Data Plugin to parse schema of a data source that will be ingested in same workflow.

Schema file input should be a JSON file that stores primary and foreign keys for all tables in the data source. Stores all relations in global dsi_relations table used for creating backends/writers

__init__(filename, target_table_prefix=None, **kwargs)

filename: file name of the json file to be ingested

target_table_prefix: prefix to be added to every table name in the primary and foreign key list

add_rows() None

Generates the dsi_relations OrderedDict to be added to the internal DSI abstraction.

The Ordered Dict has 2 keys, primary key and foreign key, with their values a list of PK and FK tuples associating tables and columns

class dsi.plugins.file_reader.TOML1(filenames, target_table_prefix=None, **kwargs)

Structured Data Plugin to read in an individual or a set of TOML files

Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary

__init__(filenames, target_table_prefix=None, **kwargs)

filenames: one toml file or a list of toml files to be ingested

target_table_prefix: prefix to be added to every table created to differentiate between other toml sources

add_rows() None

Parses TOML data and creates an ordered dict whose keys are table names and values are an ordered dict for each table.

class dsi.plugins.file_reader.Wildfire(filenames, table_name=None, sim_table=False, **kwargs)

A Structured Data Plugin to ingest Wildfire data stored as a CSV

Can be used for other cases if data is post-processed and running only once. Can create a manual simulation table

__init__(filenames, table_name=None, sim_table=False, **kwargs)

Initializes Wildfire Reader with user specified parameters.

filenames: Required input – Wildfire data files

table_name: default None. User can specify table name when loading the wildfire file.

sim_table: default False. Set to True if creating manual simulation table where each row of Wildfire file is a separate sim

  • also creates new column in wildfire data for each row to associate to a corresponding row/simulation in sim_table

add_rows() None

Creates Ordered Dictionary for the wildfire data.

If sim_table = True, a sim_table Ordered Dict also created, and both are nested within a larger Ordered Dict.

class dsi.plugins.file_reader.YAML1(filenames, target_table_prefix=None, yamlSpace='  ', **kwargs)

Structured Data Plugin to read in an individual or a set of YAML files

Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary

__init__(filenames, target_table_prefix=None, yamlSpace='  ', **kwargs)

filenames: one yaml file or a list of yaml files to be ingested

target_table_prefix: prefix to be added to every table created to differentiate between other yaml sources

yamlSpace: indent used in ingested yaml files - default 2 spaces but can change to the indentation used in input

add_rows() None

Parses YAML data to create a nested Ordered Dict where each table is a key and its data as another Ordered Dict with keys as cols and vals as col data

check_type(text)

Internal helper function that tests input text and returns a predicted compatible SQL Type

text: text string

return: string returned as int, float or still a string

File Writers

Note for users:
  • If runTable flag is True in Terminal instantiation, the run table is only included in ER Diagram writer if data is processed from a backend. View Example 4: Process data to see an instance of a runTable included in an ER diagram

class dsi.plugins.file_writer.Csv_Writer(table_name, filename, export_cols=None, **kwargs)

A Plugin to output queries as CSV data

__init__(table_name, filename, export_cols=None, **kwargs)

table_name: name of table to be exported to a csv

filename: name of the CSV file that will be generated

export_cols: default None. When specified, this must be a list of column names to keep in output csv file

  • Ex: all columns are [a, b, c, d, e]. export_cols = [a, c, e]

get_rows(collection) None

Function called in core.py that generates the output CSV file.

collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts

return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)

class dsi.plugins.file_writer.ER_Diagram(filename, target_table_prefix=None, **kwargs)

Plugin that generates an ER Diagram from the current data in the DSI abstraction

__init__(filename, target_table_prefix=None, **kwargs)

filename: file name of the ER Diagram to be generated

target_table_prefix: if generating diagram for only a select set of tables, can specify prefix to search for all alike tables

  • Ex: prefix = “student” so only “student__address”, “student__math”, “student__physics” tables are displayed here

get_rows(collection) None

Function called in core.py that generates the ER Diagram.

collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts

return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)

class dsi.plugins.file_writer.Table_Plot(table_name, filename, display_cols=None, **kwargs)

Plugin that plots all numeric column data for a specified table

__init__(table_name, filename, display_cols=None, **kwargs)

table_name: name of table to be plotted

filename: name of output file the plot will be stored in

display_cols: default None. When specified, must be a list of column names, whose data is NUMERICAL, to plot

get_rows(collection) None

Function called in core.py that generates the table plot image file.

collection: representation of internal DSI abstraction. It is a nested Ordered Dict, with table names as keys, and table data as Ordered Dicts

return: None. Only returns if error. Message is sent back to core to print along with error type. Ex: (ValueError, “error message”)

Environment Plugins

class dsi.plugins.env.Environment

Environment Plugins inspect the calling process’ context.

Environments assume a POSIX-compliant filesystem and always collect UID/GID information.

__init__()

Initializes a StructuredDataPlugin with an output collector

class dsi.plugins.env.GitInfo(git_repo_path='./')

A Plugin to capture Git information.

Adds the current git remote and git commit to metadata.

__init__(git_repo_path='./') None

Initializes the git repo in the given directory and access to git commands

add_rows() None

Adds a row to the output with POSIX info, git remote, and git commit

pack_header() None

Set schema with POSIX and Git columns

class dsi.plugins.env.Hostname(**kwargs)

An example Environment implementation.

This plugin collects the hostname of the machine, and couples this with the POSIX information gathered by the Environment base class.

__init__(**kwargs) None

Initializes a StructuredDataPlugin with an output collector

add_rows() None

Parses environment provenance data and adds the row.

pack_header() None

Set schema with keys of prov_info.

class dsi.plugins.env.SystemKernel

Plugin for reading environment provenance data.

An environment provenance plugin which does the following:

  1. System Kernel Version

  2. Kernel compile-time config

  3. Kernel boot config

  4. Kernel runtime config

  5. Kernel modules and module config

  6. Container information, if containerized

__init__() None

Initialize SystemKernel with inital provenance info.

add_rows() None

Parses environment provenance data and adds the row.

static get_cmd_output(cmd: list, ignore_stderr=False) str

Runs a given command and returns the stdout if successful.

If stderr is not empty, an exception is raised with the stderr text.

get_kernel_bt_config() dict

Kernel boot-time configuration is collected by looking at /proc/cmdline.

The output of this command is one string of boot-time parameters. This string is returned in a dict.

get_kernel_ct_config() dict

Kernel compile-time configuration is collected by looking at /boot/config-(kernel version) and removing comments and empty lines.

The output of said command is newline-delimited option=value pairs.

get_kernel_mod_config() dict

Kernel module configuration is collected with the “lsmod” and “modinfo” commands.

Each module and modinfo are stored as a key-value pair in the returned dict.

get_kernel_rt_config() dict

Kernel run-time configuration is collected with the “sysctl -a” command.

The output of this command is lines consisting of two possibilities: option = value (note the spaces), and sysctl: permission denied … The option = value pairs are added to the output dict.

get_kernel_version() dict

Kernel version is obtained by the “uname -r” command, returns it in a dict.

get_prov_info() str

Collect and return the different categories of provenance info.

pack_header() None

Set schema with keys of prov_info.

Optional Plugin Type Enforcement

Plugins take data in an arbitrary format, and transform it into metadata which is queriable in DSI. Plugins may enforce types, but they are not required to enforce types. Plugin type enforcement can be static, like the Hostname default plugin. Plugin type enforcement can also be dynamic, like the Bueno default plugin.

A collection of pydantic models for Plugin schema validation

class dsi.plugins.plugin_models.EnvironmentModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int])
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class dsi.plugins.plugin_models.GitInfoModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], git_remote: str, git_commit: str)
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'git_commit': FieldInfo(annotation=str, required=True), 'git_remote': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class dsi.plugins.plugin_models.HostnameModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], hostname: str)
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'hostname': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class dsi.plugins.plugin_models.SystemKernelModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], kernel_info: str)
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'kernel_info': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

dsi.plugins.plugin_models.create_dynamic_model(name: str, col_names: list[str], col_types: list[type], base=None) BaseModel

Creates a pydantic model at runtime with given name, column names and types, and an optional base model to extend.

This is useful for when column names are not known until they are retrieved at runtime.