DSI Readers/Writers

Readers/Writers connect data-producing applications to DSI core functionalities. A Reader function deals with existing data files or input streams. A Writer function deals with generating new data formats.

Readers/Writers are modular to support user contribution and contributors are encouraged to offer custom Readers/Writers abstract classes and implementations. A contributed Reader/Writer abstract class may either extend another Reader/Writer to inherit the properties of the parent, or be a completely new structure.

In order to be compatible with DSI, Readers should store data in data structures sourced from the Python collections library (OrderedDict). Similarly, Writers should be compatible by accepting data structures from Python collections (OrderedDict) to export data/generate an image.

Any contributed Readers/Writers, or extension of one, should include unit tests in plugins/tests to demonstrate the new capability.

Figure depicting the current Readers/Writers class hierarchy. — Figure depicts prominent portion of the current DSI Readers/Writers class hierarchy.

class dsi.plugins.plugin.Plugin(path)

Plugin abstract class for DSI core product.

A Plugin connects a data reader or writer to a compatible middleware data structure.

abstractmethod __init__(path)

Initialize Plugin setup.

Read a Plugin file. Return a Plugin object.

abstractmethod add_to_output(path)

Initialize Plugin setup.

Read a Plugin file. Return a Plugin object.

Metadata Processing

Note for users: StructuredMetadata class is used to assign data from a file_reader to the DSI abstraction in core. If data in a user-written reader is structured as an OrderedDict, only need to call set_schema_2() at the end of the reader’s add_rows()

class dsi.plugins.metadata.StructuredMetadata(**kwargs)

plugin superclass that provides handy methods for structured data

__init__(**kwargs): Initializes StructuredMetadata class with an output collector (Ordered Dictionary)

add_to_output(row: list, tableName=None) → None

Old function to be deprecated soon. Do not use.

Adds a row of data to the output_collector with enforced structure.

This method is typically used within a plugin’s add_rows() method to incrementally build table output in a consistent format.

rowlist: A single row of data to be added. Must match the expected structure for the target table.
tableNamestr, optional: Name of the table to which the row should be added. If None, the function identifies which plugin called it and assigns tableName for that data

schema_is_set() → bool

Old function to be deprecated soon. Do not use.

Helper method to see if the schema has been set

set_schema(table_data: list, validation_model=None) → None

Old function to be deprecated soon. Do not use.

Initializes column structure in the output_collector and table_cnt.

This method is typically used within a plugin pack_header() method to define the expected schema before rows are added.

table_datalist

Defines the table structure to be used.

For multple-table data:
A list of tuples, each structured as (table_name, list_of_column_names)
For single-table data:
A simple list of column names.

set_schema_2(collection, validation_model=None) → None

Use this if data in a Reader’s ``add_rows()`` is structured as an OrderedDict()

Faster update of the DSI abstraction when input data is structured as an OrderedDict.

This method is optimized for plugins where add_rows() passes data as an OrderedDict, and avoids the incremental row ingest via set_schema() and add_to_output().

collectionOrderedDict

The plugin’s data structure (with data) passed from a plugin.

If collection only contains one table, the data will be wrapped in another OrderedDict, where the plugin’s class name is the table name key.

File Readers

Note for users:

DSI assumes data structure from all data sources are consistent/stable. Ex: table/column names MUST be consistent. Number of columns in a table CAN vary.
DSI Readers can handle data files with mismatched number of columns. Ex: file1: table1 has columns a, b, c. file2: table1 has columns a, b, d
- if only reading in one table at a time, users can utilize python pandas to stack mulutiple dataframes vertically (ex: CSV reader)
- if multiple tables in a file, users must pad tables with null values (ex: YAML1, which has example code at bottom of add_rows() to implement this)

class dsi.plugins.file_reader.Bueno(filenames, **kwargs)

A DSI Reader that captures performance data from Bueno (github.com/lanl/bueno)

Bueno outputs performance data in keyvalue pairs in a file. Keys and values are delimited by :. Keyval pairs are delimited by \n.

__init__(filenames, **kwargs) → None: filenames: one Bueno file or a list of Bueno files to be ingested

add_rows() → None: Parses Bueno data and adds a list containing 1 or more rows.

class dsi.plugins.file_reader.Cloverleaf(folder_path, **kwargs)

DSI Reader that stores input and output Cloverleaf data from a directory for each simulation run

__init__(folder_path, **kwargs)

folder_pathstr: Filepath to the directory where the Cloverleaf data is stored. The directory should have a subfolder for each simulation run, each containing input and output data

add_rows() → None

Flattens data from each simulation’s input file as a row in the input table. Flattens data from each simulation’s output file as a row in the output table. Creates a simulation table which is stores each simulation’s number and execution datetime.

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.Csv(filenames, table_name=None, **kwargs)

A DSI Reader that reads in CSV data

__init__(filenames, table_name=None, **kwargs)

Initializes the CSV Reader with user specified filenames and optional table_name.

filenamesstr or list of str: Required. One or more CSV file paths to be loaded into DSI. If multiple files are provided, all data must correspond to the same table.
table_namestr, optional: Optional name to assign to the loaded table. If not provided, DSI will default to using “Csv” as the table name.

add_rows() → None: Adds a list containing one or more rows of the CSV along with file_info to output.

class dsi.plugins.file_reader.DublinCoreDatacard(filenames, **kwargs)

DSI Reader that stores a dataset’s data card as a row in the dublin_core_datacard table. Input datacard should follow template in examples/test/template_dc_dublin_core.xml

__init__(filenames, **kwargs)

filenamesstr or list of str: File name(s) of XML data card files to ingest. Each file must adhere to the Dublin Core metadata standard.

add_rows() → None

Flattens data in the input data card as a row in the dublin_core_datacard table

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.Ensemble(filenames, table_name=None, sim_table=True, **kwargs)

DSI Reader that loads ensemble simulation data stored in a CSV file.

Designed to handle simulation outputs where each row represents an individual simulation run. Specifically for single-run use cases when data is post-processed.

Automatically generates a simulation metadata table to accompany the data.

__init__(filenames, table_name=None, sim_table=True, **kwargs)

Initializes Ensemble Reader with user specified parameters.

filenamesstr or list of str

Required input. One or more Ensemble data files in CSV format. All files must correspond to the same table.

table_namestr, optional

Optional name to assign to the table when loading the Ensemble data. If not provided, the default table name, ‘Ensemble’, will be used.

sim_tablebool, default=True

If True, creates a simulation metadata table (sim_table) where each row in the input data table represents a separate simulation run.

Adds a new column to the input data to associate each row with its corresponding entry in the simulation table.

If False, skips creation of the simulation table.

add_rows() → None

Creates an OrderedDict representation of the Ensemble data to load into DSI.

When sim_table = True, a sim_table Ordered Dict is created alongside the Ensemble data table OrderedDict. Both tables are nested within a larger OrderedDict.

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.GoogleDatacard(filenames, **kwargs)

DSI Reader that stores a dataset’s data card as a row in the google_datacard table. Input datacard should follow template in examples/test/template_dc_google.yml

__init__(filenames, **kwargs)

filenamesstr or list of str: File name(s) of YAML data card files to ingest. Each file must adhere to the Google Data Cards Playbook metadata standard.

add_rows() → None

Flattens data in the input data card as a row in the google_datacard table

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.JSON(filenames, table_name=None, **kwargs)

A DSI Reader for ingesting generic JSON data with flat key-value pairs.

This reader assumes that all values are primitive data types (e.g., str, int, float), and that there are no nested dictionaries, arrays, or complex structures.

The keys in the JSON object are treated as column names, and their corresponding values are interpreted as rows.

__init__(filenames, table_name=None, **kwargs) → None

Initializes the generic JSON reader with user-specified filenames

filenamesstr or list of str: Required input. One or more JSON file paths to be loaded into DSI. If multiple files are provided, all data must all correspond to the same table
table_namestr, optional: Name to assign to the loaded table. If not provided, DSI defaults to using “JSON” as the table name.

add_rows() → None: Parses JSON data and stores data into a table as an Ordered Dictionary.

class dsi.plugins.file_reader.MetadataReader1(filenames, target_table_prefix=None, **kwargs)

DSI Reader that reads in an individual or a set of JSON metadata files

__init__(filenames, target_table_prefix=None, **kwargs)

filenames: one metadata json file or a list of metadata json files to be ingested

target_table_prefix: prefix to be added to every table created to differentiate between other metadata file sources

add_rows() → None: Parses metadata json files and creates an ordered dict whose keys are file names and values are an ordered dict of that file’s data

class dsi.plugins.file_reader.Oceans11Datacard(filenames, **kwargs)

DSI Reader that stores a dataset’s data card as a row in the oceans11_datacard table. Input datacard should follow template in examples/test/template_dc_oceans11.yml

__init__(filenames, **kwargs)

filenamesstr or list of str: File name(s) of YAML data card files to ingest. Each file must adhere to the Oceans 11 LANL Data Server metadata standard.

add_rows() → None

Flattens data in the input data card as a row in the oceans11_datacard table

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.Schema(filename, target_table_prefix=None, **kwargs)

DSI Reader for parsing and ingesting a relational schema alongside its associated data.

Schema file input should be a JSON file that defines primary and foreign key relationships between tables in a data source. Parsed relationships are stored in the global dsi_relations table, which is used for creating backends and used by writers.

This reader is essential when working with complex, multi-table data structures. See Cloverleaf (Complex Schemas) to learn how a schema file should be structured

__init__(filename, target_table_prefix=None, **kwargs)

Initializes the Schema reader with the specified schema file.

filename: str: Path to the JSON file containing the schema to be ingested. This file should define primary and foreign key relationships between tables.
target_table_prefixstr, optional: A prefix to prepend to every table name in the primary and foreign key lists. Useful for avoiding naming conflicts in shared environments.

add_rows() → None

Generates a dsi_relations OrderedDict to be added to the internal DSI abstraction.

The Ordered Dict has 2 keys, primary key and foreign key, with their values a list of PK and FK tuples associating tables and columns

class dsi.plugins.file_reader.SchemaOrgDatacard(filenames, **kwargs)

DSI Reader that stores a dataset’s data card as a row in the schema_org_datacard table. Input datacard should follow template in examples/test/template_dc_schema_org.json

__init__(filenames, **kwargs)

filenamesstr or list of str: File name(s) of JSON data card files to ingest. Each file must adhere to the Schema.org metadata standard.

add_rows() → None

Flattens data in the input data card as a row in the schema_org_datacard table

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.TOML1(filenames, target_table_prefix=None, **kwargs)

DSI Reader that reads in an individual or a set of TOML files

Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary

__init__(filenames, target_table_prefix=None, **kwargs)

filenamesstr or list of str: One TOML file or a list of TOML files to be loaded into DSI.
target_table_prefix: str, optional: A prefix to be added to each table name created from the TOML data. Useful for distinguishing between tables from other data sources.

add_rows() → None

Parses TOML data and constructs a nested OrderedDict to load into DSI.

The resulting structure has:

Top-level keys as table names.

Each value is an OrderedDict where:

Keys are column names.

Values are lists representing column data.

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_reader.YAML1(filenames, target_table_prefix=None, yamlSpace=' ', **kwargs)

DSI Reader that reads in an individual or a set of YAML files

Table names are the keys for the main ordered dictionary and column names are the keys for each table’s nested ordered dictionary

__init__(filenames, target_table_prefix=None, yamlSpace=' ', **kwargs)

Initializes the YAML1 reader with the specified YAML file(s)

filenamesstr or list of str: One YAML file or a list of YAML files to be loaded into DSI.
target_table_prefix: str, optional: A prefix to be added to each table name created from the YAML data. Useful for distinguishing between tables from other data sources.
yamlSpacestr, default=’ ‘: The indentation used in the input YAML files. Defaults to two spaces, but can be customized to match the formatting in certain files.

add_rows() → None

Parses YAML data and constructs a nested OrderedDict to load into DSI.

The resulting structure has:

Top-level keys as table names.

Each value is an OrderedDict where:

Keys are column names.

Values are lists representing column data.

return: None.: If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

File Writers

Note for users:

DSI’s runTable is only included in File Writers if data was previously ingested into a backend in a Core.Terminal workflow where runTable was set to True.
- Ex: in Example 5: Process and Write data, runTable is included in a generated ER Diagram since it uses ingested data from Example 2: Ingest data where runTable = True

class dsi.plugins.file_writer.Csv_Writer(table_name, filename, export_cols=None, **kwargs)

DSI Writer to output queries as CSV data

__init__(table_name, filename, export_cols=None, **kwargs)

Initializes the CSV Writer with the specified inputs

table_namestr

Name of the table to export from the DSI backend.

filenamestr

Name of the CSV file to be generated.

export_colslist of str, optional, default is None.

A list of column names to include in the exported CSV file.

If None , all columns from the table will be included.

Ex: if a table has columns [a, b, c, d, e], and export_cols = [a, c, e], only those are writted to the CSV

get_rows(collection) → None

Exports data from the given DSI data collection to a CSV file.

collectionOrderedDict

The internal DSI abstraction. This is a nested OrderedDict where:

Top-level keys are table names.
Each value is another OrderedDict representing the table’s data (with column names as keys and lists of values).

return: None.

If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_writer.ER_Diagram(filename, target_table_prefix=None, **kwargs)

DSI Writer that generates an ER Diagram from the current data in the DSI abstraction

__init__(filename, target_table_prefix=None, **kwargs)

Initializes the ER Diagram writer

filenamestr

File name for the generated ER diagram. Supported formats are .png, .pdf, .jpg, or .jpeg.

target_table_prefixstr, optional

If provided, filters the ER Diagram to only include tables whose names begin with this prefix.

Ex: If prefix = “student”, only “student__address”, “student__math”, “student__physics” tables are included

get_rows(collection) → None

Generates the ER Diagram from the given DSI data collection.

collectionOrderedDict

The internal DSI abstraction. This is a nested OrderedDict where:

Top-level keys are table names.
Each value is another OrderedDict representing the table’s data (with column names as keys and lists of values).

return: None.

If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

class dsi.plugins.file_writer.Table_Plot(table_name, filename, display_cols=None, **kwargs)

DSI Writer that plots all numeric column data for a specified table

__init__(table_name, filename, display_cols=None, **kwargs)

Initializes the Table Plot writer with specified inputs

table_namestr

Name of the table to plot

filenamestr

Name of the output file where the generated plot will be saved.

display_cols: list of str, optional, default is None.

A list of column names to include in the plot. All included columns must contain numerical data.

If None (default), all numerical columns in the specified table will be plotted.

get_rows(collection) → None

Generates a plot of a specified table from the given DSI data collection.

collectionOrderedDict

The internal DSI abstraction. This is a nested OrderedDict where:

Top-level keys are table names.
Each value is another OrderedDict representing the table’s data (with column names as keys and lists of values).

return: None.

If an error occurs, a tuple in the format - (ErrorType, “error message”) - is returned to and printed by the core

Environment Plugins

class dsi.plugins.env.Environment

Environment Plugins inspect the calling process’ context.

Environments assume a POSIX-compliant filesystem and always collect UID/GID information.

__init__(): Initializes StructuredMetadata class with an output collector (Ordered Dictionary)

class dsi.plugins.env.GitInfo(git_repo_path='./')

A Plugin to capture Git information.

Adds the current git remote and git commit to metadata.

__init__(git_repo_path='./') → None: Initializes the git repo in the given directory and access to git commands

add_rows() → None: Adds a row to the output with POSIX info, git remote, and git commit

pack_header() → None: Set schema with POSIX and Git columns

class dsi.plugins.env.Hostname(**kwargs)

An example Environment implementation.

This plugin collects the hostname of the machine, and couples this with the POSIX information gathered by the Environment base class.

__init__(**kwargs) → None: Initializes StructuredMetadata class with an output collector (Ordered Dictionary)

add_rows() → None: Parses environment provenance data and adds the row.

pack_header() → None: Set schema with keys of prov_info.

class dsi.plugins.env.SystemKernel

Plugin for reading environment provenance data.

An environment provenance plugin which does the following:

System Kernel Version
Kernel compile-time config
Kernel boot config
Kernel runtime config
Kernel modules and module config
Container information, if containerized

__init__() → None: Initialize SystemKernel with inital provenance info.

add_rows() → None: Parses environment provenance data and adds the row.

static get_cmd_output(cmd: list, ignore_stderr=False) → str

Runs a given command and returns the stdout if successful.

If stderr is not empty, an exception is raised with the stderr text.

get_kernel_bt_config() → dict

Kernel boot-time configuration is collected by looking at /proc/cmdline.

The output of this command is one string of boot-time parameters. This string is returned in a dict.

get_kernel_ct_config() → dict

Kernel compile-time configuration is collected by looking at /boot/config-(kernel version) and removing comments and empty lines.

The output of said command is newline-delimited option=value pairs.

get_kernel_mod_config() → dict

Kernel module configuration is collected with the “lsmod” and “modinfo” commands.

Each module and modinfo are stored as a key-value pair in the returned dict.

get_kernel_rt_config() → dict

Kernel run-time configuration is collected with the “sysctl -a” command.

The output of this command is lines consisting of two possibilities: option = value (note the spaces), and sysctl: permission denied … The option = value pairs are added to the output dict.

get_kernel_version() → dict: Kernel version is obtained by the “uname -r” command, returns it in a dict.

get_prov_info() → str: Collect and return the different categories of provenance info.

pack_header() → None: Set schema with keys of prov_info.

Optional Plugin Type Enforcement

Plugins take data in an arbitrary format, and transform it into metadata which is queriable in DSI. Plugins may enforce types, but they are not required to enforce types.

Plugin type enforcement can be static, like the Hostname default plugin. Plugin type enforcement can also be dynamic, like the Bueno default plugin.

A collection of pydantic models for Plugin schema validation

class dsi.plugins.plugin_models.EnvironmentModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int])

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class dsi.plugins.plugin_models.GitInfoModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], git_remote: str, git_commit: str)

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'git_commit': FieldInfo(annotation=str, required=True), 'git_remote': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class dsi.plugins.plugin_models.HostnameModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], hostname: str)

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'hostname': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class dsi.plugins.plugin_models.SystemKernelModel(*, uid: int, effective_gid: int, moniker: str, gid_list: list[int], kernel_info: str)

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}: A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'effective_gid': FieldInfo(annotation=int, required=True), 'gid_list': FieldInfo(annotation=list[int], required=True), 'kernel_info': FieldInfo(annotation=str, required=True), 'moniker': FieldInfo(annotation=str, required=True), 'uid': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

dsi.plugins.plugin_models.create_dynamic_model(name: str, col_names: list[str], col_types: list[type], base=None) → BaseModel

Creates a pydantic model at runtime with given name, column names and types, and an optional base model to extend.

This is useful for when column names are not known until they are retrieved at runtime.