Common Workflow Language (CWL)

BEE workflows are written in the Common Workflow Language (CWL). CWL allows workflows to be written in a simple YAML format that can represent “steps” (or tasks) of a workflow as well as how those steps depend on each other. Each step has a base command to be used along with inputs and outputs. Special options are used to indicate how those inputs and outputs can be used to form the actual command to be run on a system. In addition, requirements and hints are special options that determine what a particular step requires and the environment in which the step can run. Dependencies are expressed naturally through the inputs and outputs that flow between the tasks.

In the past each workflow orchestration system would use a different workflow format, based on their needs at the time. Thus when someone would write a workflow for one system, it would be time consuming to port that workflow over to another system. This is one of the main reasons for creating the CWL standard and today a number of workflow systems like BEE support it. By using this standard we hope to make it easier on users to both write and extend their workflows for different systems.

BEE currently supports a subset of the CWL standard. On this page we’ll try to list any differences between the CWL standard and what BEE supports, as well as extensions that you might want to use.

For general information about writing workflows please take a look at the Common Workflow Language User Guide. If you have more questions or are interested in particular details of CWL, then you might want to take a look at the specification.

BEE-Specific Requirements/Hints

Requirements and hints are both used to help BEE configure the environment for each step. When a requirement is encountered in a workflow the workflow system must be able to fulfill it, or raise an error indicating a failure. On the other hand if a hint is not supported, then it can safely be ignored by the implementation. Since the standard only includes a limited number of requirements, not all of which are useful for an HPC setting, we’ve added some extensions that are prefixed with beeflow:. Please specify beeflow extensions as hints in the steps of your CWL workflow specification.

DockerRequirement

A DockerRequirement (a CWL standard specification) is used to run a step with a container. BEE does not support Docker, but it does support Charliecloud and also has limited support for Singularity.

An example DockerRequirement in BEE is shown below:

DockerRequirement:
    dockerFile: "dockerfile-name"
    beeflow:containerName: "some-container"

This example includes two suboptions, a dockerFile option that specifies the name of of a dockerfile as well as an extension beeflow:containerName that gives the name of the container to build. Below are some of the suboptions that BEE supports and how they can affect running a step with a container.

Suboption Name	Usage/Meaning/Requirements
`dockerPull`	`dockerPull: "container-image"` Pull from a container repository.
`dockerLoad`	Not supported
`dockerFile`	`dockerFile: "dockerfile-name"` Builds a container using the dockerfile. Requires: `beeflow:containerName`
`dockerImport`	Not supported
`dockerImageId`	Not supported
`dockerOutputDirectory`	Not supported
`beeflow:copyContainer`	`beeflow:copyContainer: "path-to-container-image"` Copies image to `container_archive` (specified in bee.conf). Uses copy.
`beeflow:useContainer`	`beeflow:useContainer: "path-to-container-image"` Executes using the specified image.
`beeflow:containerName`	`beeflow:containerName: "containerName"` Specifies the container name. Used in conjunction with dockerFile.
`beeflow:forceType`	`beeflow:forceType: "forceType"` Charliecloud specific option that corresponds to `ch-image`’s `--force` argument.

beeflow:MPIRequirement

BEE also includes a special requirement for running MPI jobs. Note that CWL has also experimented with a cwltool specific hint (see their paper). The experimental extension includes basic support for running MPI jobs, but doesn’t include all of the options that are important for the HPC systems that BEE is designed to run on. For instance, many jobs will require some sort of mpi runtime information that needs to be passed to the underlying scheduler (such as the mpi type that needs to be passed to Slurm with the --mpi={version} option).

BEE’s beeflow:MPIRequirement attempts to be as configurable as possible for running HPC jobs with MPI. An example beeflow:MPIRequirement in BEE is shown below:

beeflow:MPIRequirement:
  nodes: 10
  ntasks: 32

The values for nodes and ntasks are then passed to the template and can be used to request the required resources from the underlying scheduler on submission.

beeflow:CheckpointRequirement

BEE is designed to manage workflows that include long running scientific simulations, requiring checkpointing and restarting. We implemented the beeflow:CheckpointRequirement for this purpose. If a step in a workflow includes this requirement and the task stops, such as for a timelimit on the job, a subtask will run to continue the simulation using the specified checkpoint file.

An example beeflow:CheckpointRequirement in BEE is shown below:

beeflow:CheckpointRequirement:
     enabled: true
     file_path: checkpoint_output
     container_path: checkpoint_output
     file_regex: backup[0-9]*.crx
     restart_parameters: -R
     num_tries: 3

For the above example file_path is the location of the checkpoint_file. The file_regex specifies the regular expression for the possible checkpoint filenames, the restart parameter will be added to the run command followed by the path to the latest checkpoint file, and num_tries specifies the maximum number of times the task will be restarted.

beeflow:SlurmRequirement

This requirement is designed for specifying additional information that will be passed to the Slurm scheduler during job submission. Each of the options can be set in the configuration file bee.conf under the job section to use for all workflows. Setting any beeflow:SlurmRequirement in the CWL file will override the setting in bee.conf. Current options supported are:

account - account name to run the job with (often used for charging).
partition - partition to launch job on.
qos - quality of service to use.
reservation - reservation to use to launch job.
timeLimit - time limit for the job in the format that Slurm uses currently.

An example is shown below:

beeflow:SchedulerRequirement:
  timeLimit: 00:00:10
  account: account12345
  partition: partition-a
  qos: long
  reservation: reservation-a

beeflow:ScriptRequirement

Some tasks may require small additional commands for setup or teardown such as loading modules, setting up checkpointing files, or cleaning up after a run. The script requirement enables this by adding shell scripts that will run before and after a task. The script must be within the workflow directory. The desired shell interpreter must be specified in both the beeflow:ScriptRequirement section of the cwl file as well as the shebang line of the script, otherwise, an error will be returned. Furthermore, if different shell interpreters are specified, then expect an error. Default shell environment variable is /bin/bash. The pre_script is run before a task and the post_script is run after. Currently, we only support running scripts outside of a container. We are considering adding container support in the future.

ScriptRequirement currently supports the following options:

enabled - Enables pre/post script support
pre_script - Path to the pre_script relative to the workflow directory.
post_script - Path to the post_script relative to the workflow directory.
shell - Desired shell interpreter. Must match shell interpreter defined in pre/post scripts.

An example beeflow:ScriptRequirement is shown below:

beeflow:ScriptRequirement:
  enabled: True
  pre_script: before.sh
  post_script: after.sh
  shell: /bin/bash