Common Workflow Language (CWL)

BEE workflows are written in the Common Workflow Language (CWL). CWL allows workflows to be written in a simple YAML format that can represent “steps” (or tasks) of a workflow as well as how those steps depend on each other. Each step has a base command to be used along with inputs and outputs. Special options are used to indicate how those inputs and outputs can be used to form the actual command to be run on a system. In addition, requirements and hints are special options that determine what a particular step requires and the environment in which the step can run. Dependencies are expressed naturally through the inputs and outputs that flow between the tasks.

In the past each workflow orchestration system would use a different workflow format, based on their needs at the time. Thus when someone would write a workflow for one system, it would be time consuming to port that workflow over to another system. This is one of the main reasons for creating the CWL standard and today a number of workflow systems like BEE support it. By using this standard we hope to make it easier on users to both write and extend their workflows for different systems.

BEE currently supports a subset of the CWL standard. On this page we’ll try to list any differences between the CWL standard and what BEE supports, as well as extensions that you might want to use.

For general information about writing workflows please take a look at the Common Workflow Language User Guide. If you have more questions or are interested in particular details of CWL, then you might want to take a look at the specification.

BEE-Specific Requirements/Hints

Requirements and hints are both used to help BEE configure the environment for each step. When a requirement is encountered in a workflow the workflow system must be able to fulfill it, or raise an error indicating a failure. On the other hand if a hint is not supported, then it can safely be ignored by the implementation. Since the standard only includes a limited number of requirements, not all of which are useful for an HPC setting, we’ve added some extensions that are prefixed with beeflow:. Please specify beeflow extensions as hints in the steps of your CWL workflow specification.

DockerRequirement

A DockerRequirement (a CWL standard specification) is used to run a step with a container. BEE does not support Docker, but it does support Charliecloud and also has limited support for Singularity.

An example DockerRequirement in BEE is shown below:

DockerRequirement:
    dockerFile: "dockerfile-name"
    beeflow:containerName: "some-container"

This example includes two suboptions, a dockerFile option that specifies the name of of a dockerfile as well as an extension beeflow:containerName that gives the name of the container to build. Below are some of the suboptions that BEE supports and how they can affect running a step with a container.

Suboption Name

Usage/Meaning/Requirements

dockerPull

dockerPull: "container-image"
Pull from a container repository.

dockerLoad

Not supported

dockerFile

dockerFile: "dockerfile-name"
Builds a container using the dockerfile.
Requires: beeflow:containerName

dockerImport

Not supported

dockerImageId

Not supported

dockerOutputDirectory

Not supported

beeflow:copyContainer

beeflow:copyContainer: "path-to-container-image"
Copies image to container_archive (specified in bee.conf). Uses copy.

beeflow:useContainer

beeflow:useContainer: "path-to-container-image"
Executes using the specified image.

beeflow:containerName

beeflow:containerName: "containerName"
Specifies the container name. Used in conjunction with dockerFile.

beeflow:forceType

beeflow:forceType: "forceType"
Charliecloud specific option that
corresponds to ch-image’s --force argument.

beeflow:MPIRequirement

BEE also includes a special requirement for running MPI jobs. Note that CWL has also experimented with a cwltool specific hint (see their paper). The experimental extension includes basic support for running MPI jobs, but doesn’t include all of the options that are important for the HPC systems that BEE is designed to run on. For instance, many jobs will require some sort of mpi runtime information that needs to be passed to the underlying scheduler (such as the mpi type that needs to be passed to Slurm with the --mpi={version} option).

BEE’s beeflow:MPIRequirement attempts to be as configurable as possible for running HPC jobs with MPI. An example beeflow:MPIRequirement in BEE is shown below:

beeflow:MPIRequirement:
  nodes: 10
  ntasks: 32

The values for nodes and ntasks are then passed to the template and can be used to request the required resources from the underlying scheduler on submission.

beeflow:CheckpointRequirement

BEE is designed to manage workflows that include long running scientific simulations, requiring checkpointing and restarting. We implemented the beeflow:CheckpointRequirement for this purpose. If a step in a workflow includes this requirement and the task stops, such as for a timelimit on the job, a subtask will run to continue the simulation using the specified checkpoint file.

An example beeflow:CheckpointRequirement in BEE is shown below:

beeflow:CheckpointRequirement:
     enabled: true
     file_path: checkpoint_output
     container_path: checkpoint_output
     file_regex: backup[0-9]*.crx
     restart_parameters: -R
     num_tries: 3

For the above example file_path is the location of the checkpoint_file. The file_regex specifies the regular expression for the possible checkpoint filenames, the restart parameter will be added to the run command followed by the path to the latest checkpoint file, and num_tries specifies the maximum number of times the task will be restarted.

beeflow:SchedulerRequirement

This requirement is designed for specifying additional information that will be passed to a scheduler such as Slurm on job submission. It currently supports the following options:

  • timeLimit - time limit for the job in the format that Slurm uses currently.

  • account - may be useful if running jobs with different accounts (if you want to run all workflows with the same account it’s best to set this with the default_account option under the job section in the bee.conf file).

  • partition - partition to launch job with.

An example is shown below:

beeflow:SchedulerRequirement:
  timeLimit: 00:00:10
  account: account12345
  partition: scaling

beeflow:ScriptRequirement

Some tasks may require small additional commands for setup or teardown such as loading modules, setting up checkpointing files, or cleaning up after a run. The script requirement enables this by adding shell scripts that will run before and after a task. The script must be within the workflow directory. The desired shell interpreter must be specified in both the beeflow:ScriptRequirement section of the cwl file as well as the shebang line of the script, otherwise, an error will be returned. Furthermore, if different shell interpreters are specified, then expect an error. Default shell environment variable is /bin/bash. The pre_script is run before a task and the post_script is run after. Currently, we only support running scripts outside of a container. We are considering adding container support in the future.

ScriptRequirement currently supports the following options:

  • enabled - Enables pre/post script support

  • pre_script - Path to the pre_script relative to the workflow directory.

  • post_script - Path to the post_script relative to the workflow directory.

  • shell - Desired shell interpreter. Must match shell interpreter defined in pre/post scripts.

An example beeflow:ScriptRequirement is shown below:

beeflow:ScriptRequirement:
  enabled: True
  pre_script: before.sh
  post_script: after.sh
  shell: /bin/bash