Common Workflow Language (CWL)
BEE workflows are written in the Common Workflow Language (CWL). CWL allows workflows to be written in a simple YAML format that can represent “steps” (or tasks) of a workflow as well as how those steps depend on each other. Each step has a base command to be used along with inputs and outputs. Special options are used to indicate how those inputs and outputs can be used to form the actual command to be run on a system. In addition, requirements and hints are special options that determine what a particular step requires and the environment in which the step can run. Dependencies are expressed naturally through the inputs and outputs that flow between the tasks.
In the past each workflow orchestration system would use a different workflow format, based on their needs at the time. Thus when someone would write a workflow for one system, it would be time consuming to port that workflow over to another system. This is one of the main reasons for creating the CWL standard and today a number of workflow systems like BEE support it. By using this standard we hope to make it easier on users to both write and extend their workflows for different systems.
BEE currently supports a subset of the CWL standard. On this page we’ll try to list any differences between the CWL standard and what BEE supports, as well as extensions that you might want to use.
For general information about writing workflows please take a look at the Common Workflow Language User Guide. If you have more questions or are interested in particular details of CWL, then you might want to take a look at the specification.
BEE-Specific Requirements/Hints
Requirements and hints are both used to help BEE configure the environment for
each step. When a requirement is encountered in a workflow the workflow system
must be able to fulfill it, or raise an error indicating a failure. On the
other hand if a hint is not supported, then it can safely be ignored by the
implementation. Since the standard only includes a limited number of
requirements, not all of which are useful for an HPC setting, we’ve added some
extensions that are prefixed with beeflow:
. Please specify beeflow extensions as hints in the steps of your CWL workflow specification.
DockerRequirement
A DockerRequirement
(a CWL standard specification) is used to run a step with a container. BEE does not
support Docker, but it does support Charliecloud and also has limited support
for Singularity.
An example DockerRequirement
in BEE is shown below:
DockerRequirement:
dockerFile: "dockerfile-name"
beeflow:containerName: "some-container"
This example includes two suboptions, a dockerFile
option that specifies
the name of of a dockerfile as well as an extension beeflow:containerName
that gives the name of the container to build. Below are some of the suboptions
that BEE supports and how they can affect running a step with a container.
Suboption Name |
Usage/Meaning/Requirements |
---|---|
|
|
|
Not supported |
|
|
|
Not supported |
|
Not supported |
|
Not supported |
|
|
|
|
|
|
|
|
beeflow:MPIRequirement
BEE also includes a special requirement for running MPI jobs. Note that CWL has
also experimented with a cwltool specific hint (see their paper). The
experimental extension includes basic support for running MPI jobs, but doesn’t
include all of the options that are important for the HPC systems that BEE is
designed to run on. For instance, many jobs will require some sort of mpi
runtime information that needs to be passed to the underlying scheduler (such
as the mpi type that needs to be passed to Slurm with the --mpi={version}
option).
BEE’s beeflow:MPIRequirement
attempts to be as configurable as possible for
running HPC jobs with MPI. An example beeflow:MPIRequirement
in BEE is
shown below:
beeflow:MPIRequirement:
nodes: 10
ntasks: 32
The values for nodes
and ntasks
are then passed to the template and
can be used to request the required resources from the underlying scheduler on
submission.
beeflow:CheckpointRequirement
BEE is designed to manage workflows that include long running scientific
simulations, requiring checkpointing and restarting. We implemented the
beeflow:CheckpointRequirement
for this purpose. If a step in a workflow
includes this requirement and the task stops, such as for a timelimit on the job,
a subtask will run to continue the simulation using the specified checkpoint
file.
An example beeflow:CheckpointRequirement
in BEE is shown below:
beeflow:CheckpointRequirement:
enabled: true
file_path: checkpoint_output
container_path: checkpoint_output
file_regex: backup[0-9]*.crx
restart_parameters: -R
num_tries: 3
For the above example file_path
is the location of the checkpoint_file. The
file_regex
specifies the regular expression for the possible checkpoint
filenames, the restart parameter
will be added to the run command followed
by the path to the latest checkpoint file, and num_tries
specifies the maximum
number of times the task will be restarted.
beeflow:SchedulerRequirement
This requirement is designed for specifying additional information that will be passed to a scheduler such as Slurm on job submission. It currently supports the following options:
timeLimit
- time limit for the job in the format that Slurm uses currently.account
- may be useful if running jobs with different accounts (if you want to run all workflows with the same account it’s best to set this with thedefault_account
option under thejob
section in the bee.conf file).partition
- partition to launch job with.
An example is shown below:
beeflow:SchedulerRequirement:
timeLimit: 00:00:10
account: account12345
partition: scaling
beeflow:ScriptRequirement
Some tasks may require small additional commands for setup or teardown such as
loading modules, setting up checkpointing files, or cleaning up after a run.
The script requirement enables this by adding shell scripts that will run before
and after a task. The script must be within the workflow directory. The desired
shell interpreter must be specified in both the beeflow:ScriptRequirement
section
of the cwl file as well as the shebang line of the script, otherwise, an error will be
returned. Furthermore, if different shell interpreters are specified, then expect
an error. Default shell environment variable is /bin/bash
. The pre_script is run
before a task and the post_script is run after. Currently, we only support running
scripts outside of a container. We are considering adding container support in the
future.
ScriptRequirement currently supports the following options:
enabled
- Enables pre/post script supportpre_script
- Path to the pre_script relative to the workflow directory.post_script
- Path to the post_script relative to the workflow directory.shell
- Desired shell interpreter. Must match shell interpreter defined in pre/post scripts.
An example beeflow:ScriptRequirement
is shown below:
beeflow:ScriptRequirement:
enabled: True
pre_script: before.sh
post_script: after.sh
shell: /bin/bash