Skip to content

Calrissian CWL Extension: DaskGatewayRequirement

Overview

The DaskGatewayRequirement is a custom CWL extension developed for the Calrissian CWL runner, enabling dynamic provisioning of Dask clusters through Dask Gateway on Kubernetes. This extension allows a CommandLineTool to declare runtime needs for distributed processing resources (e.g., cores, memory) via a standard CWL ProcessRequirement.

When used in a workflow step, Calrissian will:

  • Provision a Dask cluster with the specified resource limits.
  • Run the CWL tool with access to this Dask cluster.
  • Shutdown the cluster after tool execution completes.

Schema

The DaskGatewayRequirement is defined in the following custom schema:

$base: https://calrissian-cwl.github.io/schema#
$namespaces:
  cwl: "https://w3id.org/cwl/cwl#"
$graph:
  - $import: https://w3id.org/cwl/CommonWorkflowLanguage.yml

  - name: DaskGatewayRequirement
    type: record
    extends: cwl:ProcessRequirement
    inVocab: false
    doc: "Indicates that a process requires a Dask cluster procured via [Dask Gateway](https://gateway.dask.org/) runtime."
    fields:
      class:
        type: 'string'
        doc: "Always 'DaskGatewayRequirement'"
        jsonldPredicate:
          "_id": "@type"
          "_type": "@vocab"
      workerCores:
        type: ['int', 'cwl:Expression']
        doc: Number of cpu-cores available for a Dask worker.
      workerCoresLimit:
        type: ['int', 'cwl:Expression']
        doc: Maximum number of cpu-cores available for a Dask worker.
      workerMemory:
        type: ['string', 'cwl:Expression']
        doc: Maximum memory (e.g., '4G') per Dask worker.
      clusterMaxCores:
        type: ['int', 'cwl:Expression']
        doc: Cluster-wide core count upper limit.
      clusterMaxMemory:
        type: ['string', 'cwl:Expression']
        doc: Cluster-wide memory upper limit.

Example Usage in CommandLineTool

cwlVersion: v1.2
$namespaces:
  s: https://schema.org/
  calrissian: https://calrissian-cwl.github.io/schema#
s:softwareVersion: 1.4.1
schemas:
  - http://schema.org/version/9.0/schemaorg-current-http.rdf

class: CommandLineTool
id: eopf-sentinel-2
requirements:
  DockerRequirement:
    dockerPull: ttl.sh/eopf-sentinel-2:2h@sha256:...
  EnvVarRequirement:
    envDef: {}
  calrissian:DaskGatewayRequirement:
    workerCores: 1
    workerCoresLimit: 1
    workerMemory: "4G"
    clusterMaxCore: 5
    clusterMaxMemory: "20G"
baseCommand: ["eopf-sentinel-2-proc"]
arguments: []
inputs:
  item-url:
    type: string
    inputBinding:
      position: 1
      prefix: "--item-url"
outputs:
  stac-catalog:
    type: Directory
    outputBinding:
      glob: .

Runtime Behavior

When Calrissian encounters a DaskGatewayRequirement, the following workflow is triggered:

  • Cluster Provisioning - Calrissian connects to the Dask Gateway specified by the CLI flag --dask-gateway-url and provisions a Dask cluster matching the requested cores and memory.
  • Tool Execution - The CommandLineTool is launched with environment variables or context pointing to the Dask cluster (e.g., DASK_SCHEDULER_ADDRESS). This allows the tool to use Dask's distributed scheduler as needed.
  • Cluster Teardown - Once execution finishes, Calrissian shuts down the Dask cluster, ensuring efficient resource usage.

CLI Integration

A new CLI option is introduced in Calrissian to support this functionality:

--dask-gateway-url <URL>

Description:

Specifies the base URL of the Dask Gateway (e.g., https://dask-gateway.example.com). This option is optional but required when using DaskGatewayRequirement.

Notes

This extension assumes that the tool being run is compatible with Dask and can use a Dask cluster.

The resource requests (workerCores, workerMemory, etc.) can also be dynamic via CWL expressions.

Error handling for failed cluster creation or shutdown is included in Calrissian’s internal execution flow.