Execution Guide
This guide details how to execute the Sentinel-2 Monthly Mosaic Generator module, leveraging Dask for parallel processing in a distributed cluster environment and Calrissian for CWL workflow orchestration.
Distributed Cluster Setup
To enable parallel processing with Dask, a Dask Gateway object is required to create a Dask Cluster.
Prerequisites:
-
Dask Gateway Deployment: It is assumed that a Dask Gateway is already deployed and accessible. The dev-platform-eoap repository provides a streamlined method for deploying Dask Gateway on Kubernetes, which can then be integrated with a Code-Server for module execution.
-
Dask Cluster Environment Variable: Ensure the
DASK_CLUSTER
environment variable is set to point to your Dask cluster. For example:export DASK_CLUSTER=eoap-dask-gateway.600b64a112eb404888df41006e19666f
-
Dask Worker Image: For Dask to utilize your module's dependencies and code, a custom Docker image for the Dask workers must be built and made accessible to the Kubernetes cluster. This image should contain all the necessary Python packages and your module's code.
-
Building and Providing the Worker Image: The easiest way to build the worker image for this module is to navigate into the
cloudless-mosaic/
directory (where the Dockerfile for the worker is located) and run a Docker build command:Once built, this image needs to be available to your Kubernetes cluster.cd cloudless-mosaic/ docker build -t cloudless-mosaic-worker:latest .
-
Using ttl.sh for Temporary Images: For development or testing, ttl.sh provides an anonymous, ephemeral registry. This allows you to quickly build and push a temporary, pullable image without authentication, which expires after a set duration (e.g., 1 hour).
You would then use this ttl.sh image name in your Dask Gateway configuration or Calrissian workflow.IMAGE_NAME=cloudless-mosaic-worker docker build -t ttl.sh/${IMAGE_NAME}:1h . docker push ttl.sh/${IMAGE_NAME}:1h echo "Temporary image available at: ttl.sh/${IMAGE_NAME}:1h"
Once the Dask Gateway is available, you can proceed with running the processing module:
-
Direct Execution: The provided Jupyter Notebook demonstrates how to execute the module directly using its main function.
-
Command-Line Interface (CLI): Alternatively, the module can be executed via its CLI interface, offering a more traditional command-line approach.
Calrissian CWL Execution
This project includes a Common Workflow Language (CWL) workflow for automating the generation of Sentinel-2 monthly mosaics. The workflow leverages a DaskGatewayRequirement
extension, enabling CWL runners like Calrissian to optimize parallel processing on a Dask cluster.
Environment Setup:
Ensure the environment described in the Distributed Cluster Setup section is established, including the DASK_CLUSTER environment variable.
Execution Method
The workflow file can be executed using Calrissian in two primary ways:
-
Direct Calrissian Command
Execute the workflow directly using the calrissian command:
calrissian \
--stdout /calrissian/results.json \
--stderr /calrissian/app.log \
--max-ram 16G \
--max-cores "8" \
--tmp-outdir-prefix /calrissian/tmp/ \
--outdir /calrissian/results \
--usage-report /calrissian/usage.json \
--tool-logs-basepath /calrissian/logs \
--pod-serviceaccount calrissian-sa \
--dask-gateway-url "http://traefik-dask-gateway.eoap-dask-gateway.svc.cluster.local:80" \
https://github.com/eoap/dask-app-package/releases/download/1.0.1/cloudless-mosaic.1.0.1.cwl \
--resolution 100 \
--start-date 2020-01-01 \
--end-date 2020-12-31
-
Kubernetes Job Submission
For robust and scalable execution, you can submit the workflow as a Kubernetes Job:
---
apiVersion: batch/v1
kind: Job
metadata:
name: calrissian-mosaic
spec:
ttlSecondsAfterFinished: 60
template:
spec:
serviceAccountName: default
securityContext:
runAsUser: 0
runAsGroup: 0
containers:
- name: calrissian
image: calrissian:0.19.0
command: ["calrissian"]
args:
- --debug
- --stdout
- /calrissian/results.json
- --stderr
- /calrissian/app.log
- --max-ram
- 16G
- --max-cores
- "8"
- --tmp-outdir-prefix
- /calrissian/tmp/
- --outdir
- /calrissian/results
- --usage-report
- /calrissian/usage.json
- --tool-logs-basepath
- /calrissian/logs
- --dask-gateway-url
- "http://traefik-dask-gateway.eoap-dask-gateway.svc.cluster.local:80"
- https://github.com/eoap/dask-app-package/releases/download/1.0.1/cloudless-mosaic.1.0.1.cwl
- --resolution
- 100
- --start-date
- "2020-01-01"
- --end-date
- "2020-12-31"
volumeMounts:
- name: calrissian-volume
mountPath: /calrissian
env:
- name: CALRISSIAN_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
restartPolicy: Never
volumes:
- name: calrissian-volume
persistentVolumeClaim:
claimName: calrissian-claim
backoffLimit: 3
The latest version can be found here Important Note: Ensure you are using a Calrissian image that supports DaskGateway. The latest version can be found here.
Input Data
The module retrieves Sentinel-2 imagery from the Planetary Computer STAC API based on the specified parameters:
- Time Range: Defined by
--start-date
and--end-date
. - Area of Interest (AOI): A bounding box (min_lon,min_lat,max_lon,max_lat).
- Bands: Specific spectral bands to retrieve (e.g., nir, red, green).
- Collection: The Sentinel-2 collection to use (e.g., sentinel-2-l2a).
- Max Cloud Cover: Filters images to include only those below a specified cloud cover percentage.
- Max Items: Limits the maximum number of STAC items to process.
Output Data
Upon successful execution, the pipeline generates the following outputs:
- Cloud-Optimized GeoTIFFs (COGs): Monthly median mosaics are saved as COGs, organized into directories named
monthly-mosaic-YYYY-MM