Training Module & CWL Runner

This Application Package provides a CWL document containing a top-level workflow with a single CommandLineTool step that executes the training pipeline. It also supports parallel execution, allowing users to specify multiple sets of hyperparameter or training configurations. This makes it suitable for large-scale experiments and hyperparameter tuning on platforms like a Minikube cluster.

To execute the training workflow, users can choose between cwltool and Calrissian as their CWL runners.

Inputs:

Parameter	Type	Description
MLFLOW_TRACKING_URI	string	An environment variable for the MLFLOW_TRACKING_URI
stac_reference	string	URL pointing to a STAC catalog. The model reads GeoParquet annotations from the collection's assets.
BATCH_SIZE	int[]	Number of batches
CLASSES	int	Number of land cover classes to classify.
DECAY	float[]	Decay value used in training.
EPOCHS	int[]	Number of epochs
EPSILON	float[]	Epsilon value (model hyperparameter)
LEARNING_RATE	float[]	Initial learning rate for the optimizer.
LOSS	string[]	Loss function for training. Options: `binary_crossentropy`, `cosine_similarity`, `mean_absolute_error`, `mean_squared_logarithmic_error`, `squared_hinge`.
MEMENTUM	float[]	Momentum parameter used in optimizers
OPTIMIZER	string[]	Optimization algorithm. Options: `Adam`, `SGD`, `RMSprop`.
REGULARIZER	string[]	Regularization technique to avoid overfitting. Options: `l1l2`, `l1`, `l2`, `None`.
SAMPLES_PER_CLASS	int	Number of samples to use for training per class.

How to execute the Application Ppackage?

Before running the application with a CWL runner, make sure to download and use the latest version of the CWL document:

cd training/app-package
VERSION=$(curl -s https://api.github.com/repos/eoap/machine-learning-process/releases/latest | jq -r '.tag_name')
curl -L -o "tile-sat-training.cwl" \
  "https://github.com/eoap/machine-learning-process/releases/download/${VERSION}/tile-sat-training.${VERSION}.cwl"

Run the Application Package:

There are two methods to execute the application:

Executing the tile-based-training using cwltool in a terminal:

 cwltool --podman --debug --parallel tile-sat-training.cwl#tile-sat-training params.yaml

Executing the tile-based classification using calrissian in a terminal:

 calrissian --debug --stdout /calrissian/out.json --stderr /calrissian/stderr.log --usage-report /calrissian/report.json --parallel --max-ram 10G --max-cores 2 --tmp-outdir-prefix /calrissian/tmp/ --outdir /calrissian/results/ --tool-logs-basepath /calrissian/logs tile-sat-training.cwl#tile-sat-training params.yaml

You can monitor the pod creation using kubectl command below:

kubectl get pods

How the CWL document is designed:

The CWL workflow can be executed using either cwltool or calrissian. The execution requires a params.yml file, which supplies all the necessary inputs defined in the CWL specification. The workflow is structured to run the module according to the diagram outlined below:

[] in the image above indicates that the user may pass a list of parameters to the application package.

For developers

Users can train multiple tile-based classifiers using the CWL runner, with model weights tracked as artifacts in MLflow. Once training is complete, the next step is to retrieve the best-performing model, based on the chosen evaluation metric, from the MLflow artifact registry and convert it to ONNX format. This process is detailed in the "Export the Best Model to ONNX Format" guide. The resulting ONNX model can then be integrated into the inference application package.

Troubleshooting

Users might encounter memory-related issues when executing workflows with CWL Runners (especially with cwltool). These issues can often be mitigated by reducing the ramMax parameter (e.g. ramMax: 1000) specified in the CWL file, which can help prevent excessive memory allocation.