Lessons Learned from Building a Machine Learning process for Geospatial Data
Introduction
This page highlights the technical challenges, design decisions, and key insights gained while developing the machine learning process for geospatial data pipeline.
It also includes recommendations for future improvements and practical advice for replicating or extending the setup.
Design Decisions
Modular Workflow Templates
Decision: Separate the CWL execution, training pipeline, and inference pipeline into distinct workflow templates.
Outcome: * Enhanced reusability for other geospatial pipelines requiring similar preprocessing steps.
STAC Integration
Decision: Leverage the STAC API, Geoparquet, and DuckDB for querying and storing geospatial data.
Outcome: * Improved interoperability with other geospatial tools and standards.
Tracking the process
Decision: Use MLFLOW exclusively for tracking the process of training workflow and selecting the best model candidate.
Test inference with Sentinel-2 product
Decision: Use Stars tool to stage-in a sentinel-2 product ready to pass to inference module.
Challenges and Solutions
Build Docker Images
Challenge: Initially, we used an advanced tooling technique that leveraged Taskfile to build a Kaniko-based image and reference the CWL files. The image was then pushed to ttl.sh, a temporary image registry. This will help us to execute the application packages using calrissian. However, this process was slow and hard to debug, often failing due to the large size of the Kaniko images.
Solution: We now push the Docker images to a dedicated GitHub Container Registry.