Machine Learning Process
Introduction
This learning resource demonstrates a Machine Learning (ML) system for classification of Sentinel-2 images into 10 different classes using cloud-native technologies. The system leverages MLFLOW to track the training process and selects the best candidate trained model from MLFLOW server.
There are two workflows developed: one for training a deep learning model classifier on EuroSAT dataset, and one for running a prediction on a real world Sentinel-2 data. The automation is achieved using Kubernetes-native tools, making the setup scalable, modular, and suitable for Earth Observation and geospatial applications.
Key Components
This setup integrates the following technologies and concepts:
MLFLOW
- Manages end-to-end ML workflows, from development to production
- End-to-end MLOps solution for traditional ML, including integrations with traditional ML models, and Deep learning one
- Simple, low-code performance tracking with autologging
- State-of-the-art UI for model analysis and comparison
STAC & DuckDB
- Acts as the primary data source by providing geospatial data in a standardized format. The STAC collection includes references to the EUROSAT benchmark dataset, which can be queried using DuckDB.
High-Level Architecture
The system is designed to handle the following flow:
-
Training pipeline: A CNN model trained on EuroSAT dataset which already exist on a dedicated STAC endpoint. The MLFLOW tracks the whole process to monitor the life cycle of training.
-
Inference: Run the inference pipeline to perform tile-based classification on Sentinel-2 L1C products.
-
Workflow Execution: Both training and inference pipeline will be executed using the CWL-based algorithm.
Why Use This Setup?
This setup demonstrates the power of combining machine learning paradigms with container-native workflows to enable scalable geospatial analysis.
It is particularly suited for Earth observation and scientific workflows because:
- Scalability: Kubernetes ensures workflows can handle varying loads effectively.
- Modularity: Components can be easily reused or replaced for other applications.
- Automation: Events trigger workflows without manual intervention, enabling real-time processing.
Through this resource, you'll learn to implement a cloud-native pipeline for tile-based classification, which can be extended to other geospatial or scientific applications.