






Key Capabilities
- Real-time air quality monitoring across Sri Lanka
- Multi-source environmental data integration
- ML-powered air quality forecasting
- Automated ETL pipeline with data quality checks
- Interactive public-facing dashboards
- Optimized time-series query performance
Project Links
What is Pneumetra?
Air quality monitoring in Sri Lanka has historically been fragmented — data scattered across government agencies, weather services, and satellite feeds with no unified view. Pneumetra addresses this by centralising environmental data into a single cloud analytics platform built on Snowflake, enabling continuous monitoring, historical analysis, and predictive forecasting of air quality across the country.
The platform serves a broad audience: policymakers who need reliable data to inform environmental regulations, researchers studying pollution trends over time, and the general public seeking accessible air quality information for their region.
Key Features
Data Ingestion
Pneumetra pulls from OpenMeteo weather APIs. A script that continuously updates the database by checking the API every day for new weather data.
Scalable ETL Pipelines
Real-time ETL pipelines handle over 6K data points daily. Each stage — ingest, validate, transform, serve — runs in isolated Docker containers, enabling fault recovery. Data quality checks at the validation layer handle missing values and anomalous readings common in environmental sensor data.
Machine Learning Forecasting & Alerts
The analytics engine applies ML models to historical AQI data to forecast pollution trends at a regional level. When predicted values exceed configurable thresholds, the system triggers early warning alerts — a capability actively used by over 50,000 users to plan around pollution events.
Interactive Dashboards
Streamlit dashboards connect directly to Snowflake to render live visualisations of AQI indices, regional comparisons, and trend forecasts. The interface is designed to be usable by non-technical audiences, translating raw sensor data into clear, actionable environmental insights.
Technical Approach
Data moves through a four-stage pipeline:
- Ingest — REST API adapters pull from the API on configurable schedules
- Validate — Quality checks flag missing values, outliers, and stale readings before they propagate downstream
- Transform — Cleansed data is standardised into a unified AQI schema and loaded into
Snowflake - Serve — Snowflake powers both the Streamlit dashboards and the ML feature store used for forecasting
Snowflake as the Analytical Core
Snowflake's separation of storage and compute proved critical for this workload. Real-time ingestion runs on a dedicated warehouse while analytical queries and ML feature extraction run independently, avoiding resource contention. Time-series query performance was further optimised through clustering keys on timestamp and region columns.
Infrastructure on GCP
All services are containerised with Docker and deployed on Google Cloud Platform. PostgreSQL handles operational metadata — pipeline run history, audit logs, and station configuration — keeping these concerns separate from the analytical layer in Snowflake.
Challenges
Integrating heterogeneous data sources was the primary engineering challenge. Monitoring stations report at different intervals, weather APIs return varying schema versions, and satellite imagery arrives asynchronously. The REST API abstraction layer and schema-on-write approach in Snowflake gave the team flexibility to onboard new sources without pipeline downtime.
Handling missing values in environmental datasets required building imputation logic directly into the transform stage, using neighbouring station readings and historical averages to fill gaps rather than dropping incomplete records.