Welcome | AVDiv's Portfolio

Machine learning forecasting dashboard with air quality trend predictions and early warning alerts

Interactive AQI heatmap visualizing regional air quality indices across Sri Lanka

Streamlit-powered analytics dashboard with real-time pollution metrics and historical charts

ETL pipeline architecture diagram showing data ingestion from the source

Key Capabilities

Real-time air quality monitoring across Sri Lanka
Multi-source environmental data integration
ML-powered air quality forecasting
Automated ETL pipeline with data quality checks
Interactive public-facing dashboards
Optimized time-series query performance

Project Links

Source Code

What is Pneumetra?

Air quality monitoring in Sri Lanka has historically been fragmented — data scattered across government agencies, weather services, and satellite feeds with no unified view. Pneumetra addresses this by centralising environmental data into a single cloud analytics platform built on Snowflake, enabling continuous monitoring, historical analysis, and predictive forecasting of air quality across the country.

The platform serves a broad audience: policymakers who need reliable data to inform environmental regulations, researchers studying pollution trends over time, and the general public seeking accessible air quality information for their region.

Key Features

Data Ingestion

Pneumetra pulls from OpenMeteo weather APIs. A script that continuously updates the database by checking the API every day for new weather data.

Scalable ETL Pipelines

Real-time ETL pipelines handle over 6K data points daily. Each stage — ingest, validate, transform, serve — runs in isolated Docker containers, enabling fault recovery. Data quality checks at the validation layer handle missing values and anomalous readings common in environmental sensor data.

Machine Learning Forecasting & Alerts

The analytics engine applies ML models to historical AQI data to forecast pollution trends at a regional level. When predicted values exceed configurable thresholds, the system triggers early warning alerts — a capability actively used by over 50,000 users to plan around pollution events.

Interactive Dashboards

Streamlit dashboards connect directly to Snowflake to render live visualisations of AQI indices, regional comparisons, and trend forecasts. The interface is designed to be usable by non-technical audiences, translating raw sensor data into clear, actionable environmental insights.

Technical Approach

Data moves through a four-stage pipeline:

Ingest — REST API adapters pull from the API on configurable schedules
Validate — Quality checks flag missing values, outliers, and stale readings before they propagate downstream
Transform — Cleansed data is standardised into a unified AQI schema and loaded into Snowflake
Serve — Snowflake powers both the Streamlit dashboards and the ML feature store used for forecasting

Snowflake as the Analytical Core

Snowflake's separation of storage and compute proved critical for this workload. Real-time ingestion runs on a dedicated warehouse while analytical queries and ML feature extraction run independently, avoiding resource contention. Time-series query performance was further optimised through clustering keys on timestamp and region columns.

Infrastructure on GCP

All services are containerised with Docker and deployed on Google Cloud Platform. PostgreSQL handles operational metadata — pipeline run history, audit logs, and station configuration — keeping these concerns separate from the analytical layer in Snowflake.

Challenges

Integrating heterogeneous data sources was the primary engineering challenge. Monitoring stations report at different intervals, weather APIs return varying schema versions, and satellite imagery arrives asynchronously. The REST API abstraction layer and schema-on-write approach in Snowflake gave the team flexibility to onboard new sources without pipeline downtime.

Handling missing values in environmental datasets required building imputation logic directly into the transform stage, using neighbouring station readings and historical averages to fill gaps rather than dropping incomplete records.