Hybrid Approach for Refining and Reconstructing Air Pollution Data in Tehran Megacity Using Machine Learning

Document Type : Original Research

Authors
1 PhD Candidate, Department of Environmental Engineering, Faculty of Civil and Environmental Engineering, Tarbiat Modares University, Tehran, Iran
2 Assistant Professor, Department of Environmental Engineering, Faculty of Civil and Environmental Engineering, Tarbiat Modares University, Tehran, Iran
Abstract
Air pollution is a major challenge in megacities, and its management depends on high-quality data. In developing countries like Iran, accessing reliable ground-based data is difficult. Satellite data offers a promising solution, but incomplete and outlier data remain significant challenges. This study addresses the issue of incomplete air pollution data in Tehran by employing a hybrid approach for data refinement and reconstruction. The dataset includes NO₂, CO, and O₃ pollutants from the Sentinel-5p sensor and meteorological variables from ERA5-land, covering December 2018 to March 2025. Results indicate a high prevalence of incomplete data for all pollutants in December due to weather conditions, with CO showing the highest level of incompleteness. A two-stage process using univariate Robust Z-score and multidimensional Isolation Forest (IF) was applied to identify outliers. Analysis revealed that cold months had the highest number of outlier data for pollutants, with NO₂ exhibiting the most outliers compared to other pollutants. The LightGBM algorithm was used to reconstruct missing values, yielding (r²) of 0.61, 0.50, and 0.38 for NO₂, O₃, and CO, respectively. Despite data limitations and the absence of complex spatio-temporal algorithms compared to previous studies, the results, particularly for NO₂ and O₃, are considered satisfactory. This research demonstrates the potential of integrating satellite and meteorological data with machine learning to enhance air quality monitoring in data-scarce urban environments.

Keywords

Subjects


Aithal, Sathya Swarup, Ishaan Sachdeva, and Om P Kurmi. 2023. “Air Quality and Respiratory Health in Children.” Breathe 19 (2).
Appel, Marius. 2024. “Efficient Data-Driven Gap Filling of Satellite Image Time Series Using Deep Neural Networks with Partial Convolutions.” Artificial Intelligence for the Earth Systems 3 (2): 220055.
Arcudi, Alessio, Davide Frizzo, Chiara Masiero, and Gian Antonio Susto. 2024. “Enhancing Interpretability and Generalizability in Extended Isolation Forests.” Engineering Applications of Artificial Intelligence 138: 109409.
Bayat, Reza, Khosro Ashrafi, Majid Shafiepour Motlagh, Mohammad Sadegh Hassanvand, Rajabali Daroudi, Günther Fink, and Nino Künzli. 2019. “Health Impact and Related Cost of Ambient Air Pollution in Tehran.” Environmental Research 176: 108547.
Borhani, Faezeh, Majid Shafiepour Motlagh, Amir Houshang Ehsani, Yousef Rashidi, Masoud Ghahremanloo, Meisam Amani, and Armin Moghimi. 2023. “Current Status and Future Forecast of Short-Lived Climate-Forced Ozone in Tehran, Iran, Derived from Ground-Based and Satellite Observations.” Water, Air, & Soil Pollution 234 (2): 134.
Borhani, Faezeh, Majid Shafiepour Motlagh, Andreas Stohl, Yousef Rashidi, and Amir Houshang Ehsani. 2022. “Tropospheric Ozone in Tehran, Iran, during the Last 20 Years.” Environmental Geochemistry and Health, 1–23.
Čampulová, Martina, Jaroslav Michalek, and Jiří Moučka. 2019. “Generalised Linear Model-Based Algorithm for Detection of Outliers in Environmental Data and Comparison with Semi-Parametric Outlier Detection Methods.” Atmospheric Pollution Research 10 (4): 1015–23.
Dongre, Pradeep Kumar, Viral Patel, Upendra Bhoi, and Nilesh N Maltare. 2025. “An Outlier Detection Framework for Air Quality Index Prediction Using Linear and Ensemble Models.” Decision Analytics Journal 14: 100546.
Holloway, Tracey, Daegan Miller, Susan Anenberg, Minghui Diao, Bryan Duncan, Arlene M Fiore, Daven K Henze, Jeremy Hess, Patrick L Kinney, and Yang Liu. 2021. “Satellite Monitoring for Air Quality and Health.” Annual Review of Biomedical Data Science 4 (1): 417–47.
Hua, Van, Thu Nguyen, Minh-Son Dao, Hien D Nguyen, and Binh T Nguyen. 2024. “The Impact of Data Imputation on Air Quality Prediction Problem.” Plos One 19 (9): e0306303.
Jiménez-Navarro, Manuel J, Mario Lovrić, Simonas Kecorius, Emmanuel Karlo Nyarko, and María Martínez-Ballesteros. 2024. “Explainable Deep Learning on Multi-Target Time Series Forecasting: An Air Pollution Use Case.” Results in Engineering 24: 103290.
Junninen, Heikki, Harri Niska, Kari Tuppurainen, Juhani Ruuskanen, and Mikko Kolehmainen. 2004. “Methods for Imputation of Missing Values in Air Quality Data Sets.” Atmospheric Environment 38 (18): 2895–2907.
la Cruz Libardi, Arturo de, Pierre Masselot, Rochelle Schneider, Emily Nightingale, Ai Milojevic, Jacopo Vanoli, Malcolm N Mistry, and Antonio Gasparrini. 2024. “High Resolution Mapping of Nitrogen Dioxide and Particulate Matter in Great Britain (2003–2021) with Multi-Stage Data Reconstruction and Ensemble Machine Learning Methods.” Atmospheric Pollution Research 15 (11): 102284.
Li, Meixin, Ying Wu, Yansong Bao, Bofan Liu, and George P Petropoulos. 2022. “Near-Surface NO2 Concentration Estimation by Random Forest Modeling and Sentinel-5P and Ancillary Data.” Remote Sensing 14 (15): 3612.
Liashchynskyi, Petro, and Pavlo Liashchynskyi. 2019. “Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS.” ArXiv Preprint ArXiv:1912.06059.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. 2008. “Isolation Forest.” In 2008 Eighth Ieee International Conference on Data Mining, 413–22. IEEE.
Ramírez, A Susana, Steven Ramondt, Karina Van Bogart, and Raquel Perez-Zuniga. 2019. “Public Awareness of Air Pollution and Health Threats: Challenges and Opportunities for Communication Strategies to Improve Environmental Health Literacy.” Journal of Health Communication 24 (1): 75–83.
Rollo, Federica, Chiara Bachechi, and Laura Po. 2023. “Anomaly Detection and Repairing for Improving Air Quality Monitoring.” Sensors 23 (2): 640.
Saim, Abdullah Al, and Mohamed H Aly. 2024. “Big Data Analyses for Determining the Spatio-Temporal Trends of Air Pollution Due to Wildfires in California Using Google Earth Engine.” Atmospheric Pollution Research 15 (9): 102226.
Schneider, Philipp, Paul D Hamer, Arve Kylling, Shobitha Shetty, and Kerstin Stebel. 2021. “Spatiotemporal Patterns in Data Availability of the Sentinel-5p No2 Product over Urban Areas in Norway.” Remote Sensing 13 (11): 2095.
Schneising, Oliver, Michael Buchwitz, Jonas Hachmeister, Steffen Vanselow, Maximilian Reuter, Matthias Buschmann, Heinrich Bovensmann, and John P Burrows. 2023. “Advances in Retrieving XCH 4 and XCO from Sentinel-5 Precursor: Improvements in the Scientific TROPOMI/WFMD Algorithm.” Atmospheric Measurement Techniques 16 (3): 669–94.
Shao, Yanchuan, Wei Zhao, Riyang Liu, Jianxun Yang, Miaomiao Liu, Wen Fang, Litiao Hu, Matthew Adams, Jun Bi, and Zongwei Ma. 2023. “Estimation of Daily NO2 with Explainable Machine Learning Model in China, 2007–2020.” Atmospheric Environment 314: 120111.
Sokhi, Ranjeet S, Nicolas Moussiopoulos, Alexander Baklanov, John Bartzis, Isabelle Coll, Sandro Finardi, Rainer Friedrich, Camilla Geels, Tiia Grönholm, and Tomas Halenka. 2022. “Advances in Air Quality Research–Current and Emerging Challenges.” Atmospheric Chemistry and Physics 22 (7): 4615–4703.
Tabunshchik, Vladimir, Aleksandra Nikiforova, Nastasia Lineva, Polina Drygval, Roman Gorbunov, Tatiana Gorbunova, Ibragim Kerimov, Cam Nhung Pham, Nikolai Bratanov, and Mariia Kiseleva. 2024. “The Dynamics of Air Pollution in the Southwestern Part of the Caspian Sea Basin (Based on the Analysis of Sentinel-5 Satellite Data Utilizing the Google Earth Engine Cloud-Computing Platform).” Atmosphere 15 (11): 1371.
Wang, Guojie, Damien Garcia, Yi Liu, Richard De Jeu, and A Johannes Dolman. 2012. “A Three-Dimensional Gap Filling Method for Large Geophysical Datasets: Application to Global Satellite Soil Moisture Observations.” Environmental Modelling & Software 30: 139–42.
Yu, Xinyu, Man Sing Wong, Majid Nazeer, Zhengqiang Li, and Coco Yin Tung Kwok. 2024. “A Novel Algorithm for Full-Coverage Daily Aerosol Optical Depth Retrievals Using Machine Learning-Based Reconstruction Technique.” Atmospheric Environment 318: 120216.
Yu, Zhongqi, Jinghui Ma, Yuanhao Qu, Liang Pan, and Shiquan Wan. 2023. “PM2. 5 Extended-Range Forecast Based on MJO and S2S Using LightGBM.” Science of The Total Environment 880: 163358.

Zali, Nader, Masoud Zamanipoor, Hassan Ahmadi, and Mehrdad Karami. 2018. “Analysis of Key Factors Influencing Air Pollution of Metropolises in Developing Countries by Year 2025 (Case Study: Tehran Metropolis, Iran).” Anu. Do Inst. De Geocienc 41: 548–59.
Zhang, Xiaoxia, and Pengcheng Zhou. 2024. “A Transferred Spatio-Temporal Deep Model Based on Multi-LSTM Auto-Encoder for Air Pollution Time Series Missing Value Imputation.” Future Generation Computer Systems 156: 325–38.
Zheng, Zihao, Zhiwei Yang, Zhifeng Wu, and Francesco Marinello. 2019. “Spatial Variation of NO2 and Its Impact Factors in China: An Application of Sentinel-5P Products.” Remote Sensing 11 (16): 1939.
دلاور, غلامی, امین, شیران, رشیدی, نخعی زاده, غلام رضا, فدرا, کرت, and هاتفی افشار. 2020. “بهبود برآورد میزان آلودگی هوای شهر تهران.” مجله علمی رایانش نرم و فناوری اطلاعات 9 (2): 87–99.