Ahmadi, MahdiaIbrahim, Ahmad GamalJvarsheishvili, MariamIgrejas, GetúlioIzidorio, FelipeLopes, Rui PedroSoares, CaioRodrigues, Pedro João2026-03-252026-03-252025Ahmadi, Mahdia; Ibrahim, Ahmad Gamal; Jvarsheishvili, Mariam; Igrejas, Getúlio; Izidorio, Felipe, Lopes, Rui Pedro; Soares, Caio; Rodrigues, João Pedro (2025). Synthetic data generation for volatile organic compounds recognition. In RECPAD 2025 - 31st Portuguese Conference on Pattern Recognition. Aveiro, Portugal.http://hdl.handle.net/10198/36269The fact that machine learning (ML) models to recognize volatile organic compounds (VOC) are typically developed with limited datasets and can be expensive to gather scaled sensor data is an obstacle in their development. The Bosch BME688 is a multi-gas sensor that can give detailed environmental data, but needs large experimental campaigns to construct representative data sets. To overcome this issue, we introduce a Python library on synthetic data generation to the BME688. The tool uses the Kernel Density Estimation (KDE) to generate an empirical gas resistance distribution according to various heater profiles and uses mathematical gas mixing to generate self-configurable multi-gas simulations. Experiments by validation on coffee and oil gases show that the resulting datasets retain the statistical characteristics of actual measurements, both at the stepwise level of gas resistance distributions and at the multivariate level with Principal Component Analysis (PCA). The library generates machine learning reproducible experimentation, machine learning algorithm prototyping on mixtures of percentages, and provision of systematic evaluation of VOC recognition systems. The contribution of the work is a modular and lightweight framework to address the problem of the lack of data, facilitate the reproducible research and speed up the creation of air quality monitoring solutions based on ML.engSynthetic data generation for volatile organic compounds recognitionconference paper