Finnish Physical Society – Suomen Fyysikkoseura – Finlands Fysikerförening

Linus Lind: Atmospheric compound identification from electron ionization mass spectrometry data

Linus Lind: Atmospheric compound dentification from electron ionization mass spectrometry data
Linus Lind, Aalto University

Organic aerosols account for up to 90 % of particulate matter present in the lower atmosphere. These aerosol particles affect air quality and key atmospheric processes, yet the extent of their contribution remains uncertain without molecular-level understanding. Moreover, the chemical space of these aerosols is vast: the number of plausible distinct compounds is currently estimated to be on the order of millions. Detecting these compounds is challenging, and in field measurement campaigns, the majority of detected signals remain chemically unassigned. This poses a major bottleneck in determining the environmental impact and fate of atmospheric organic compounds.

Mass spectrometry (MS) is currently the primary analytical method for detecting the molecular composition of atmospheric organic compounds. In MS, molecules are ionized and fragmented, and the mass-to-charge ratios of the resulting ions are recorded as a spectrum that reflects the structure of the original compound (Figure 1). Interpreting mass spectra to elucidate the chemical structures of molecules remains challenging, particularly for electron ionization mass spectrometry (EI-MS), which produces extensively fragmented ions, leading to dense and information-rich spectra. Manual analysis of such data is time-consuming, even for an expert, and does not scale to the data volumes produced by modern instrument setups. Moreover, structure identification from EI-MS is inherently ill-posed: different structural isomers can produce identical or near-identical fragmentation spectra, so perfect identification from EI-MS alone is not achievable, even in principle.

Figure 1: A traditional electron ionization mass spectrometry measurement and analysis setup. The sampled molecules are first separated using gas chromatography and subsequently passed into the inlet of the mass spectrometer. The molecules are then fragmented with high-energy electrons and detected using a time-of-flight mass analyzer. The detected ions form a mass spectrum, which can be analyzed downstream with various software tools.

Many existing spectrum-to-structure prediction methods are retrieval-based: a measured spectrum is either matched directly against a library of reference spectra, or first encoded into a molecular fingerprint that is then used to retrieve the closest match from a molecular database. However, these methods are ineffective for novel compounds, as the relevant spectral libraries and compound databases simply do not exist. This is particularly true in atmospheric measurement campaigns, where signals remain unassigned precisely because no existing database covers them.

To move beyond traditional structure retrieval, the thesis adapts DiffMS, a diffusion-based spectrum-to-structure model from the metabolomics domain, to EI-MS data of atmospheric organic compounds. The adapted framework, DiffEIMS, employs a decoupled architecture in which an encoder predicts molecular fingerprints from EI-MS spectra and a graph denoising diffusion decoder generates candidate structures from these fingerprints (Figure 2). Because the candidates are generated de novo, DiffEIMS can propose structures beyond existing databases. Adapting the model to the atmospheric setting required retraining on an in silico dataset of 166 434 atmospheric molecules paired with simulated EI-MS spectra, adding support for formal charges in nitrogen-containing functional groups, and introducing peak data augmentation strategies during training. A further contribution of this work was a systematic study to find effective hyperparameter configurations for DiffEIMS.

Figure 2: Schematic of the DiffEIMS model. The molecular formula constrained encoder maps the EI-MS spectrum into a molecular fingerprint, and the graph-diffusion decoder generates candidate molecular structures conditioned on that fingerprint and the known molecular formula. The two parts are pretrained separately and then fine-tuned end-to-end.

DiffEIMS is benchmarked on a held-out test set of 16 384 atmospheric molecules, achieving top-1 and top-10 accuracies of 6 and 20 %, a top-10 Tanimoto similarity of 0.59, and a chemical validity rate of 95 %. Given the ill-posedness of the problem, the top-10 accuracy and Tanimoto similarity metrics are more informative than top-1 accuracy alone: even when the exact structure is not recovered on the first attempt, the candidates tend to remain structurally similar to the true molecule (Figure 3). Two findings from the hyperparameter optimization stand out. First, peak-intensity information matters: reducing spectra to the mere presence or absence of peaks substantially degrades performance, whereas log-scaling intensities and perturbing them during training improve generalization. Second, when the decoder is conditioned on the true fingerprints rather than on the predicted ones, top-1 and top-10 accuracies reach approximately 80 and 97 %, respectively. The spectrum-to-fingerprint encoder is therefore the primary bottleneck and the natural target for further improvement.

Figure 3: Example output of DiffEIMS for three test molecules. The ground-truth atmospheric compound (left) is shown next to several candidate structures generated by the model from its EI-MS spectrum. Even when the exact structure is not recovered, the generated candidates tend to capture the overall connectivity and functional groups of the target molecule.

Even when the exact structure is not recovered, the structurally similar candidates produced by DiffEIMS can guide experimental analysis in atmospheric chemistry. The natural next steps are to apply DiffEIMS to experimental EI-MS data from laboratory and field campaigns and to improve the spectrum-to-fingerprint encoder. More broadly, this work opens a new direction for combining deep generative modeling with EI-MS in molecular-level atmospheric science.

The full thesis can be found here: Atmospheric compound identification from electron ionization mass spectrometry data.