GitHub - odea-project/qAlgorithms

Usage Notice

At this moment, qAlgorithms is provided purely for reasons of scientific transparency. There are still known flaws in the concrete implementation, and we are steadily working on finding a useful way to vaidate the rest in a sensible manner. For this reason, you should not use qAlgorithms to answer your research questions (yet).

Introduction

qAlgorithms is a standalone non-target screening analysis workflow for processing Liquid Chromatography High-Resolution Mass Spectrometry (LC-HRMS) data. Our goal is to provide a solution to single-file processing pipeline that does not require any additional user input, delivers error estimates for results and is very fast / computationally efficient (see Design Philosophy for more).

The individual algorithms within qAlgorithms are not restricted to the domain of HPLC-HRMS, although we do not actively pursue other applications.

We are always open for suggestions and feedback regarding your useage of our algorithms, so do not hesitate to open an issue on our github page.

Additionally, we are interested in measurement data that allows us to test our quality parameters with some degree of confidence. If you have performed an experiment which resulted in data that allows you to infer its general quality (on basis of the experimental conditions) or measured a large amount of replicates of one sample, we would appreciate your support. You can contact us by opening an issue or sending a request to our Zenodo community. Here, we have also provided example files for trying out qAlgorithms as well as NTS-related posters.

Installation and Usage

The following only applies for the standalone project. For users interested in using qAlgorithms from R, we are collaborating with the patRoon project by Helmus et al. patRoon will include bindings for "ready to use" parts of qAlgorithms relatively soon. Bindings to other languages and data analysis frameworks are planned, but not currently in development.

Windows

The entire qAlgorithms workflow is provided as an executable under "Releases" on our github repository. There is no need to download or compile the source code.

On windows, double-clicking the .exe will open qAlgorithms in interactive mode. This mode is extremely primitive and not officially supported. We recommend all users to start qAlgorithms.exe using powershell, which is pre-installed on all windows PCs. If you are unfamiliar with using the shell, refer to the basic powershell demonstration for qAlgorithms.

To build from source, follow the instructions for linux after installing mingw. After installation, run the command

pacman -Syu mingw-w64-x86_64-zlib

from the installed UCRT environment.

Linux

Currently, no compiled Linux releases are provided. We recommend you to clone the repository and compile from source. The only dependencies apart from a modern kernel are a compiler (we only test for gcc, but clang should also work), gnu make, cmake and the zlib headers. The script install.sh provided in the main directory will check missing dependencies and install them through your package manager, although they should already be present on almost all linux installations. The script also runs the cmake setup described below and compiles the program.

We require CMake 3.25 or later and only test for GCC 15 or later. The program requires features from the C++17 standard. All should already be present or readily available on any modern system.

Build qAlgorithms by executing these commands:

git clone https://github.com/odea-project/qAlgorithms.git
cd ./qAlgorithms
mkdir build
cmake -S . -B ./build
cd ./build
cmake --build . -j

Usage

To use qAlgorithms for processing mass spectrometry data, you need to convert your measurements into .mzML files, for example with msconvert or ThermoRawFileParser. Currently, only MS1 data can be used, so you save some disk space if you filter them out at this stage.

You can find pre-converted files which are confirmed to be processed correctly here.

qAlgorithms is a command line utility which reads mzML files and outputs them as csv. You can select individual files or an entire directory to search for .mzML files recursively. All output is written into one directory, which you also must specify. Below are some commands you will likely use (replace example paths with your system paths):

Display the help menu, listing all availvable options:

  ./qAlgorithms.exe -h

Process the file measurement.mzML and write a complete feature list into the directory "results":

  ./qAlgorithms.exe -i C:/example/path/measurement.mzML -o ../my/results -printfeatures

searches the directory "allMeasurements" and all subdirectories for files ending in .mzML and process them. All intermediate results, those being centroids, bins and features, are written to a .csv file and saved to the "results" directory:

  ./qAlgorithms.exe -i ./allMeasurements -o ./results -printall

Some things to keep in mind:

qAlgorithms can only process profile mode data at this point. We rely on uncertainty data generated during centroiding for following steps and have not found a way to esitmate them so far. The ability to process centroids will be added once that problem is solved.
Check out the documentation for more details on using qAlgorithms.
If you do not specify which results you want, no output will be written when using the command line interface.
If the program crashes, check if your problem matches one of the known program errors on our issues page. We will not fix known problems before the program is considered to be stable.
If multiple copies of the same file are found during recursive search, only one of them will be processed.
The different quality scores do not serve as a way to remove peaks from your results. They only indicate how well the data at every step fit our model assumptions regarding the mathematical properties of real peaks. All peaks which are provided in the peak table are statistically significant. The best current usage for quality scores is priorisation of peaks during further analysis.

Design Philosophy

qAlgorithms is free (as in freedom) software licensed under GPL-3 (see LICENSE).

The algorithms within qAlgorithms are rooted in well-established statistical tests and employ standard linear regression as the main problem solving strategy. This allows us to be fully deterministic without requiring the user to supply and optimise algorithm parameters, which is a time-consuming and error-prone process even for domain experts.

While this design allows users to treat processing as a "black box", we always provide additional layers of data. For the user, it is thulsy possible to treat the output features as the commonly found mz -- RT -- intensity triplet or to consider more precise descriptors. These are summarised in the data quality score, a number between zero and one, that is roughly equivalent to the commonly used R². For in-depth details, the outcomes of all processing steps are preserved within the output and every feature can be traced back to the profile points that produced it.

qAlgorithms is develped with computational performance as a core concern. With qAlgorithms, we aim for the fastest time-to-insight possible. If a long time is spent waiting for results, the entire process is more prone to insufficient validation and more sensitive to potential hardware failures.

We use C++, but heavily limit the used language features to be closer to pure C. This mainly results in widespread use of std::vector for the time being, as well as the pugixml library for reading mzML data and some application of compile-time execution. This is a choice made to reduce the difficulty for chemists, who are generally inexperienced programmers, in reading and understanding the source code. By limiting the use of heavily templated data structures we also improve runtime performance of the code.

We hope that by demonstrating the effectiveness of our approach, more software written by researchers for non-target questions will adopt or improve on these ideals.

Documentation

All currently existing documentation can be found in the ./docs directory of this repository. It is incomplete as of now and focuses on the theoretical considerations behind implemented methods rather than the implementation. For those details, refer to the commented code.

Current Offerings

qPeaks Algorithm

qPeaks uses a comprehensive peak model developed by Reuschenbach, Renner et al. [https://doi.org/10.1021/acs.analchem.4c00494] to identify peaks within the bins generated by qBinning. Every produced peak is statistically significant, sidestepping the need for further filtering steps like a minimum intensity requirement. The scores generated provide you with information about how well every step of the process to your peak worked, and allow you to make a statement about the confidence of your results. Like all other parts of the qAlgorithms project, qPeaks requires no user parameters.

The mathematical description of the peak model is available in detail in the documentation.

The algorithm as it is implemented includes a set of decision rules for selecting the best fit and modifications to the design matrix that were not part of the initial publication and are as of now undocumented.

qBinning Algorithm

The qBinning algorithm utilises the centroids generated by qCentroids to produce extracted ion chromatograms. Like qCentroids, it requires no user parameters. Binning allows you to reduce the amount of centroids considered in future analysis by roughly 30%. The current qBinning program is based on the algorithm presented by Reuschenbach, Renner et al. [https://doi.org/10.1021/acs.analchem.3c01079], but implements additional steps for finding the highest amount of statistically sound bins. Additionally, the prediction interval is used instead of the mass error. The function fitted to the simulated data has also been changed to provide even higher accuracy. For details on the implementation, refer to src/qalgorithms_qbin.cpp. The method used to determine the critical value for the used test is described in the documentation.

qPattern Algorithm

The qPattern algorithm (as of now unpublished, name subject to change) is a newly developed componentisation strategy to group features produced with qPeaks. It uses linear regressions to estimate shape similarity of features and group related features into components.

Acknoledements

While we minimize overall library usage, some functionality is only possible because of libraries written by others under permissive open-source licenses. In alphabetical order:

Cephes, used for calculating the F statistic.
libcerf, used for calculating the error functions.
pugixml, used to read in mzML documents.
simdutf, used for fast, vectorised decoding of base64 in mzML.
zlib, for decompression of mzML binary data.

With the exception of zlib, we directly include the full or partial source code of used libraries (also called "vendoring"). This should reduce the potential for compilation failures to almost zero, because zlib is included on all systems we expect to be used for data analysis of spectra in any context. The choice to use system libraries for zlib was made because compression and decompression are the only highly security-relevant parts of this program and because from a performance perspective, faster libraries on a system may override the ZLIB environment variable.

Development Roadmap

For seeing concrete goals check out the todo list. If you are interested in solving one of these issues, you are welcome to open a pull request for one of these items.

Our goal is to provide a specialised, high-precision tool for analytical data processing of HRMS data. All current and future additions to qAlgorithms are developed with the goal of reducing the potential for human error and increasing result reliability.

The current main priority is establishing a comprehensive validation strategy for our peak fitting and overall feature detection workflow.

qAlgorithms - Transforming data into insight.

Name		Name	Last commit message	Last commit date
Latest commit History 1,059 Commits
R		R
docs		docs
external		external
include		include
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
run_cppcheck.sh		run_cppcheck.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage Notice

Introduction

Installation and Usage

Windows

Linux

Usage

Design Philosophy

Documentation

Current Offerings

qPeaks Algorithm

qBinning Algorithm

qPattern Algorithm

Acknoledements

Development Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Usage Notice

Introduction

Installation and Usage

Windows

Linux

Usage

Design Philosophy

Documentation

Current Offerings

qPeaks Algorithm

qBinning Algorithm

qPattern Algorithm

Acknoledements

Development Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages