At this moment, qAlgorithms is provided purely for reasons of scientific transparency. There are
still known flaws in the concrete implementation, and we are steadily working on finding a useful
way to vaidate the rest in a sensible manner. For this reason, you should not use qAlgorithms
to answer your research questions (yet).
qAlgorithms is a standalone non-target screening analysis workflow for processing Liquid Chromatography
High-Resolution Mass Spectrometry (LC-HRMS) data. Our goal is to provide a solution to single-file
processing pipeline that does not require any additional user input, delivers error estimates for results
and is very fast / computationally efficient (see Design Philosophy for more).
The individual algorithms within qAlgorithms are not restricted to the domain of HPLC-HRMS,
although we do not actively pursue other applications.
We are always open for suggestions and feedback regarding your useage of our algorithms, so do not hesitate to open an issue on our github page.
Additionally, we are interested in measurement data that allows us to test our quality
parameters with some degree of confidence. If you have performed an experiment which
resulted in data that allows you to infer its general quality (on basis of the experimental
conditions) or measured a large amount of replicates of one sample, we would appreciate your
support. You can contact us by opening an issue or sending a request to our Zenodo community.
Here, we have also provided example files for trying out qAlgorithms as well as NTS-related posters.
The following only applies for the standalone project. For users interested in using qAlgorithms
from R, we are collaborating with the patRoon project by Helmus et al.
patRoon will include bindings for "ready to use" parts of qAlgorithms relatively soon. Bindings to other
languages and data analysis frameworks are planned, but not currently in development.
The entire qAlgorithms workflow is provided as an executable under "Releases"
on our github repository. There is no need to download or compile the source code.
On windows, double-clicking the .exe will open qAlgorithms in interactive mode. This mode
is extremely primitive and not officially supported.
We recommend all users to start qAlgorithms.exe using powershell, which is pre-installed on all windows PCs.
If you are unfamiliar with using the shell, refer to the basic powershell demonstration for qAlgorithms.
To build from source, follow the instructions for linux after installing mingw. After installation, run the command
pacman -Syu mingw-w64-x86_64-zlibfrom the installed UCRT environment.
Currently, no compiled Linux releases are provided. We recommend you to clone the repository
and compile from source. The only dependencies apart from a modern kernel are a compiler (we
only test for gcc, but clang should also work), gnu make, cmake and the zlib headers. The
script install.sh provided in the main directory will check missing dependencies and
install them through your package manager, although they should already be present on almost
all linux installations. The script also runs the cmake setup described below and compiles
the program.
We require CMake 3.25 or later and only test for GCC 15 or later. The program requires features from the C++17 standard. All should already be present or readily available on any modern system.
Build qAlgorithms by executing these commands:
git clone https://github.com/odea-project/qAlgorithms.git
cd ./qAlgorithms
mkdir build
cmake -S . -B ./build
cd ./build
cmake --build . -jTo use qAlgorithms for processing mass spectrometry data, you need to convert your
measurements into .mzML files, for example with msconvert or
ThermoRawFileParser.
Currently, only MS1 data can be used, so you save some disk space if you filter them out at this stage.
You can find pre-converted files which are confirmed to be processed correctly here.
qAlgorithms is a command line utility which reads mzML files and outputs them
as csv. You can select individual files or an entire directory to search for
.mzML files recursively. All output is written into one directory, which you also
must specify. Below are some commands you will likely use (replace example paths with your system paths):
Display the help menu, listing all availvable options:
./qAlgorithms.exe -hProcess the file measurement.mzML and write a complete feature list into the directory "results":
./qAlgorithms.exe -i C:/example/path/measurement.mzML -o ../my/results -printfeaturessearches the directory "allMeasurements" and all subdirectories for files ending in .mzML and process them. All intermediate results, those being centroids, bins and features, are written to a .csv file and saved to the "results" directory:
./qAlgorithms.exe -i ./allMeasurements -o ./results -printallSome things to keep in mind:
-
qAlgorithmscan only process profile mode data at this point. We rely on uncertainty data generated during centroiding for following steps and have not found a way to esitmate them so far. The ability to process centroids will be added once that problem is solved. -
Check out the documentation for more details on using
qAlgorithms. -
If you do not specify which results you want, no output will be written when using the command line interface.
-
If the program crashes, check if your problem matches one of the known program errors on our issues page. We will not fix known problems before the program is considered to be stable.
-
If multiple copies of the same file are found during recursive search, only one of them will be processed.
-
The different quality scores do not serve as a way to remove peaks from your results. They only indicate how well the data at every step fit our model assumptions regarding the mathematical properties of real peaks. All peaks which are provided in the peak table are statistically significant. The best current usage for quality scores is priorisation of peaks during further analysis.
qAlgorithms is free (as in freedom) software licensed under GPL-3 (see LICENSE).
The algorithms within qAlgorithms are rooted in well-established statistical
tests and employ standard linear regression as the main problem solving strategy.
This allows us to be fully deterministic without requiring the user to supply
and optimise algorithm parameters, which is a time-consuming and error-prone
process even for domain experts.
While this design allows users to treat processing as a "black box", we always provide additional layers of data. For the user, it is thulsy possible to treat the output features as the commonly found mz -- RT -- intensity triplet or to consider more precise descriptors. These are summarised in the data quality score, a number between zero and one, that is roughly equivalent to the commonly used R². For in-depth details, the outcomes of all processing steps are preserved within the output and every feature can be traced back to the profile points that produced it.
qAlgorithms is develped with computational performance as a core concern. With qAlgorithms,
we aim for the fastest time-to-insight possible. If a long time is spent waiting for results,
the entire process is more prone to insufficient validation and more sensitive to potential
hardware failures.
We use C++, but heavily limit the used language features to be closer to pure C. This mainly results in widespread use of std::vector for the time being, as well as the pugixml library for reading mzML data and some application of compile-time execution. This is a choice made to reduce the difficulty for chemists, who are generally inexperienced programmers, in reading and understanding the source code. By limiting the use of heavily templated data structures we also improve runtime performance of the code.
We hope that by demonstrating the effectiveness of our approach, more software written by researchers for non-target questions will adopt or improve on these ideals.
All currently existing documentation can be found in the ./docs directory of this repository.
It is incomplete as of now and focuses on the theoretical considerations behind implemented
methods rather than the implementation. For those details, refer to the commented code.
qPeaks uses a comprehensive peak model developed by Reuschenbach,
Renner et al. [https://doi.org/10.1021/acs.analchem.4c00494] to
identify peaks within the bins generated by qBinning. Every produced peak is statistically
significant, sidestepping the need for further filtering steps like a minimum
intensity requirement. The scores generated provide you with information about
how well every step of the process to your peak worked, and allow you to make
a statement about the confidence of your results. Like all other parts of the
qAlgorithms project, qPeaks requires no user parameters.
The mathematical description of the peak model is available in detail in the documentation.
The algorithm as it is implemented includes a set of decision rules for selecting the best fit and modifications to the design matrix that were not part of the initial publication and are as of now undocumented.
The qBinning algorithm utilises the centroids generated by qCentroids to
produce extracted ion chromatograms. Like qCentroids, it requires no user
parameters. Binning allows you to reduce the amount of centroids considered
in future analysis by roughly 30%. The current qBinning program is based
on the algorithm presented by Reuschenbach, Renner et al. [https://doi.org/10.1021/acs.analchem.3c01079],
but implements additional steps for finding the highest amount of statistically
sound bins. Additionally, the prediction interval is used instead of the mass error.
The function fitted to the simulated data has also been changed to provide even higher accuracy.
For details on the implementation, refer to src/qalgorithms_qbin.cpp. The method
used to determine the critical value for the used test is described in the documentation.
The qPattern algorithm (as of now unpublished, name subject to change) is a newly
developed componentisation strategy to group features produced with qPeaks. It uses
linear regressions to estimate shape similarity of features and group related features
into components.
While we minimize overall library usage, some functionality is only possible because of libraries written by others under permissive open-source licenses. In alphabetical order:
- Cephes, used for calculating the F statistic.
- libcerf, used for calculating the error functions.
- pugixml, used to read in mzML documents.
- simdutf, used for fast, vectorised decoding of base64 in mzML.
- zlib, for decompression of mzML binary data.
With the exception of zlib, we directly include the full or partial source code of used libraries (also called "vendoring"). This should reduce the potential for compilation failures to almost zero, because zlib is included on all systems we expect to be used for data analysis of spectra in any context. The choice to use system libraries for zlib was made because compression and decompression are the only highly security-relevant parts of this program and because from a performance perspective, faster libraries on a system may override the ZLIB environment variable.
For seeing concrete goals check out the todo list. If you are interested in solving one of these issues, you are welcome to open a pull request for one of these items.
Our goal is to provide a specialised, high-precision tool for analytical data processing
of HRMS data. All current and future additions to qAlgorithms are developed with the
goal of reducing the potential for human error and increasing result reliability.
The current main priority is establishing a comprehensive validation strategy for our peak fitting and overall feature detection workflow.
qAlgorithms - Transforming data into insight.