Critical Assessment of Small Molecule Identification

CASMI 2022

CASMI 2017

CASMI 2016

Important Dates
Contest Rules
Example Data
Challenge Data
Solutions
Results
Proceedings
About the Team

CASMI 2014

CASMI 2013

CASMI 2012

News

March 29th, 2017
The CASMI 2016 Cat 2+3 paper is out!

Jan 20th, 2017
Organisation of CASMI 2017 is underway, stay tuned!

Dec 4th, 2016
The MS1 peak lists for Category 2+3 have been added for completeness.

May 6th, 2016
The winners and full results are available.

April 25th, 2016
The solutions are public now.

April 18th, 2016
The contest is closed now, the results are fantastic and will be opened soon!

April 9th, 2016
All teams who submit before the deadline April 11th will be allowed to update the submission until Friday 15th.

February 12th, 2016
New categories 2 and 3 and data for automatic methods released. 10 new challenges in category 1.

January 25th, 2016
E. Schymanski and S. Neumann joined the organising team, additional contest data coming soon.

January 11th, 2016
New CASMI 2016 raw data files are available.

Results in Category 1

Please note that Challenge-3 was removed from Category 1, since the MS/MS was acquired from the wrong precursor ion.

Summary of Challenge wins

	Vaniya	Allen	Nothias-Scaglia	Nikolic	Allard	Bertrand	Kind
Gold	14	7	11	15	2	6	12
Silver	1	7	1	3	3	3	1
Bronze	0	1	0	0	4	4	1

Summary statistics per participant

	Mean rank	Median rank	Top	Top3	Top10	Mean RRP	Median RRP
Vaniya	5.25	1.0	14	15	15	0.989	1.000
Allen	3.47	2.0	7	12	16	0.971	0.993
Nothias-Scaglia	1.25	1.0	11	11	12	0.994	1.000
Nikolic	1.22	1.0	14	18	18	0.785	1.000
Allard	3.40	2.5	2	6	10	0.661	0.727
Bertrand	5.29	2.0	6	8	12	0.781	0.933
Kind	19.62	1.0	12	14	15	0.875	1.000

Summary of Rank by Challenge and Participant

For each challenge, the rank of the winner(s) is highlighted in bold. If the submission did not contain the correct candidate this is denoted as "-". If someone did not participate in a challenge, nothing is shown. The tables are sortable if you click into the column header.

This summary is also available as CSV download.

	Vaniya	Allen	Nothias-Scaglia	Nikolic	Allard	Bertrand	Kind
challenge-001	1.0	4.5	-	2.0	3.0	7.0	5.0
challenge-002	1.0	1.0	1.0	2.0	2.0	1.0	1.0
challenge-004	1.0	1.0	1.0	1.0	2.0	6.0	1.0
challenge-005	-	-	-	1.0	-	-	-
challenge-006	-	2.5		1.0	7.0	6.0	-
challenge-007	1.0	1.0	1.0	1.0	2.0	1.0	3.0
challenge-008	1.0	4.0	1.0	2.0	4.0		2.0
challenge-009	1.0	1.0	1.0	1.0	1.0	2.0	1.0
challenge-010	1.0	1.0	1.0	1.0	1.0	1.0	1.0
challenge-011	1.0	19.0	-	1.0	8.0	4.0	1.0
challenge-012	1.0	2.0	1.0	1.0	4.0	1.0	1.0
challenge-013	1.0	3.0	1.0	1.0	-	-	1.0
challenge-014	68.0	8.0	-	2.0	-	29.0	292.0
challenge-015	1.0	1.0	4.0	1.0		1.0	1.0
challenge-016	2.0	2.0	1.0	1.0		2.0	1.0
challenge-017	1.0	1.0	1.0	1.0		-	1.0
challenge-018	1.0	4.0	-	1.0		12.0	1.0
challenge-019	1.0	3.0	1.0	1.0		1.0	1.0

Participant information and abstracts

Participant:	Avaniya
Authors:	Vaniya, Arpana [1], Stephanie N. Samra [1], Mine Palazoglu [1], 
		Hiroshi Tsugawa [2], and Oliver Fiehn [1]
Affiliations: 	[1] Genome Center, University of California, Davis 
		[2] RIKEN Center for Sustainable Resource Science (CSRS), Wako, Japan

ParticipantID:	avaniya002
Category:	Category 1
Automatic methods: Yes

Abstract: 

The challenges were first searched against multiple mass spectral libraries to find
the best match.  The MS/MS data was converted to msp format to be searched against
NIST14, METLIN, MassBank, ReSpect, and LipidBlast using NIST MS Search 2.0.
Candidates with a reverse dot product score of 500 were confirmed by examining match
of experimental MS/MS to reference MS/MS.  Top candidates from the MS library search
were used to validate candidates in the MS-FINDER and MetFrag results.  MS-FINDER,
Seven Golden Rules, SIRIUS 3.1.3 and MetFrag were used with the same method for the
submission titled avaniya001-category1.  For this submission, candidates that was
found in both MS library search and MS-FINDER or MetFrag were weighted more heavily.
Candidates for challenges with no hits from the MS library search remained unchanged
from the submission titled avaniya001-category1.  Final scores and SMILES were
reported for submission to CASMI 2016. Multiple candidates were submitted for each
challenge.

Participant:          Allen
Authors:              Felicity Allen, Russ Greiner, David Wishart
Affiliations:         Department of Computing Science
		      University of Alberta, Canada

ParticipantID:        felicityallen
Category:             category1 and category2
Automatic pipeline:   yes
Spectral libraries:   no

Abstract

A list of candidate structures was obtained by querying all of the following
databases for all candidates within the required mass ranges (determined as above):
HMDB  http://www.hmdb.ca/
ChEBI http://www.ebi.ac.uk/chebi/
ChEMBL https://www.ebi.ac.uk/chembl/
Metlin http://metlin.scripps.edu/
FOODB http://foodb.ca/
T3DB http://www.t3db.ca/
DrugBank http://www.drugbank.ca/
ECMDB http://www.ecmdb.ca/
YMDB http://www.ymdb.ca/
PlantDB Privately held list of 200,000 plant and plant-derived compounds.

The MS1 spectra were then predicted for each candidate molecular formula using
the emass program by A. Rockwood and P. Haimi [1].  These predicted spectra
were compared to the provided MS1 spectra (restricted to within 10 Da of the 
monoisotopic mass of the molecular formula), and an MS1_SCORE was produced 
for each molecular formula based on the closeness of this match. The scoring
metric used was:  
MS1_SCORE = ( (WP + WR + DP)_5ppm + (WP + WR + DP)_10ppm + (WP + WR + DP)_50ppm )/10
where 
WP = intensity weighted precision (0-100)
WR = intensity weighted recall (0-100)
DP = dot product (0-1) x 100

[1] Rockwood A. and Haimi P., "Efficient calculation of accurate masses 
    of isotopic peaks.", Journal of the American Society for Mass Spectrometry, 
    17:3 p415-9 2006.	

For all candidate structures, CFM was used to produce a score for the MS2 spectra.
The original  CFM positive and negative models were used, which were trained 
on data from the Metlin database.  Mass tolerances of 10ppm were used
and the Jaccard score was applied for spectral comparisons. The input spectrum
was repeated for the low, medium and high energies.
The Jaccard score was summed across three energies, and multiplied by 300.

[2] Allen F., Pon A., Wilson M., Greiner R., Wishart D., "CFM-ID: A
    web server for annotation, spectrum prediction and metabolite
    identification from tandem mass spectra", Nucleic Acids Research,
    Web Server Edition 2014.

[3] Allen F., Greiner R., Wishart D., "Competitive Fragmentatation
    Modeling of ESI-MS/MS spectra for putative metabolite
    identification", Metabolomics, 11:1, p98-110, 2015.

For all candidates, a DB_SCORE was produced according to which of the above databases 
it was found in, adding +50 for each database, except CHEMBL, which added only 10.0.

The results were ranked according to the sum of the above three scores:
TOTAL_SCORE = MS2_SCORE + DB_SCORE + MS1_SCORE

Participant:	  Nothias-Scaglia
Authors:    	  Louis-Felix Nothias (1), Ricardo Silva (1), Florent Olivon (2), 
	    	  Alex Melnik (1), Marc Litaudon (2)
Affiliations: 	  (1) Skaggs School of Pharmacy and Pharmaceutical Sciences, 
	      	  University of California San Diego, La Jolla, CA 92037, USA
	      	  (2) Institut de Chimie des Substances Naturelles, CNRS-ICSN, 
	      	  University of Paris-Saclay, 1 avenue de la terrasse, 91190, 
	      	  Gif-sur-Yvette, France

ParticipantID:         GNPS with MS in silico tools
Category:    	       category1
Automatic pipeline:    partial
Spectral libraries:    yes

Abstract

In the frame of the CASMI 2016, we used GNPS (Global Natural Products
Social molecular networking) dereplication workflow [1], and tested
different combinations of in silico tools for mass spectrometry with
three different proposals.

The proposal “GnpsCSIFingerID” was prepared by using: (A) Sirius3 for
molecular formula calculation [2]; (B) GNPS for MS/MS spectral
matching; (C) CSI:FingerID for in silico MS/MS spectral matching for
challenges 1-14 [3]; (D) CFM-ID [4] with a candidate list retrieved
from Dictionary of Natural Products or SciFinder for challenges 15-19.

(A) Candidate molecular formulas of challenges 1-19 were calculated
using Sirius 3.1 with the provided MS1 and MS/MS peak lists (atoms
C,H,N,O,S,P and halogens and 20 ppm max error). Candidate molecular
formulas were manually curated based on natural products likeliness.

(B) Both MS/MS peak lists and raw MS/MS spectra were converted to .mgf
format and uploaded to GNPS web platform (http://gnps.ucsd.edu). A
spectral library search were conducted via a GNPS dereplication
workflow (with all spectral libraries available in March
2016). Annotations were confirmed based on the fitting score,
inspection of MS/MS spectral matching with mirror plot, and
consistency with the molecular formula from Sirius3. Additionally,
searches were conducted with METLIN [5] and NIST spectral libraries
[6]. Then, in silico tools for tandem mass spectrometry were used to
establish a list of candidates for each challenge.

(C) CSI:FingerID was used for challenges 1-14 (positive ion mode). The
top 10 candidates were considered for the putative molecular
formula. The « biological database » filter was not used, and the same
candidate rank order was kept (not the match score).

(D) Because negative ion mode is not available in CSI:FingerID, CFM-ID
was used for challenges 15-19. A list of candidate was retrieved from
Dictionary of Natural Products or SciFinder (challenge 18) by
searching the hypothetical molecular formula(s). The output score of
CFM-ID was used to rank candidates.

Finally, the candidate list for each challenge was made by ranking the
spectral library hit at first position, and then the candidates from
in silico tools.  The candidates for challenges 3, 10 and 18 were
found to be non natural products. Thus, these challenges should be
regarded as unannotated. Furthermore, no candidates are proposed for
challenge 6, because the hypothetical molecular formula was not
available in CSI:FingerID (the monoisotopic ion of the parent was
above 15 ppm of mass deviation).

[1] GNPS - Global Natural Products Social molecular networking, http://gnps.ucsd.edu
[2] Böcker, S.; Dührkop, K. Fragmentation Trees Reloaded. J Cheminform 2016, 8 (1), 1–26.
[3] Dührkop, K.; Shen, H.; Meusel, M.; Rousu, J.; Böcker, S. Searching
Molecular Structure Databases with Tandem Mass Spectra Using
CSI:FingerID. PNAS 2015, 112 (41), 12580–12585.
[4] Allen, F.; Greiner, R.; Wishart, D. Competitive Fragmentation
Modeling of ESI-MS/MS Spectra for Putative Metabolite
Identification. Metabolomics 2014.
[5] Smith, C. A.; O’Maille, G.; Want, E. J.; Qin, C.; Trauger, S. A.;
Brandon, T. R.; Custodio, D. E.; Abagyan, R.; Siuzdak, G. METLIN: A
Metabolite Mass Spectral Database. Therapeutic drug monitoring 2005,
27 (6), 747–751.
[6] NIST Mass spectrometry datacenter, http://chemdata.nist.gov

Participant:	   Nikolic
Author:            Dejan Nikolic
Affiliations:      UIC/NIH Center for Botanical Dietary Supplements Research
	           Department of Medicinal Chemistry & Pharmacognosy, 
	           College of Pharmacy, University of Illinois at Chicago,

ParticipantID:        Nikolic
Category:             Category1
Automatic methods:    No

Abstract 

Structure candidates were determined on a case by case basis using a
manual method outlined in the previous publication from the CASMI2012
contest (1). The method involves searching of the elemental
composition in the SciFinder and Reaxys databases restricting the hits
to naturally occurring compounds. Publicly available spectral
libraries such as MassBank, METLIN and ReSpect were also
consulted. Hits returned from the searches were manually scrutinized
by attempting to rationalize the experimental spectrum with the
candidate structures. For ranking candidate structures, a subjective
confidence scale from 0.60 to 1.00 was used. The overall confidence in
the assignment was assessed based on several factors including
spectral library match (if applicable), the ability to rationalize as
many fragment ions as possible as well as the overall experience in
working with a particular class of compounds. The confidence scale
ranking brackets are defined as follows:

1.00: Full confidence that the single candidate is the correct structure. 
0.90 to 0.99: High confidence that candidate is the correct structure. 
0.80 to 0.89: Good confidence that candidate is the correct structure. 
0.70 to 0.79: Fair confidence that candidate is the correct structure. 
0.60 to 0.69: Poor confidence that candidate is the correct structure. 

For some challenges (e.g. Ch 4, 6, 8, 14) the data could fit equally
well several structural isomers, which reduces the overall confidence
that the highest ranking candidate is the correct structure. It was
noted that for some of the originally posted challenges (1-9) there is
a discrepancy between the raw data in the original manufacturer�s
format and the peak list provided. In those cases the original file
was used for evaluation.

Reference

(1) Newsome, A. and Nikolic D. CASMI 2013: Identification of small
molecules by tandem mass spectrometry combined with database and
literature mining; Mass Spectrometry 3, S0034 (2014)

Participant:	      Allard
Authors:              Allard, Pierre-Marie(1) and Houriet, Joëlle(1)
Affiliations:         (1) Laboratory of Phytochemistry and Bioactive Natural Products, 
		      School of Pharmaceutical Sciences, University of
		      Geneva, Quai-Ernest Ansermet 30, 1211 Geneva, Switzerland
                      
ParticipantID:        pma
Category:	      category1
Automatic pipeline:   yes
Spectral libraries:   yes

Abstract: 

We processed only data of category 1, in positive mode (challenge 1 to 14). 

Data conversion: 

Data of challenge 1 to 9 were converted to .mzXML format using
Proteowizzard. Fragmentation spectra of the ions of interest were
extracted and saved as .mgf files. Parent ion mass in .mgf files was
corrected to fit the exact mass of ion of interest when necessary.

Molecular network generation:

(Molecular network was generated to assess possible structural
relationship between metabolites and to generate a common .mgf file)

All .mgf files (challenge 1 to 14) were uploaded to GNPS servers
(http://gnps.ucsd.edu) and treated in the data treatment workflow
using the following parameters: The data were clustered with
MS-Cluster with a parent mass tolerance of 0.8 Da and a MS/MS fragment
ion tolerance of 0.5 Da to create consensus spectra. A network was
then created, where edges were filtered to have a cosine score above
0.7 and more than 6 matched peaks. Further edges between two nodes
were kept in the network if, and only if, each of the nodes appeared
in each other's respective top 10 most similar nodes.  The spectra in
the network were then searched against all available GNPS spectral
library. A GNPS library hit was taken into account for challenge 3
since it was a permanently charged compound wich was not included in
the ISDBs. It's score was set as the highest.

In-Silico Databases (ISDB) spectral match:

Two in-silico MS/MS fragmentation database were queried: an ISDB
created from data of the Dictionary of Natural Products and an ISDB
created from data of the UNPD database (http://pkuxxj.pku.edu.cn/UNPD/) ISDBs 
were generated using cfm-id (https://sourceforge.net/projects/cfm-id/)
as described in : Allard, P.-M.; Péresse, T.; Bisson, J.; Gindro, K.;
Marcourt, L.; Pham, V. C.; Roussi, F.; Litaudon, M.; Wolfender,
J.-L. Anal. Chem. 2016, 88, 3317–3323.

The clustered .mgf file obtained were searched against both ISDBs
using tremolo (http://proteomics.ucsd.edu/Software/Tremolo/) for the
spectral match and in-house script for annotation of the hits. The
spectral search was made using the following parameters:

tolerance.PM_tolerance=0.01
SCORE_THRESHOLD=0.1
TOP_K_RESULTS=10

Detailed workflow to perform spectral match, scripts and the UNPD-ISDB
are available here : http://oolonek.github.io/ISDB/

Participant:	      Bertrand
Authors:              Bertrand, Samuel(1)
Affiliations:         (1) Groupe Mer, Molécules, Santé-EA 2160, UFR des Sciences 
		      Pharmaceutiques et Biologiques, Université de Nantes, France 

ParticipantID:        SamuelBLCMS
Category:             category1
Automatic methods:    yes
Spectral libraries:   no

Abstract
The challenge data were automatically treated using R, XCMS [1], IPO
[2], CAMERA [3], SIRIUS3 [4], MeHaloCoA [5], RMassBank [6] and CFM-ID
[7] as follow, and stored during the analysis in a MYSQL databases
throughout the process:

1- LC-MS data were transformed in centroid mode using proteowizad if necessary.
2- LC peaks detection was achieved using XCMS after peak detection
   optimisation with IPO. In the case of some challenges, peaks
   detection was optimized manually.
2- in the challenge-related peak, neutralLosses, adducts were searched
   within PCgroups using CAMERA.
3- MS2 spectra of the ions related to each challenges were retrieved using RMassBank
4- for each challenge, molecular formula obtained using SIRIUS3 and
   discriminated based on isotpic distribution, MS2 fragmentation
   (calculated by SIRIUS) and adduct redundancy (number of occurrences
   of the MF among all adducts over the maximum number of occurrences
   of a MF among all proposed MF). The presence of S, Cl, Br atoms
   were automatically detected using MeHaloCoA.
5- molecular formula of compounds (corrected from adduct information)
   were searched into various databases looking for CAS number, InChI,
   InChIKey, SMILES, Mol: AntiBase, ChEBI, DNP, DMNP, KNAPSACK, UNPD,
   KEGG, LipidMaps. For each compounds found in the data bases missing
   data were completed (as much as possible) using OpenBabel [8], CTS [9], 
   CACTUS [10], ChemSpider [11].
6- MS2 similarity between simulated and measures MS2 were evaluated
   and scored using CFM-ID (when possible).
7- final scores (SF) was calculated according to MF score (SMF) and
   MS2 similarity score (SMS2) as follow: SF=SMF+SMS2.

Note: when no sucessfull detection of the peaks were achieved
(Challenges 1-2, 5-9 and 16), the raw MS spectra (available on the
casmi website) were manually introduced for calculation. No structures
were submitted for Challenges 3, 8 du to the absence of structures in
DB.

Bibliography:
[1] R. Tautenhahn, et al., BMC Bioinf., 2008, 9, 504.        
[2] G. Libiseller, et al., BMC Bioinf., 2015, 16, 118.
[3] C. Kuhl, et al., Anal. Chem., 2012, 84, 283.
[4] S. Böcker, et al., Bioinformatics, 2009, 25, 218.
[5] http://yguitton.github.io/MeHaloCoA/
[6] http://bioconductor.org/packages/RMassBank/
[7] F. Allen, et al., Metabolomics, 2014, 11, 98.
[8] N. O'Boyle, et al., J. Cheminformatics, 2011, 3, 33.
[9] G. Wohlgemuth, et al., Bioinformatics, 2010, 26, 2647.
[10] http://cactus.nci.nih.gov/chemical/structure
[11] H.E. Pence, et al., Journal of Chemical Education, 2010, 87, 1123.

Participant:	      Kind
Authors:              Tobias Kind 
Affiliations:         UC Davis Genome Center - Metabolomics

ParticipantID:        tkind
Category:             category1
Automatic methods:    no

Abstract
This is a submission for the http://www.casmi-contest.org/2016/
Category 1: Best Structure Identification on Natural Products

The challenges for Category 1 are natural products from several organisms 
of different possible origin (plants, fungi, marine sponges, algae or 
micro-algae), acquired on QToF instruments from Waters and Agilent.
Based on the MS and MS/MS and other data, the goal is to determine 
the correct molecular structure at the given retention time using 
the spectral data and the additional information provided. 

(1) Molecular formulas were determined with the Seven Golden Rules
[http://fiehnlab.ucdavis.edu/projects/Seven_Golden_Rules] and
Sirius [https://bio.informatik.uni-jena.de/software/sirius/]
In some cases the provided data was not sufficient and was extracted
from the raw files using ProteoWizard and MZMine.

(2) Formulae were then queried in Dictionary of Natural Products
[http://dnp.chemnetbase.com/] and UNPD [http://pkuxxj.pku.edu.cn/UNPD/]
as well as ChemSpider [http://www.chemspider.com/] and REAXYS
[https://www.reaxys.com] to obtain molecular structures.

(3) Obtained molecule candidates from the natural product databases 
were downloaded as SMILES or InCHI and InChiKey and then 
submitted to different programs to rank them.

CFM-ID was used to generate MS/MS spectra
[https://sourceforge.net/projects/cfm-id/]. Additionally the MS-Finder
software [http://prime.psc.riken.jp/Metabolomics_Software/] and 
CSI-FingerID [http://www.csi-fingerid.org/] were used for compound ranking.

Subsequently all compound data was converted into MGF format and MS/MS
spectra were submitted to NIST14 GUI MS/MS database search and manual peak inspection.
For some cases additional neutral losses and charachteristic product ion peaks
were investigated with the MS-Finder GUI.

This manual process of compound annotation is highly unsustainable, 
error-prone, frustrating and time-consuming. Fully automated
processes have to be developed. More importantly completely
unknown compounds can not be elucidated with this workflow,
because MS/MS data and retention time is not sufficient
for complete structure elucidation.

Details per Challenge and Participant. See legend at bottom for more details

The details table is also available as HTML and as CSV download. The individual submissions are also available for download.