1. Status of this Template
This document provides guidelines for annotating metaproteomics experiments in SDRF-Proteomics format. This template extends the core SDRF-Proteomics specification with metaproteomics-specific metadata fields.
Status: Draft - This template is under active development and may change significantly.
Version: 1.0.0-dev - 2026-01
2. Abstract
Metaproteomics is the study of proteins expressed by microbial communities, analyzing complex samples containing proteins from many different organisms simultaneously. Key applications include gut microbiome characterization, environmental microbiology, host-microbiome interactions, and industrial biotechnology.
Metaproteomics experiments have unique characteristics: multi-organism samples, complex sample matrices, databases constructed from metagenomics data, taxonomic attribution challenges, and critical environmental context metadata. This template defines the additional metadata requirements for annotating metaproteomics datasets.
3. Connections to Other Omics Fields
Metaproteomics data is typically integrated with:
-
Metagenomics: For database construction and taxonomic context
-
Metatranscriptomics: For gene expression correlation
-
16S/ITS amplicon sequencing: For community composition
-
Metabolomics: For functional interpretation
-
Environmental databases: ENVO, MIxS standards
When available, linking to related omics datasets is highly RECOMMENDED.
4. Additional Ontologies
In addition to the ontologies supported by the core SDRF-Proteomics specification, metaproteomics templates utilize:
-
Environment Ontology (ENVO): For environmental sample classification (http://www.obofoundry.org/ontology/envo.html)
-
Gazetteer (GAZ): For geographic locations
-
MIxS: Minimum Information about any (x) Sequence standards
-
NCBI Taxonomy: For organism classification
5. Checklist
This section defines the metadata columns required and recommended for metaproteomics experiments.
5.1. Required Columns
The following columns are REQUIRED for metaproteomics experiments in addition to the core SDRF-Proteomics requirements:
| Column Name | Description | Cardinality | Ontology/CV | Example Values |
|---|---|---|---|---|
characteristics[environmental sample type] |
Type of environmental or biological sample being analyzed |
1 |
ENVO, EFO |
soil, marine sediment, human gut, wastewater, rhizosphere, oral cavity, skin |
characteristics[sample collection method] |
Method used to collect the sample |
1 |
Controlled vocabulary or free text |
grab sample, core sampling, swab collection, filtration, fecal collection, biopsy |
5.2. Recommended Columns
The following columns are RECOMMENDED for metaproteomics experiments:
| Column Name | Description | Cardinality | Ontology/CV | Example Values |
|---|---|---|---|---|
characteristics[geographic location] |
Geographic location where sample was collected |
0..1 |
GAZ or free text with coordinates |
Baltic Sea, Amazon rainforest, lat:52.5 lon:13.4, GAZ:00002459 |
characteristics[collection date] |
Date when sample was collected |
0..1 |
ISO 8601 date format |
2024-03-15, 2024-03, March 2024 |
characteristics[depth] |
Depth at which sample was collected (for water/sediment) |
0..1 |
Numeric value with unit |
10 m, 0-5 cm, surface |
characteristics[altitude] |
Altitude/elevation for terrestrial samples |
0..1 |
Numeric value with unit |
500 m, 1200 m above sea level |
characteristics[temperature] |
Environmental temperature at collection |
0..1 |
Numeric value with unit (Celsius) |
25 C, 4 C, 37 C |
characteristics[pH] |
pH of the sample or environment |
0..1 |
Numeric value |
7.4, 5.5, 8.2 |
characteristics[host organism] |
Host organism for host-associated microbiomes |
0..1 |
NCBITaxon |
homo sapiens, mus musculus, zea mays |
characteristics[host disease] |
Disease state of host if applicable |
0..1 |
MONDO, EFO |
inflammatory bowel disease, normal, colorectal cancer |
characteristics[host age] |
Age of host organism |
0..1 |
Standard age format |
45Y, 8W, 3M |
comment[database search strategy] |
Strategy for sequence database construction |
0..1 |
Free text description |
matched metagenome, RefSeq bacteria, GTDB, UniRef90, custom assembly |
comment[metagenome accession] |
Accession number for associated metagenome data |
0..1 |
Database accession |
SRR12345678, ERR12345678, PRJNA123456 |
comment[sample storage] |
How sample was stored before processing |
0..1 |
Free text description |
flash frozen, -80C, RNAlater, 4C for 24h |
5.3. Environmental Context Columns
5.3.1. Aquatic Samples
| Column Name | Description | Example Values |
|---|---|---|
characteristics[water body type] |
Type of water body |
ocean, lake, river, estuary, pond |
characteristics[salinity] |
Salinity measurement |
35 PSU, freshwater, brackish |
characteristics[dissolved oxygen] |
Dissolved oxygen concentration |
8.5 mg/L, hypoxic, anoxic |
characteristics[chlorophyll] |
Chlorophyll concentration if measured |
2.5 ug/L |
characteristics[sampling depth zone] |
Ecological depth zone |
epipelagic, mesopelagic, benthic |
5.3.2. Soil Samples
| Column Name | Description | Example Values |
|---|---|---|
characteristics[soil type] |
Soil classification |
sandy loam, clay, peat |
characteristics[soil horizon] |
Soil horizon sampled |
A horizon, O horizon, B horizon |
characteristics[land use] |
Land use type |
agricultural, forest, urban, grassland |
characteristics[vegetation] |
Dominant vegetation |
deciduous forest, corn field, prairie |
5.3.3. Host-Associated Samples
| Column Name | Description | Example Values |
|---|---|---|
characteristics[body site] |
Specific body site for host-associated samples |
stool, oral cavity, skin, vaginal |
characteristics[host diet] |
Diet information for host |
omnivore, vegan, western diet, high-fiber |
characteristics[antibiotic treatment] |
Recent antibiotic exposure |
none, amoxicillin 7 days prior, broad-spectrum |
6. Organism Annotation in Metaproteomics
The characteristics[organism] column presents a unique challenge in metaproteomics because samples contain multiple organisms. Several approaches are supported:
6.1. Option 1: Use "metagenome" or Environment-Specific Term
For samples where the community composition is unknown or highly complex:
| characteristics[organism] | Example Samples |
|---|---|
metagenome |
General environmental samples |
human gut metagenome |
Human fecal samples, gut biopsies |
soil metagenome |
Soil samples |
marine metagenome |
Ocean water, marine sediment |
mouse gut metagenome |
Mouse fecal samples |
These terms have corresponding NCBITaxon identifiers (e.g., NCBITaxon:749906 for "gut metagenome").
6.2. Option 2: Specify Host Organism
For host-associated microbiomes where host context is important:
| characteristics[organism] | characteristics[microbiome source] |
|---|---|
homo sapiens |
gut microbiome |
mus musculus |
gut microbiome |
arabidopsis thaliana |
rhizosphere microbiome |
6.3. Option 3: Multiple Organism Columns
For controlled or known communities (e.g., defined consortia, mock communities):
| characteristics[organism] | characteristics[organism] | characteristics[organism] |
|---|---|---|
escherichia coli |
bacteroides fragilis |
lactobacillus acidophilus |
|
Note
|
The approach should be chosen based on the experimental context and database search strategy used. |
7. Database Search Strategy Annotation
Metaproteomics database construction is critical and should be documented:
| Strategy | Description | When to Use |
|---|---|---|
Matched metagenome |
Database from metagenome of same sample |
When metagenome sequencing was performed |
Reference database |
Public reference database (RefSeq, UniProt) |
When no metagenome available |
Custom assembly |
De novo assembled from sample metagenome |
For novel environments |
GTDB/UHGG |
Genome-resolved metagenome databases |
For gut microbiome studies |
Hybrid approach |
Combination of reference and sample-specific |
Complex communities with some known members |
Document the strategy in comment[database search strategy].
8. Mock Community Samples
For defined/mock microbial communities used as standards:
| Column Name | Description | Example Values |
|---|---|---|
characteristics[mock community] |
Identifier or name of mock community standard |
ZymoBIOMICS Microbial Community Standard, ATCC MSA-1000 |
characteristics[mock community composition] |
Description of community composition |
8 bacteria + 2 yeasts at defined ratios, even mix of 10 species |
comment[expected organism list] |
List of organisms expected in mock community |
E. coli;B. subtilis;S. cerevisiae;L. fermentum |
9. Example SDRF Files
9.1. Human Gut Microbiome Example
| source name | characteristics[organism] | characteristics[environmental sample type] | characteristics[host disease] | ... | comment[database search strategy] | comment[metagenome accession] | comment[data file] |
|---|---|---|---|---|---|---|---|
| subject_001_stool | human gut metagenome | human gut | inflammatory bowel disease | ... | matched metagenome | SRR12345678 | IBD_001.raw |
| subject_002_stool | human gut metagenome | human gut | normal | ... | matched metagenome | SRR12345679 | healthy_002.raw |
9.2. Environmental Sample Example
| source name | characteristics[organism] | characteristics[environmental sample type] | characteristics[geographic location] | characteristics[depth] | ... | comment[database search strategy] | comment[data file] |
|---|---|---|---|---|---|---|---|
| baltic_station1_summer | marine metagenome | marine sediment | Baltic Sea | 50 m | ... | GTDB marine subset | baltic_sed_001.raw |
| soil_farm_plot1 | soil metagenome | agricultural soil | Iowa, USA | 0-10 cm | ... | custom assembly | soil_plot1.raw |
10. Quality Control Annotations
For metaproteomics, consider documenting:
| Column Name | Description | Example Values |
|---|---|---|
comment[biomass estimation] |
Estimated microbial biomass in sample |
1e9 cells/g, high biomass, low biomass |
comment[host contamination] |
Level of host protein contamination if known |
low (<5%), moderate (5-20%), high (>20%) |
comment[contaminant database] |
Contaminant databases used in search |
cRAP, MaxQuant contaminants |
11. Best Practices for Metaproteomics Annotation
-
Use appropriate organism terms: Choose between metagenome terms, host organism, or explicit multi-organism annotation based on your experimental design.
-
Document database strategy: The sequence database used dramatically affects results - always document construction strategy.
-
Link to metagenomics data: When available, provide accession numbers for related metagenome/metatranscriptome data.
-
Include environmental context: Geographic location, collection date, and environmental parameters enhance data reusability.
-
Specify host information: For host-associated microbiomes, document host species, health status, and relevant metadata.
-
Note sample handling: Storage conditions and processing time can affect community composition.
-
Consider MIxS standards: Align with Minimum Information about any (x) Sequence standards where applicable.
-
Use controlled vocabularies: Prefer ENVO terms for environmental classification and NCBITaxon for organisms.
12. Template File
The metaproteomics SDRF template file is available in this directory:
13. Validation
Metaproteomics SDRF files should be validated using the sdrf-pipelines tool:
pip install sdrf-pipelines
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv
|
Note
|
Metaproteomics-specific validation rules are under development. |
14. Authors and Maintainers
This template was developed by the SDRF-Proteomics community with contributions from metaproteomics researchers.
For questions or suggestions, please open an issue on the GitHub repository.
15. References
-
Wilmes P, Bond PL. (2004) The application of two-dimensional polyacrylamide gel electrophoresis and downstream analyses to a mixed community of prokaryotic microorganisms. Environmental Microbiology.
-
Muth T, et al. (2015) The MetaProteomeAnalyzer: a powerful open-source software suite for metaproteomics data analysis and interpretation. Journal of Proteome Research.
-
Chatterjee S, et al. (2023) A comprehensive and scalable database search system for metaproteomics. Nature Communications.
-
ENVO: Environment Ontology (http://www.obofoundry.org/ontology/envo.html)
-
MIxS: Minimum Information about any (x) Sequence (https://genomicsstandardsconsortium.github.io/mixs/)