SDRF-Proteomics

1. Introduction

This document provides guidelines for annotating general sample metadata in SDRF-Proteomics format. These guidelines apply to samples from any organism (human, animal, plant, microorganism).

For human-specific metadata (disease staging, comorbidities, treatment history), see Human Template.

Version 1.1.0 - 2026-01

2. Best Practices

  1. Use lowercase for all controlled vocabulary values (except proper nouns in disease names).

  2. Use ontology terms mapped to MONDO (diseases), CL (cell types), UBERON (anatomy), PATO (phenotypes).

  3. Be consistent with format across all samples in a dataset.

  4. Document unknowns using not available (unknown) or not applicable (not relevant) - never leave cells empty.

  5. Validate before submission using sdrf-pipelines to check ontology mappings.

3. General Formatting Conventions

3.1. Capitalization Rules

Most controlled vocabulary values are recommended to be lowercase:

  • Organism names: homo sapiens, mus musculus

  • Organism parts: blood, liver, brain

  • Sex values: male, female

Exceptions (retain proper noun capitalization):

  • Ancestry categories (geographic populations): African, European, South Asian

  • Cell line names: HeLa, HEK293, K562

Note
Validators should normalize common capitalization variations (e.g., accept both Homo sapiens and homo sapiens), but submitters should use lowercase for consistency.

4. Organism

Column: characteristics[organism]

Use the scientific name in lowercase. The validator will map to the correct ontology term.

Value NCBITaxon ID Description

homo sapiens

NCBITaxon:9606

Human

mus musculus

NCBITaxon:10090

Mouse

5. Organism Part / Tissue

Column: characteristics[organism part]

Ontology: UBERON for mammals/vertebrates, Plant Ontology (PO) for plants, FlyBase Anatomy (FBbt) for Drosophila.

Use lowercase for all values. For cell line samples, use not applicable or specify the tissue of origin (e.g., cervix for HeLa).

Note

Do NOT use characteristics[tissue] - Use characteristics[organism part] instead.

The characteristics[tissue] column should not be used as a replacement for characteristics[organism part]. When tissue-related columns are needed, they should be used alongside characteristics[organism part] for additional categorization (e.g., characteristics[tissue supergroup] for higher-level grouping).

Value UBERON ID Description

blood plasma

UBERON:0001969

Blood plasma

liver

UBERON:0002107

Liver tissue

brain

UBERON:0000955

Brain

heart

UBERON:0000948

Heart

6. Age

Column: characteristics[age]

Format: {Number}{Unit} where Unit is: Y (Years), M (Months), W (Weeks), D (Days).

6.1. Age Formats

Age Type Format Example Description

Exact age

{Number}{Unit}

40Y

40 years old

Exact age

{Number}{Unit}

8W

8 weeks old

Age range

{Lower}{Unit}-{Upper}{Unit}

40Y-50Y

Between 40 and 50 years

Greater than

>{Number}{Unit}

>18Y

Older than 18 years

Less than

<{Number}{Unit}

<65Y

Younger than 65 years

Greater or equal

>={Number}{Unit}

>=21Y

21 years or older

Important

When exact age is unavailable: If age cannot be determined precisely, consider using:

  1. Age ranges (e.g., 40Y-50Y) when approximate age is known

  2. Comparison operators (e.g., >18Y) when only bounds are known

  3. Developmental stage (e.g., adult, juvenile, embryonic) when age is unknown but developmental stage can be inferred from other metadata

Developmental stage is useful for animal studies, pooled samples, or historical samples where exact age is unavailable.

7. Sex

Column: characteristics[sex]

Requirement: REQUIRED for human samples, OPTIONAL for other organisms. Values allowed are under PATO ontology: PATO:0000047 - Biological Sex

Allowed values:

Value Description

male

Male biological sex

female

Female biological sex

Note
For cell lines, use the sex of the original donor if known (e.g., female for HeLa cells), otherwise use not available. Use anonymized when the information exists but has been redacted for privacy reasons (e.g., clinical studies with de-identified data).

8. Developmental Stage

Column: characteristics[developmental stage]

Ontology: UBERON, EFO

Value Description

adult

Sexually mature organism

embryonic

Pre-birth/pre-hatching

For model organisms, use specific stages when applicable: embryonic day 14 (mouse E14), 24 hpf (zebrafish 24 hours post fertilization).

9. Disease

Column: characteristics[disease]

9.1. Healthy Samples: normal vs healthy vs control

For samples without disease, the terminology matters for standardization:

normal (PATO:0000461) - Recommended

  • Standard term in pathology ("normal tissue" vs "tumor tissue")

  • Well-defined ontology mapping to PATO:0000461

  • Widely used in existing proteomics datasets

healthy (SIO:001012) - Accepted alternative

  • More intuitive for clinical/human samples

  • Has valid ontology support (Semanticscience Integrated Ontology)

  • Validators should accept both normal and healthy

  • Avoid using "Control" as a disease state, it is an experimental design concept, not a disease state

  • Ambiguous: a control could be healthy, vehicle-treated, time-zero, or even a different disease used for comparison

The characteristics[disease] column captures the actual disease state of the sample, while factor value[disease] indicates the experimental comparison groups:

source name ... characteristics[disease] ... factor value[disease]
healthy_1 ... normal ... normal
tumor_1 ... breast carcinoma ... breast carcinoma
adjacent_1 ... normal ... normal
Sample metadata Factor value

In this example:

  • healthy_1: Healthy control sample from a healthy individual

  • tumor_1: Disease case sample (tumor tissue)

  • adjacent_1: Adjacent normal tissue from a cancer patient

Both healthy_1 and adjacent_1 have normal as their disease state (no tumor cells), but they come from different individuals. The experimental design and comparison groups are defined by the factor values.

9.2. Disease Examples

Disease SDRF Value Ontology Link

Healthy/no disease

normal

PATO:0000461

Healthy (alternative)

healthy

SIO:001012

Breast cancer

breast carcinoma

MONDO:0004989

Tip
For animal disease models, use the human disease name to facilitate cross-species comparisons.

10. Phenotype

Column: characteristics[phenotype]

Ontology: EFO:0000651

Requirement: Optional

Describes observable characteristics or traits of a sample that result from genotype, environment, and treatment interactions. Captures how the sample behaves or appears, not what disease it has.

Column Purpose Example

characteristics[disease]

What disease/pathology is present

"breast cancer", "Alzheimer disease"

characteristics[genotype]

What genetic variant is present

"KRAS G12D mutant", "wild type"

characteristics[phenotype]

How the sample behaves/appears

"gefitinib sensitive", "HER2 positive"

Common phenotype categories:

  • Drug response: gefitinib sensitive, cisplatin resistant

  • Molecular markers: HER2 positive, ER positive

  • Expression states: overexpressing Neurogenin3, FOXP3 expression

  • Functional traits: undifferentiated, adipogenic

  • Environmental responses: high fat diet, heat shock response

When NOT to use phenotype: For disease names use characteristics[disease], for genetic variants use characteristics[genotype], for treatments use characteristics[compound] or characteristics[treatment].

11. Cell Type

Column: characteristics[cell type]

Ontology: Cell Ontology (CL)

For cell lines, optionally include the cell type of origin.

Value CL Link

epithelial cell

CL:0000066

T cell

CL:0000084

B cell

CL:0000236

macrophage

CL:0000235

fibroblast

CL:0000057

For detailed immune cell studies, use specific subtypes: CD4-positive, alpha-beta T cell (CL:0000624), regulatory T cell (CL:0000815).

12. Material Type

Column: characteristics[material type]

Ontology: PRIDE:0000837

Material type describes the nature of the biological material being analyzed.

Requirement: Optional

Allowed values:

Value Description When to Use

tissue

Solid tissue sample from an organism

Biopsies, surgical specimens, autopsy samples

cell

Individual cells or cell suspensions

FACS-sorted cells, primary cell cultures, dissociated tissue

cell line

Established immortalized cell line

HeLa, HEK293, A549, etc.

organism part

A part of an organism’s anatomy

Organ samples, body fluids (when not whole tissue)

whole organism

Complete organism sample

Single-celled organisms, small model organisms (C. elegans, yeast)

synthetic

Artificially synthesized material

Synthetic peptide libraries, recombinant proteins

Special values:

  • not available: When material type is unknown

  • not applicable: When the concept doesn’t apply (e.g., for computational datasets)

Examples:

Sample Type characteristics[material type] characteristics[organism part] characteristics[cell line]

Liver biopsy

tissue

liver

not applicable

HeLa cells

cell line

cervix

HeLa

Sorted T cells

cell

blood

not applicable

Mouse whole brain

tissue

brain

not applicable

E. coli culture

whole organism

not applicable

not applicable

Synthetic peptides

synthetic

not applicable

not applicable

Note
For cell line samples, use cell line as the material type. The specific cell line name goes in characteristics[cell line]. The tissue of origin can be specified in characteristics[organism part].
Column Description Example Values When to Use Example

characteristics[treatment]

Treatment applied to the sample

dexamethasone, vehicle control, untreated

Drug treatment studies

PXD017291

characteristics[time point]

Time of sample collection

0h, 24h, 7d, baseline

Time-course experiments

PXD007555

characteristics[dose]

Dose of treatment if applicable

10 mg/kg, 100 nM, high dose

Dose-response studies

characteristics[patient bmi]

Body mass index (for human studies)

25.3 kg/m2, 30.1 kg/m2

Metabolic or obesity-related studies

PXD005780

characteristics[smoking status]

Smoking history of the patient

never smoked, current smoker, former smoker

Lung or cardiovascular studies

For additional columns, see the SDRF Terms Reference and the Human Template.

13. PTM Enrichment

Column: characteristics[enrichment process]

Ontology: EFO

Value Description

enrichment of phosphorylated Protein

Phosphoproteomics enrichment

not applicable

No PTM enrichment performed

14. Depletion

Column: characteristics[depletion]

For blood/plasma samples indicating abundant protein depletion.

Values: no depletion, depletion, not applicable.

15. Patient-Derived Xenografts (PDX)

Column: characteristics[xenograft]

When annotating PDX samples, metadata (age, sex) MUST refer to the original patient, not the host organism.

source name characteristics[organism] characteristics[xenograft] characteristics[age]
tumor_001 homo sapiens not applicable 65Y
pdx_001 homo sapiens pancreatic adenocarcinoma grown in nude mice 65Y
Sample metadata

16. Synthetic Peptide Libraries

Column: characteristics[synthetic peptide]

Values: synthetic (sample is a synthetic peptide library), not synthetic (biological sample).

For synthetic libraries, most sample metadata can be not applicable. The organism MAY be specified if the library was designed from specific species peptides.

17. Spiked-in Samples

Column: characteristics[spiked compound]

For samples spiked with peptides, proteins, or mixtures (e.g., for quantification standards or retention time alignment), use key-value pairs:

Key Meaning Example Required for

CT

Compound type

peptide, protein, mixture

All

QY

Quantity

10 fmol, 20 nmol

All

PS

Peptide sequence

PEPTIDESEQ

Peptides

AC

UniProt accession

A9WZ33

Proteins

CN

Compound name

iRT mixture

Optional

CV

Compound vendor

Biognosys

Mixtures (required)

Example: characteristics[spiked compound]: CT=peptide;PS=PEPTIDESEQ;QY=10 fmol

The injected mass of the main sample SHOULD be specified in characteristics[mass]. For multiple spiked components, repeat the column. If the spiked component is another biological sample (e.g., E. coli lysate), annotate it in its own row with characteristics[mass] specified for both components.

18. Ontologies