Quick Start Tutorial - SDRF-Proteomics

What is SDRF?

SDRF (Sample and Data Relationship Format) is a simple tab-separated file (like Excel) that describes your proteomics experiment. It connects your biological samples to your mass spectrometry data files.

The Core Concept

Think of SDRF as a table where:

Each row = one sample-to-file relationship
Each column = one piece of information about that sample or file

Why use SDRF? When you submit data to repositories like PRIDE, SDRF ensures your experiment is fully described and can be reanalyzed by others. It's becoming a standard requirement for proteomics data submission. Learn more in the specification →

Learn from Real Examples

The best way to learn is by example. Browse real SDRF files from published proteomics datasets and use them as templates for your own experiments.

Tip: Find datasets similar to yours by organism, instrument, or acquisition method. You can view the complete SDRF structure and use it as a starting point. Browse Examples in SDRF Explorer →

Step 1: Choose Your Template

Templates are pre-made SDRF files with the right columns already set up for your experiment type. Pick one based on your organism and experiment:

Core Templates (by organism)

Your Samples	Template	Includes
Human samples	human	age, sex, ancestry, ethnicity
Mouse, rat, zebrafish	vertebrates	developmental stage, strain
Insects, worms	invertebrates	developmental stage
Plants	plants	cultivar, growth conditions
Other / Not sure	ms-proteomics	minimal required columns

Specialized Templates (by experiment type)

Experiment Type	Template	Adds
Cell line studies	cell-lines	cell line name, Cellosaurus ID
DIA acquisition	dia-acquisition	DIA isolation window
Immunopeptidomics	immunopeptidomics	HLA alleles, MHC class
Cross-linking MS	crosslinking	crosslinker, cleavability
Single-cell proteomics	single-cell	single cell identifier

Tip: You can combine templates! Start with a core template (e.g., "human") and add columns from specialized templates as needed.

Step 2: Understand Column Types

SDRF columns follow a naming pattern that tells you what kind of information they contain:

characteristics[...]

Describe the biological sample

characteristics[organism] characteristics[disease] characteristics[organism part] Sample Metadata

comment[...]

Describe the data file or MS run

comment[data file] comment[instrument] comment[label] Data File Metadata

factor value[...]

The experimental variable you're comparing

factor value[disease] factor value[compound] factor value[time] What You're Studying

Important: Column names are case-sensitive and spacing matters!

characteristics[organism] — Correct
Characteristics[organism] — Wrong (capital C)
characteristics [organism] — Wrong (space before bracket)

Step 3: Fill in Sample Information

Open your template in Excel, Google Sheets, or any spreadsheet software. For each sample, fill in the biological information:

Column	What to Write	Example	Notes
`source name`	A unique identifier for your biological sample	patient_001	Unique to the sample; repeats for fractions/replicates
`characteristics[organism]`	Species name (lowercase)	homo sapiens	Use scientific name from NCBI Taxonomy
`characteristics[organism part]`	Tissue or body part	liver	Use terms from UBERON
`characteristics[disease]`	Disease name, or "normal"	hepatocellular carcinoma	Use "normal" for healthy samples

Tip: Don't stress about finding exact ontology terms. Write the common name (e.g., "liver", "breast cancer") and the validator will check it for you. You can always refine later.

Step 4: Add Data File Information

For each row, also specify information about the raw file and how it was acquired:

Column	What to Write	Example	Notes
`assay name`	A name for this MS run	run_001	Often same as source name
`comment[label]`	Type of labeling	label free sample	Or TMT126, TMT127N, etc.
`comment[instrument]`	Mass spectrometer used	Q Exactive HF	From PSI-MS ontology
`comment[data file]`	Your raw file name	sample_001.raw	Exact filename including extension

One row = one sample-to-file relationship. In multiplexed experiments (TMT/iTRAQ), multiple samples share the same file, so you'll have multiple rows pointing to the same raw file. In fractionated experiments, one sample spans multiple files, so you'll have multiple rows for the same sample. More about SDRF structure →

Step 5: Define Your Experimental Variables

Factor values tell analysis tools what you're comparing in your experiment. This is crucial for downstream analysis!

What is a Factor Value?

A factor value is the experimental variable you're studying. If your experiment compares cancer vs. healthy tissue, then disease is your factor. The values would be "hepatocellular carcinoma" and "normal".

Experiment Type	Factor Value Column	Example Values
Disease vs. healthy	`factor value[disease]`	cancer, normal
Drug treatment	`factor value[compound]`	aspirin, DMSO
Time course	`factor value[time]`	0 hour, 6 hour, 24 hour
Tissue comparison	`factor value[organism part]`	liver, kidney, heart
Multiple variables	Multiple factor columns	Both disease AND time

Common mistake: Factor values often duplicate information from characteristics columns — and that's correct! The factor value explicitly marks which characteristic is the experimental variable.

Complete Example

Here's a minimum valid SDRF file for a human liver cancer study, including all required columns from the human template:

source name	characteristics[organism]	characteristics[organism part]	characteristics[disease]	characteristics[biological replicate]	characteristics[age]	characteristics[sex]	assay name	technology type	comment[proteomics data acquisition method]	comment[label]	comment[instrument]	comment[cleavage agent details]	comment[fraction identifier]	comment[technical replicate]	comment[data file]	factor value[disease]
patient_001	homo sapiens	liver	hepatocellular carcinoma	1	55Y	male	run_001	proteomic profiling by mass spectrometry	Data-dependent acquisition	label free sample	Q Exactive HF	NT=Trypsin;AC=MS:1001251	1	1	patient_001.raw	hepatocellular carcinoma
patient_002	homo sapiens	liver	hepatocellular carcinoma	2	62Y	female	run_002	proteomic profiling by mass spectrometry	Data-dependent acquisition	label free sample	Q Exactive HF	NT=Trypsin;AC=MS:1001251	1	1	patient_002.raw	hepatocellular carcinoma
control_001	homo sapiens	liver	normal	1	48Y	male	run_003	proteomic profiling by mass spectrometry	Data-dependent acquisition	label free sample	Q Exactive HF	NT=Trypsin;AC=MS:1001251	1	1	control_001.raw	normal
control_002	homo sapiens	liver	normal	2	51Y	female	run_004	proteomic profiling by mass spectrometry	Data-dependent acquisition	label free sample	Q Exactive HF	NT=Trypsin;AC=MS:1001251	1	1	control_002.raw	normal

Scroll horizontally to see all columns →

Sample metadata (characteristics) Data file metadata (comment) Experimental variable (factor value)

Required columns in this example:

Sample metadata: source name, organism, organism part, disease, biological replicate, age, sex
Data file metadata: assay name, technology type, proteomics data acquisition method, label, instrument, cleavage agent, fraction identifier, technical replicate, data file
Factor value: the experimental variable being compared (disease)

What this example tells us:

2 biological replicates per condition (numbered 1-2 within each factor value group)
No fractionation (fraction identifier = 1 for all)
Single injection per sample (technical replicate = 1)
Label-free DDA proteomics with trypsin digestion on Q Exactive HF

Step 6: Validate Your File

Before submission, validate your SDRF to catch errors early:

Option 1: Command Line (Recommended)

# Install the validator
pip install sdrf-pipelines

# Validate your file
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv

Best for batch validation and integration into pipelines.

Option 2: With Template Check

# Validate against a specific template
parse_sdrf validate-sdrf --sdrf_file your_file.sdrf.tsv --template human

Checks that all required columns for your template are present.

Validation checks for:

Correct column names and formatting
Valid ontology terms (organism, disease, etc.)
Required columns present
No empty cells where values are required

Common Scenarios

TMT/iTRAQ Multiplexed Samples

For multiplexed experiments, multiple samples share the same raw file. Each sample gets its own row with a different label:

source name	comment[label]	comment[data file]
sample_A	TMT126	multiplex_1.raw
sample_B	TMT127N	multiplex_1.raw
sample_C	TMT127C	multiplex_1.raw

Full TMT documentation →

Fractionated Samples

If you fractionated your sample before MS, add a fraction identifier column:

source name	comment[fraction identifier]	comment[data file]
sample_001	1	sample_001_F01.raw
sample_001	2	sample_001_F02.raw
sample_001	3	sample_001_F03.raw

Fractionation documentation →

Technical Replicates

Same sample run multiple times? Use the same source name with different assay names and data files:

source name	assay name	comment[technical replicate]	comment[data file]
sample_001	sample_001_rep1	1	sample_001_rep1.raw
sample_001	sample_001_rep2	2	sample_001_rep2.raw

Cell Line Experiments

For cell lines, include the cell line name and Cellosaurus accession:

source name	characteristics[cell line]	characteristics[cellosaurus accession]
hela_001	HeLa	CVCL_0030
hek_001	HEK293	CVCL_0045

Find accessions at Cellosaurus.

Cell lines template →

Common Mistakes to Avoid

Wrong Source Name

Correct source name

Column names must be lowercase

Wrong characteristics [organism]

Correct characteristics[organism]

No space before the bracket

Wrong control for healthy

Correct normal for healthy

Use "normal" for healthy tissue/samples

Wrong Empty cells

Correct not available

Never leave cells empty; use "not available" or "not applicable"

Finding the Right Terms

SDRF uses ontology terms to ensure consistency. Here's where to find them:

Next Steps

Browse Examples

See real SDRF files from published datasets in ProteomeXchange

SDRF Explorer

All SDRF Terms

Complete reference of all columns, their requirements, and valid values

SDRF Terms Reference

Full Specification

Detailed documentation for advanced use cases and edge cases

Read Specification

Get Help

Questions? Open an issue on GitHub to reach the bigbio team

Ask a Question

Create Your SDRF File

What acquisition method did you use?

What labeling strategy?

What organism?

Your Recommended Setup

Templates to use:

Example datasets: