A comprehensive guide for template selection, combination rules, key column guidance, format conventions, and YAML schema definitions. It complements the per-template YAML definitions and auto-generated documentation.
| Source of truth: The YAML file for each template (located in sdrf-templates/) is the authoritative definition of its columns, validators, and metadata. To add, remove, or modify a column, edit the corresponding YAML file. All other documentation — including this guide — is derived from or supplements those YAMLs. |
YAML templates do not define every possible column. Templates define the most common columns with their validators and requirement levels. Additional terms listed in TERMS.tsv (e.g., characteristics[xenograft], characteristics[mass]) can be added to any SDRF file without needing a corresponding YAML definition. The YAML templates capture what is most frequently needed; TERMS.tsv serves as a broader registry of recognised column names and ontology mappings.
|
Template files location: sdrf-templates/
1. Template Architecture
1.1. Layers
Templates are organized in a hierarchy with two internal construction layers (base, sample-metadata) and three user-facing layers (technology, sample, experiment):
base Infrastructure (identifiers, data files, versioning) [internal]
sample-metadata Shared sample biology (organism, tissue, disease) [internal]
technology templates MS or affinity proteomics columns [layer: technology]
organism templates Species-specific metadata (human, vertebrates, etc.) [layer: sample]
clinical templates Clinical/oncology metadata [layer: sample]
experiment templates Technique-specific columns (DIA, crosslinking, etc.) [layer: experiment]
1.2. Inheritance
Every template inherits all columns from its parent chain. Child templates can override parent column properties (e.g., changing a column from optional to required).
base (construction artifact - not user-facing)
├── sample-metadata (intermediate - sample columns shared by most templates)
│ ├── ms-proteomics (technology)
│ │ ├── dia-acquisition (experiment)
│ │ ├── crosslinking (experiment)
│ │ ├── immunopeptidomics (experiment)
│ │ └── single-cell (experiment)
│ ├── affinity-proteomics (technology)
│ │ ├── olink (experiment)
│ │ └── somascan (experiment)
│ ├── human (sample)
│ ├── vertebrates (sample)
│ ├── invertebrates (sample)
│ ├── plants (sample)
│ ├── cell-lines (experiment)
│ ├── clinical-metadata (sample)
│ │ └── oncology-metadata (sample)
│ └── (columns: organism, organism part, cell type, disease, biological replicate...)
└── metaproteomics (sample - MIxS-aligned, excludes sample-metadata when combined)
├── human-gut (sample)
├── soil (sample)
└── water (sample)
1.3. Version Pinning in extends
The extends field supports an optional version constraint using the @ separator. This controls which version of the parent template a child inherits from, preventing unintended breaking changes when parent templates are updated.
Formats:
| Format | Example | Meaning |
|---|---|---|
Name only (latest) |
|
Always inherits from the latest available version. Backwards compatible with existing templates. |
Lower bound |
|
Inherits from the latest version >= 1.1.0. Follows Python/PEP 440 version specifier syntax. |
Explicit range |
|
Inherits from the latest version within [1.1.0, 2.0.0). Use when an upper bound is needed. |
Exact version |
|
Inherits from exactly that version. Use sparingly — prevents receiving patch fixes. |
Recommended convention: Use lower-bound constraints (>=major.minor.0) for all templates. Add an upper bound (<next_major.0.0) only when a known incompatible major version exists. This ensures:
-
Patch and minor updates propagate automatically — bug fixes, new optional columns, and documentation changes in the parent are inherited without child template changes.
-
Upper bounds are added when needed — once a parent releases a breaking major version, children can add
<next_major.0.0to stay on the compatible range.
Example: When ms-proteomics is at version 1.1.0:
# single-cell inherits from ms-proteomics >=1.1.0
extends: ms-proteomics@>=1.1.0
# If ms-proteomics is bumped to 1.2.0: single-cell automatically inherits 1.2.0 ✓
# If ms-proteomics is bumped to 2.0.0: single-cell inherits 2.0.0 (review needed)
# → If incompatible, add upper bound: extends: ms-proteomics@>=1.1.0,<2.0.0
Resolution order: When loading templates, the registry resolves version constraints by:
-
Scanning all available versions on disk for the parent template
-
Filtering to versions matching the constraint
-
Selecting the latest matching version
-
Falling back to the manifest
latestif no match (with a warning)
1.4. Layer Definitions
| Layer | Purpose | Examples |
|---|---|---|
|
Internal construction artifact with infrastructure columns (source name, assay name, technology type, etc.). Never used directly by users. |
base |
|
Intermediate template with sample-level columns shared by all templates (organism, organism part, cell type, biological replicate, pooled sample, disease, biosample accession). Extends |
sample-metadata |
|
Defines the proteomics technology. Exactly one required per SDRF. |
ms-proteomics, affinity-proteomics |
|
Defines organism-specific sample metadata. One recommended per SDRF. |
human, vertebrates, invertebrates, plants |
|
Defines methodology-specific columns. Multiple can be combined if not mutually exclusive. |
dia-acquisition, single-cell, olink, cell-lines |
1.5. Combination Rule
An SDRF file is built by combining templates:
-
ONE technology template (required):
ms-proteomicsoraffinity-proteomics -
ONE sample template (recommended): either an organism template (
human,vertebrates,invertebrates,plants) OR a metaproteomics template (human-gut,soil,water). These two groups are mutually exclusive. -
ZERO or more experiment templates:
dia-acquisition,single-cell,crosslinking,immunopeptidomics(extendms-proteomics),olink,somascan(extendaffinity-proteomics) -
ZERO or more clinical templates:
clinical-metadata,oncology-metadata -
ZERO or one cell-lines template: when samples are cultured cell lines
When templates are combined, columns from all selected templates are merged. If the same column appears in multiple templates, the most specific definition wins (experiment > technology > sample-metadata). Templates may declare excludes to remove columns from other templates in the combination (see Column Exclusion).
1.6. Mutual Exclusivity
-
Technology:
ms-proteomicsandaffinity-proteomicscannot be combined -
Organism: only one of
human,vertebrates,invertebrates,plantsper study (multi-species studies use the most specific applicable template) -
Metaproteomics:
metaproteomics(and its childrenhuman-gut,soil,water) are mutually exclusive with organism templates (human,vertebrates,invertebrates,plants). Only one metaproteomics child template per study. -
Platform:
olinkandsomascancannot be combined -
Affinity experiments:
olink/somascanrequireaffinity-proteomics(notms-proteomics)
2. How to Choose Templates
2.1. By Study Type
| Study type | Templates to combine |
|---|---|
Human clinical DDA proteomics |
|
Human cancer proteomics |
|
Mouse DDA proteomics |
|
Plant stress proteomics |
|
Human cell line drug screen |
|
Human XL-MS structural study |
|
In-vitro XL-MS (purified proteins) |
|
Human immunopeptidomics / HLA ligandomics |
|
Mouse immunopeptidomics / H-2 ligandomics |
|
Human DIA proteomics |
|
Mouse DIA proteomics |
|
Human single-cell proteomics |
|
Human gut metaproteomics |
|
Soil metaproteomics |
|
Aquatic metaproteomics (ocean, lake, river) |
|
Human Olink plasma study |
|
Human SomaScan serum study |
|
Drosophila phosphoproteomics |
|
Zebrafish developmental proteomics |
|
Arabidopsis drought response proteomics |
|
Mouse cancer model proteomics |
|
2.2. By Organism
| Organism | Template |
|---|---|
Homo sapiens |
|
Mouse, rat, zebrafish, other vertebrates |
|
Drosophila, C. elegans, insects |
|
Arabidopsis, rice, maize, crops |
|
Cultured cell lines (any origin) |
|
2.3. By Technology
| Technology | Template |
|---|---|
Any mass spectrometry proteomics |
|
DDA (data-dependent acquisition) |
|
DIA / SWATH / diaPASEF |
|
TMT / iTRAQ / SILAC |
|
Label-free quantification |
|
Olink PEA |
|
SomaScan aptamer |
|
3. Format Conventions
3.1. Column Naming
-
All column names inside brackets must be lowercase:
characteristics[disease],comment[dia method], notcharacteristics[Disease]orcomment[DIA method]. -
Prefix:
characteristics[…]for sample properties,comment[…]for technical/analytical metadata.
3.2. Ontology Terms
-
Use lowercase for organism names:
homo sapiens, notHomo sapiens -
Use ontology-controlled vocabulary wherever validators specify ontology validation
3.3. CV Term Format (Controlled Vocabulary)
Used for enzymes, modifications, instruments, and other ontology-mapped values:
NT={name};AC={accession}
Examples:
-
NT=Trypsin;AC=MS:1001251 -
NT=Oxidation;MT=Variable;TA=M;AC=Unimod:35 -
NT=DSS;AC=XLMOD:02001
3.4. Numbers with Units
Format: {value} {unit} (space-separated)
-
Mass tolerance:
10 ppm,0.02 Da -
Collision energy:
30 NCE,27 eV -
Tissue mass:
50 mg,1 g -
Cell diameter:
15 um -
Temperature:
25 °C,-80 °C -
Time:
24 hour,5 day,30 minute
3.5. Age Format (Human)
See Age in SAMPLE-GUIDELINES for the complete format reference, including compound ages (30Y6M), ranges (40Y-50Y), and comparison operators (>=21Y).
3.6. Special Values
-
not available— value exists but is unknown -
not applicable— column does not apply to this sample -
normal— healthy/control for disease columns (PATO:0000461) -
untreated— no treatment applied (for treatment columns) -
pooled— pooled sample (for biological replicate)
3.7. Multiple Values
Some columns allow multiple values (cardinality: multiple):
-
Use multiple columns with the same header name
-
Example:
characteristics[organism part]can appear twice for mixed-tissue samples
3.8. Accession Formats
-
BioSample:
SAMN12345678,SAMEA12345678,SAMD1234567 -
Cellosaurus:
CVCL_0030,CVCL_0004 -
Metagenome:
MGYA00001234(ENA),SRP123456(SRA)
4. Value Types Reference
SDRF cell values are not parsed by inspecting the value itself. Instead, each column’s YAML template definition declares which value types are accepted through a list of validators. The validator determines how the value is interpreted and validated.
This section provides a formal reference of all supported value types, their patterns, and examples.
4.1. How Value Types Are Determined
The parsing process for any SDRF cell value follows this logic:
1. Look up the column name in the active YAML template(s)
2. Retrieve the declared list of validators for that column
3. Validate the cell value against each declared validator
4. If no validators are declared, the value is accepted as free text
The value type is a property of the column definition, not of the value itself. The same string 10 could be valid as an identifier, a free-text value, or invalid (if the column requires a unit), depending on the column’s declared validators.
|
4.2. Value Type Catalog
The following table lists all supported value types. The Type Name corresponds to the validator_name used in YAML template definitions. The Pattern column shows the formal regular expression or grammar that values must match. For complete validator configuration options, see Column-Level Validators.
| Type Name | Description | Pattern / Grammar | Examples |
|---|---|---|---|
|
Ontology-controlled term. Accepts free text (exact ontology term name), ontology URI, or key=value representation. |
Free text matching an ontology term, or |
|
|
Enumerated list. Value must exactly match one of the allowed entries (case-insensitive). |
Exact match from a declared list |
|
|
Numeric value followed by a unit string, separated by optional whitespace. |
|
|
|
Custom regular expression. Used for domain-specific formats not covered by other validators (e.g., age format, collision energy steps). |
Any valid regular expression declared in |
|
|
Database accession identifier with a known prefix and suffix pattern. |
|
|
|
Alphanumeric identifier with configurable character set. |
|
|
|
ISO 8601 date at configurable precision. |
|
|
|
Semicolon-separated key=value pairs with declared keys and value patterns. |
|
|
|
Semantic version string with optional prefix and prerelease suffix. |
|
|
|
Ensures a cell contains a single value (no semicolon-separated lists). Used in combination with other validators. |
No semicolons allowed in value |
(any single value) |
4.3. Sentinel Values
Regardless of value type, most columns accept the following reserved words. These are controlled at the column level via the allow_not_available and allow_not_applicable flags in the YAML definition:
-
not available— the value exists but is unknown -
not applicable— the column does not apply to this sample
Additional sentinel values may be accepted per column via the special_values parameter (e.g., pooled, untreated, room temperature).
4.4. Range and Compound Formats
Some columns accept range expressions or compound values. These are not a generic SDRF feature — they are defined per column using pattern validators with specific regular expressions:
-
Age ranges:
40Y-50Y— defined by the age pattern validator in the human/vertebrate templates -
m/z ranges:
400 m/z-1200 m/z— defined by a pattern validator in the ms-proteomics template -
Stepped collision energy:
25 NCE;27 NCE;30 NCE— semicolon-separated repeatednumber_with_unitvalues, defined by a pattern validator
To determine if a column supports ranges or compound values, consult its YAML template definition and check the declared pattern validator’s regex.
5. YAML Template Schema
This section describes the technical implementation of SDRF-Proteomics templates for developers and maintainers. These YAML definitions are used by the sdrf-pipelines validator to check SDRF files for compliance.
5.1. Complete Template Structure
# =============================================================================
# TEMPLATE METADATA (required)
# =============================================================================
name: human # Unique template identifier (used in --template flag)
description: Human SDRF template # Human-readable description
version: 1.1.0 # Semantic version (major.minor.patch)
# =============================================================================
# INHERITANCE AND RELATIONSHIPS (optional)
# =============================================================================
extends: sample-metadata@>=1.0.0 # Parent template with version constraint
usable_alone: false # Can this template be used without others?
layer: sample # Layer: base, sample-metadata, technology, sample, experiment
# Mutual exclusivity - templates that cannot be combined with this one
mutually_exclusive_with:
- vertebrates
- invertebrates
- plants
# Required layers - for templates that need specific layer types
requires:
- layer: technology # Requires a technology template
- layer: sample # Requires a sample template
# =============================================================================
# TEMPLATE-LEVEL VALIDATORS (optional)
# =============================================================================
validators:
- validator_name: min_columns
params:
min_columns: 12
- validator_name: trailing_whitespace_validator
params: {}
# =============================================================================
# COLUMN DEFINITIONS (required)
# =============================================================================
columns:
- name: characteristics[disease]
description: Disease state of the sample
requirement: required # required | recommended | optional
type: string # string | integer (default: string)
cardinality: single # single | multiple (default: single)
allow_not_applicable: true # Allow "not applicable" value
allow_not_available: true # Allow "not available" value
allow_pooled: false # Allow "pooled" value (for replicates)
validators:
- validator_name: ontology
params:
ontologies:
- mondo
- efo
- doid
- pato
error_level: warning # error | warning
description: The disease should be a valid ontology term
examples:
- normal
- breast cancer
- diabetes mellitus
5.2. Template Properties Reference
| Property | Required | Type | Description |
|---|---|---|---|
|
Yes |
string |
Unique template identifier. Used in validation commands ( |
|
Yes |
string |
Human-readable description of the template’s purpose. |
|
Yes |
string |
Semantic version (e.g., |
|
No |
string |
Parent template name with optional version constraint. Supports: |
|
No |
boolean |
If |
|
No |
string |
Template layer: |
|
No |
list |
Templates that cannot be combined with this one. |
|
No |
list |
Layer requirements (e.g., |
|
No |
list |
Template-level validators applied to the entire SDRF file. |
|
Yes |
list |
Column definitions with validation rules. |
|
No |
object |
Column exclusion rules applied when this template is combined with others. See Column Exclusion. |
5.3. Column Exclusion
When templates are combined, a template may need to exclude columns that another template brings through inheritance. For example, metaproteomics extends base (not sample-metadata) and defines its own sample-level metadata aligned with GSC MIxS standards. When combined with ms-proteomics, the sample-metadata columns inherited by ms-proteomics (organism, disease, cell type, etc.) are not appropriate — metaproteomics replaces them with environment-specific fields.
The excludes property supports three complementary exclusion strategies, applied in order:
excludes:
# Strategy 1: Exclude all columns inherited from a named template
templates:
- sample-metadata # Remove all columns that originated from sample-metadata
# Strategy 2: Exclude columns by prefix category
categories:
- characteristics # Remove all characteristics[...] columns from other templates
- comment # Remove all comment[...] columns from other templates
# Strategy 3: Exclude specific columns by name
columns:
- characteristics[organism]
- characteristics[disease]
- characteristics[cell type]
5.3.1. Strategy 1: Template-level exclusion (excludes.templates)
Removes all columns that originated from the named parent template when combining with another template that inherits from it.
# metaproteomics.yaml
name: metaproteomics
extends: base
excludes:
templates:
- sample-metadata
When metaproteomics is combined with ms-proteomics, the merge logic tracks column provenance. Columns that ms-proteomics inherited from sample-metadata (organism, organism part, cell type, disease, biological replicate, etc.) are removed from the combined result. Columns defined directly in ms-proteomics itself (instrument, label, cleavage agent, etc.) are preserved.
Provenance tracking: During inheritance resolution, each column is tagged with its originating template. When excludes.templates is evaluated, columns whose origin matches a listed template are removed.
Use case: A template that provides its own complete set of sample-level metadata and needs to replace (not supplement) the standard sample-metadata columns.
5.3.2. Strategy 2: Category-level exclusion (excludes.categories)
Removes columns from other templates that match a given prefix category (characteristics or comment).
excludes:
categories:
- characteristics # Exclude all characteristics[...] from other templates
This removes all characteristics[…] columns contributed by other templates in the combination. Columns defined by the excluding template itself (and its own inheritance chain) are preserved.
Scope: Only affects columns from other templates in the combination, not from the excluding template’s own hierarchy. For example, if metaproteomics excludes category characteristics, its own characteristics[environmental sample type] is kept, but characteristics[organism] from ms-proteomics (via sample-metadata) is removed.
Use case: Broad exclusion when a template replaces an entire category of metadata. For instance, metaproteomics defines its own characteristics[…] columns for environmental/host metadata and does not want the generic sample-metadata characteristics[…] columns.
5.3.3. Strategy 3: Column-level exclusion (excludes.columns)
Removes specific named columns from other templates.
excludes:
columns:
- characteristics[organism]
- characteristics[disease]
- characteristics[cell type]
This is the most precise strategy, removing only the listed columns from the combination result.
Use case: When only a small number of specific columns conflict and the rest should be preserved.
5.3.4. Combining Strategies
All three strategies can be used together. They are applied in order: templates first, then categories, then individual columns. The union of all excluded columns is removed.
# Example: metaproteomics excludes sample-metadata entirely
excludes:
templates:
- sample-metadata
# Example: exclude only specific columns
excludes:
columns:
- characteristics[organism]
- characteristics[disease]
- characteristics[cell type]
5.3.5. Version Independence
Exclusion operates on template names only, independent of version. When excludes.templates lists sample-metadata, it matches all columns originating from sample-metadata regardless of which version (1.0.0, 1.1.0, etc.) was resolved during inheritance. This ensures that exclusion rules remain stable across version upgrades — a template does not need to update its excludes list when a dependency bumps versions.
5.3.6. Precedence Rules
-
Self-preservation: A template’s
excludesnever removes its own columns or columns from its own parent chain. Exclusion only applies to columns from other templates in the combination. -
Order of application:
templates→categories→columns. -
Union semantics: All three strategies contribute to a single exclusion set. A column excluded by any strategy is removed.
-
Strictest wins: If two templates both define the same column and one excludes it, the excluding template’s version (if any) is kept. If the excluding template does not define it, the column is removed entirely.
5.4. Column Properties Reference
| Property | Required | Type | Description |
|---|---|---|---|
|
Yes |
string |
Column header exactly as in SDRF (e.g., |
|
Yes |
string |
What the column contains and how to fill it. |
|
Yes |
string |
|
|
No |
string |
Data type: |
|
No |
string |
|
|
No |
boolean |
Allow |
|
No |
boolean |
Allow |
|
No |
boolean |
Allow |
|
No |
list |
Column-level validators for value checking. |
6. Validator Reference
6.1. Template-Level Validators
These validators apply to the entire SDRF file structure.
6.1.1. min_columns
Ensures minimum column count.
- validator_name: min_columns
params:
min_columns: 12
6.1.2. trailing_whitespace_validator
Checks for trailing whitespace in cell values.
- validator_name: trailing_whitespace_validator
params: {}
6.1.3. column_order
Validates expected column order (source name first, data file last).
- validator_name: column_order
params: {}
6.1.4. empty_cells
Checks for empty cells that should have values.
- validator_name: empty_cells
params: {}
6.1.5. combination_of_columns_no_duplicate_validator
Ensures column combinations are unique (no duplicate rows).
- validator_name: combination_of_columns_no_duplicate_validator
params:
column_name: # Must be unique (error if duplicates)
- source name
- assay name
- comment[label]
column_name_warning: # Should be unique (warning if duplicates)
- source name
- assay name
6.2. Column-Level Validators
These validators apply to individual column values.
6.2.1. ontology
Validates values against ontology terms.
- validator_name: ontology
params:
ontologies: # List of ontology prefixes
- mondo
- efo
- doid
parent_term: MS:1000044 # Optional: restrict to children of this term
error_level: warning # error or warning
description: Human-readable validation message
examples:
- normal
- breast cancer
6.2.2. pattern
Validates values against a regular expression.
- validator_name: pattern
params:
# Age format: optional comparison operator + number + unit in strict Y>M>W>D order, with optional range
# Examples: 45Y, 6M, 3W, 14D, 30Y6M, 6M2W, 40Y-50Y, 6M-12M, >18Y, >=21Y, <65Y
pattern: ^(>=?|<=?)?(\d+[Yy](\d+[Mm](\d+[Ww](\d+[Dd])?)?)?|\d+[Mm](\d+[Ww](\d+[Dd])?)?|\d+[Ww](\d+[Dd])?|\d+[Dd])(-((\d+[Yy](\d+[Mm](\d+[Ww](\d+[Dd])?)?)?|\d+[Mm](\d+[Ww](\d+[Dd])?)?|\d+[Ww](\d+[Dd])?|\d+[Dd])))?$
case_sensitive: false # Optional, default: true
description: "Age format: 45Y, 6M, 30Y6M (Y>M>W>D order), ranges like 40Y-50Y, or comparison operators like >18Y, >=21Y, <65Y"
examples:
- 45Y
- 6M
- 3W
- 30Y6M
- 40Y-50Y
- ">18Y"
- ">=21Y"
6.2.3. values
Validates against a fixed list of allowed values.
- validator_name: values
params:
values:
- male
- female
- intersex
error_level: error
description: Sex must be one of the allowed values
6.2.4. single_cardinality_validator
Ensures single value per cell (no semicolon-separated values).
- validator_name: single_cardinality_validator
params: {}
6.2.5. number_with_unit
Validates values in <number> <unit> format. Replaces complex regex patterns for fields like mass tolerance, temperature, concentration, and depth.
- validator_name: number_with_unit
params:
units: [ppm, Da] # Required: allowed unit strings
allow_negative: false # Optional: allow negative numbers (default: false)
allow_decimal: true # Optional: allow decimal numbers (default: true)
special_values: [] # Optional: extra allowed literals (e.g., "room temperature")
error_level: error # error or warning
description: Mass tolerance with unit
examples:
- 10 ppm
- 0.5 Da
Parameters:
| Parameter | Required | Default | Description |
|---|---|---|---|
|
Yes |
- |
List of allowed unit strings (e.g., |
|
No |
|
Allow negative numbers (useful for temperature) |
|
No |
|
Allow decimal numbers (e.g., |
|
No |
|
Additional literal values accepted verbatim (e.g., |
not available and not applicable are handled automatically by the column-level allow_not_available and allow_not_applicable flags — do not add them to special_values.
|
6.2.6. accession
Validates accession identifiers with prefix + suffix format. Supports predefined formats or custom prefix/suffix patterns.
# Predefined format (recommended for known databases)
- validator_name: accession
params:
format: biosample # Predefined format name
error_level: error
description: BioSample accession
examples:
- SAMN12345678
- SAMEA12345678
- SAMD1234567
# Custom prefix + suffix
- validator_name: accession
params:
prefix: "CVCL_" # Literal or simple regex
suffix: "[A-Z0-9]+" # Simple regex (default: \d+)
error_level: error
Predefined formats:
| Format | Prefix | Suffix | Example |
|---|---|---|---|
|
|
|
SAMN12345678 |
|
|
|
CVCL_0030 |
|
|
|
PXD000001 |
6.2.7. identifier
Validates alphanumeric identifiers with optional special values. Useful for individual/patient IDs and cell identifiers.
- validator_name: identifier
params:
charset: "[A-Za-z0-9_-]" # Character class (default: [A-Za-z0-9_-])
special_values: [anonymized, pooled] # Extra allowed literals
error_level: error
description: Patient or individual identifier
The charset parameter accepts a regex character class. Common values:
-
[A-Za-z0-9_-](default) — alphanumeric with underscores and hyphens -
[A-Za-z0-9_.-]— also allows dots (for cell identifiers)
6.2.8. date
Validates ISO 8601 dates at variable precision levels.
- validator_name: date
params:
format: iso8601 # Currently the only supported format
precision: [year, month, day] # Allowed precision levels
error_level: warning
description: Sample collection date
examples:
- "2024"
- "2024-01"
- "2024-01-15"
The precision list controls which granularity levels are accepted. For example, [day] would only accept full YYYY-MM-DD dates, while [year, month, day] accepts any level.
6.2.9. structured_kv
Validates semicolon-separated key=value pair formats. Useful for ontology-annotated values with NT=name;AC=accession structure.
- validator_name: structured_kv
params:
separator: ";" # Separator between pairs (default: ";")
fields: # Required key=value definitions
- key: NT
value: ".+" # Simple regex for the value
- key: AC
value: "(XLMOD|CHEBI|UNIMOD):\\d+"
error_level: error
description: "Cross-linker: NT=name;AC=ontology:accession"
examples:
- "NT=DSS;AC=XLMOD:02001"
- "NT=BS3;AC=XLMOD:02000"
Each entry in fields defines a required key and a regex pattern for its value.
6.2.10. semver
Validates semantic version strings.
- validator_name: semver
params:
prefix: "v" # Optional prefix character
allow_prerelease: true # Allow -alpha, -beta, -rc.1 suffixes (default: true)
error_level: error
description: SDRF specification version
examples:
- v1.1.0
- v1.0.0
7. Supported Ontologies
| Prefix | Ontology | Common Use |
|---|---|---|
|
NCBI Taxonomy |
organism |
|
Uberon Anatomy Ontology |
organism part |
|
Cell Ontology |
cell type |
|
Cell Line Ontology |
cell line |
|
BRENDA Tissue Ontology |
tissue, cell type, cell line |
|
MONDO Disease Ontology |
disease |
|
Experimental Factor Ontology |
disease, experimental factors |
|
Disease Ontology |
disease |
|
Phenotype and Trait Ontology |
phenotypes (including "normal") |
|
PSI-MS Ontology |
instrument, cleavage agent, mass analyzer |
|
PRIDE Ontology |
acquisition method, labels, affinity instruments |
|
UNIMOD |
post-translational modifications |
|
XLMOD |
crosslinking reagents |
|
Human Ancestry Ontology |
ancestry category (human) |
|
Environment Ontology |
environmental samples (metaproteomics) |
|
Gazetteer |
geographic locations |
|
ChEBI |
chemical entities, treatments |
8. Template Inheritance Rules
When templates are combined, the following rules apply:
8.1. Inheritance Behavior
-
Column inheritance: Child templates inherit all columns from parent templates
-
Validator inheritance: Child templates inherit all validators from parent templates
-
Column override: Child templates can redefine inherited columns with stricter requirements
8.2. Requirement Strengthening
Child templates may strengthen but not weaken requirements:
| Parent Requirement | Allowed in Child | Not Allowed |
|---|---|---|
|
|
- |
|
|
|
|
|
|
8.3. Multi-Template Combination
When multiple templates are combined:
-
All columns from all templates are included
-
If the same column appears in multiple templates, the strictest requirement wins
-
Validators are merged (all validators apply)
-
Mutual exclusivity is checked first
Example valid combination:
ms-proteomics + human + dia-acquisition + cell-lines
↓ ↓ ↓ ↓
technology sample experiment experiment
9. Creating a New Template: Step-by-Step Guide
This section walks through creating a new YAML template from scratch. The example creates a hypothetical top-down template for top-down proteomics experiments.
9.1. Step 1: Choose the Parent Template
Every template (except base) must extend a parent. Choose based on where your template fits in the hierarchy:
| You are defining… | Extend | Layer |
|---|---|---|
A new proteomics technology (rare) |
|
|
A new organism type |
|
|
A new MS experiment type |
|
|
A new affinity experiment type |
|
|
A template with its own sample metadata (e.g., metaproteomics) |
|
|
For our example, top-down proteomics is a mass spectrometry technique, so we extend ms-proteomics and use the experiment layer.
9.2. Step 2: Create the Directory Structure
Templates live in the sdrf-templates repository. Each template has a versioned directory:
sdrf-templates/
└── top-down/
└── 1.0.0-dev/
├── top-down.yaml # Template schema (required)
└── top-down.sdrf.tsv # Example SDRF file (required)
Use the -dev suffix for the initial version until the template is reviewed and accepted by the community.
9.3. Step 3: Write the YAML Schema
Start with the template metadata, then define the columns your experiment type needs. Only define columns that are new or different from the parent — inherited columns from ms-proteomics do not need to be repeated.
# ===========================================================================
# TEMPLATE METADATA
# ===========================================================================
name: top-down
description: >
SDRF template for top-down proteomics experiments where intact proteins
are analyzed without prior enzymatic digestion. Extends ms-proteomics
with top-down-specific columns for intact mass analysis.
version: 1.0.0-dev
extends: ms-proteomics@>=1.1.0
usable_alone: false
layer: experiment
# ===========================================================================
# COLUMN DEFINITIONS — only columns new or changed from parent
# ===========================================================================
columns:
# --- New column: protein separation method ---
- name: comment[protein separation method]
description: >
Method used to separate intact proteins before MS analysis
(e.g., GELFrEE, SEC, RPLC, CZE). Use "not applicable" if no
separation was performed (direct infusion).
requirement: recommended
allow_not_applicable: true
allow_not_available: true
validators:
- validator_name: ontology
params:
ontologies:
- ms
- pride
error_level: warning
description: >
Protein separation method should be a valid PSI-MS or PRIDE
ontology term.
examples:
- size exclusion chromatography
- gel electrophoresis
- capillary zone electrophoresis
- reversed-phase liquid chromatography
- not applicable
# --- New column: intact mass range ---
- name: comment[precursor mass range]
description: >
The mass range of intact protein precursors analyzed, in Daltons.
Format: "min-max Da" (e.g., "5000-50000 Da").
requirement: optional
allow_not_available: true
validators:
- validator_name: pattern
params:
pattern: ^(\d+-\d+\s*Da|not available)$
case_sensitive: false
description: >
Precursor mass range in format "min-max Da".
examples:
- 5000-50000 Da
- 10000-100000 Da
- not available
# --- Override inherited column: make cleavage agent "not applicable" ---
# In top-down experiments, proteins are not digested, so the cleavage
# agent column should always be "not applicable". We override the
# inherited column to make this explicit.
- name: comment[cleavage agent details]
description: >
Cleavage agent is not applicable for top-down experiments where
intact proteins are analyzed. Use "not applicable".
requirement: required
allow_not_applicable: true
allow_not_available: false
validators:
- validator_name: values
params:
values:
- not applicable
error_level: warning
description: >
Top-down experiments analyze intact proteins. Cleavage agent
should be "not applicable".
Key points demonstrated in this example:
-
New columns (
comment[protein separation method],comment[precursor mass range]) are defined with their own validators. -
Overriding an inherited column (
comment[cleavage agent details]) restricts the parent’s definition — in this case, forcing the value to "not applicable" since top-down experiments don’t use enzymatic digestion. -
Descriptions explain the "why" — not just what the field is, but when to use
not applicableand what format to follow. -
Examples are always provided in validators — they serve as documentation and can be used by tools to generate autocomplete suggestions.
9.4. Step 4: Create an Example SDRF File
Every template must include an example .sdrf.tsv file that passes validation. This file demonstrates correct usage and serves as a starting point for users.
9.5. Step 5: Test Locally
Before submitting, validate your example file against your template:
# Install or upgrade the validator
pip install sdrf-pipelines
# Validate the example SDRF file with the new template
parse_sdrf validate-sdrf \
--sdrf_file top-down/1.0.0-dev/top-down.sdrf.tsv \
--template ms-proteomics \
--custom_template top-down/1.0.0-dev/top-down.yaml
Check that:
-
The example file passes validation with no errors.
-
All required columns from the parent template are present.
-
New columns validate correctly (ontology terms resolve, patterns match).
-
The
not applicableandnot availablevalues work where expected.
9.6. Step 6: Submit a Pull Request
Submit the template to the sdrf-templates repository via pull request. The PR must include:
-
The YAML schema file (
top-down/1.0.0-dev/top-down.yaml). -
The example SDRF file (
top-down/1.0.0-dev/top-down.sdrf.tsv). -
A PR description explaining: what experiment type the template covers, why it is needed, and which parent it extends.
Once reviewed and merged, the template appears in the templates.yaml manifest and becomes available to all users of sdrf-pipelines. A documentation page is auto-generated from the YAML definition. The -dev suffix is removed when the template is promoted to a stable release (e.g., 1.0.0).
9.7. Quick Reference: Template YAML Checklist
| Field | Checklist |
|---|---|
|
Lowercase with hyphens. Unique across all templates. |
|
One or two sentences. Mention what it extends and what experiment type it covers. |
|
Use |
|
Must be an existing template name with a version constraint (e.g., |
|
One of: |
|
Almost always |
|
Only define columns that are new or that override a parent column. Do not repeat inherited columns unchanged. |
Validators |
Use |
Reserved words |
Set |
9.8. Best Practices
-
Only define what is new. Inherited columns from the parent template do not need to be repeated. Only add a column to your template if it is new or if you need to override the parent’s definition (e.g., change
requirementfromoptionaltorequired, or restrict allowed values). -
Use ontologies over patterns. Prefer
ontologyvalidators for fields where controlled vocabulary terms exist. Use typed validators (number_with_unit,accession,identifier,date,structured_kv,semver) for common structured formats. Usepatternvalidators only for unique formats not covered by typed validators. -
Provide clear descriptions. Explain not just what the field is, but when to use
not applicablevsnot available, and give format guidance. -
Always include examples. Examples in validators serve as documentation and help tools generate suggestions.
-
Test with real data. Your example SDRF should represent a realistic experiment, not a toy file. If possible, base it on a real public dataset.
10. Validation Commands
Validate SDRF files using sdrf-pipelines:
# Install
pip install sdrf-pipelines
# Validate with single template
parse_sdrf validate-sdrf --sdrf_file file.sdrf.tsv --template ms-proteomics
# Validate with multiple templates
parse_sdrf validate-sdrf --sdrf_file file.sdrf.tsv \
--template ms-proteomics \
--template human \
--template dia-acquisition
# Check template compatibility
parse_sdrf check-templates --templates ms-proteomics,human,dia-acquisition
11. Versioning
11.1. Template Versions
Templates follow semantic versioning (MAJOR.MINOR.PATCH):
-
MAJOR: Breaking changes (column removals, requirement escalations)
-
MINOR: Additions (new optional/recommended columns)
-
PATCH: Fixes (description updates, validator corrections)
11.2. Recording Templates in SDRF
Use comment[sdrf template] to record which templates were used:
NT=ms-proteomics;VV=v1.1.0 NT=human;VV=v1.1.0
Multiple template columns supported for combined templates.
12. References
-
sdrf-pipelines - SDRF validation tool