Metadata schema

Suggest Edits

These are subdivided into three categories (File, Sample, and General). The recommended practice is to enter as much metadata as possible when you first upload files to the Platform. For instance, for raw sequencing files, you should enter Platform (sequencing platform) and Sample ID. Of these fields, there are seven metadata fields that we highly suggest you set for your data. While your tasks may run correctly without them, these metadata fields will help optimize your analyses. These fields are labeled in the table below with a suggested tag in the Name column.

Please keep in mind the fields have to be specified exactly as listed in the tables below under the Name column. This means that if the field is not listed exactly as in the table, the Platform will interpret it is a custom metadata field (see below).

File

In the following table, you will find the name, description, and values of metadata fields for File. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

There are six metadata fields that we highly suggest you set for your data. While your tasks may run correctly without them, these metadata fields will help optimize your analyses. These fields are labeled in the table below with a red suggested tag in the Name column.

Name	API key	Description	Values
Reference genome	`reference_genome`	The reference assembly (such as HG19 or GRCh37) to which the nucleotide sequence of a case can be aligned.	string Suggested values: human_g1k_v37 human_g1k_v37_decoy ucsc.hg19 Homo_sapiens.Ensembl.GRCh37 Homo_sapiens.GRCh38.dna.primary_assembly ion_torrent.hg19 mouse_mm9_ucsc ens_mouse_mm9_genome mouse_mm10_ucsc
Quality scale suggested	`quality_scale`	For raw reads, this value denotes the sequencing technology and quality format. For BAM and SAM files, this value should always be ‘Sanger’. Enter this value for all FASTQ files, unless they are used in a workflow with a FASTQ quality scale detector wrapper.	Choose from one of the following options: sanger llumina13 illumina15 illumina18 solexa Or, enter no value.
Platform suggested	`platform`	Only some tools and workflows may require a value for the Platform field. However, it is recommended that you set it whenever possible, unless you are certain that your workflow will work without it.	string Suggested values: Illumina HiSeq Illumina GA ABI capillary sequencer Illumina MiSeq ABI SOLiD Ion Torrent PGM LS 454 Illumina HiSeq X Ten Illumina Helicos PacBio Not available
Platform unit ID suggested	`platform_unit_id`	This is an identifier for lanes (Illumina), or for slides (SOLiD) in the case that a library was split and ran over multiple lanes on the flow cell or slides. The platform unit ID refers to the lane ID or the slide ID. The value supplied in the Platform unit ID field will be written to the read group tag (@RG:PU) in SAM or BAM files. All aligner apps add read group fields to the aligned BAM file on the basis of Platform unit ID metadata.	string
Paired end suggested	`paired_end`	For paired-end sequencing, this value determines the end of the fragment sequenced. For paired-end read files, this field indicates whether the read file is left end or right end. Set ‘1’ for left end and ‘2’ for right end reads. This is used to group pairs. If the FASTQ file is a single-end read this field should be left as ‘-’. Note: It is important for two members of paired-end reads to have identical Sample ID, Library ID, Platform unit ID, and File segment number.	This takes a value of 1 or 2. Note: For single-end sequencing no value is needed.
Library ID suggested	`library_id`	This is an identifier for the sequencing library preparation. The value set in this field does not affect whether or not the workflow runs successfully. However, all files that come from the same sequencing library must have the same value. The Library ID will be written to the read group tag (@RG:LB) in SAM or BAM files. All aligner apps are programmed to add RG fields to the aligned BAM according to the Library ID.	string
File segment number suggested	`file_segment_number`	If the sequencing reads for a single library, sample and lane are divided into multiple (smaller) files, the File segment number is used to enumerate these. Otherwise, this field can be left blank. This information can be used for batching when processing files with a workflow.	Integer.
Experimental strategy	`experimental_strategy`	This is the method or protocol used to perform the laboratory analysis.	string Suggested values: DNA-Seq WXS WGS Amplicon Bisulfite-Seq RNA-Seq miRNA-Seq Total RNA-Seq Not available

Sample

In the following table, you will find the name, description, and values of metadata fields for Sample. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

Name	API key	Description	Values
Sample ID	sample_id	A human readable identifier for a sample or specimen, which could contain some metadata information. A sample or specimen is material taken from a biological entity for testing, diagnosis, propagation, treatment, or research purposes, including but not limited to tissues, body fluids, cells, organs, embryos, body excretory products, etc.	This takes a string.
Sample type	sample_type	The type of material taken from a biological entity for testing, diagnosis, propagation, treatment, or research purposes. This includes tissues, body fluids, cells, organs, embryos, body excretory products, etc.	This takes a string. Suggested values: Blood Derived Normal Buccal Cell Normal Primary Blood Derived Cancer - Peripheral Blood Recurrent Blood Derived Cancer - Peripheral Blood Primary Tumor Recurrent Blood Derived Cancer - Bone Marrow Recurrent Tumor Solid Tissue Normal Metastatic Additional - New Primary Additional Metastatic Human Tumor Original Cells Primary Blood Derived Cancer - Bone Marrow Cell Lines Xenograft Tissue Bone Marrow Normal Fibroblasts from Bone Marrow Normal Not available
Sample UUID	sample_uuid	A unique identifier for the sample or specimen used in the investigation, such as a Universally Unique Identifier (UUID). A sample or specimen is material taken from a biological entity for testing, diagnosis, propagation, treatment, or research purposes, including but not limited to tissues, body fluids, cells, organs, embryos, body excretory products, etc.	This takes a string.

Aliquot

In the following table, you will find the name, description, and values of metadata fields for Aliquot. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

Name	API key	Description	Value
Aliquot ID*	aliquot_id	A human readable identifier for an aliquot, which may contain metadata information. The aliquot is a product or unit extracted from a sample of a specimen and prepared for the analysis.	This takes a string.
Aliquot UUID	aliquot_uuid	The unique identifier for an aliquot, such as a Universally Unique Identifier (UUID). The aliquot is a product or unit extracted from a sample of a specimen and prepared for the analysis.	This takes a string.

Case

In the following table, you will find the name, description, and values of metadata fields for Case. The Case category is further subdivided by the following properties: Diagnosis, Demographic, Status, and Prognosis. These properties are included in italics below the metadata field's name in the first column. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

Name	API key	Description	Value
Case ID	case_id	An identifier, such as a number or a string that may contain metadata information, for a subject who has taken part in the investigation of study.	This takes a string.
Case UUID	case_uuid	An unique identifier, such as a Universally Unique Identifier (UUID), for a subject who has taken part in the investigation of study.	This takes a string.
Primary site (Diagnosis)	primary_site	The anatomical site where the primary tumor is located in the organism.	This takes a string. Suggested values: Adrenal Gland Bile Duct Bladder Blood Brain Breast Cervix Colorectal Esophagus Eye Head And Neck Liver Lung Lymph Nodes Kidney Mesenchymal Mesothelium Nervous System Ovary Pancreas Prostate Skin Stomach Uterus Testis Thymus Thyroid Not available
Disease type (Diagnosis)	disease_type	The type of the disease or condition studied.	This takes a string. Suggested values: Acute Myeloid Leukemia Adrenocortical Carcinoma Bladder Urothelial Carcinoma Brain Lower Grade Glioma Breast Invasive Carcinoma Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma Cholangiocarcinoma Chronic Myelogenous Leukemia Colon Adenocarcinoma Esophageal Carcinoma Glioblastoma Multiforme Head and Neck Squamous Cell Carcinoma Kidney Chromophobe Kidney Renal Clear Cell Carcinoma Kidney Renal Papillary Cell Carcinoma Liver Hepatocellular Carcinoma Lung Adenocarcinoma Lung Squamous Cell Carcinoma Lymphoid Neoplasm Diffuse Large B-cell Lymphoma Mesothelioma Ovarian Serous Cystadenocarcinoma Pancreatic Adenocarcinoma Pheochromocytoma and Paraganglioma Prostate Adenocarcinoma Rectum Adenocarcinoma Sarcoma Skin Cutaneous Melanoma Stomach Adenocarcinoma Testicular Germ Cell Tumors Thymoma Thyroid Carcinoma Uterine Carcinosarcoma Uterine Corpus Endometrial Carcinoma Uveal Melanoma * Not available
Gender (Demographic)	gender	The collection of behaviors and attitudes that distinguish people on the basis of societal roles expected for the two sexes.	Choose from the following: Female Male
Age at diagnosis (Diagnosis )	age_at_diagnosis	The age in years of the case at the initial pathological diagnosis of disease or cancer.	This takes a non-negative integer.
Vital status (Status)	vital_status	The state of being living or deceased for cases that are part of the investigation.	Choose from the following: Alive Dead Lost to follow-up Unknown * Not available
Days to death (Prognosis)	days_to_death	A value denoting the project or study that generated the data.	This takes a non-negative integer.
Race (Demographic)	race	The number of days from the date of the initial pathological diagnosis to the date of death for the case in investigation.	This takes a string. Suggested values: White American Indian or Alaska Native Black or African American Asian Native Hawaiian or other Pacific Islander Not reported * Not available
Ethnicity (Demographic)	ethnicity	A socially defined category of people based on common ancestral, cultural, biological, and social factors.	This takes a string. Suggested values: Hispanic or Latino Not Hispanic or Latino Not reported Not Available

General

In the following table, you will find the name, description, and values of metadata fields for General. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

Name	API key	Description	Value
Investigation	Investigation	A value denoting the project or study that generated the data.	This takes a string.

Updated almost 3 years ago