Create a multisample VCF from Variant Store

If you have access to VCF files in the TOPMed datasets, you have been able to mix and combine samples from those VCF files into joint multisample VCF files by running standard bioinformatics tools for VCF processing, to generate VCF files for specific subsets of samples and genomic regions. This process can take several manual steps, can take time, and incurs cloud computing costs.

To speed up and streamline this flow, you can now use a new feature on BioDataCatalyst powered by Seven Bridges Platform, where using a set of API calls you can get a VCF file for the studies and genomic regions of your choice.

IMPORTANT NOTE: The data transfer into the project incurs cloud costs. Please see the Cost of data transfer section below.

Prerequisites

Procedure

Copy the python script below, modify the inputs and run it. The script will:

  • Kick off the transfer of data into your project
  • Monitor the data transfer and wait for completion
  • Once the raw data is transferred, start a set of CWL tasks that will convert the raw data into actual usable VCF file
import requests
import time
import datetime
import sevenbridges as sbg
from sevenbridges.errors import NotFound

# USER INPUT

# Project ID for the project where the data should be exported. 
# Must be a controlled project.
project = "user-name/vs-export"

# List of study with consents values to gather in the VCF
studies = ["phs001211.c2", "phs001546.c1"] 
# studies=[
#     "phs001624.c1",
#     "phs000993.c2",
#     "phs001416.c1",
#     "phs000997.c1",
#     "phs000974.c1",
#     "phs001032.c1",
#     "phs001466.c1",
#     "phs001515.c1",
#     "phs001514.c1",
#     "phs001682.c1",
#     "phs000956.c2",
#     "phs001217.c1",
#     "phs001662.c2",
#     "phs000954.c1",
#     "phs001608.c1",
#     "phs000988.c1",
#     "phs000920.c2",
#     "phs001345.c1",
#     "phs001368.c2",
#     "phs001237.c1",
#     "phs000964.c1",
#     "phs001598.c1",
#     "phs001207.c1",
#     "phs001607.c3",
#     "phs001211.c1",
#     "phs001607.c5",
#     "phs000993.c1",
#     "phs001607.c1",
#     "phs000946.c1",
#     "phs001729.c2",
#     "phs001211.c2",
#     "phs001602.c1",
#     "phs001218.c2",
#     "phs001547.c1",
#     "phs001607.c4",
#     "phs001543.c1",
#     "phs001604.c1",
#     "phs000921.c2",
#     "phs001143.c1",
#     "phs000972.c1",
#     "phs001402.c1",
#     "phs001607.c2",
#     "phs001472.c1",
#     "phs001387.c3",
#     "phs001466.c3",
#     "phs001735.c1",
#     "phs001237.c2",
#     "phs001412.c1",
#     "phs001726.c1",
#     "phs001368.c1",
#     "phs001189.c1",
#     "phs001024.c1",
#     "phs001612.c2",
#     "phs001735.c2",
#     "phs000964.c4",
#     "phs001728.c2",
#     "phs001435.c1",
#     "phs001293.c1",
#     "phs001359.c1",
#     "phs000964.c3",
#     "phs001416.c2",
#     "phs001062.c1",
#     "phs001062.c2",
#     "phs001545.c1",
#     "phs001412.c2",
#     "phs001466.c2",
#     "phs001546.c1",
#     "phs000964.c2",
#     "phs001725.c1",
#     "phs001644.c1",
#     "phs001215.c1",
#     "phs001368.c4",
#     "phs000974.c2",
#     "phs001603.c1",
#     "phs001605.c2",
#     "phs001395.c2",
#     "phs000951.c1",
#     "phs001730.c2",
#     "phs001446.c1",
#     "phs000951.c2",
#     "phs001040.c1",
#     "phs001732.c2",
#     "phs001514.c2",
#     "phs001467.c1",
#     "phs001606.c1",
#     "phs001293.c2",
#     "phs001434.c1",
#     "phs001727.c2",
#     "phs001468.c1",
#     "phs001612.c1",
#     "phs001544.c1",
#     "phs001395.c1"
# ]

# Credentials profile
sb_profile = 'bdc'

# Merge app ID
try:
    app = api.apps.get(project + '/sbg-merge-variant-store-export-chunks')
except NotFound:
    app = api.apps.get('admin/sbg-public-data/sbg-merge-variant-store-export-chunks').copy(project=project)


# Optional parameter where only a specific chromosome or region can be exported
# Uncomment below to use this parameter
regions = [ # Optional parameter - defaults to the entire genome.
            {
                "chromosome": "1",
                "start": 1000000,
                "end": 1100000
            },
#             {
#                 "chromosome": "2"
#             },
            {
                "chromosome": "12", 
                "start": 1000000,
                "end": 1100000
            }
        ]


# Initialize the API client
config = sbg.Config(profile=sb_profile)
api = sbg.Api(config=config)

# Define and submit the export request
headers = {
    "X-SBG-Auth-Token": api.token,
    "Content-Type": "application/json"
}
body = {
    "project": project,
    "inputs": {
        "studies": studies,
        "regions": regions
    },
    "instance_type": "X-Large"
}
r = requests.post(url=api.url+'/topmed-vcf-export', json=body, headers=headers)
export_id = r.json()['id']

print("Export started. Please do not close the session until it is done")

# Monitor the export until done
while True:
    try:
        r = requests.get(url=api.url + '/topmed-vcf-export/' + export_id, headers=headers)
        if r.json()['status'] in ['Completed', 'Failed']:
           print("Export completed")
            break
        time.sleep(10)
    except KeyboardInterrupt:
        r = requests.post(url=api.url + '/topmed-vcf-export/' + export_id + '/actions/abort', headers=headers)
        time.sleep(10)
        print("Aborted export")
        break #abort unload


# Navigate through the results
user_dir = api.files.query(project=project, names=[api.users.me().username])[0]
time_dir = api.files.query(parent=user_dir.id)[-1]
export_dir = api.files.query(parent=time_dir.id)[-1]

# Optionally print out the resulting directory tree
# for result_dir in api.files.query(parent=export_dir):
#     print(result_dir.name)
#     for f in api.files.query(parent=result_dir):
#         print ('--' + f.name)


# Start off tasks for resulting folders
task = api.tasks.create(
    app=app,
    name="Merge Variant Store Export chunks - " + str(datetime.datetime.now()),
    project=project,
    inputs={"in_directory": export_dir},
    run=True
)
print(f"Task {task.id} started in project {project}")

Resulting files

The result of the process should be one multisample VCF file per region specified in the script inputs. If the regions are unspecified, the script will generate one VCF file per chromosome.

The multisample VCFs will contain all the samples from the specified study with consent code values.
The multisample VCF files will contain variants that have at least one genotype with non-reference alleles. This means that if for a specific position all samples in the resulting file would have a GT value of 0/0, this variant is omitted from the file.

Cost of data transfer

Empirical data shows cloud costs based on number of samples and genomic region length in the table below.

Number of samplesGenomic regionTotal costs
100Chromosome 11$30
1000Entire genome$300

*The costs are presented as a guideline, actual costs may differ