Import data from BioData Catalyst Powered by Gen3


The purpose of this feature is to allow you to bring data from the BioData Catalyst Powered by Gen3 to a project on the BioData Catalyst Powered by Seven Bridges so you could further analyze it. The procedure consists of:


Active NIH or BioData Catalyst Developer account with access to controlled data.

Selecting and exporting data from BioData Catalyst powered by Gen3

This step takes place on BioData Catalyst powered by Gen3.

Before you can import the data into your project, you first need to export it from BioData Catalyst Gen3 by following the procedure below:

  1. On the top navigation bar click Exploration.
  2. In the pane on the left, select the Data tab.
  3. Use the available filters in the Filters section to select patient data of your interest.
  4. When you're done with filtering, click Export to Seven Bridges to start the export process.

Importing data into a project on BioData Catalyst powered by Seven Bridges

This step takes place on BioData Catalyst powered by Seven Bridges.

After using the export feature, you will land on a page for setting up the import to the Platform (further discussed below).

  1. Select the project you want to import the data to (Important note: this data can only be imported to a controlled project). Alternatively, you can create a new project and upload the files there.

  1. (Optional) Add tags for the files.
  2. (Optional) Choose the method for resolving naming conflicts in case files with the same names already exist in your project.
  3. Click Import Data.

The files will be imported. The notification in the upper right corner contains information on the exact number of files what were uploaded.

The following file formats will be available:

  • Multisample VCF
  • BAM
  • Raw clinical TSV
  • Raw clinical Avro (PFB file)

To use raw clinical data in your analyses, the data will first need to be extracted from archives and then converted to JSON, following the procedure below.

Extract and convert Avro files to JSON

This step takes place in Data Studio on BioData Catalyst powered by Seven Bridges.

  1. Access the project containing the data you imported from BioData Catalyst powered by Gen3.
  2. Open the Data Studio tab. This takes you to the Data Studio home page. If you have previous analyses, they will be listed on this page.
  3. In the top-right corner click Create new analysis. The Create new analysis wizard is displayed.
  4. Name your analysis in the Analysis name field.
  5. Select JupyerLab as the analysis environment.
  6. Keep the default Environment setup. Each setup is a preinstalled set of libraries that is tailored for a specific purpose. Learn more.
  7. Keep the default Instance type and Suspend time settings.
  8. Click Start. The Platform will start acquiring an adequate instance for your analysis, which may take a few minutes.
  9. Once the Platform has acquired an instance for your analysis, the JupyterLab home screen is displayed.
  10. In the Notebook section, select Python 3. A new blank Jupyter notebook opens.
  11. Start by installing the fastavro library:
!pip install fastavro
  1. Import the reader feature from fastavro and import the json Python library:
from fastavro import reader
import json
  1. Unpack the gzip file and save it in the Avro format. Make sure to replace <gzip-file-name> in the code below with the name you want to use for the unpacked Avro file. The /sbgenomics/project-files/ path is the standard path used to reference project files in a Data Studio analysis.
!tar -xzvf /sbgenomics/project-files/<gzip-file-name>.avro.gz
  1. Use the reader functionality of the fastavro library to read the Avro file and assign its content to the file variable. Make sure to replace <avro-file-name> with the name of the Avro file that was extracted from the gzip archive in the previous step.
file = []
with open('<avro-file-name>.avro', 'rb') as fo:
    avro_reader = reader(fo)
    for record in avro_reader:
  1. Save the content to a JSON file. Make sure to replace <json-file-name> with the actual name you want to use for the JSON file.
json_file = []
with open('<json-file-name>.json', 'w') as fp:
    json.dump(file, fp, indent=2)
  1. Finally, save the JSON file to the /sbgenomics/output-files/ directory within the analysis. When you stop the Data Studio analysis, the JSON file will be saved to the project, and will be available for further use in tasks within that project.
!cp <json-file-name>.json /sbgenomics/output-files/