This Quickstart Guide leads you through a simple RNA sequencing analysis. It uses the BioData Catalyst powered by Seven Bridges API, but its steps parallel the Quickstart for the visual interface.
We have written this example in Python, but the concepts can be adapted to your preferred programming language. We encourage you to try this analysis yourself, as an aid to creating a script for your own custom analysis.
Objective:
We will use the API to create a project, add files to it, add a workflow, create and run a task, then download the outputs.
On this page
- Requirements
- Preparatory work
- API Quickstart
- 1 Create a project
- 2 Add files to the project
- 3 Get a copy of the correct public workflow
- 4 Build a file processing list for your analysis
- 5 Format, create, and start your tasks
- 6 Check task completion
- 7 Download Files
- Visualize files on the Platform visual interface
- Download files via the API
Requirements
To run the code in this tutorial, you will need:
- Python version 2.7.x (Python 3.x is not 100% compatible with all the code used in this example)
- The Python requests module. If you do not already have this, it can be installed via Python's package management system, pip:
$ sudo pip install requests
- The authentication token associated with your account, which you can get by going to the Developer Dashboard after logging into your account. Remember to keep your authentication token secure!
- This project analyzes TCGA Controlled Data that is available on the Platform. To access the particular data file used here, you will need to have been awarded TCGA Controlled Data access through dbGaP.
- We show three ways of adding the controlled data file to your project. You can choose your preferred method:
- Find the file(s) you need with the Case Explorer and Data Browser. To learn more about this, follow the QuickStart for the visual interface.
- Copy the file(s) from another project that you are a member of using the API.
- Upload your own private data to analyze using the Command line uploader.
Preparatory work
To interact with the API, we send and receive data as JSON objects. Each JSON object received will represent one of the following:
- A requested resource, or resources, listed in the items field in the JSON array,
- An error, accompanied by text detailing the error in the message field.
Most of theGET
,POST
andPATCH
requests will only signal their success or failure by means of an HTTP status code in the response.
First we import the necessary Python libraries and define the names of our new project and the desired workflow.
In the code below, please replace the
AUTH_TOKEN
string with your authentication token!
# IMPORTS
import time as timer
from requests import request
import json
from urllib2 import urlopen
import os
# GLOBALS
FLAGS = {'targetFound': False, # target project exists in the Platform project
'taskRunning': False, # task is still running
'startTasks': True # (False) create, but do NOT start tasks
}
# project we will create in the Platform (Settings > Project name in GUI)
TARGET_PROJECT = 'Quickstart_API'
TARGET_APP = 'RNA-seq Alignment - STAR for TCGA PE tar' # app to use
INPUT_EXT = 'tar.gz'
# TODO: replace AUTH_TOKEN with yours here
AUTH_TOKEN = 'AUTH_TOKEN'
Since we are going to write the functions that interact with API in Python, we'll prepare a function that converts the information we send and receive into JSON.
# FUNCTIONS
def api_call(path, method='GET', query=None, data=None, flagFullPath=False):
""" Translates all the HTTP calls to interface with the Platform
This code adapted from the Seven Bridges platform API v1.1 example
https://docs.sbgenomics.com/display/developerhub/Quickstart
flagFullPath is novel, added to smoothly resolve pagination issues with the Platform API"""
data = json.dumps(data) if isinstance(data, dict) or isinstance(data,list) else None
base_url = 'https://api.sb.biodatacatalyst.nhlbi/v2/'
headers = {
'X-SBG-Auth-Token': AUTH_TOKEN,
'Accept': 'application/json',
'Content-type': 'application/json',
}
if flagFullPath:
response = request(method, path, params=query, data=data, headers=headers)
else:
response = request(method, base_url + path, params=query, data=data, headers=headers)
response_dict = json.loads(response.content) if response.content else {}
if response.status_code / 100 != 2:
print response_dict['message']
raise Exception('Server responded with status code %s.' % response.status_code)
return response_dict
def hello(): # for debugging
print("Is it me you're looking for?")
return True
We will not only create objects but also need to interact with them. So in this demo we also may use object oriented programming.
We have created a class API, defined is below. Generally, the API calls will either return a list of things (e.g. myFiles
is plural) or a very detailed description of one thing (e.g. myFile
is singular). The appropriate structure is created automatically in the response_to_fields()
method.
# CLASSES
class API(object):
# making a class out of the api() function, adding other methods
def __init__(self, path, method='GET', query=None, data=None, flagFullPath=False):
self.flag = {'longList': False}
response_dict = api_call(path, method, query, data, flagFullPath)
self.response_to_fields(response_dict)
if self.flag['longList']:
self.long_list(response_dict, path, method, query, data)
def response_to_fields(self,rd):
if 'items' in rd.keys():
# get * {files, projects, tasks, apps} (object name plural)
if len(rd['items']) > 0:
self.list_read(rd)
else:
self.empty_read(rd)
else:
# get details about ONE {file, project, task, app}
# (object name singular)
self.detail_read(rd)
def list_read(self,rd):
n = len(rd['items'])
keys = rd['items'][0].keys()
m = len(keys)
for jj in range(m):
temp = [None]*n
for ii in range(n):
temp[ii] = rd['items'][ii][keys[jj]]
setattr(self, keys[jj], temp)
if ('links' in rd.keys()) & (len(rd['links']) > 0):
self.flag['longList'] = True
def empty_read(self,rd): # in case an empty project is queried
self.href = []
self.id = []
self.name = []
self.project = []
def detail_read(self,rd):
keys = rd.keys()
m = len(keys)
for jj in range(m):
setattr(self, keys[jj], rd[keys[jj]])
def long_list(self, rd, path, method, query, data):
prior = rd['links'][0]['rel']
# Normally .rel[0] is the next, and .rel[1] is prior.
# If .rel[0] = prior, then you are at END_OF_LIST
keys = rd['items'][0].keys()
m = len(keys)
while prior == 'next':
rd = api_call(rd['links'][0]['href'], method, query, data, flagFullPath=True)
prior = rd['links'][0]['rel']
n = len(rd['items'])
for jj in range(m):
temp = getattr(self, keys[jj])
for ii in range(n):
temp.append(rd['items'][ii][keys[jj]])
setattr(self, keys[jj], temp)
API Quickstart
1. Create a project
All work on the Platform is carried out inside a project. For this task, we can either use a project that has already been created, or we can use the API to create one. Here we will create a new project: TARGET_PROJECT
, which we set in the definitions above to be 'Quickstart_API'. However, since we want to first check that that the named project doesn't exist, we'll also GET a list of all projects that have already been created that you can access.
The project's name and description will also be sent in the call to create the project, and its billingGroup
will be set .
if __name__ == "__main__":
# Did you remember to change the AUTH_TOKEN?
if AUTH_TOKEN == 'AUTH_TOKEN':
print "You need to replace 'AUTH_TOKEN' string with your actual token. Please fix it."
exit()
# list all billing groups on your account
billingGroups = API('billing/groups')
# Select the first billing group, this is "Pilot_funds(USER_NAME)"
print billingGroups.name[0], \
'will be charged for this computation. Approximate price is $4 for example STAR RNA seq (n=1) \n'
# list all projects you are part of
existingProjects = API(path='projects') # make sure your project doesn't already exist
# set up the information for your new project
NewProject = {
'billing_group': billingGroups.id[0],
'description': "A project created by the API Quickstart",
'name': TARGET_PROJECT,
'tags': ['tcga']
}
# Check to make sure your project doesn't already exist on the platform
for ii,p_name in enumerate(existingProjects.name):
if TARGET_PROJECT == p_name:
FLAGS['targetFound'] = True
break
# Make a shiny, new project
if FLAGS['targetFound']:
myProject = API(path=('projects/' + existingProjects.id[ii]))
# GET existing project details (we need them later)
else:
myProject = API(method='POST', data=NewProject, path='projects')
# POST new project
# (re)list all projects, to check that new project posted
existingProjects = API(path='projects')
# GET new project details (we will need them later)
myProject = API(path=('projects/' + existingProjects.id[0]))
# GET new project details (we need them later)
2. Add files to the project
Here we have shown three different options for adding data to a project:
(a) Copy files from an existing project using the API
(b) Copy files from an existing project using the visual interface
(c) Add files using the API and command line uploader.
Follow one of these methods only.
(a) Copy files from an existing project via the API
Here we will take advantage of the project that you will have created if you followed the Platform QuickStart , so, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis.
The following code lets us look for the three files from that project and copy them over to our current project, API_QUICKSTART.
if __name__ == "__main__":
for ii,p_id in enumerate(existingProjects.id):
if existingProjects.name[ii] == 'QuickStart':
filesToCopy = API(('files?limit=100&project=' + p_id))
break
# Don't make extra copies of files
# (loop through all files because we don't know what we want)
# files currently in project
myFiles = API(('files?limit=100&project=' + myProject.id))
for jj,f_name in enumerate(filesToCopy.name):
# Conditional is HARDCODED for RNA Seq STAR workflow
if f_name[-len(INPUT_EXT):] == INPUT_EXT or f_name[-len('sta'):] \
== 'sta' or f_name[-len('gtf'):] == 'gtf':
if f_name not in myFiles.name:
# file currently not in project
api_call(path=(filesToCopy.href[jj] + '/actions/copy'), method='POST', \
data={'project': myProject.id, 'name':f_name} ,flagFullPath=True)
(b) Copy files from an existing project using the visual interface
Again, this method takes advantage of the project that you will have created if you followed the Quickstart. So, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis.
To copy those files into your project 'API Quickstart' using the Platform visual interface:
- Select the project 'API Quickstart' that you have just created.
- Go to the Files tab, and click Add Files.
- On the left hand side, you will see a list of locations that you can add files from. Under projects you will see 'Quickstart'. Click that project's name.
- Select the checkboxes next to the files in that project. Then click Add to project.
(c) Upload local files using the API and the command line uploader
To use this option you need to have the Platform command line uploader installed already.
If you are using this script to call the uploader, make sure to set up your $AUTH_TOKEN
.
You first need find the IDs of your projects with:
bin/f4c-uploader.sh --list-projects
which will print:
9e710b4e-148e-414f-99b0-26cfbc316719 Quickstart_API
e56092a9-482d-44fc-a98d-825a3c90c5d2 Quickstart
431d4397-8b7e-4d35-bb74-47865750aead Open Data Project
We will copy the first string since it matches our project name. Then, add the following to the python script:
print "You need to install the command line uploader before proceeding"
ToUpload = ['G17498.TCGA-02-2483-01A-01R-1849-01.2.tar.gz','ucsc.hg19.fasta','human_hg19_genes_2014.gtf']
for ii in range(len(ToUpload)):
cmds = "cd ~/f4c-uploader; bin/f4c-uploader.sh -p 67f68072-45b6-12b3-789c-37be8b0f2f04 " + \
"/Users/digi/PycharmProjects/platform_API/toUpload/" + ToUpload[ii]
os.system(cmds)
del cmds
File directory
In the example code above,
/Users/digi/PycharmProjects/platform_API/toUpload/
is the path of the directory containing files to upload. You should change this to the appropriate path on your own computer.
Now that your files are uploaded, it may be useful to set their metadata. For more information about metadata, please refer to the file metadata documentation page. Once the file is uploaded, we can use the API call to set the file metadata.
For this, we need to know the ID number of the file we just uploaded; this is the number used to identify the file with the API.
We can obtain the file ID by running the API call to list project files, which returns the names and IDs for all the files in the project.
See the API overview for more information on referring to files, projects and other objects on the Platform.
Once we have the file's ID, we can move on to setting its metadata. This is done via the request PUT /project/:project_id/file/:file_id, replacing :project_id with the project's ID and :file_id with the file's ID.
We include the metadata we want to set in the body of the request, in the form of a JSON dictionary. Below is an example of how this is done (replace with appropriate metadata for your own files):
singleFile = api_call(path=myFiles.href[1], flagFullPath=True)
# here we modify file #1, adapt appropriately
metadata = {
# this is made up metadata, adapt appropriately
"name": singleFile['name'],
"library":"TEST",
"file_type": "fastq",
"sample": "example_human_Illumina",
"seq_tech": "Illumina",
"paired_end": "1",
'gender': "female",
"data_format": "awesome"
}
api_call(path=(singleFile['href'] + '/metadata'), method='PATCH', \
data = metadata, flagFullPath=True)
3. Get a copy of the correct public workflow
There are more than 150 public apps available on the Platform. Here we query all of them, then copy the target workflow, TARGET_APP
, which we set earlier to be RNA-seq Alignment -STAR for TCGA.
if __name__ == "__main__":
myFiles = API(('files?limit=100&project=' + myProject.id))
# GET files LIST, regardless of upload method
# Add a workflow (copy it from another project or the public apps,
# not looping through all apps, we know exactly what we want)
allApps = API(path='apps?limit=100&visibility=public')
# long function call, currently 183
myApps = API(path=('apps?limit=100&project=' + myProject.id))
if TARGET_APP not in allApps.name:
print("Target app (%s) does not exist in the public repository. Please check the spelling" \
% (TARGET_APP))
else:
ii = allApps.name.index(TARGET_APP)
if TARGET_APP not in myApps.name:
# app not already in project
temp_name = allApps.href[ii].split('/')[-2] # copy app from public repository
api_call(path=('apps/' + allApps.project[ii] + '/' + temp_name + '/actions/copy'), \
method='POST', data={'project': myProject.id, 'name': TARGET_APP})
myApps = API(path=('apps?limit=100&project=' + myProject.id)) # update project app list
del allApps
4. Build a file processing list for your analysis
It's likely that you'll only have one input file and two reference files in your project. However, if multiple input files were imported, the following code will create a batch of tasks -- one for each file. This code builds the list of files:
if __name__ == "__main__":
# Build .fileProcessing (inputs) and .fileIndex (references) lists [for workflow]
FileProcList = ['Files to Process']
Ind_GtfFile = None
Ind_FastaFile = None
for ii,f_name in enumerate(myFiles.name):
# this conditional is for 'RNA seq STAR alignment' in
# Quickstart_API. _Adapt_ appropriately for other workflows
if f_name[-len(INPUT_EXT):] == INPUT_EXT: # input file
FileProcList.append(ii)
elif f_name[-len('gtf'):] == 'gtf':
Ind_GtfFile = ii
elif f_name[-len('sta'):] == 'sta':
Ind_FastaFile = ii
5. Format, create, and start your tasks
Next we will iterate through the File Processing List FileProcList
to generate one task for each input file.
if __name__ == "__main__":
myTaskList = [None]
for ii,f_ind in enumerate(FileProcList[1:]):
# Start at 1 because FileProcList[0] is a header
NewTask = {'description': 'APIs are awesome',
'name': ('batch_task_' + str(ii)),
'app': (myApps.id[0]), # ASSUMES only single workflow in project
'project': myProject.id,
'inputs': {
'genomeFastaFiles': { # .fasta reference file
'class': 'File',
'path': myFiles.id[Ind_FastaFile],
'name': myFiles.name[Ind_FastaFile]
},
'input_archive_file': { # File Processing List
'class': 'File',
'path': myFiles.id[f_ind],
'name': myFiles.name[f_ind]
},
# .gtf reference file, !NOTE: this workflow expects a _list_ for this input
'sjdbGTFfile': [
{
'class': 'File',
'path': myFiles.id[Ind_GtfFile],
'name': myFiles.name[Ind_GtfFile]
}
]
}
}
# Create the tasks, run if FLAGS['startTasks']
if FLAGS['startTasks']:
myTask = api_call(method='POST', data=NewTask, path='tasks/', query={'action': 'run'}) # task created and run
myTaskList.append(myTask['href'])
else:
myTask = api_call(method='POST', data=NewTask, path='tasks/') # task created and run
myTaskList.pop(0)
print("%i tasks have been created. \n" % (ii+1))
print("Enjoy a break, come back to us once you've got an email that tasks are done")
6. Check task completion
These tasks may take a long time to complete. Here are two ways to check in on them:
(a) Wait for email confirmation
No additional code is needed. An email will be sent to with the status of your task when it completes.
(b) Poll task status
The following script will poll the task every 10 minutes and report back when it has completed.
if __name__ == "__main__":
# if tasks were started, check if they've finished
for href in myTaskList:
# check on one task at a time, if any running, can not continue (no sense to query others)
print("Pinging CGC for task completion, will download files once all tasks completed.")
FLAGS['taskRunning'] = True
while FLAGS['taskRunning']:
task = api_call(path=href, flagFullPath=True)
if task['status'] == 'COMPLETED':
FLAGS['taskRunning'] = False
elif task['status'] == 'FAILED': # NOTE: leave loop on ANY failure
print "Task failed, can not continue"
exit()
timer.sleep(600)
7. Download Files
It may be useful to quickly download some summary files to visualize the results.
Visualize files on the Platform visual interface
To visualize the files produced by your task:
- Log in to the Platform, and go to the Quickstart_API project
- Click on the Files tab and select the files produced by the task. Clicking on any file will bring up its metadata and an option to visualize it. There is also an option to download the file.
Download files via the API
You can do this by iterating through your myFiles
list
from urllib2 import urlopen
import os
def download_files(fileList):
# download a list of files from URLs
dl_dir = 'downloads/'
try: # make sure we have the download directory
os.stat(dl_dir)
except:
os.mkdir(dl_dir)
for ii in range(1, len(fileList)): # skip first [0] entry, it is a text header
url = fileList[ii]
file_name = url.split('/')[-1]
file_name = file_name.split('?')[0]
file_name = file_name.split('%2B')[1]
u = urlopen(url)
f = open((dl_dir + file_name), 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 1024*1024
prior_percent = 0
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
if (file_size_dl * 100. / file_size) > (prior_percent+20):
print status + '\n'
prior_percent = (file_size_dl * 100. / file_size)
f.close()
# Check which files have been generated (only taking small files to avoid long times)
myNewFiles = API(('files?project=' + myProject.id)) # calling again to see what was generated
dlList = ["links to file downloads"]
for ii, f_name in enumerate(myNewFiles.name):
# downloading only the summary files. Adapt for whichever files you need
if (f_name[-4:] == '.out'):
dlList.append(api_call(path=('files/' + myNewFiles.id[ii] + '/download_info'))['url'])
T0 = timer.time()
download_files(dlList)
print timer.time() - T0, "seconds download time"
Good luck and have fun!
Updated about 2 years ago