Project and file locations (Multi-cloud)
Constant decrease in sequencing costs has resulted in exponential increase in amount of data being generated by the NGS technology. Due to its high volume, this data is usually stored in the cloud, which makes downloading/moving those files an extremely expensive operation.
In order to optimize task execution on Platform and its costs, it is important to understand the basics about locations where your files are stored and where the execution takes place.
BioData Catalyst powered by Seven Bridges currently works with two cloud providers, Amazon Web Services (AWS) and Google Cloud Platform (GCP), both of which are available for file storage and computation purposes. Each of the cloud infrastructure providers has cloud resources hosted in multiple locations across the world, called regions.
If you store your files on AWS and GCP US regions, the Platform offers managing all your work from a single space and spinning up chosen computation resources where your data lives, thus providing much easier cost optimization of your work.
In order to use files as inputs for a task on the Platform, the files need to be uploaded/imported to the Platform, accessed from a mounted Amazon Simple Storage Service (S3) or Google Cloud Storage (GCS) bucket, or used from publicly available files (Seven Bridges Public Reference Files and available public datasets).
The files and the computation instances that are used to process the files may be located in the same cloud provider's region, but can also be in different regions or even on different cloud providers. If the two locations are in the same region within one cloud provider's infrastructure, there will be no additional data transfer costs as the Platform can use the files directly from the location where they are stored.
However, if these locations are in different regions or belong to different cloud providers, there will be additional data transfer costs as the Platform will need to transfer the files from the location where they are stored, to the location where computation is done. This data transfer is charged by the cloud infrastructure provider and passed through by Seven Bridges with no extra charge or fee.
Seven Bridges provides full transparency of data transfer costs charged by cloud infrastructure providers and allows you to optimize them by defining your project location as described below. For a complete list of data transfer prices, please refer to Amazon S3 Data Transfer Pricing and Google Cloud Storage Network Pricing.
When creating a project, you need to define the location (cloud provider and region) where the tasks in the project will be executed. The available options are:
- aws:us-east-1 - AWS US East (N.Virginia)
- google:us-west1 - GCP US West (Oregon)
The selected location is the location for analysis (task or Data Studio) execution, e.g. the exact region where computation resources (virtual instances and accompanying attached disks) will be spun up.
Also, all file uploads will end up at this location (analysis outputs, user uploads and imports). This option is used to help you organize your workspace in the most cost-optimal way, leaving you the control over execution and file organization.
In order to use files as inputs for a task on the Platform, the files need to be:
- Uploaded/imported to the Platform. When uploading files to a project, they will be uploaded to the location (cloud provider and region) selected as the project location on project creation. This allows for cost optimization as all analyses (tasks and Data Studio) will be executed on computation instances at the same location, thus causing no additional data transfer costs.
- Accessed from a mounted Amazon Simple Storage Service (S3) or Google Cloud Storage (GCS) bucket. The Platform allows you to mount your own external bucket where you store the files that you want to use as inputs for an analysis. In this case, there will be additional data transfer costs if your files are hosted at a different location from the one defined for your project. See below for details.
- Used from publicly available files (Seven Bridges Public Reference Files and available public datasets). Files that are available through the Public Reference Files repository and public datasets on the Platform are hosted in all available project locations and will not cause additional data transfer costs, regardless of your project's location.
- Generated as outputs from a previously completed task within the same project. Task outputs are stored at the defined project location and will not cause any additional data transfer costs when used as inputs in a different task within the same project.
File transfer costs
Additional file transfer costs charged by the cloud infrastructure provider can happen in the situations described below.
Using files copied between projects that are in different locations
When you copy a file between projects that are in different locations, the file will not be physically copied to the target project's location. Instead, it will be used from the location where it was originally uploaded to, as shown in the diagram below:
When you set such a file as an input in your task, a warning will be displayed below the input saying that the location of the file is different from the project location. Please be aware that this will cause additional costs as the file will need to leave the region or cloud provider where it is stored, to be brought to the computation instance where it will be processed.
When a task containing such files is completed, the cost of data transfer is included in the total costs of task execution and will be charged together with the task price. Data transfer cost will be shown as a separate item in the task price tooltip on the task page.
Seven Bridges provides full transparency of data transfer costs charged by cloud infrastructure providers and passes them through with no extra charge or fee. For a complete list of data transfer prices, please refer to Amazon S3 Data Transfer Pricing and Google Cloud Storage Network Pricing.
Using files from mounted AWS S3 or GCS buckets
Another option that might cause data transfer costs is the use of files from mounted cloud storage buckets. If your bucket is in a location that is different from the location of the project in which you want to use the files, when executing an analysis within the project, the files will need to be transferred to the project's location as this is where the computation will take place.
In this case, there will be additional file transfer costs charged directly to your account with the cloud storage provider. Also, when using such files as inputs for a task, the Platform will not display a warning about the location of the input file being different from the project location.
To optimize your task execution costs, when creating a project and planning the analyses that will be executed within that project, please keep in mind where your input files are stored and choose a project location that matches the location of your files, if possible.
Also, please keep in mind that the location of a project cannot be changed once the project has been created.
When exporting a file from BioData Catalyst powered by Seven Bridges to an attached volume, export is possible only to a volume that is in the same location (cloud provider and region) as the location of your project. Therefore, exporting to a volume will not cause any additional file transfer costs, as the transfer will take place within the same location.
Updated 10 months ago