As a researcher you will spend months or even years to get a project to the stage where samples can be sent off for processing. After countless late nights in the lab and meticulous sample prep, you now have the FASTQ files in your hands and they are ready to be sequenced.
When the processed files are ready to be downloaded from the sequencing facility, you save them to your labs’ external hard drive. But what happens when you go to access your files and they’re not there? After much troubleshooting, you come to the fated conclusion that the hard drive simply experienced a failure and all files were lost.
Don’t let this be you.
Deciding where and how you store your data is as important as the experiment itself. This data must be kept safe and secure in a redundant storage mechanism, so that no one single point of failure jeopardizes the entirety of the data.
You keep biological samples at the correct temperature in expensive fridges, so why should data be left so daintily up to a single point of hardware failure? At least in the story above, it was no one’s fault in particular.
Let’s turn back time. Instead of a faulty hard drive, what if you could store your data in the cloud?
Let’s take a deeper look at the differences between using a storage server or cloud object storage for those valuable files.
Cloud object storage is a mechanism for storing and retrieving files, known as objects, on an individual basis. Each object is an atomic entity and cannot be overwritten, but rather versioned. These objects exist within what is known as a bucket, a globally unique namespace for your little cubby on the cloud. Cloud object storage is offered by all the major cloud providers and is very fast at writing data to/from servers within that cloud.
A storage server is a computer on your lab’s network running a software to expose files over the network - typically this will be NFS (Network File Share). Typically the hard drives in the storage server will be installed using RAID (Redundant Array of Inexpensive Disks), which is hardware or software that enables copying bits of files across disks so that no data is lost when some number of disk drives crash.
It’s like photocopying a single photo and placing each photocopy into a number of boxes. When you lose a single box, it is still possible to retrieve the same photo from a different box.
Using a storage server, you must be on the same network to access the files. Not only does this mean you cannot play with the data from home, but it also means that the computing infrastructure used to run heavy duty bioinformatic pipelines must also be accessible from the same network.
On the cloud, data can be retrieved over the public internet no matter where you are. When you are running pipelines on the cloud, the data can be copied very quickly onto virtual machines since the cloud provider has optimized their internal infrastructure.
On a storage server, linux users and groups must be created and maintained to match the lab’s current roster of employees and existing projects. These permissions are limited to read/write by groups on a file/folder basis. The property of which group can write or read the file or folder is stored on the file or folder itself.
It is straightforward to setup a situation where a single data admin can write files and the rest of the lab can only read them, but if you want to, for example, have one write admin per project, with most members having read access but a limited few with write access (for running pipelines and writing processed results) it can tricky. You may have to end up resorting to creating fixed use groups and maintaining user membership across multiple groups.
Conversely, using cloud storage, access control measures are not limited by “metadata” of the object itself. On the cloud, any number of access control rules can be applied on a per object basis, as well as inheriting from pre-existing rules that apply to all buckets within the organization. This leads to less mistakes and a simple, straightforward means of controlling which lab members can perform which operations and with what data. More importantly, it means you can give access to collaborators, regardless of physical proximity!
A bonus feature that cloud object storage provides is an audit log of all data access and modification events. This way, you can identify when the data was downloaded, changed or uploaded.
For organizations that must follow HIPAA, encryption at rest can be used to comply with HIPAA requirements for encryption of Protected Health Information (PHI).
When storing data on a disk, we can encrypt the disk so that every time the server boots, an encryption key must be provided to access the data. While we still need to limit who has access to the running server, this prevents a malicious individual from accessing the data from stolen hard disks. Be careful not to lose the encryption keys - without them, the data will be rendered useless!
In the case of cloud object storage, encryption means that the employees of the organization would not be able to view your files.
Major cloud providers already support HIPAA and other data security policies. For example on Google Cloud, each bucket is automatically encrypted at rest with keys managed by the provider, and you have the option of providing your own keys if you so wish.
Due to the popularity of cloud object storage, bioinformatic pipeline tools are beginning to natively support object storage URLs. That means that the pipeline tool can understand the source of an object and control the download itself, alleviating the bioinformatician from ensuring that absolute paths on a filesystem match what the pipeline software expects. Furthermore pipeline tools are adding support for writing outputs back to cloud object storage.
The storage cost of hard drives will not be exactly the same as the GB amount of hard disks, since with RAID setups, extra disks are used to ensure redundancy of data. This means that one or more of the hard disks can fail without data loss.
Let’s assume a 4TB hard drive costs $140. In a RAID 6 setup, using 4 4TB drives and achieving 8 TB of storage, where 2 at most can fail at any time (2), will be a static upfront cost of $560. You can assume hard drive replacement will cost $14/per year on average (4 * 0.025% chance of disk failure per year per drive).
Conversely, the pricing model for cloud object storage is per GB per month. Assuming $0.023/gb/month, it will take ~3 months to reach the cost of a local server. So on paper, running a local server is cheaper. However you need to consider the cost of your time to install and maintain the hardware and software to run the server, as well as making sure drives are replaced when failed! The cost per GB itself may be less, but the responsibility to not lose everything still rests on you. At the end of the day you are paying for a service in addition to the storage itself!
Another factor to consider is that cloud object storage typically offers different pricing structures based on how often you plan to access the data. There are cheaper/GB monthly options known as “cold storage”, which will cost more when downloading them. Using these storage tiers effectively can alleviate a lot of costs from your cloud bill.
Don’t leave the security of your data up to the chance for yourself or your lab members who end up missing the warning light on a storage server. Cloud storage is safe and secure. It is simple to manage data access rules and enables collaboration across lab members.