October 23, 2024

Data banks for Biobanks

Author Profile
Jan Zheng

Spreadsheets should be the first choice for any new biobank project. But when do you know it’s time move on?

When most labs start a biobank project, spreadsheets should always be their first choice for storing and working with . They’re cheap, fast, and easy to use, everyone’s familiar with them, and with Google Sheets and Excel’s collaborative features, any lab member should be able to update it with ease.

Spreadsheets encourage experimentation and flexibility, and doesn’t impose many rigid structures around the type of information the biobank needs to collect. This makes them very easy to pick up.

Spreadsheets should comfortably manage a few hundred items in the bank. But when the collection grows to many hundreds to thousands of items, they start becoming unwieldy. Spreadsheets are also mostly ideal for simple data. It’s hard to upload images and PDFs to a spreadsheet. It’s even harder to manage multiple tables of data that need to reference each other.

For example, a simple biobank might manage one table for phages, another for bacterial hosts, and a third for the physical batches of phages. The list of phage batches should reference the list of phages and the list of hosts, but also contain its own details, like batch lot number, expiration date, the person responsible, etc. While there can only be one listing for P. aeruginosa and a phage Pae7, there can be multiple listings of phage batches.

While spreadsheets are great for new and smaller projects, they get unwieldy when more structure, consistency, and reliability is needed. For that, there are databases.

Spreadsheets vs. Databases

Spreadsheets are tools meant to explore and interact freely and quickly with data. Databases are tools meant to make data predictable and consistent, which requires structure, constraints, order, and correctness.

These additional constraints give databases a lot more power — there’s a reason why databases power the modern world — but they also add a lot more complexity. The additional structures means users can’t just “move things around” anymore in the data. Data is now organized into well-characterized columns: email column can only contain valid email addresses. Any other content won’t be saved.

For a long time, databases were very hard and expensive to use, but that’s not the case anymore, as there are plenty of tools that bridge the gap between spreadsheets and “real” databases.

User-friendly Database tools

Today, databases don’t have to be scary anymore. Gone are the days of having to write SQL code just to see the content in a database. Instead, there are a number of tools today that are just as easy (or even easier) than Google Sheets that have great database-like functionality. Airtable for example looks similar to a spreadsheet, but runs like a database. You can create multiple tabs, like a spreadsheet. You can also link and reference rows from one table to another, and filter, sort, and view your data in different ways. For Phage Directory, we’ve been using Airtable to power anything from the website to our blog that we’ve been running for over six years.

Airtable
Airtable is a database with a spreadsheet-like interface. This is an example of a library system

Airtable enforces the structure and consistency of databases, but has a very approachable interface. They even support uploading files and images, and turning databases into websites.

Another great tool that we use a lot at Phage Directory is Notion, which is kind of a cross between Apple Notes and Airtable. In fact, this entire post is written in Notion, and served to the labspace.ai website using a custom data pipeline (the original post can be found here: https://phagedirectory.notion.site/Data-banks-for-Biobanks-fff6478089c681d0be0ccb30e50aa8b8).

While use
While use Notion to write blog posts for Labspace, Notion also comes with data structures, that make it great for a database with lots of text content — like blog posts!

Notion offers a lot of flexibility for labs that use bio banks as lab resources, as Notion can double up as a great electronic lab notebook. It has collaborative and comment features like a Google Doc, and multiple people can work in it at the same time.

What about traditional database tools?

If Notion and Airtable are so great, do labs even need “real” databases like Postgres or even SQLite? Notion and Airtable still come with lots of constraints, but they should handle 80% of the cases. If you’re trying to store hundreds of thousands of genomes, from raw reads all the way through annotations in Notion, that’s probably not a good idea.

Notion doesn’t impose row restrictions, but their interface becomes fairly unusable at around 1,000 rows of data.

Airtable caps out at about 50,000 rows of data, and 1GB of file uploads.

Unfortunately, there are few publicly available tools spanning the range from Airtable to “real” databases. If you need to collect lots of types of data with different relationships, including graph relationships (e.g. a network of related items, like Facebook’s friends networks), then Airtable and Notion won’t cut it.

The best way to start using databases is to start with “simpler,” user-friendly databases like MongoDB, Turso, or Supabase, and use an Airtable-like interface like Outerbase or libsqlstudio to explore the data.

For object data like PDFs, images and fasta files, if you have hundreds to thousands of large files spanning hundreds of MBs or GBs, I’d suggest using Cloudflare R2, as they have a generous 10GB free tier, and no egress pricing (which means you pay every time someone else downloads a file. This is how AWS S3 makes billions of dollars).

Despite the simplicity of MongoDB or Turso or Cloudflare R2, they still take significant time and technical experience to set up. You still need to write code to be able to connect these tools. You would probably still need some kind of user interface to make it easy for users to add new data and upload files, and you probably need another Airtable-like interface to browse and build reports from the data — like image browsers or genomic analyses — to take full advantage of this data.

Electronic lab notebooks (ELN) and lab inventory management software (LIMS) exist for this occasion, but many of them are hundreds of thousands of dollars a year, and are built for a more rigid way of working. Adopting ELNs and LIMS would require you to adapt your lab’s process to the way these tools work, not the other way around. This works for more traditional workflows like small molecule drug design, but imposes too much structure for new therapy development like CAR T-Cell or phage therapy design.

So what’s next?

Hopefully this wasn’t all too overwhelming.

If you’re just getting started: use Google Sheets or Excel. These are great for getting started, and doesn’t impose too much constraints while you think about your data, protocols, and processes.

If you’re overwhelmed with too many rows, move on to try either Airtable or Notion. Try to impose some more structure in the forms of schemas around your data and protocols.

If you’re the 5% of labs outgrowing the features of Airtable, or need a lot more data or complexity than what Airtable can handle — try to move onto a combination of Turso, Supabase, and Cloudflare R2. For this, you should consult your bioinformatics engineer!

If you don’t have a bioinformatics engineer, or they’re busy with doing actual science, reach out to me (Jan) and I can walk you through the process and help you with your next steps.