It’s been a long road to mastering the cell, but biological scientists think they are getting closer and closer to understanding the fundamental mechanics of the kernels of life that make up our bodies. Decades after the sequencing of the first human genome, we now have a much more comprehensive understanding of how to discover a cell’s functions — and increasingly, the tools to actually analyze and prove that our models and theories about them are correct.
That’s been the domain of single-cell analysis and a novel technique in genetic science, which has been dubbed “perturbation biology”: making extremely small changes to the genetic code inside of cells and then observing how that cell’s functions change. What began with 18 cells and limited observational data in a single lab has now grown exponentially to hundreds of thousands of cells and millions of observations globally. That massive increase in data has forced the creation of a whole new set of analytical tools to process this data and derive foundational insights into the workings of cells.
How do all of these new laboratory experiments work and what kind of software tools are needed to progress the most advanced theories today? Joining host Danny Crichton on “Securities” this episode is Rahul Satija, an associate professor at New York University and a core member of the New York Genome Center as well as Lux’s own Shaq Vayda.
We’ll talk about how biological tools like CRISPR power perturbation bio, why scientists are increasingly moving away from indirect experiments to direct experiments and what that means for the future of the field, how we comprehend cell heterogeneity, if we’re getting closer to “fundamental truth” in biology, and finally, why theoretical molecular scientists are increasingly going to need large-scale clinical trials for the next-generation of health treatments.
Produced by Christopher Gates
Music by George Ko
Transcript
Danny Crichton:
When it comes to the big challenges of human health, the neurodegenerative disorders, the cancers, the cavalcade of other diseases that afflict us, it always feels like molecular biology has never been closer and yet never been farther away from offering treatments. Every new advance seems to just add new complexities forever pushing our final understanding just out of reach. Well, that might just be changing this decade and that's the subject of today's show.
Hello, I'm Danny Crichton, and you're listening to Securities by Lux Capital, where we construct and defend the future of humanities' prosperity through the hard sciences and technology. Today I'm joined by Rahul Satija, an associate professor at New York University, and a core member of the New York Genome Center, Big Apple's hub for genomic and sequencing research. Rahul, thanks for joining us.
Rahul Satija:
Thank you so much for having me, Danny.
Danny Crichton:
Also joining us is my Lux partner, Shaq Vayda. Shaq, welcome.
Shaq Vayda:
Wonderful to be here.
Danny Crichton:
I mentioned that molecular biology has been challenged to answer some of the biggest questions in the science of life, but a new paper in nature on perturbation bio has been making some big waves. Rahul, why is there so much excitement about these small scale genetic changes and why might they answer the largest questions in biology?
Rahul Satija:
Yeah, that's a great question. It's a big human genome, right? It's billions and billions of bases. And I think one of the questions that so many labs are interested in, and certainly my own, is understanding how it works.
One of the amazing things about the genome is how much small changes can matter. Individual diseases can be caused by single base pair edits, one change out of 3 billion bases, and all of a sudden you go from having someone who's perfectly healthy to having a terminal illness. And then you can have other changes, billions of other changes that have absolutely no effect. And I think one of the incredible mysteries of biology is what parts of the genome matter and when do they matter and how do they matter? But every single base has the potential to have an impact and a function on human health. And to unravel that mystery, we have to start looking at the genome in very, very small components. And that's a hard problem because it's a big genome. But that's I think what makes a lot of this research exciting and fun.
Danny Crichton:
So I think about the history here. You're going back to the '90s, we had this whole sequencing, the human genome project, trying to get a map of everything going on in the genome. And that was the competition with the NSF-funded group and so did on genomics. And then we got into the late-2000s, early 2010s where it was about SNPs, these single nucleotide polymorphism. So these small changes, and that led to both the identification of certain diseases that were caused by those single changes, as well as a lot of the popular products around genome analysis. They're looking at your ancestry, connecting the dots and saying, "Well, if you have these sort of changes to your genome, it leads to this sort of background," and ancestry.com and 23andMe. But it sounds like given the new tools we have with CRISPR and others along with some new analytical tools we can talk about, we're really opening up this field in a way that didn't exist 10 years ago.
Rahul Satija:
This has been the goal since the genome was first sequenced. And I actually grew up not too far from the NIH and also near the Tiger Institute, which was a competing group. And I was in middle school when the genome was first sequenced and they had those press conferences and those announcements. And it was exciting. We were hearing about it in school. We were hearing about it in the news, and I think we thought maybe at the time, all right, now we've reached this milestone. We've sequenced the human genome. All of a sudden biology is going to change. And we quickly realize that the sequencing wasn't enough. Just because we know the sequence of 3 billion bases in the human genome doesn't mean we understand how it works.
And that's led now to decades of research of what's often called functional genomics to develop tools and technologies that can help us interpret the genome and try to understand how each individual based functions and tools like CRISPR and tools like high-throughput perturbations and functional screens are exactly what enables that to happen. And there's been incredible advances in that really in the past five to 10 years in particular.
Danny Crichton:
So let's zoom in on that. So we're talking about perturbation bio so this idea of changing these single nucleotides. What does that mean? How do you do that in a lab? How does that happen and how can you actually figure out what that little change does to a cell and to a person more broadly?
Rahul Satija:
Yeah, so it's these tools which are often called genome engineering tools, which are ways of modifying or editing the genome. So when we do DNA sequencing, we can read the genome, but now we're trying to change it or edit it or write it. And in many ways, the goal is similar to debugging a computer program or understanding a computer program. If you have a program with thousands and thousands of lines of code and you go in and you change one of them, you run the program before the change, you run the program after the change, you look for a difference in output. And if you can do that over and over and over again, you can reconstruct or understand a program from first principles.
And so I think that's a similar goal for genome engineering tools in biology. And that's what tools like CRISPR precisely enable you to do. They enable you to go into a cell, a living cell, and change a specific piece of DNA. And we can get into the details about exactly how much you're changing and the types of changes you can make in their continual advances in this area. But the goal is you run an experiment where you make no changes and you run an experiment where you make a very small targeted change, and then you observe the cell over time. You can observe the cell with by taking pictures of it, you can observe the cell by seeing if it lives or dies or some of the things that my lab works on. You can observe the cell by trying to sequence it and seeing what molecules are inside it. But fundamentally, you want to be able to associate that change that you made with the downstream consequences. And if you can do that over and over and over again, you can start trying to interpret the genome.
Shaq Vayda:
And I always find that why does any of this matter? And what are the rules that we even have that kind of help us understand where to even look? And there's this amazing thing in biology known as essential dogma. And the central dogma basically tells us that we have DNA, which transcribes an RNA, which is then translated into proteins.
And an analogy that I always like to use, which in some ways understates the complexity of biology but DNA is like the source code. We're all familiar with source code. That's where the heart of the program lives. And then it needs to go through a compiler. In this case, we can kind of think of the RNA as sort of as a pseudocompiler. And ultimately that gets translated into machine code, which is what the machines that we use actually understand. And we have this example of a thing called a ribosome, which is like the printer, which gives you the visual representation of what you're ultimately trying to do.
To Rahul's point, we need to look at all of those different players in the cell and understand exactly what is everyone doing at any given point in time.
Rahul Satija:
Yeah, and I think that is a good analogy. And DNA is effectively an instruction manual for a human being to some extent. And the thing that I'll add on to that, which is I think one of the just very basic and fundamentally exciting questions that we look at in my lab, is that every cell, we have trillions of cells in our body and to a first approximation, every cell has exactly the same DNA sequence, but our cells are incredibly different. So we have heart cells and lung cells and brain cells and neurons and the immune cells, all of which have incredibly different functions. The cells in our eye can process light, the cells in our blood can combat infections. The cells in our heart pump blood. They do completely and totally different things, yet they have exactly the same genome. They have the same set of instructions.
And so the way that that works is exactly as you were saying that different subsets of instructions are activated in different cells. So we turn on a different portion of the genome in our immune cells than we turn on in our neurons. And how that process works, how each cell knows which portion of the genome to activate and which portions of the genome should stay silent, and how that even changes over time. It's the same cell over time will activate different portions of the genome. That is an incredible mystery in biology. That's also something that really motivates the use of sequencing technologies for single-cell analysis. And it's just a very fundamental question that we're really excited to try to answer.
Danny Crichton:
Well, I think when I go over the last 10 years, we had the human genome project, we sequenced it, and then it became about sequencing everything else that was going on. We had the proteomics, the metabolomics. I'm saying that really fast 'cause I always get very tough on that word. And you're sort of trying to figure out what actually is getting expressed, because it's one thing to have a copy of the protein in your genetic code, but how many of those proteins actually get made in each cell and how they interact with each other, that's something that we haven't known and that we're focusing on right now.
Rahul Satija:
A lot of these functional genomics technologies that we'll talk about today were enabled by fundamental advances in DNA sequencing technology. And we often call this next-generation sequencing for sort of parallel or massively scalable ways of doing DNA sequencing technology.
The first big example of that was the sequencing of large genomes, including the human genome. And so we refer to those technologies as bulk sequencing because they don't work on individual cells. They often work on large collections of cells. And that's because typically when you run a DNA sequencer, you'll take a biological sample, you'll extract DNA from that. Maybe many of you actually did that, and in your high school biology classes, I certainly did. And I remember that that process of purifying DNA from a calf serum or whatever it is that we used. But anyway, once you extract that DNA, you can run a series of molecular reactions that will then enable you to put it on a DNA sequencer and learn the sequence of nucleotides that's present.
But that's a complex technological step. And traditionally, certainly 10 or 15 years ago, and certainly when the Human Genome project was started, you needed a huge biological sample to be able to extract enough DNA to eventually put on a DNA sequencer. So often we needed millions of cells that in order to be able to extract enough material to be able to sequence, and that was primarily a technological limitation, we would've loved to be able to sequence individual cells. We just couldn't do it yet.
Now, when we were doing DNA sequencing, it's not such a problem because as I mentioned earlier to a first approximation, all of our cells kind of have the same DNA sequence to begin with. So it doesn't matter as much if we're only sequencing one cell or a hundred cells together or a million cells together because they're kind of all the same.
But if we're trying to sequence other things, for example RNA, you mentioned Shaq. So RNA is the portion of a cell's genome that's actually activated. So in a liver cell, you'll activate a completely different set of instructions than you will in an immune cell. If you start sequencing RNA from mixtures of different cells, that becomes more of a problem because they're very, very different to each other. And so you start to measure a bit of a mess. So that kind of describes how the sequencing technology started to evolve over time, first trying to sequence DNA, then eventually sequencing RNA, which really motivated this goal of being able to look at smaller and smaller samples.
Danny Crichton:
But no, I think when we get into the bulk sequencing, the way I think of it is going to, what you said earlier was going from millions of cells and you sort of had to average across all this huge sample, and now we can get both into the single cell and these single proteins and the single sequences inside that cell. And for the first time, you're kind of getting at a atomistic kernel of biology where before we're always kind of indirect, we're trying to figure out exactly what's going on, but it seems to me like we're hitting the edge. This is as deep as we would hopefully need to go.
Rahul Satija:
Yeah, and I think as our technologies get more advanced, we can ask more and more targeted, specific questions, the real questions that we're actually after, not approximations and averages, but really to be able to do a very clean experiment where we're able to perturb a particular gene or a particular DNA sequence and see what happens in an individual cell. That's really a remarkable kind of accomplishment that we've gotten to.
And again, it's not to say that the technologies we have haven't been incredibly useful and informative. We learned an extraordinary amount of biology from bulk measurements and the ability to measure an average across millions of cells still is extraordinarily informative and valuable. We're always, as technology developers and biologists trying to get to that next level of resolution, trying to be able to ask that next level of question. And hopefully that's what single-cell analysis enables us to do.
Danny Crichton:
And so you're empowered with this new technology. You can do single-cell analysis. You presumably have sequenced and done a lot of work on individual cells, presumably hundreds, thousands of single cells. Are they all the same or are we learning that they're sort of heterogeneity across them?
Rahul Satija:
So they are definitely not all the same. And that, of course was something we've known that from the beginning of biology, no one ever ... the breadth and complexity of complex biological systems is something we've always appreciated. Of course, the ability to sequence cells as opposed to just see them under a microscope or see that they have different sizes or shapes or functions, gives us a deeper understanding, or at least we hope it does. But that realization of how complex and beautifully intricate complex systems is exactly why we wanted to develop these types of technologies in the first place.
Just as a fun story, the very first single-cell experiment that we ever did, I was a postdoc at the Broad Institute in Boston, and I was working with a really talented team of people Aviv Regev, Nir Hacohen and Joshua Levin and Alex Shalek, and we wanted to do our first single-cell sequencing experiments, but we were worried that this is a new technology. And if we just sequence a lot of cells and find that they're different, then reviewers are going to say, "Well, maybe your technology isn't working. Maybe the cells were exactly the same, and so you're just getting technical differences as opposed to real biological differences."
So we actually had a meeting where we tried to think of can we find a set of cells that we really think are going to be exactly the same and run them for our pilot experiment? And we actually settled on a system where the cells had already stopped dividing because if they're dividing, then that's already a source of change. So we picked a system, bone marrow-derived dendritic cells. We wanted them to be identical, and we wanted the sequencing data to tell us they were identical. And after that happened, then we would feel more comfortable doing more complex experiments because we believed the technology.
And what actually happened is that even in these subset of cells that we assume would be exactly the same, we already were seeing huge differences. And at first we thought, "Okay, well this might be some sort of artifact." But we went through and validated these differences and understood them biologically. And that was kind of exciting to us that even in a group, we had just sequenced 18 cells. That was our first experiment. It was an 18-cell paper even in a group of cells that we did this to try to show they were the same. We already were seeing big differences. That opened our eyes, and it was really an exciting moment for me.
Shaq Vayda:
As part of the analysis piece, this probably produces just a little bit of data. Can't imagine that we were able to just process that with a simple sort of Excel spreadsheet. So maybe some of the technologies that your team has been thinking about to help analyze would be helpful to walk through.
Rahul Satija:
Yeah, and as part of that, as I mentioned, that first experiment was 18 cells. Our bodies consist of trillions of cells. So that was fun as kind of an initial pilot, but it's nowhere near the scale of data that we need to be able to generate. And we can talk more about this, but over the next few years, a series of incredible inventions came out in single-cell analysis that enabled us to now sequence hundreds of thousands of cells simultaneously in a single experiment rather than dozens. And that that's why these technologies have taken off is the ability to do highly scalable experiments routinely in any biological lab.
And those technologies got commercialized. And now there's a huge single-cell market, which maybe many of you are aware of, which is very widely used internationally, where labs are generating millions and millions of cells worth of sequencing data every day. And so what that creates then is a challenge analytically for how do you make sense and interpret that kind of data.
And we were fortunate, my lab, just because we were working on these types of problems relatively early, that was just the focus of my postdoctoral research. We started to build analysis methods and software packages that we could use to analyze our own data. And actually one of the very first data sets that I had analyzed after the pilot was an experiment from the zebrafish embryo, and which is this really beautiful system where cells are developing and growing into different organs and lineages very, very quickly. We were looking at this data and we were trying to visualize it spatially and kind of realize that each of these cells reminded us of dots on a pointillist painting.
So the very first version of the software that we used to analyze our own data, I named Seurat after the pointillist painting, just sort of evoking that image of lots of individual dots, each of which on their own isn't that informative, but you put them together and you organize them in the right way, and all of a sudden you get this beautiful picture.
And so that was a software package that we were again, really building for ourselves, but then realized over time that this is something that maybe other labs, there weren't that many at the time. And in 2014 when we first developed this that we're using the technology, but we took the step of releasing it online as open source software. And then as the technology field grew for a single-cell analysis, all of a sudden the demand for these software tools grew really exponentially.
And now there's a whole host of really incredible user-friendly, open source, freely available analytical toolkits that use a variety of ecosystems from R to Python to all sorts of things. But there are tens of thousands of labs that use these tools. And Seurat has something like 50,000 downloads every month and about a million and a half downloads since we launched it. And it's been really incredible to see the community's response to something that we originally built for ourselves but realized would have value more generally.
Danny Crichton:
I'll highlight a couple of things. We're going back to the human genome project. Billions of dollars spent to sequence exactly one DNA sequence. Then we expand that, it brought the cost down massively down to the hundreds. Now we're getting get at the single-cell level. Now you're telling me you have hundreds of thousands of cells in parallel all the same time. You're able to sequence with software that's in the cloud that's also exponentially growing in terms of scale. To me, the exponentials on exponentials here is really unique for biology. I mean, I think of it as still person in lab doing an experiment in some ways, the parallelization you're able to do today to me is really eye-opening.
Shaq Vayda:
This is really the why now. This is why we're excited about this space. And building atlas is a relatively new concept. And 10 years ago, it would've been so cost prohibitive. You would've never built an atlas. But today, we can actually go through an effort of whether each individual cell in collecting that data and then potentially even providing an open source, allowing folks to then build on top of those data sets, then create higher and higher orders of value for the world.
Rahul Satija:
Yeah, and I think that analogy to the first genome was an incredible technological advance that took decades and billions of dollars. And now you can sequence the genome for a couple hundred bucks and you can do this for tens of thousands of patients. And that same analogy for single, we started with 18. It took us months. It was this huge endeavor, and now we routinely sequence hundreds of thousands. That is how exciting biological research works. It's just constantly, constantly improving. Not in incremental ways, but by orders of magnitude.
But the questions have stayed the same. How does the genome work? What are the types of cells that we have? How do cells malfunction in disease? The same basic questions are there, but the technologies get better and better. And maybe at every point you think this is really the moment where we're going to have a breakthrough. And I always sort of laugh at that, but I do think that this is going to be a very exciting next decade for molecular biology and genomics.
Danny Crichton:
But we talked about we're getting closer and closer to the fundamental unit. We now know the proteins. We're now getting into those structures within a single-cell. We're about as small as we can go in terms of the building blocks of life. And then you're also doing all this parallelization of the sequencing. So you're getting all this data. And so I guess the open question I have is around statistical power is are you getting enough data? Is there enough parallelization going on with the tens of thousands of cells that you're sequencing to actually be able to answer those sort of next generation questions?
Rahul Satija:
Yeah, I hope so, but not so it's a good question and it's not a particularly easy answer. So I think in some areas where we don't yet have enough data and where we're starting to realize, let's say that we want to, for example, understand I'll take Alzheimer's disease than as an example. This is a very exciting area. We want to understand what are the cells that are responsible for causing neurodegeneration and in a variety of diseases, Alzheimer's being an example. It's becoming more and more common now to be able to obtain access to postmortem tissue to brains after people pass away, for example, from patients that have Alzheimer's or patients that don't, to be able to do single-cell sequencing and to compare.
And you can sequence hundreds and hundreds of thousands of cells and you get a lot of data. There's a lot of statistics there. But the fundamental underlying challenge is that the patients are heterogeneous. People get Alzheimer's disease for all sorts of different reasons. Their treatment is different. Their age is different. So it's not just about the number of cells we sequence, it's also about the number of underlying patients and samples that we have.
And so as I think as a single cell starts to move more into the clinic, I think there's a growing realization that we can't do these experiments on four or five people. We need to do these experiments on cohorts of hundreds and eventually thousands of people. And so that's been a really interesting kind of intersection between sort of large scale clinical research, which traditionally isn't applied to new and cutting edge technologies. Usually there's a slightly different communities. And now I think there's a growing appreciation of the fact that those two things need to come together.
Danny Crichton:
Oh, Rahul, thank you so much for joining us.
Rahul Satija:
Thank you so much for having me. This was a lot of fun.