Krishna Veeramah
From blood to bone: A brief review of how genetic analysis has helped untangle sub-Saharan Africa's prehistory
A synthesis of how the analysis of modern and ancient DNA has helped better understand African prehistory.
FULL TRANSCRIPT
Okay, so Lawrence and Fred asked me to give a talk on kind of genomics and how it's relevant to deep history in Africa. And I'll confess it's not really what my lab works on much anymore. It's something that we did when I was a PhD student. So I don't have any big impressive new research to present. So this is why I'm giving you a kind of a review of how the history of how genetics has helped understand African and particular sub-Saharan African prehistory. So you've probably all at some point heard the phrase, there is more genetic diversity in Africa than anywhere else in the world, which is true. However, that in itself is not particularly interesting. The fruit flies have more genetic diversity than humans. The reason it's interesting from a population geneticist point of view like myself is that you can then layer that on to the immense amount of other kinds of diversity that you have in the continent.
Phenotypic diversity, the ecological diversity. And then the two elements have really framed how population geneticists like me, have thought about human African genetic diversity is the variation in subsistence strategies, the farmers and hunter gatherers and language. The immense language diversity and trying to see how that is structured within the context of the genetic data and people have always been interested in particular in the kind of spread of Bantu speaking languages and how that is reflected in the genetic data and also what the genetic data says about click speaking individuals, hunter-gatherers, and these kinds of things. Now, before people could really look at DNA, and like I said, this is an overview or review, so I'm going to go all the way back to the beginning. We didn't have direct ability to look at DNA. We had to look at so-called classical genetic markers, proxies for the DNA.
And this kind of started with Ludwik Hirschfield in 1919 when he discovered they were blood groups. And in fact, these blood groups we had A and B actually had variation in them depending on what populations you looked at. And you can see that you talked about races here and had some certain words that we really wouldn't deem acceptable. But the principle was that there is variation in populations for blood groups and that would underlie genetic variation because DNA makes rna makes protein. And over time stretching all the way to the seventies, people developed more and more assays where they could look at protein variation as a proxy for genetic DNA variation and try to see if they could use that to understand human population history and the variation that exists, and this is not the only one that did this, but of course the most famous example in the kind of almost culmination of using these classical genetic markers was by Luca Cavalli-Sforza for who in I think 84 and in 94 published this very famous work, the “History and geography of human genes”, which is this kind of incredible effort to assemble all the data that existed and try and come up with ways to explain how people have moved all around the world.
And of course, he looked at Africa and in this case he looked at 49 African populations. Here's a tree he made see this works, yep. There's this tree he made and obviously from the perspective of what we do today is a fairly kind of primitive thing, but it does have features in it that would give a key a kind of little indicator of what we would see in the future when we looked at real DNA and it kind of identified that there were differences in some of these groups like hunter-gatherers, Bambuti groups that were very recently until recently, very hunter-gatherers, San, the Eastern African individuals from Ethiopia were different to other sub-Saharan African individuals from a classical genetic point of view. And that the kind of languages of a Bantu language in particular individuals from all over the Bantu spread across Africa from South Africa, southeast, southwest, north, genetically clustered together, even though this is, you would think that it would show more spread given how widely distributed they are.
Genetically they seemed from classical markers, very, very, very homogeneous. And this allowed Cavalli to present these kinds of models like using the genetic data from the trees to think about how populations might spread and how the Bantu expansion, what kind of the streams of the Bantu expansion might exist through eastern and Western and which ones came first. And obviously there's a lot of simplifications here, but it was kind of a first attempt to really do that, to model migration with genetic data and also to think about things not in a tree like manner necessarily. And this is something that's become very important in the modern genomic area. But to think of that there is actually populations that are the result of admixture, process of mixing of two populations coming together and creating a third. And this idea, for example, the San and the Ethiopians may have become, may have been a result of mixtures of different types of other people, in this case sub-Saharan African and near eastern.
In one case, this is probably somewhat correct and in the other case probably not, it was limited data, but it was kind of a starting point for how everyone would really think about contextualizing genetic data. It hasn't actually changed that much in some ways. It wasn't until really the 80s and 87 that we actually started to look at real DNA and this is the real kind of signature of this is this very famous paper by Rebecca Cann in 87, which essentially identified this mitochondrial eve and had obviously such huge implications in distinguishing between the multi-regional and out of Africa hypothesis. And this kind of was the beginning of a series of papers from in the nineties and early 2000s that were dominated by genetic work looking at uniparental markers, the mitochondria, and the Y chromosome. And this was driven by essentially the kind of ability to look at restriction fragment, length of polymorphisms, places in DNA that we necessarily couldn't sequence, but we could tell that they were different between people and also the development of PCR, which allowed us to actually go and get bits of DNA, we were interested in amplify them and examine them as much as we can and that really revolutionized molecular biology in so many ways.
So, like I said, the kind of 90s and early 2000s was really the Y chromosome mitochondrial era, if you wanted, you could call it the Genographic era, the National Genographic era, because that really is a kind of, Spencer Wells was kind of running around telling people who they really were with their Y chromosomes and their mitochondrial DNA. And that became a real kind of a precursor for like 23 and me, where you'd actually go and send your DNA to them and they would tell you a Y chromosome and mitochondria, it became very popularized. They were actually sending it here to the University of Arizona with my old postdoc advisor, Mike Hammer. And so when you're looking at why in mitochondria though, it's important to understand what you are looking at in terms of history. So how do we use genetic data to understand history?
Now fundamentally, we don't actually care about DNA. What we care about is for any particular individual or set of individuals, the genealogy back in the past. And if we can work out what that genealogy is, we understand that genealogy of all our relatives is shaped by our history where we've moved, who we've encountered in the past. So, we want to get at this genealogy and DNA is essentially a kind of fuzzy lens of looking at that genealogy. Now, when we only look at the Y chromosome or the mitochondrial DNA, we only get a very limited view of that genealogy. So, if we're looking at the Y chromosome for example, of this guy here, what we're really looking at is DNA, that comes from ancestors along a purely paternal line. When we're looking at mitochondrial DNA are a purely maternal line.
What we're not doing is looking at all the other ancestors that have contributed to someone's history or their different ancestors. And of course that's very important, right? There's obviously reasons why there might be interest in just purely male and female lines, but this is a huge number of other individuals and they all contain the kind of remnants of our history. It doesn't mean Y mtDNA is not important. It showed some very broad kind of patterns that not so different to what Cavalli showed. So, for example, here is a principal component analysis of populations of a mitochondria, and it showed that Eastern Africans were different to Southeastern Africans, but then there was a kind of Central Africans and West Africans, you actually couldn't really differentiate that well through mitochondria. You would see similar things in the Y chromosome where you would see that there were actually Y chromosome types that were found in Eastern Africans from Ethiopia and from click speaking Khoisan that weren't found in Niger-Kordofanian speakers from West Africa and Central Africa.
And so you would see these kind of deep variations between populations. And the way this one in particular has always been explained is that there's this deep connection between Ethiopians and Khoisan and other click speaking languages like speakers like Sandawe and Hadza, where they contain parts Y chromosomes that are in the basal parts of the tree, the deepest part of the tree, whatever you want to describe it. And it's completely absent from the majority of people that live in Africa, the Niger- Kordofanian speakers, Niger-Congo speakers. And maybe this is because at one point in time this kind of ancient primordial population was spread much more widely across the continent, across Eastern Africa, across Eastern, Southwestern Africa. But then at some point it gets erased by Bantu speakers coming along and kind of leaving these little pockets of these ancient Y chromosomes around, which is an interesting idea.
However, you have to be very careful about interpreting Y chromosome mitochondria. It is just one lineage along the human tree. And indeed, while this kind of deep clade was seen as meaningful back then some years later, we found that actually the deepest Y chromosome clade, A00, which was 300,000 years more diverged than from any other Y chromosome that we had found, was found in the Mbo, in Cameroon, in the grass fields area. And no one would say that the Mbo is some kind of remnant of an ancient African hunter-gatherer population. It's just the fact that sometimes the Y chromosome, you're looking at one lineage, you get random results. It's kind of a random thing. So you have to be very careful about just looking at these single locusts because they lack a lot of power. They also lack a lot of resolution. So, when I was doing my PhD, we collected a lot of samples from Cameroon and the Cross River region.
We were interested in Bantu expansion and from Ghana we had hundreds and hundreds of samples from individual towns. And when we looked at the mitochondria and the Y chromosome, we really didn't find a lot of genetic differentiation with the Y mtDNA. We didn't see any real correlates with genetic distance. So, here's a PCA, and each dot is a different Cross River Nigerian population, but you can see they're intermingled amongst South African Bantu, Rwandans doesn't seem to be a lot of genetic differences there based on geography. And similarly, we didn't see a lot of difference based on language. There wasn't any correlation between language and genetic diversity. And this is not because it doesn't exist, but it's just a limitation of just looking at Y mtDNA, because last year we were able to look at these same populations using a whole genome approach.
And by doing that, we can now see actually we can distinguish individual towns at the genetic level that we couldn't see that information with the Y mtDNA. We could only do it by looking at the whole genome. So what is it about the whole genome that gives us this power? So, like I said, this is what we would normally look at if we were just looking at Y chromosome and mitochondrial DNA. Now, as we entered the late 2000s and 2010s, we moved into this new genomic era. There were two technical innovations, micro rays that allowed us to look at a million different positions in the genome simultaneously, and then eventually DNA sequencing. So, we could look at most of the DNA across the whole genome and places where people showed variability. Now if we look at this same individual where we were looking at the Y and the mitochondria, but now we look at some chromosome like chromosome four, and so we've got this chromosome four here that this individual inherited from their father and this chromosome four here that they inherited from their mother.
And then we look at a specific position along that chromosome say, position 1 million or whatever it is. When we are looking at that position, what we can see is there's essentially a lineage that we can see back in time that reflects that particular position that DNA is that kind of inheritance. When we look at the other chromosome, we can see this inheritance. So now we're not looking along this line, we're looking at this line and we're seeing different ancestors in the past, we're looking at DNA from different ancestors. Then there's this process of recombination, which shuffles DNA during meiosis. So, the DNA that you get from your parents is kind of this mixed-up version. And so, let's say there's a recombination event here and in these ancestors, then that switches who we get to see in that part of the DNA.
And similarly, we move to another part of the DNA where there's another recombination event, and now we see another set of ancestors. So by looking at the whole genome, we see much more than just this one ancestor here and this one ancestor here. We actually get to see DNA from many, many parts of our ancestry, which obviously gives us this extra power. And you can see the more DNA, we can look at the more of the genome, we can look at the more of these ancestors and the more complete picture of this genealogy we can get, which is ultimately what we want. We want to know if we can get this whole genealogy that's exactly what we want and that provide us a wealth of information. So, Sarah Tishkoff paper in 2009 is kind of the really first big paper that kind of did this at a broad level across Africa.
And so this is hopefully at this point you're all familiar with these kind of cluster analysis, admixture analysis where each line is essentially one individual and the color represents the genetic ancestry that that individual has. So if you see a bunch of lines that are orange, those individuals share a very common ancestry, but individuals can also have a mixture of orange and green because they've got slightly different ancestries. And you'll see that not unlike what we saw with the mitochondria Eastern Africa, it kind of different Central Africa and western Africa, you see this kind of more purple ancestry here versus it's kind of orange. And this kind of pattern actually is not dissimilar to what we've seen when we've moved from kind of snip arrays to whole genomes where basically you can see there's a kind of West African, central African similarity differences in Eastern Africa and so on.
And this of course has allowed people to look at the Bantu expansion as well and look at the route. And it's not unfair to say this looks a lot like the initial Cavalli figure because that's had such an influence. Maybe it's not exactly the same route, but this has basically shaped how we think that initial figure has really shaped how we think we're going to use genetic data when we look at these analyses. The things that might stand out to you, there are a few populations here that are in Central Africa and Southern Africa that look a little different. We have the Batwa Pygmies, Mbuti Pygmies, and San and the Khoisan populations, they have a different ancestry. Again, something that was hinted at in Cavalli original trees, even if it was kind of looking at it in a different way.
And this actually led to me taking some of that data and some other data and seeing not just can we see who's different to who, but can we model the explicit relationships of these individuals? And so I kind of did the first modeling of how San were related to Pygmies and Niger-Kordofanian populations. And we found that the best model was that San were the most diverged human population amongst Africans and basically amongst humans. And this divergence took place somewhere between 100 and 200,000 years ago, depending on various assumptions you make. And to be honest though, this was kind of based on fairly simplistic data that this patent has been replicated a number of times with whole genome data since then. However, I will draw your attention to this is the model that was favored. It's very simplistic model, this tree-like model. And you'll see that later that was kind of a really poor representation of how things really are.
But that's kind of how we often think in evolution. I've seen many, many trees from other people here looking at fossils. We think of a tree-like system and that's how we understand the world. And so, what is the problem that you have with autosomal DNA? So genome Y data has given us a lot, but when I look at this kind of thing, is this the best we can do in terms of understanding human history, all this money that we spend sequencing all these base pairs and collecting all these samples that we can make these kind of tree-like things, which is nice, it's nice to know the San are the most diverged, but is it really telling us anything really fundamentally important about history or this very macro level, broad level things.
And so part of the problem is even genome wide data doesn't give us everything. If we look at only modern populations. So, if I look at this guy here, this is all the individuals that I showed you before that had contributed some DNA to this individual. And you can see when you go to the grandparents, actually there's one grandparent, even though it's clearly their grandparent didn't contribute any DNA to this person in the future. And if we go all the way back, five generations, only 25% of the individuals actually get represented in this individual's DNA. The other ones are clearly part of their family, but just by the luck of the game, their DNA doesn't make it into this guy. And so this is why history is written by the genealogical victors because these are the ones that we see today, but we don't get to see any of these other guys who were probably very important.
And David Reich probably, as most of you know, has actually had this nice figure in his book that really formalizes this concept. This is just a cartoon, but these are the kind of real estimates. And if I look at myself and I go 15 generations ago, it turns out of all the ancestors that existed 15 generations ago, only 3% of them probably contribute DNA to my genome. 97% of my ancestors never actually contribute any DNA to me. So that's a huge amount of that genealogy that we're missing, the past that we're missing just because we're fixated on having to limit ourselves to looking at my genome or modern genomes. Now hopefully you see where this is kind of going. Why can't we just go and fill in the gaps by just sequencing some of the bones? Why can't we go and take some of these skulls and from Turkana and use our amazing sequencing technology and now we can fill in some of these gaps here that we couldn't get by looking at modern samples?
And of course, when I was doing my PhD, we never thought this was remotely attainable. But the world has completely changed because of this guy here, Svante Paabo, when he sequenced a Neanderthal genome and rightly received the Nobel Prize last year. And it just has been kind of just an absolute revelation in terms of how we can now get DNA from ancient material. And indeed, last month we hit 10,000 ancient genomes, which is quite an achievement. Within 10 years, we've sequenced 10,000 ancient individuals at certain levels. Most of this data, but not everything, it's generated by three big groups. David Reich Harvard, Johannes Krauser at the Max Plank in Leipzig, Johannes student and SK at Liesler. Now this is great, and my lab has been doing a lot of work on ancient DNA in Europeans, and I have a student that's got 500 genomes they're analyzing as one chapter in their PhD thesis.
However, what you will notice down here is Africa. Of these 10,000 genomes, only 3%, very small amount are actually of African are origin. And that is an issue if we want to use ancient DNA to understand history. So why is this a problem? So when we die, the enzymes that protect our genome, they're no longer around. So, the D N A starts to break up into little pieces and even changes, we get postmortem damage. Now we've got very good in the ancient DNA world of sequencing these tiny little bits and even correcting some of these changes. However, the rate that this happens is very different depending on the environment. So, the oldest DNA that we've been able to sequence so far is about 1.2 million years, which is a mammoth. It comes from permafrost. So, DNA survives really well in dry cold environments. It does terribly generally in hot humid environments, which obviously is problematic for getting material from Africa. So that has been a big stumbling block. Any kind of area that's tropical in particular is very, very difficult.
Now, these are all the genomes that have been sequenced, published. I'm sure there's more it's hidden in David Reich's lab somewhere. But these are all the kind of genomes that we have from Africa. And you can see there's a lot from Kenya, which is hopefully nice to see. There's only 166, like I said, I have a PhD student working on 500, and there are only 13 that are more than 5,000 years old. So that is a huge limitation on what we can do in the ancient DNA world. When we are thinking about Europe and how ancient DNA has transformed our understanding of Europe, we've basically got individuals that cover the entire span of when humans have been in Europe. So we have everything basically. We don't have obviously what we want from Africa. That doesn't mean ancient DNA hasn't been useful. So the first ancient genome that was sequenced was in 2016, which was a Ethiopian individual. Well, he wasn't Ethiopian, but he was from where Ethiopia is today from around 5,000 years old Mota, the Mota cave. And kind of interestingly, or maybe interestingly depending how you think about it, when they looked at this individual and who they were genetically most close to amongst modern populations was most similar to Ethiopians but that's interesting in itself because it suggests some level of continuity for 5,000 years, right?
That's an interesting finding, but it wasn't exactly the same as well. It lacked some ancestry that modern eastern Africans have. And it turns out that this ancestry, that Mota lacked, that a modern Ethiopian walking around has, is the same genetic ancestry that's actually found in the first neolithic farmers that went from Anatolia and spread farming all across Europe. So actually it looks like they went in both directions to some extent. They had a much bigger effect in Europe, but they did try and go back into Africa. And that's just from one genome you get these really nice insights. Another paper, which I'm on, but really, I did nothing, but David Wright did all the work and the other people. They were able to get four samples from the Shum Laka archeological context, which is in the grass fields of Cameroon in key place for understanding the Bantu expansion where we think the proto- Bantu languages come from.
And they had two individuals that were 8,000 years old and two individuals that were 3000 years old. So kind of either side of the ban to expansion. And what was striking is these individuals genetically 8,000 or 3,000 years, 5,000 years separated. We were genetically pretty similar. Again, we have this kind of 5,000 years of genetic continuity almost in this small area. But they were very, very different from West Africa's, Bantu speakers that live today. No, even the 3,000 year old one, they were more like central African hunter gatherers. And this is the kind of insight we could only have gained from ancient DNA. We would never have known that this kind of diversity existed 3,000 years ago by looking at modern population. Just we're blind to it unless we get ancient DNA. Similarly, we were just talking about pastoralism I, so I'll highlight a paper from another Stony Brook alumni, Elizabeth Sawchuk, who's been really heavy involved in a lot of this ancient DNA work in Africa where they were looking at from Kenya from Turkana Basin, a bunch of kind of pastoralists and pre pastoralists, late stone age.
And again, the remarkable thing you see is those individuals from the late stone age have a genetic kind of signal that looks a lot like Eastern African foragers and is very distinct from what would be the pastoralist genome, which suggests a big shift in ancestry and some kind of maybe migration and admixture. What's even more interesting is once you have this shift for about 3000, from about 3000 to 2000, even a little bit later, there's pastoralist samples they have from a variety of different cultures, different pastoralist cultures, and though there's this incredible archeological diversity about them genetically, they're all pretty similar. And so that's a kind of interesting insight as well. It tells us about the demographics of that process, culture of diversity, but actually not a kind of very small genetic kind of pool.
Now, those have been interesting. I want to just draw attention to this paper by Brenna Hen, who was a faculty member here, very sad that she left some days ago. This paper came out I think two weeks ago, big news in the New York Times, and it goes back to that first model that I came up with about San being the most divergent and in Pygmies and Niger-Kordofanian. As we've got more and more genetic data, it's been harder and harder to fit that model because we have now a lot more things that just don't seem to fit. So, people have invoked things like archaic admixture from a ghost archaic hominin that goes in and maybe in introgresses with West African, kind of mimicking what we see with Neanderthals where we have actual DNA, but we still maintain the first kind of divergence of human populations is 100, 200,000 years ago.
It's just integration from this ghost archaic hominin. What Brenna was able to show is actually with a much more complex model involving population structure that lasts for actually about a million years with all these and these kind of loose connected gene flows, you can fit the data just as well, if not better, but it suggests a radically different kind of model. Rather than thinking about trees, it's now a graph, it's a network of things, and it's really not about just bifurcations, it's coming together and things like that. The problem is, if you're going to a million years for ancient DNA, these are all just made-up populations. This is just models we don’t anything about them. It would be great to have ancient DNA to even get some of these guys, but we don't know who they were. But if we could, it would be important. And as Brenna points out in the paper, other African populations such as pre-Holocene, ancient DNA samples could further test our proposed models. We can get real empirical about this rather than these kind of guessing things. So, this all leads to the natural conclusion of what is needed, which is a paleogenomics lab at the Turkana Basin Institute. And I will end there because I'm pretty sure I went over time but thank you.
The Turkana Basin Institute is an international research institute to facilitate research and education in paleontology, archeology and geology in the Turkana Basin of Kenya.
Discoveries like these are a direct result of your support.