Mikael Fortelius / Indrė Žliobaitė
Collection Bias: Making use of what everybody knows
As everybody knows, and as Richard Leakey was fond of pointing out, fossils are collected under many kinds of bias. Mikael Fortelius and his team explore how knowledge of such biases might be obtained post hoc, from understanding of the general processes and priorities involved.
FULL TRANSCRIPT
Hi everybody. Hope you can hear me. Okay. So my name is Daniel Green. I'm a postdoctoral scientist at Columbia and a colleague of Mikael and Indre. I work with them in the Turkana Basin and like them was introduced to that area really by Richard and Meave. I'm afraid that neither Mikael nor Indre can be here today. So they've asked me to present to their talk and their place. Their talk is titled “Collection Bias: making use of what everyone knows”. Mikael and Indre extend their warm greetings to everyone here. They encourage you to reach out to them. You can see that I've noted their emails up above and they above all are thinking about Richard during this week and tribute to his life and work.
So Mikaela and Indre would like to start from remarks that Richard made when they gave a presentation in the spring of 2021 about their work in the Turkana Basin, mostly an analysis of fossil mammalian data. This was in TBIs excellent seminar series, Richard was very positive, almost polite even, but he also made some important and critical remarks that form the basis of their talk today. You might even call this talk a kind of belated response to what Richard said at the time. They've transcribed Richard's remarks above and I note one particularly important passage that they would like you to pay attention to in bolded text. So, I'll just read there, “When I started the work at Turkana and conditions were very different. We had a very specific set of instructions as to what was to be noted and collected and what was not to be. And Louise has told me recently how many elephants they're finding now, skulls and mandibles. I said, yes, we found them at the time, but we didn't collect them because we didn't know where to put the damn things. And I think that the same goes with post cranial elements of pigs and bovids, there's a tremendous amount of material that wasn't collected and that was a deliberate policy”.
“I think we need to keep in mind that there was an element of deliberate bias occasioned by the circumstances under which we were working that I haven't heard to have been incorporated into caveats as to the reliability of some of the results of the analysis of this data”. So, Mikael and Indre, just conclude with some last quote from him. “It's terribly exciting what you're doing and as we go forward, I think we'll get an awful lot of information. But remember tectonics and I think also goats have had a huge impact on what you find and see and what lived where in the past”.
When Mikael and Indre refer in their title to what everyone knows, they mean this. There's bias, bias, bias everywhere. Should we give up Mikael’s instincts say no, not at all. He's a tooth man and he spent his lifetime looking at a similar problem. The teeth that we find in the fossil record are damaged in many ways and almost invariably worn by use. People used to think of tooth wear as a problem for paleontologists, but for Mikael it's always been a source of valuable information. Unworn teeth may be beautiful, but they're actually far less informative. Mikael and Indre suggest that the same may apply to collecting biases. They are, if you like, not a bug, but a feature of collections. Let's consider briefly the kinds of biases involved in the creation of what we call the fossil record. Much of this is familiar stuff and so this will be brief. The first is of course, what originally lived there. Paleontologists think of this as macro ecology and address it in terms of scaling relationships including metabolism, trophic structure and so forth. This talk won't touch on this today. Secondly, we have what is known as taphonomy; What was preserved short and long-term? What was exposed? What was found? This is interesting, but of course also falls outside this talks focus today. And finally we have what Richard was talking about, using databases to understand what happened in the past. And this example, you have the remotely located Buluk locality.
Here, so this is some records from the Buluk locality where all sorts of challenges apply even today, you can ask Ellen Miller about it, she's here in the audience. So some of these challenges are for instance, where people can go in the first place and what they choose to collect based upon interest and ability to identify and store actually what was collected. There are also collection processes, collector's decisions, what expertise is available at the time of collection in the field and in the lab, what people actually publish in the end, what ends up in databases. There are post collection processes, curatorial work and editorial work. These are Mikael and Indre's foci today, and they repeat, they regard these things not as bugs but as features.
Mikael’s recollection of his youth is more blurred than the photos shown above. But in the cold north, his circumstances were obviously far different from those of Richard Leakey at the equator. But Mikael’s father kept National Geographic magazine and so they were actually well-informed of the amazing work of the Leakey family and of Richard's discoveries in the Turkana Basin almost as they happened, and they were enormously excited by them. Although Richard was only 10 years ahead of Mikael in age, Mikael says that Richard was so staggeringly far ahead of him in mileage that it never even occurred to Mikael that their paths might one day cross. Mikael didn't even become a professional paleontologist; he says until Richard had already moved on to more urgent concerns such as conservation. And Mikael's own fieldwork when it started, didn't take him to Africa but to the deserts of Asia. You can therefore imagine Mikael's astonishment when in the spring of 2010 he received a letter from Richard inviting him to a workshop at Lake Turkana. Mikael says that he still doesn't quite know what had happened to him but let him just say that it didn't occur to him to turn the invitation down.
So there Mikael was, and you can see him circled in red in the middle of the Turkana scene. These were early days in the development of the TBI field stations. There was a lot of excitement and it wasn't just about the science. Mikael had never imagined that he would do more than attend the meeting and get to see Lake Turkana for once in his life. But somehow, he got more involved and ended up suggesting to me that they could do something there similar to what Mikael had been doing in Asia in the previous decade. Fossil rainfall, ecometrics and so forth.
And so, Mikael did get involved more and more. Sitting with Meave on the roof of the lab building at Turkwel, pouring over the data, combining additional files from here and there with Rene Bobe database into one extended master file, learning the meaning of collecting areas, air photos with pinpricks and popcorn gravels. A high point was meeting Isaiah Nengo at Turkwel before Isaiah met the fateful 13 million year old juvenile ape called alesi that catapulted into him into fame and perhaps eventually also contributed to his tragically premature death at the height of his career. Mikael and Indre are proud to be carrying on Isaiah's vision in the Turkana Miocene Project, well represented at this meeting, as all of you know. After several rejections, Mikael and others finally got a project funded by the Academy of Finland. The ECHOES Project would bring together older veterans and younger researchers to compile, revise, and analyze the full set of mammal data from the Turkana Basin in Kenya. ECHOES organized two workshops for the participants and collaborators. This photo is from the first workshop in Helsinki in the autumn of 2017. The photo is from the very cool interior of the science campus of the University of Helsinki,
And here they all are squinting in the September sun ECHOES was of course what brought Mikael's colleague and co-presenter Indre into this mix, and this is where he hands things over to her. And I should just note that Indre and Mikael are both very excitedly watching.
So, Indre entered the ECHOES Project as a data analyst. This is one of the first outcomes where Indre, Mikael and others reconstructed paleoclimate in the Plio-Pleistocene of the Turkana Basin based upon dental traits of mammalian communities. So, this is their famous ecometric work. This paper was published in a special volume on the occasion of Richard 70th birthday. While doing this work, it was obvious that if we want to reason for real from the fossilized communities to the living ecosystems at large, it's not so much a question of running the fanciest analytical methods than it is about making good sense of the data. As always, where does the data come from? What does the data represent? The ECHOES project was a lot about making sense of data at large. It was out of the synthesis of, it was about the synthesis of the data. The picture here shows the second ECHOES workshop in Turkana, out of which came the synthesis file. Here you can see a snapshot of this file compiled by Kari Lintulaakso. It's a compilation of over 50 years of collection data from multiple sources and databases in the Turkana Basin, colors indicate where attention is still needed.
You know them very well, so you won't be surprised of course. Multiple data compilations and revisions followed and allowed Indre, Mikael, and other ECHOES participants to look at the history of collection, which naturally varied over the years in collecting areas that were covered. That's a normal way that collection field research works depending upon funding, research, questions, teams and more. So here you can just see how many specimens are being collected over time. And then you can see that there's version 47 there and there's even more versions that have been produced afterwards. Naturally, collection volumes also varied taxonomically. This shows the proportion of taxonomic groups collected in different years. So, Indre and Mikael, see for instance, downward trends and fossil and large mammals collected as collections filled up. And also as the work of individual researchers became less constrained. Carnivores on the other hand were relatively stable at all times. Perhaps researchers were collecting everything that was found for these specimens. Primates and artiodactyls also do not show trends, but higher variance perhaps depending upon the collection method. The peak in rodents relates to the PhD thesis of Dr. Kyalo Manthi, and this suggests just what kind of impact a student and a Kenyan researcher can have on a global scientific effort centered in Turkana.
We see that collection proportions have varied over the years, even in the same basin with the comparable geological and taphonomic circumstances. And to an extent we know why they varied as highlighted by Richard earlier in this talk, the majority of the fossil record is like that. It's collected following research and funding priorities. If inevitably so, how then about analyzing the communities. How can we rigorously reason from fossil databases about the living communities they represent? And what is a fossil community after all? The good news is that we can analyze the collection processes statistically, somewhat like taphonomic processes. We can quantify from observational data how fossils make it into the fossil record, and we can use that information for reasoning about communities including answering what is a fossil community. So, this is Lothagam, Indre Mikael and others from the Lothagam Research Project and TMP collected data for this pilot study during the TMP field season last year. The data covers a bit less than 300 fossil observations out of which 42% were collected. The features noted in this study include the body part in taxonomic ID to the family level, the size of the fossil, its preservation, its completeness, and most importantly of course, what fossils the team chose to collect and not to collect. They then use this data for a pilot study to build a statistical model to predict the probability of collection.
The simplest model is a logistic regression. It's a linear model with a non-linear twist at either end. The weights on the Y axis, which you can see over there on the left, can be interpreted similarly, as for a linear regular regression, the higher the weight, the higher contribution towards the probability of collection. So here this model is fit to the Lothagam data set just described. And you can see for instance that primates and carnivores are highly likely to be collected, crocodiles and fish and turtles are highly unlikely to be collected or less likely to be collected. Undetermined taxa have a low probability of collection and body parts, horn cores and teeth have high probabilities, vertebra and such have low probabilities. Surprisingly and somewhat counter to expectations, size, preservation, and completeness in this model do not carry much weight in either direction. Digging into the data, the field and field work practices, it became obvious that their importance, that is the importance of size preservation and completeness depends on specific context, but how? So, there are other ways of looking at this data as well. This is a decision tree another classifier, a nonlinear local model that can capture different dependencies in different circumstances. Each leaf at the end of the tree represents an outcome. The percentages in orange and blue predict the probability of collection with orange indicating high probabilities and blue indicating low probabilities of collection.
We can see here in this pilot study that if we're dealing with an elephant for instance, then teeth are very likely to be collected and feet are unlikely to be collected. If we're looking at artiodactyls, things are more complicated. Collection depends on the body part, preservation, completeness, and size. So this is a pruned tree pruned deliberately for better generalization here in this talk. Thus, we can also see that some probabilities are not 100% or 0%, which would mean collect or not collect. For instance, we can see, look at carnivores and perissodactyls. We can potentially grow a deeper tree from these nodes to find what further depends upon what. At the same time, primates are still very confidently likely to be collected and reptiles are still very unlikely to be collected. So collection depends, and this way we can potentially quantify on what it depends. Cross validation perhaps in other studies reassures that many of such patterns are generic and predictable. We can model the collection process to quantify what we already know intuitively. This can then be used to reason about fossil communities at large and the processes involved. Thus, our response to Richard's remark at the beginning of this talk is not that it's a hopeless mess, we the community, can deal with it also quantitatively as knowledge accumulates.
Mikael and Indre are very much missing all of you and looking forward to when they can see you next. Indre will be in the field in Turkana soon so many of you will see her there. She's just won a major grant from the Academy of Finland to continue this work looking at macroevolutionary processes and in particular tracking mammalian and plant functional communities from the Oligocene to the present. She's hiring a number of graduate students, so students in the back, if you're interested or if you have graduate students that you might have in mind, please do be in touch with her and with Mikael. With that, please thank the two of them who are watching.
View all Q&A and Other Remarks videos