Denné N. Reed
A comprehensive database of the published hominin fossils from the Lake Turkana basin
The Turkana Basin represents one of the largest assemblages of early hominin fossils. This paper presents the first, comprehensive, validated digital catalog of all the published hominin fossil material from this area.
FULL TRANSCRIPT
Thank you everyone. It's an honor and a pleasure to be here and participate in the symposium commemorating the legacy of Richard Leakey. And today I'd like to share some new work towards developing a comprehensive online database of hominid fossils from the Turkana Basin. Work that builds on the discoveries of Leakey and others that have dedicated their efforts to expanding the human fossil record.
Now, our understanding of human evolution is based on a global set of fossils, archeological and geological discoveries of which the Turkana Basin represents one of the most significant. Over the past six years of research, there the combined efforts of several teams have amassed tens of thousands of vertebrate fossils, including over a thousand hominid fossils. Now, from a paleobiological perspective, the Omo-Turkana Basin represents a geographically contiguous sedimentary system, as Kay pointed out, and understanding the larger scope of human evolution in the system, I think requires combining fossil evidence from the Omo and Fejej regions and the northern end of the basin with a fossil record at Koobi Fora and Alia Bay and Ileret, as well as the deposits on west Turkana, not to mention those from Turkwel, Lothagam and Kanapoi, the southern portion of the basin. Fully realizing the scientific potential of these fossil deposits requires integrating diverse data across multiple projects and bringing them together digitally. A system for integrating paleoanthropological data has long been recognized as necessary, and yet its development poses, I think one of the major challenges facing paleoanthropology today. It's a challenging problem, which is why there's no overarching catalog of the hominid fossil record or the Turkana Basin for that matter, much less a comprehensive catalog of associated fauna isotopic or geochronological and related information that informs our wider interpretation of these fossils.
The challenge was the impetus for initiating the Origins project. So Origins is an online resource for human origin studies that seeks to bring together information on hominid fossils, sites, taxa and nomenclature. We recently released the first component of the project focusing on hominid nomenclature. This module features a comprehensive online catalog listing all the taxonomical names introduced into paleoanthropology since Linnaeus since 1758, which currently is about 200 plus names.
Not only is this listing comprehensive, each name is evaluated against its source publication to assess the availability of the name and also its potential validity. As such, I think it provides a shared open access resource that the paleoanthropological community can build upon. Now additionally, each hominin listed in Origins is linked to its respective type fossil. So, Origin contains behind the scenes an expansive database of over 3000 hominid fossil specimens from over 500 Miocene to Pleistocene ape sites across Africa and Eurasia, along with multiple stratigraphic and geochronological context about those sites. We're currently working on finalizing the validation of the type fossils in the database and developing public facing web pages to present this, which represents the first step in opening the fossils module in Origins. In addition to the type fossils, I'm excited today to talk about the progress in developing the Origins Turkana database.
So the main goal of the Origins Turkana project is to establish a comprehensive online catalog of all the published hominid fossils from the Turkana Basin region of Northern Kenya and Southern Ethiopia. And this represents a major step towards a larger aim of a comprehensive catalog of the entire human fossil record. Origins got its start back in 2018 when I presented an earlier version of this project at the AAPA meetings in Austin. And this initial version, let's call it 0.1, aggregated data from existing publicly available digital sources for all hominid fossils of the Miocene through early Pleistocene. Now, in creating this early version, I focus not only on bringing together the data, but also on establishing the digital infrastructure and the models to support large scale, a large scale online catalog of the human fossil record.
Now, I was able to do this quickly by staging Origins on the Paleo Core platform. Paleo Core is an open-source platform for paleoanthropology that I built with NSF funding starting in 2012. The platform currently serves 12 major projects and host data for hundreds of thousands of vertebrate fossils, artifacts, and geological specimens. So, Origins is one of the projects on Paleo Core, but whereas the other kind of Paleo Core projects feature data related to individual projects, much of which is proprietary Origins features publicly accessible data across many sites. So, in late 2020, during one of the meetings of the African Rift Valley Research Consortium or ARVRC, Sandrine Prat, she and her colleague Francois Marchal were compiling a database of hominid fossils from the Turkana Basin region, and I reached out to them to collaborate. Now, Francois and Sandrine are both actively engaged in field research in the Turkana Basin, and since 1997, Francois has been assembling a database of published hominin fossils. And in 2018, Francois and Sandrine presented a poster at the UI SPP World Congress documenting their work. Now by combining their carefully curated catalog with the first version of the Origins database, we have together produced a detailed, systematically structured, and comprehensive database of the hominin fossil record that integrates data from multiple Plio-Pleistocene sites in the Turkana Basin. And that is linked directly to the primary literature that's geospatially aware, and that links each fossil to contextual data about its location, geological context, geochronometric age, et cetera.
This slide illustrates the basic workflow we use to create the Origins Turkana online catalog. Fossils were collected in the field resulting in hard copy catalogs of their discovery. In this case, that's the published literature, and those hard copies were digitized to Excel spreadsheets that were compiled by Francois and Sandrine, and these were then ingested into origins. The ingested data were aligned with previously ingested digital data based on the fossil catalog numbers. So, aligning the data that is matching the catalog numbers against numbers that were already in the first version of Origin 0.1, proved a real challenge, and it highlights one of the first point to one emphasized in this talk, which is the need for paleoanthropology to develop a system of stable, globally unique identifiers for all fossil specimens. Once we have the data aligned by catalog number, I was able to map the remaining database fields from the Marshal- Pratt database onto the Origins data structure.
From there, I use Python scripts to cross validate the data for consistency and accuracy. In this step, I work with my collaborators to iteratively evaluate the data and root out problems. So validation kind of reveals this is common, a lot of errors in the source publications. So you find things like catalog numbers that are mistyped or incomplete, or where there are disagreements about taxonomic assignment between different investigators. This brings me to the second major takeaway that I'd like to present in this talk, that online resources, if they're maintained, can be iteratively refined and perfected over time to provide high quality data the community can draw upon. The resulting database contains 1,231 unique hominin fossil entries from the 10 Turkana sites, including 10 type specimens. These fossils were derived from over a hundred source publications. And if you want a challenge, see if you can think of 9 or 10 extinct species associated with the type specimens collected from the Turkana basin.
The online administrative interface features a list view of the fossils that can be filtered according to different criteria, and each listing is associated with a detailed view for updating information about each fossil, including images and geospatial locations associated with the fossils on. These pages are restricted to the development team. Currently, we're working on the public facing pages that'll be available once a project is published. Part of what sets Origin apart is the attention directed to developing a common conceptual model for paleoanthropological data. This is work I've been doing in collaboration with Emily Coco at NYU. We surveyed several existing paleoanthropological data sets and identified 12 foundational or fundamental classes of information that nearly all projects have in common. So paleoanthropological data sets, commonly track people, institutions, and projects. That's kind of the who the specimens themselves, the what, locations and places.
Geospatially as well as stratigraphically, that's your geological context that really provides kind of where, the ages in geological terms, as well as the events that are associated with the discovery of publication fossils. That really gives you the when. And then another six features or sets of information that kind of further enrich your information about the specimens. So that would include perceptual types, so that might be information about taxonomic IDs or cultural typologies, functional typing, which can include anatomical elements, artifactual functional types, or geological faces. Then you need to keep track of the various units that you're using in your typology, the taxa, or types, nomina associated with this, associated measurements and character states perhaps, and references and publications. So these are the types of information that we all keep track of. We then link these core concepts together to form a standardized conceptual model for paleoanthropology that's centered around the specimen.
Now, on the left, we have the various classes of information that I talked about that are related to each individual specimen, and on the right we see subclasses that specimens can be assigned to and information specific to those subclasses. So, for example, there are fossils that are associated with biological taxa and taxa with nomina. So we have similar classificatory systems that would also apply to artifacts and geological samples. And each box here represents a broad class of information, and each class has attributes that is more detailed information relevant to that class. So if we talk about specimens, specimens for instance, would have catalog numbers or globally unique identifiers, maybe a disposition indicating where that specimen is stored. Likewise, fossil as a subclass inherits all the attributes from a specimen. So if we look at a subclass of a specimen, say a fossil, it gets all the attributes of the superclass, the specimen, and also additional attributes that are unique to fossils. So that might be things like the biological sex, the anatomical element, et cetera. Now, because all of this is modeled in Python, we can also take advantage of object oriented programming principles such as inheritance and encapsulation to implement these subtypes and modeling.
When you put it all together, you get a complete conceptual model. Importantly, each of the classes and attributes that's defined as that's in this model is part of an existing scientific data standard, such as Dublin Core or Darwin Core Origins uses the standardized data model to provide a systematic framework for managing paleontological, archeological and geological data. A second important feature of Paleo Core and Origins is the application programming interface, or API and API facilitates easy access to the data through different client applications such as QGIS, R, or Jupyter Notebooks. And this reflects a core design principle and concept behind Paleo Core, maintaining the data in a shared online repository that could be accessed through the browser or through a variety of different client applications. So, using the API, the full Origins Turkana database could be downloaded to a data frame with just two lines of code and R or Python, and the geospatial data could be read directly to QJS from the Paleo Core servers. The API greatly facilitates statistical analysis and queries. For example, we can quickly chart the distribution of fossil specimens by year of discovery or summarize fossils by location or site,
The standardized conceptual model and the API work together, working together highlight the application of fair best practices for data management. Now, fair is an acronym signifying data that are findable, accessible, interoperable, and reproducible. At the same time, we recognize that these data reflect the cultural heritage of Kenya and Ethiopia and should be managed and treated in an ethical manner that reflects all the relevant stakeholders. So care refers to principles of ethical data sovereignty that offer collective benefit, authority of control, responsibility, and ethical treatment between the scientific community and the local institutions that are curating their cultural heritage.
Looking forward, Origins can have a profound impact on how we conduct paleoanthropological research. First by using ontologies to link fossil specimens to data, fossil specimen data to wider array of related information. For example, we started linking the anatomical elements represented by fossils in origins to the Uber ontology. This work stuff I'm currently doing with my graduate student, Jiver Johnson. UBERON is a cross species anatomy ontology containing a rich suite of information about vertebra anatomy, including skeletal elements, muscle origins and insertions, developmental groupings, et cetera. So, for example, by indicating that the fossil KNM-ER 15000-G, the left femur from the Turkana Boy is an instance of UBERON 981, we'd know that KNM-ER is a femur and which is an endochondral skeletal element. It's part of the hind limb, it's part of the appendicular system, or that it's that it, it's the origin for the quadricep muscles or insertion for some of the adductor muscles.
Not only could a human user know this, and most of us do already, but doing this kind of linking, combining facts about KNM-ER would also allow any machine learning algorithm or AI with access to Origins in UBERON ontology to know the same information. An AI or large learning models such as chatGPT trained on Origins in UBERON ontologies could then respond accurately to natural language questions about the human fossil record. So right now, if you ask chatGPT, how many fossil femoral are there from Turkana Basin collections, you'll not get an accurate answer. ChatGPT doesn't know or have access to modern collection records, but it does know that KNM-ER 15000 includes a femur, which is interesting. Accessible online databases such as Origins present new avenues for AI assisted search and querying of the fossil record. Now, traditional relational databases are well-suited for high volume, high velocity, low variety queries, think banking transactions or library searches where the same basic search is being conducted repeatedly, thousands or millions of times, often rapid succession. But scientific data and especially paleoanthropological data are different. We often want to ask much more complicated and unpredictable questions. AI systems present a powerful new approach if they have access to reliable data.
So to sum up, Turkana Origins is a comprehensive digital catalog of published prominent fossils. The digital infrastructure will benefit generally by establishing a system of globally unique identifiers, which will aid in linking data together. These digital resources can be interactively or iteratively improved, and if maintained can provide a foundational scientific resource that we can all build upon. Conceptual modeling is a method for sharing and integrating data, and advancing the conceptual modeling to ontologies opens broad potential for the application of artificial intelligence and machine learning. As a final note, this is a reconstruction of Galileo's telescope, perhaps the scientific instrument representing all the ways that instruments allow us to expand our view and see the unseen.
Richard Leakey's legacy includes not only a wealth of new fossil data, but also a legacy of institutions to support the ongoing and accelerated accumulation of new data about human origins. In addition to this material legacy, I think it's worth considering the future role of information systems as instruments for peering into the complexities that arise out of the new data fast and furious, as Louise phrased it yesterday. From this perspective, Richard Leakey's legacy includes the data and resources to peer deeper into the unseen of our past and to make sense of it for protecting our future. Thank you.
The Turkana Basin Institute is an international research institute to facilitate research and education in paleontology, archeology and geology in the Turkana Basin of Kenya.
Discoveries like these are a direct result of your support.
View all Human Evolution videos