Paleoinformatics


Norman MacLeod (Department of Palaeontology, The Natural History Museum, Cromwell Road, London, SW7 5BD, UK)

Robert Guralnick (Museum of Paleontology,1101 Valley Life Sciences Bldg.,University of California, Berkeley, CA 94720, USA)

 


ABSTRACT

Paleoinformatics is that area of paleontology concerned the management of information, including the preservation of systematic information and expertise. Because paleontology is such an information-rich and integrative field, the management of its data has always been problematic. In the last several years, however, the difficulties associated with providing access to and insuring the integrity of paleontological data have been exacerbated owing to (1) problems associated with the ease with which paleontologists can now collect, manipulate, and analyze information (e.g., lack of attention paid to data management, access, and integration issues); (2) the legacy of older data often in nonelectronic formats; and (3) the decline in professional positions for paleontologists; especially trained systematists. In addition, the current crisis in systematics has reached the point where serious consideration must be given actively to preserving systematic expertise and collections for some taxonomic groups against the possibility that professional-level practitioners may be not be available. Moreover, synthetic, Earth-systems-based research agendas will require access to multiple databases if they are to test questions of importance. For all these reasons, the paleontological community must take a more active role in managing the global paleontological database and making sure that our data continue to be available to qualified practitioners for the purposes of research and education.


INTRODUCTION

Providing structure and access to data forms the heart of the scientific community's concept of informatics. For many paleontologists the term 'database' refers specifically to a highly structured data-retrieval system that allows sophisticated record sorts and Boolean queries to be made. This kind of database occupies one end of the information spectrum. Any electronic format that allows information to be stored, updated, extended, and retrieved, however, can be useful to scientists and educators in general and paleontologists in particular; especially insofar as paleontology is an indivisibly multidisciplinary field. It is to this larger concept of paleontological information spectrum that we believe paleoinformatics will have its greatest impact.

Although relatively few significant paleontological databases currently exist in electronic format, this is changing rapidly. Many museums, universities, government geological and paleontological surveys, and individual researchers are actively engaged in creating extensive collections of paleontological data in a variety of electronic formats. While some of these collections will be withheld from the public domain, many will be (at least in principle) available for public use through the World Wide Web, ftp, e-mail, and other means.

The biological community—especially molecular genetics—faced a similar situation in the early years of this decade and chose to respond by creating methods (e.g., search engines, metadatabases) to facilitate the location of and access to information held in a variety of different repositories. This was the crucible out of which bioinformatics developed. Over the next several decades it will be necessary for paleontologists to follow the biologist’s lead and create similar data location and access structures. It is of the utmost importance to note, however, that the information base of modern biological informatics consists of relatively simple types of data (e.g., gene sequences, classifications, specimen catalogues). A paleontological informatics system will need to be considerably more complex because of the wide potential user base and considerably more flexible in terms of its data-handling capabilities because of the diversity of data types used in the many subfields of paleontology. Although some might see this as a potentially insurmountable barrier to the development of such a system, others, including the authors, see it as a challenge and an opportunity to begin work on informatics systems that will fill practical, general-purpose needs. What seems clear to all, however, is that sciences that do not take the initiative to develop practical informatics access to their data (regardless of its inherent complexity) can expect decreased research opportunities, funding levels, and interest levels among scientific colleagues and the general public in the coming century. In short, we, the global paleontological community, must take up the challenge of paleoinformatics. There is no alternative.


DATABASES

A database is any object that has a structure, information that fits into that structure, and a way to query information in the structure and retrieve subsets or full sets of that information. The goal of a database is to organize information and automate some tasks (e.g., searching through text for a specific criterion or matching text) that would be difficult otherwise. Some items that are usually not considered to be databases qualify under our definition. For example, World Wide Web pages usually have a specific structure for information and are intended for information retrieval and, thus, should be considered loosely structured databases.

The reason for bringing up definitions and goals is to make clear the essential role of databases in paleoinformatic endeavors. Databases do not have to stand alone as objects but can instead be considered potentially hierarchical, related objects through which information can literally be pulled. Given this definition, information stored in a structured database like a museum collection can be accessed through the World Wide Web by allowing communication between the web client, web server, and the native database language of that collection's database. This query has to go not only to the structured database but also, for example, to other web sites that have other structured databases or possibly to an industry stratigraphic database that might have types of information other than those stored in the museum database. These different types of data can be assembled on the fly into a new database that can then be queried again to retrieve more specific information or to ask a particular question incorporating all the initial information. This, in a nutshell, is the paleoinformatics model—multiple databases that can pass information to each other on the fly to assemble new composite databases that can then be queried and used to answer specific research questions. This model is more ambitious than, say, GenBank, which is built to handle accession and retrieval of a very simple kind of data, gene sequences. GenBank, however, has shown how on-the-fly analysis can be built into web database design. Blast searches allow a user to enter in a sequence and run a similarity check on all the sequences already in the database to determine quickly the identity of the entered sequence.


COLLECTIONS

The goal of paleontological collections is to provide physical documentation for paleontological research by curating and making available a representative sample of every taxon in the history of life and allied information. This problem is global, and there is a danger that valuable information will be lost. We recognize, however, that space is limited and always will be. We recommend that funding agencies in each nation recognize that fossil collections are part of humanity’s common heritage and make appropriate provisions for their care.

Collections that are not accessible are useless. Getting information on all collections into a database would take a long time for many museums, so first priority might be getting type collection material online. Gross inventories are a realistic goal, but it may be that nothing else will be possible. Making information available online may be a goal in itself. A realistic goal is to computerize museum collections as best we can and make available this information on-line.

Orphaned paleontological collections represent a serious threat and a great opportunity for museums. Unbounded acquisition of paleontological materials is not a realistic goal for these institutions or for the paleontological community. Therefore, criteria for evaluating the value of present and future paleontological collections must be developed. Overflow materials should be placed in educational institutions where possible, especially in the developing world; but one must realize that some paleontological collections will need to be deaccessioned and disposed of.


SYSTEMATICS

Systematics is and will remain the fundamental data of paleontology. The status of systematics in paleontology has changed, however. Academic paleontologists - whose careers are largely dependent on their ability to secure research grants - have found that they have less time to spend on systematic research since such topics are no longer a priority with funding agencies. Industrial paleontologists - whose livelihoods are dependent on their ability to make accurate identifications as quickly as possible - have been allowed to develop idiosyncratic taxonomies (often in the absence of access to adequate comparative material) with the result that important information and insights are prevented from crossing the academic-industrial divide. Government paleontologists - whose traditional role is that of regulation and the assessment of resources - find themselves unable to justify needed work on paleontological systematics as governments attempt to decrease the size of geological surveys and reorder research priorities. Museum paleontologists - the traditional arbiters of paleontological systematics - find themselves increasingly isolated from their academic, industrial, and governmental colleagues while at the same time feeling sustained pressure from their administrators to increase the relevance of their research, often with an emphasis on economic return.

Because of these realities, paleoinformatics is needed to provide the continuity of access to and development of systematic information on which all paleontologists depend. Some types of paleontology have always been practiced by more-or-less isolated individuals who come together infrequently to exchange views and listen to debates on current issues. In the very near future the most paleontological systematists may find themselves in this situation; a situation in which a downward spiral in skills, motivation, and appreciation represents a real danger. To maintain the essential connection to collections and, even more importantly, too fellow paleontological systematists, some mechanism for preserving the systematic enterprise must developed.

Paleoinformatics can make a substantial contribution to paleontological systematics by providing any paleontologist who has access to computers and the internet with the ability to access state-of-the-art distributed databases to seek answers to their own systematic and taxonomic questions (as well as those of their students) and, in so doing, facilitate both research and educational objectives. Moreover, the existence of such a system will stimulate systematic research, especially from individuals who have access to fossils in remote localities but lack the requisite access to libraries, collections, and experts necessary to complete their systematic research. It is our considered opinion that a truly global systematics reference system that is able to be accessed and evaluated by most practicing paleontological systematics would result in a renaissance of systematic research in which tangible progress on a number of presently more-or-less static fronts could be achieved. We have already seen the sort of effects that traditional databases can have on paleontological research (e.g., such large taxonomic monographs as the Treatise on Invertebrate Paleontology or The Fossil Record 2). Electronic versions of these and other, presently smaller, databases will have an even larger impact.


ELECTRONIC PUBLICATION

The most effective way in which systematic paleontological data can be prepared for an electronic database—a prerequisite for creating a paleoinformatics system—is to encourage rapid migration to electronic publication (e.g., Palaeontologia Electronica) and the submission of the resulting data to appropriate electronic data archives (e.g., PaleoBank, the Plant Fossil Record). As has been discussed elsewhere (see MacLeod et al., this volume) computer technology has developed to the point where electronic paleontological journals and comprehensive, public domain and commercial paleontological databases can be launched. Support for these approaches to the collection and dissemination of paleontological data will require some adjustment of traditional practices and attitudes, but the potential advantages—as well as the economics—far outweigh the disadvantages.

Older literature presents a special problem. Although older literature is extremely important to systematic research, its post hoc conversion to electronic formats is widely regarded as being more time and labor intensive than the production of new systematic data, much of which is developed in electronic formats. Nevertheless, the conversion of both structured files (e.g., back issues of journals) and relatively unstructured files (e.g., index card files and museum register books) into electronic formats is a very active area of research in computer science. Collaborations between computer scientists and paleontological systematists can be very fruitful but would be most readily facilitated within the structure of an overall paleoinformatics initiative.


SUMMARY

As in all science, paleontology is about information—its acquisition, its collection, and its interpretation. In addition, as in all science, paleontology is about people; people and their relationship to the information provided by fossils. Traditionally, the interface between paleontological data and paleontologists took place in a relatively small number of locations (e.g., museums, universities, and industrial and government laboratories), augmented, of course, by books, journals, and other print-based media. In the next century, regardless of the personal feelings of individual paleontologists, electronic media and the dissemination of nonstructured, distributed data will supplant the current modes of communication. If paleontology is to remain at the forefront of relevance to evolutionary biology, geology, oceanography, climatology, systematics, and any of the other fields in which it now plays a crucial role, paleontology must develop improved ways to distribute, access, assemble, synthesize, and (perhaps most importantly) maintain the quality of its data in the coming world of electronic communications. Development of a practical paleontological informatics system will be neither easy nor inexpensive. Indeed, it may severely tax the resources of the entire paleontological community for some time to come. Nevertheless, it must be organized, and it must be prioritized. In short, it must be done. Failure to do so will consign paleontology to a marginal role in 21st century science. Alternatively, support of informatics by all those interested in the future of our science may pave the way for a renaissance in paleontology and place it at the center of the emerging Earth-systems research movement.



 | Paleo21 RR Table of Contents | Return to PaleoNet Home Page |
| Introduction | The Image of Paleontology and Public Outreach |
| Paleo21 Preliminary Reports |