The PaleoNet Forum

The PaleoNet Forum: An Irregular Electronic Journal
July, 1996: Volume 2, Issue 4

Interactive Manipulation of Enigmatic Palaeontological Data

Dilshat Hewzulla and Michael Boulter

Palaeobiology Research Unit, University of East London, Romford Road, London E15 4LZ, boulter@uel.ac.uk

Abstract

This article demonstrates how a complex data set can be used critically and selectively from a remote terminal over the internet. It does this by using a set of data of extinct plant records of variable quality and enabling different reliability factors to be remotely applied during the summarization of these data. This is a new analytical tool that introduces a program which allows you to use a fuzzy system to manipulate database records. The program is particularly relevant to electronic publishing. The background and theory to these techniques are discussed. Finally, these methods are demonstrated by a program allowing you to manipulate data from Benton's (1993) "The Fossil Record 2".

Why Interactive Manipulation?

The International Organisation of Palaeobotany's Plant Fossil Record database (www.uel.ac.uk/palaeo/) was one of the first large collections of palaeontological data on the internet and is becoming a well-known resource from which to monitor plant evolution and migration. As well as storing taxonomic and bibliographic details of plant fossil occurrences that have been published in the scientific literature, the database contains stratigraphical and geographical information and will soon offer interactive facilities of the kind outlined here. When we first started to analyze these data in a systematic way it became clear that they contain many idiosyncratic features (see Box 1). These make it very difficult to provide and interpret summaries of data contained within the database.

___________________________________________________________

Box 1.

Fuzzy features (such as time, identification, occurrence) of data in Plant Fossil Record database.

There are different variables characterising each global region, such as:

- area of outcrop of a particular Stage

- number of localities and exposures or cores

- number of palaeontologists employed through different generations

- the taxonomic approach of each school

- reliability of stratigraphic and taxonomic identification

- economic importance of a particular region.

The data from different sources always have different structures and different standards.

___________________________________________________________

The size of the database will continually increase by receiving information from internet users as well as from other sources (Lhotak & Boulter, 1995). When a large database, which is receiving information from users all over the world, becomes accessible in the internet, problems relating to the quality of data increase significantly. The process of data cleaning is discussed by Lhotak & Boulter (1995). The database should receive information by giving different reliability factors according to the different sources and characteristics of the data. Lhotak & Boulter (1995) suggest that only data from two types of sources should be used in a publicly-available database. These are either from a scientific journal, or from the specimen label or catalogue of a curating museum. The decision of validity and/or nomenclature, of course, remains within the control of the nomenclature commissions.

Some sources inevitably contain incomplete information (Benton 1993) because of the incompleteness of palaeontological occurrences. When processing data in such a database, many complex decisions must be made. Fortunately, much inaccurate or wrong information can often be detected on a graphical display or print out. Thus, users may be able to estimate the reliability of various data records from a graphical display. It is also possible to compare database record summaries to other data using various mathematical techniques. For this it is best if the mathematical manipulation of the data takes place on the user's machine, so that he or she can interactively change the reliability factors according to the graphical representation of the data, the results of the computation or personal preference.

Reliability and Fuzzy Features of Data

Imagine a lot of people with different levels of expertise or authority expressing their ideas about a topic at a scientific meeting. Some of them have a stronger influence to the final decision than others, because they are more experienced in this area or more skilled at debate. But through influence and negotiation the ideas of others also have some influence to the final decision. This kind of debate is very common within the analysis of palaeontological data. As we move from the occurrence level to higher classification levels, we encounter more incomplete data, and more uncertainties. There is also much controversy at all levels. Even with the use of quantitative methods and techniques there can still be major disagreements over the best way to classify (or otherwise represent) some fossils within a database (e.g., Boulter, Spicer & Thomas, 1988). The reliability feature presented in our database analyzer enables the user to select different values and thereby test which numbers that lie across a boundary are correct. In order to use variable quality data our methods enable a debate between the user and the database to take place. The specific approach we employ is discussed in the last section of this paper.

Because of the vagueness and ambiguity in the palaeontological record, the data also exhibits "fuzziness" as well as unreliability. For example, the geological timescale is composed of intervals with fuzzy boundaries. The numbers of occurences recorded in the database are also fuzzy numbers. If a number of occurrence (the number of taxa recorded at every one million year interval) is recorded as 200, it also implies it may be 201, or it may be 199, for there may be some vagueness on the classification of some fossils. Klir & Folger (1988a) have described a database that can accommodate such imprecise (but common) information types. Such a database can store and manipulate not only precise facts but also subjective opinions, judgments, and values that can be specified in linguistic terms. Because the database can receive and manipulate information about the reliability and fuzzy features of the data records, the number of complex decisions that must be made while adding records to the database is greatly reduced. Because users can change the reliability and fuzzy features of the data by using an application program sent to their machine with the data, they can perform many experiments. In our implementation of a fuzzy database analyzer the reliability feature is separated from the fuzzy feature. In this way the meanings of the parameters relating to reliable and fuzzy data attributes are more easily understood.

Data Models

Buckles and Petry (1983) present a model for a fuzzy relational database that contains, as a special case, the classical crisp model of a relational database. The fuzzy relational data model they propose differs from the crisp model in two ways: first, elements of the tuples (also known as rows) in a table may be subsets of the domain universal set and, second, a similarity is defined on each domain universal set (Klir & Floger 1988a). In the fuzzy data model, the element of each record may be described by a possibility-set and a certainty-set. The possibility-set describes all possible values of the element. The certainty-set reflects the grades of certainties of elements in the possibility-set in the domain implied by the element of the record.

In some cases, it is convenient to store only the parameters of mathematical functions that describe the possibility-set and the certainty-set. Alternatively one may have to store all the elements in the possibility-set and certainty-set. For example, the time range of the Jurassic may be described by a mathematical function (see the graphic representation in Fig.1). In this situation we would only have to store these parameters to estimate the duration of the Jurassic. We could then represent stratigraphic ranges by changing the parameters of this mathematical function. In other words, all the stratigraphic assignments could be represented by the same fuzzy implementation by assigning these parameters different values. In the same way, we can model the fuzzy features of an occurrence number (see Fig.2),

Fig.1. Graphic representation of the mathematical function which models the

the informations about the stratigraphic range of a fossil record, which is recorded

as Jurassic.

x represets time. y represents the certainty of the corresponding

time values. w and m are the parameters of the mathematical function.

Fig.2. The graphic representation of the mathematical function which models the

vagueness about the value of an occurrence (the number of taxa recorded corresponding

to a stratigraphic range) generated either by the imcompleteness of the fossil records or

the different possible classifications.

x represets occurrences, y represents the certainty of the corresponding occurrence.

m is the parameter of the mathematical function.

In short, the implementation of fuzzy features of database records may vary, and there is no fuzzy feature design model that can accomodate all situations. Because the design of a fuzzy feature structure reflects ones subjective idea about the data, these may differ between, and even within, databases. As new data records are added to the database, both the structure and the fuzzy semantics of records may differ from those of records already stored in the database, even though these records represent the same kind of information. For example, the referenced authority for the stratigraphy of a particular unit might vary from one author to the next. The stratigraphic problems generated from the different opinions of different authors have been widely described (e.g., Boulter, Spicer and Thomas, 1988). In most present databases some of this information has to be left out or added subjectively when we add new records to the database because of the difference between the actual (fuzzy) structure of the data and that accepted by the database.

In these situations, it is more reasonable to dynamically incorporate the fuzzy attributes of the data and use these to improve the overall search and/or summary results. In our implementation of a fuzzy analyser we use object-oriented techniques to achieve this goal. Applications of object-oriented technology to database systems have been widely discussed (e.g., Bertino, Negri, Pelagatti & Sbattella, 1994, Nicol, Wilkes & Manola, 1993). Using an object-oriented paradigm, data will be encapsulated by an object, which has a set of instance attributes and methods. The interface of an object is separated from its implementation and provides the means whereby data objects can interact with other objects. This allows objects to use the services provided by other objects without knowing how the services are implemented. Thus, an object's implementation may change without impacting other objects or applications using the services provided by that object. The fuzzy implementation of objects may vary, but they maintain a common interface in order to be able to communicate with each other and to be manipulated by various applications. If an object has such an interface, it is called a fuzzy object, and can be manipulated by fuzzy procedures.

The interface of fuzzy objects is briefly described in the Java language as:

public interface FuzzyObject{

public void setReliability(int reliability);

public int getReliability();

public void setFuzzyImplementation(FuzzyImplementation fuzzyimpl);

public FuzzyImplementation getFuzzyImplementation();

public Object getElementFromSet(int index);

public int getCertaintyFromSet(int index);

public void castToPossibility(PossibilitySet possibilityset);

.....

}

In our example, reliability in the fuzzy object is expressed by an integer that lies between two extreme values, one of which indicates that the data represented by the object are false, and the other of which indicates that the data are true. Consequently we can control the influence of the data on the various fuzzy computations.

A fuzzy object's fuzzy implementation can be changed by sending the message "setFuzzyImplementaion". And other objects can receive a fuzzy object's implementation by sending the message " getFuzzyImplementation". Consequently the fuzzy features of objects can be changed dynamically by the user in an interactive environment. Also, a fuzzy implementation object can be shared by several other fuzzy objects. The last three messages listed above will be delivered to the fuzzy implementation object. All fuzzy implementation objects have a common interface:

public interface FuzzyImplementation{

public Object getElementFromSet(int indext);

public int gerCertaintyFromSet(int index);

public void castToPossibility(PossibilitySet possibilityset);

......

}

A fuzzy implementation object represents one kind of implementation of a fuzzy feature, parameters of which will be stored in the corresponding fuzzy object. Any element in the possibility-set and the certainty-set of such an object can be retrieved by sending the messages "getElementFromSet" and "getCertaintyFromSet". The "castToPossibility" message is very important in various calculations. When the possibility-set object is sent as the parameter of the message "castToPossibility" to a fuzzy object, the certainties of each element in the possibility-set are calculated, and the result stored in the possibility-set object as a certainty-set. An object representing the fuzzy data can be mapped to any subset of the domain universal set by sending the "castToPossibility"message. In other words, the application can send any possibility-set to any fuzzy object without knowing whether the object contains the relating information. If the fuzzy object does not contain the relating information represented by the possibility-set, the state of the possibility-set object will not be changed.

When a fuzzy object's states are fuzzy objects, they are called a complex fuzzy object. The possibility- and certainty-sets of complex fuzzy objects can be derived by interacting with the component fuzzy objects. This implementation of complex fuzzy objects is based on the Extension Principle introduced by Zadeh (1965).

Manipulating Enigmatic Data

When a query is made to the database, there may be large amounts of information relating to the query available to be processed. Some of these data are imprecise, some contradict each other, and others are vague. However, the process has to get the most likely result by manipulating the data. This situation is very similar to that of interpersonal communication, which consists of a vast array of different types of simultaneously communicated signals (words, voice tone, body posture, clothing, etc.), many of which conflict with each other. It is often difficult to determine the precise intention and meaning of the communication, both because of distortion from environmental noise and because of ambivalence on the part of the sender (Klir & Folger 1988b). Nevertheless, the receiver must respond appropriately in the face of this fuzzy or vague information. Yager (1980) has suggested an approach to model this process through the use of fuzzy set theory. A modified version of this approach is used here to resove problems relating to queries in our example database.

At the beginning of this approach, the possibility-set object has to be generated, representing a set of all the possible answers. There are three alternatives for this process:

1. If the user wants to know which is the correct answer from several possibilities, the possibility-set object will be constructed with the help of the user, interactively. The advantage of this alternative is that the influence of imprecise information to the final result may be greatly reduced, and the data becomes very useful, even when its quality is very low.

2. It is also possible to generate the possibility-set object from the semantics of the data or query itself. For example, in a stratigraphic range query where there is a lowest number and a highest number, the range of acceptable answers can be decided a priori by the user.

3. Finally, the range of possible answers can be decided by browsing all the possibility-sets of all input objects. In this method, the possibility-set object will be completely controlled by the input data. It is an ideal solution in most cases.

All three methods may be used simultaneously in order to generate the most appropriate possibility-set object. After generating the possibility-set object, this will be sent to all the fuzzy objects selected by the application program, and the possibility-set object accumulate the certainities of each possible answer by interacting with the fuzzy objects. If this process is the last of the whole procedure, the final result will be decided by comparing the degree of certainty of each element in the possibility-set and the element with the highest certainty will be accepted as the final result. The reliability of the answer will be calculated according to its certainty and the number of fuzzy objects participating in the process. However, there is a possibility that the result may become ambiguous if there are several elements with maximum certainties in the possibility-set. If the process is not the last part of the whole procedure, the possibility-set will be reconstructed to represent a subset of the original possibility-set, each element of which has a higher certainty than a user-defined value. A fuzzy object will be constructed from the possibility-set, and will participate in the mathematical procedure as the input of the following process.

The application program Fuzzy Analyzer for Enigmatic Data is written according to the ideas presented above. It receives uncertain or unreliable data, calculates the variations acceptable within a Fuzzy System, and plots the curve representing the most probably estimate given the assigned possibility-set and certainty-set constraints. You can manipulate the data according to the instruction given in the homepage. Another of the demonstrations included allows you to manipulate all the data in "The Fossil Record 2" (Benton 1993).

References

Bertino, E.,Negri, M., Pelagatti, G. & Sbattella, L. (1994) Applications of object-oriented technology to the integration of heterogeneous database systems. Distributed and Parallel Databases, 2. 343-370.

Benton, M.J. (Ed.) (1993) The Fossil Record 2. Chapman & Hall, London. 845pp.

Boulter, M.C., Spicer, R.A. & Thomas, B.A. (1988) Patterns of plant extinction from some palaeobotanical evidence. In: G.R. Larwood (Ed.) Extinction and Survival in the Fossil Record. Oxford University Press, Oxford, 1-36.

Buckles, B.P. & Petry, F.E. (1983) Information-theoretical characterization of fuzzy relational databases. IEEE Trans. on Systems, Man, and Cybernics, 13, 74-77.

Klir, G.J. & Folger, T.A. (1988a) Computer science. Fuzzy Sets, Uncertainty, and Information, 6, 260-265.

Klir, G.J. & Folger, T.A. (1988b) Interpersonal Communication Fuzzy Sets, Uncertainty, and Information, 6, pp. 234-239.

Lhotak, M. & Boulter, M.C. (1995) Towards the creation of an international database of palaeontology. In: J.R.A. Giles (Ed.) Geological Data Management. Geol. Soc. Sp. Pub. 97, 55-64.

Nocol, J.R, Wilkes, C.T & Manola, F.A (1993) Object-orientation in heterogeneous distributed computing systems. Computer, 26, 57-67.

Yager, R.R. (1980) On modelling interpersonal communication. In: P.P. Wang & S.K. Chang (Eds.) Fuzzy Sets: theory and applications to policy analysis and information systems. Plenum Press, New York, 309-320.

Zadeh, L. A. (1965) Fuzzy Sets. Inf. Control, 8, 338-353.