|Artificial Intelligence Research Laboratory |
Department of Computer Science
Iowa State University
Information Retrieval, Extraction, and Fusion from Heterogeneous, Distributed Data Sources
Personnel Project Summary Funding Publications Additional Information Projects AI Lab
This research seeks to develop, implement, and evaluate algorithmic and systems solutions for selective, proactive, reactive, and customizable information retrieval, extraction, transformation, and fusion from heterogeneous, distributed, dynamic, semi-structured, and unstructured data and knowledge sources (including traditional databases, text repositories, image collections, sensors, simulations).
Our approach to the design of information retrieval agents builds on our recent designs of customizable mobile agents for selective retrieval of journal articles, news articles, etc. by acquiring knowledge of user preferences using machine learning techniques.
Ability to selectively retrieve data is only a necessary first step. Effective use of data from heterogeneous, distributed data sources requires interoperability between the various data sources and clients. Integrating data that is distributed over multiple relatively autonomous databases, or data that is heterogeneous in form and/or content poses a much more challenging task. Approaches to processing heterogeneous data sources can be broadly classified into two categories: multidatabase systems and mediator based systems. The former uses traditional database techniques while the latter relies on a set of rules to locate data and bridge the mismatch between semantics. The multidatabase systems approach to database interoperability uses object-oriented views to provide integrated access to heterogeneous, distributed, databases. The object-oriented view mechanism takes advantage of the underlying object structure for incorporating the rich semantics of the common data types. In our approach, an object-oriented view consists of three parts: data description, a data derivation section, and a methods section. For example, a view system for extracting a bag of words representation of from a document will have a data description section to define a bag of words representation, and a derivation section that contains the code necessary to parse the text and map it into a bag of words.
Our approach borrows from both the multidatabase as well as the mediator-based approaches to design and implement an object-oriented data warehouse based on object-oriented views (developed in the context of multidatabase systems) using knowledge-based software agents (proposed as part of the mediator-based approach). The data derivation section of a view is used as part of an agent program to extract the necessary data and transform it as needed. Views can be defined over multiple data sources. Agents using such views can gather and fuse relevant information from distributed, heterogeneous, data sources. In this context, we will investigate the design of views for associative pattern retrieval based on our recent work on neural architectures for pattern storage, flexible pattern matching, and information retrieval. We are also investigating the use of the XML metadata language in our design of object-oriented views and mediators. for transforming the data gathered from structured, semi-structured, and unstructured data sources in forms that can be effectively processed by suitable machine learning algorithms. To keep the research focused, our initial emphasis will be on tools for dealing with data sources that are encountered in applications involving monitoring of distributed systems (e.g., system log data) and data-driven knowledge discovery in bioinformatics (e.g., molecular sequences, spectrograms, and protein structures).
This research is closely integrated with the education and training of graduate and undergraduate students in Computer Science and Bioinformatics and Computational Biology at Iowa State University.
© Vasant Honavar, 1999.