This project aims to improve the detection of new and significant events or changes within a topic of interest from a multilingual stream of messages by developing a novel, automatic, probabilistic system to extract entities, events and relations and use them to populate an ontology (Knowledge Representation) that can be used to identify significant new information. The KR is a unifying framework to represent acquired knowledge independent of the source including sources from different languages; the system will encode both reinforcing and contradicting information by conditioning probabilities on source id and source reliability. The extracted facts will be used to build a probabilistic KR (PKR) that represents the system's view (or knowledge) of the world using probabilistic relational models. This PKR has many uses but we will evaluate our progress in this project by comparing our results to the current state-of-the-art in discovering significant new relations/attribute values in newly received data by building a First Story Detection system, as described below.
By understanding the incoming text beyond simple bag of words models as is commonly done in today's text search and classification systems, the project expects to improve the accuracy of discovering new information in a specific subtopic or answering a user's question about some entities or events. Current systems have a high false alarm rate with a low detection rate: incoming messages appear to be new though they do not provide significant new information.
By analyzing documents in multiple languages that describe the same event, called comparable documents, the project has a rich set of facts that may reinforce each other or may be contradictory. The richness of a comparable corpus will require the introduction of a probabilistic model to the KR to handle the information fusion. The PKR will be able to incorporate conditioning the probability of a fact on its source, and other related facts of the same entity or related entities. The PKR will learn its parameters from the comparable corpus.