search engines (AltaVista, Yahoo!, etc.) are built upon IR technology.
Similarly, though much newer, it is likely that many people will soon be using
automated summarizers to condense (or at least, to extract the major contents
of) single (long) documents or lots of (any length) ones together. [...]
In this context, multilingualism on the Web is another complexifying factor.
People will write their own language for several reasons -- convenience,
secrecy, and local applicability -- but that does not mean that other people are
not interested in reading what they have to say! This is especially true for
companies involved in technology watch (say, a computer company that wants to
know, daily, all the Japanese newspaper and other articles that pertain to what
they make) or some Government Intelligence agencies (the people who provide the
most up-to-date information for use by your government officials in making
policy, etc.). One of the main problems faced by these kinds of people is the
flood of information, so they tend to hire 'weak' bilinguals who can rapidly
scan incoming text and throw out what is not relevant, giving the relevant stuff
to professional translators. Obviously, a combination of SUM and MT (machine
translation) will help here; since MT is slow, it helps if you can do SUM in the
foreign language, and then just do a quick and dirty MT on the result, allowing
either a human or an automated IR-based text classifier to decide whether to
keep or reject the article.
For these kinds of reasons, the US Government has over the past five years been
funding research in MT, SUM, and IR, and is interested in starting a new program
of research in Multilingual IR. This way you will be able to one day open
Netscape or Explorer or the like, type in your query in (say) English, and have
the engine return texts in *all* the languages of the world. You will have them
clustered by subarea, summarized by cluster, and the foreign summaries
translated, all the kinds of things that you would like to have.
You can see a demo of our version of this capability, using English as the user
language and a collection of approx. 5,000 texts of English, Japanese, Arabic,
Spanish, and Indonesian, by visiting MuST Multilingual Information Retrieval,
Summarization, and Translation System.
Type your query word (say, 'baby', or whatever you wish) in and press
'Enter/Return'. In the middle window you will see the headlines (or just
keyword
|