Summarizer Gets the Idea
The flow of a document, including the topics covered and the ways those topics relate to each other, is clear to people. It would be useful if computer systems that process documents could also learn how to consider topic information.
Teaching a computer to discern a document's topics and create a summary that puts the topics in the correct order is a bit like teaching it how to put together the pieces of a jigsaw puzzle. Current methods focus on finding the right match for a given piece.
MIT and Cornell University researchers have developed a system that does the equivalent of putting pieces that show parts of a mountain and pieces that show parts of the sky into separate groups, and putting the sky pieces above the mountain pieces.
After training on subject-specific sets of documents and document summaries, the researchers' automatic classification algorithm, or content model, can extract the topic structure of a group of related topics. It selects and orders topics to generate to summary.
The researchers put together prototype software that can automatically create capsule summaries of, for example, movies from a movie information database.
The content model could also eventually help search engines determine the overall topic and domain of discourse of a Web page and only return on-topic pages.
It is possible to use the model to do capsule summaries in restricted domains now. Adapting the model to provide better search engine results could take 10 years, said Lee.
The researchers presented the work at the North American Chapter of the Association for Computational Linguistics Human Language Technology (HLT/NAACL) 2004 conference in Boston, Massachusetts, May 2 to 7.
