FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 4

views

Report

Comments

Description

Download FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION

Transcript

FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION

The American University in Cairo
School of Sciences and Engineering
FREQUENTLY ASKED QUESTIONS WEB PAGES
AUTOMATIC TEXT SUMMARIZATION
A Thesis submitted to
Department of Computer Science and Engineering
In partial fulfilment of the requirements for
The degree of Master of Science
by Yassien M. Shaalan
Bachelor of Science, Computer Engineering, Cairo University
Under the Supervision of Dr. Ahmed Rafea
April 2011
The American University in Cairo
School of Sciences and Engineering
FREQUENTLY ASKED QUESTIONS WEB PAGES
AUTOMATIC TEXT SUMMARIZATION
A Thesis Submitted by
Yassien Mohamed Shaalan
To the Department of Computer Science and Engineering
April 2011
In partial fulfillment of the requirements for the degree of Master of Science
Has been approved by
Dr. . . . . . . . . . . . . . . . . . . . . . .
Thesis Committee Chair / Adviser
Affiliation
Dr. . . . . . . . . . . . . . . . . . . . . . .
Thesis Committee Chair / Adviser
Affiliation
Dr. . . . . . . . . . . . . . . . . . . . . . .
Thesis Committee Reader / examiner
Affiliation
Dr. . . . . . . . . . . . . . . . . . . . . . .
Thesis Committee Reader / examiner
Affiliation
Dr. . . . . . . . . . . . . . . . . . . . . . .
Thesis Committee Reader / examiner
Affiliation
Department Chair/
Date
Dean
i
Date
DEDICATION
This thesis is dedicated to my beloved parents and wife, whom without their profound support
and help it would not have been possible to accomplish anything. An honourable dedication
goes to my dear uncle Dr. Mohamed Labib for his unfailing assistance, excellent advice,
superior help and for encouraging me and believing in what I am capable of doing. Finally, I
would like to dedicate it to honour the memory of my father who passed away during the
course of this research.
ii
ACKNOWLEDGMENTS
I would like to express my deepest sense of gratitude to each and every person who helped or
supported me to complete my thesis. It h a s b e e n a challenging and rich experience.
Accordingly, it would be my pleasure to take the time to convey my gratitude to them.
First, I would like express my sincere gratitude to my supervisor Dr. Ahmed Rafea for his
outstanding advice, great help, patient guidance and support that he provided through his rich
professional experiences.
I would like to express my sincere thanks to my thesis committee for their supportive guidance
and constructive feedback in evaluating my thesis proposal.
I would like also to acknowledge the help of my colleague Micheal Azmy who provided me
with his Web page segmentation tool with which I started my research project. A special
thanks to my dear friends- Ahmed Fekry, Taysser Abo El Magd, Marwa Shaalan, Ali Sayed,
Islam Shaker, Ahmed Yassien, Reda Salim, Mohamed El Farran, Mahmoud Khedr and Essam
Abd’ AL- who participated voluntarily to help me in my experiments.
I am also thankful to all my professors at The American University in Cairo (AUC) for their
continuous support, guidance and advice. I would like to thank all the Computer Science and
Engineering Staff for her their unfailing assistance.
iii
ABSTRACT
This research is directed towards automating frequently asked questions Web pages
summarization, a task that captures the most salient pieces of information in answers of each
question. In fact, it was found that there exists thousands of FAQ pages available on many
subjects not only online but also offline as most of products nowadays tend to attach a leaflet
tilted as FAQ along with its product describing their functionality and usage. Moreover, FAQ
Web pages are devised into a manner of a question having a specific heading style, e.g. bold,
underlines, or tagged. The answer then follows in a different lower style; usually smaller font
and may be scattered to subheadings if the answer is logically divided. This uniformity in the
creation of FAQ Web pages and its abundance online, made it pretty much amenable to
automation.
To achieve this objective, an approach, which applies Web page segmentation to detect
Q/A –Question and Answer-units along with the use of some selective statistical sentence
extraction features for summary generation, is proposed. The proposed approach is English
language dependent.
These features are namely, question-answer similarity, query overlap, sentence location
in answer paragraphs and capitalized words frequency. The choice of these features was
mainly influenced by the process of analyzing the different question types and anticipating the
expected correct answer.
The first feature –Sentence Similarity- evaluates the semantic similarity between the
question sentence and each sentence in the answer. It does so by comparing word senses of
both the question and answer words and then assigns each pair of words a numerical value and
then in turn an accumulated value to the whole sentence.
The second feature –Query Overlap-extracts the following word types; nouns, adverbs,
adjectives and gerunds from the question sentence and automatically formulate a query with
and count the number of matches with each of the answer sentences.
The third feature -Location-gives higher score to sentences at the beginning and lessens
the score to the following sentences.
The fourth feature -Capitalized Words Frequency- computes the frequency of
capitalized words in a sentence. We give each feature a weight and then linearly combine them
iv
in a single equation to give a cumulative score for each sentence. The different document
features were combined by a home grown weighting score function. It was found that using
each of the features solely performed well in some cases based on different kinds of evidence.
Pilot experimentations and analysis helped us in obtaining a suitable combination of feature
weights.
The thesis proposes three main contributions. First, using the proposed feature selection
combination is a highly preferable solution to the problem of summarizing FAQ Web pages.
Second, utilizing Web page segmentation, enables the differentiation between the different
constructs resides in Web pages, which enables us to filter out those constructs that are not
Q/A. Third, detect those answers which are logically divided into a set of paragraphs under
smaller headings representing the logical structure. Thus generating a conclusive summary
with all paragraphs included.
The automatically generated summaries are compared to summaries generated by a
widely used commercial tool named Copernic Summarizer 2.1 1 . The comparison is performed
via a formal evaluation process involving human evaluators. In order to verify the efficiency of
our approach we conducted four experiments. The first experiment tested the summarization
performance quality of our system against the Copernic system in terms of which system
produces readable short quality summaries. The second experiment tested the summarization
quality with respect to the questions’ discipline.
The third experiment to measures and
compares the human evaluators’ performance in evaluating both summarizers. The fourth
experiment analyzes the evaluators’ scores and their homogeneity in evaluating our system and
the Copernic system as well. In general, it was found out that our system –FAQWEBSUMMperforms much better than the Copernic summarizer. The overall average scores for all pages
indicate that it is superior to the other system by approximately 20% which is quite promising.
The superiority comes from the idea of knowing the web page structure in advance that helps
in reaching better results than applying a general solution to all types of pages.
Keywords: Web Document Summarization, FAQ, Question Answering and Web Page
Segmentation.
1
http://www.copernic.com/en/products/summarizer/
v
TABLE OF CONTENT
CHAPTER 1 INTRODUCTION ........................................................................................................ 1
1.1 OVERVIEW ........................................................................................................................ 1
1.2 BACKGROUND .................................................................................................................. 2
1.3 PROBLEM DEFINITION ...................................................................................................... 4
1.4 MOTIVATION .................................................................................................................... 5
1.5 OBJECTIVE ........................................................................................................................ 6
1.6 THESIS LAYOUT ................................................................................................................ 7
CHAPTER 2 LITERATURE SURVEY ............................................................................................... 8
2.1 INTRODUCTION ................................................................................................................. 8
2.2 TAXONOMY OF SUMMARIZATION CATEGORIES ................................................................ 9
2.3 SINGLE-DOCUMENT SUMMARIZATION ............................................................................ 10
2.3.1 MACHINE LEARNING METHODS ............................................................................... 11
2.3.2 DEEP NATURAL LANGUAGE ANALYSIS METHODS ................................................... 15
2.4 MULTI-DOCUMENT SUMMARIZATION ............................................................................. 17
2.5 OTHER TASKS IN TEXT SUMMARIZATION ....................................................................... 21
2.5.1 SHORT SUMMARIES .................................................................................................. 21
2.5.2 QUERY BASED SUMMARIZATION .............................................................................. 23
2.5.3 SENTENCE COMPRESSION BASED SUMMARIZATION ................................................. 24
2.5.4 STRUCTURE BASED SUMMARIZATION ...................................................................... 26
2.5.5 LINK BASED SUMMARIZATION ................................................................................. 29
2.5.6 EMAIL MESSAGES SUMMARIZATION ........................................................................ 29
2.5.7 CUSTOMER REVIEW SUMMARIZATION ..................................................................... 31
2.5.8 BLOG SUMMARIZATION ............................................................................................ 31
2.5.9 QUESTION ANSWERING & FAQ SUMMARIZATION ................................................... 32
2.6 SUMMARIZATION EVALUATION ...................................................................................... 34
CHAPTER 3 TEXT SUMMARIZATION PROJECTS .......................................................................... 39
3.1 EXTRACTIVE BASED PROJECTS ....................................................................................... 39
3.1.1 MACHINE LEARNING BASED PROJECTS .................................................................... 39
3.1.2 STATISTICAL BASED PROJECTS ................................................................................ 41
vi
3.1.3 QUESTION ANSWERING BASED PROJECTS ................................................................ 44
3.2 ABSTRACTIVE BASED PROJECTS ..................................................................................... 45
3.3 HYBRID BASED PROJECTS .............................................................................................. 46
CHAPTER 4 PROPOSED APPROACH TO FAQ WEB PAGES SUMMARIZATION ................................. 48
4.1 PROPOSED METHODOLOGY OVERVIEW .......................................................................... 48
4.2 WEB PAGE SEGMENTATION ............................................................................................ 49
4.2.1 APPLYING WEB PAGE SEGMENTATION TO FAQ WEB PAGES ................................... 50
4.3 FEATURES SELECTION .................................................................................................... 52
4.3.1 SEMANTIC SIMILARITY FEATURE.............................................................................. 56
4.3.3 LOCATION FEATURE ................................................................................................. 61
4.3.4 WORD CAPITALIZATION FEATURE ............................................................................ 61
4.3.5 COMBINING FEATURES ............................................................................................. 62
4.4 PILOT EXPERIMENT “DETECTING OPTIMUM FEATURE WEIGHTS” .................................. 63
4.4.1 OBJECTIVE ................................................................................................................ 63
4.4.2 EXPERIMENT DESCRIPTION ....................................................................................... 63
4.4.3 RESULTS ................................................................................................................... 64
4.4.4 DISCUSSION .............................................................................................................. 65
CHAPTER 5 SYSTEM DESING AND IMPLEMENTATION ................................................................... 67
5.1 FAQWEBSUMM OVERALL SYSTEM DESIGN ............................................................... 67
5.2 PRE-PROCESSING STAGE ................................................................................................. 69
5.2.1 PRIMARY PRE-PROCESSING....................................................................................... 69
5.2.2 SECONDARY PRE-PROCESSING.................................................................................. 75
5.3 SUMMARIZATION STAGE ................................................................................................ 77
5.4 FAQWEBSUMM SYSTEM IMPLEMENTATION ISSUES ................................................... 81
CHAPTER 6 SYSTEM EVALUATION ............................................................................................... 84
6.1 WHY COPERNIC SUMMARIZER? ..................................................................................... 84
6.2 EVALUATION METHODOLOGY ....................................................................................... 85
6.3 DATASET........................................................................................................................ 88
6.4 EXPERIMENTAL RESULTS AND ANALYSIS ...................................................................... 88
6.4.1 EXPERIMENT 1 “PERFORMANCE QUALITY EVALUATION” ........................................ 89
vii
6.4.1.1 OBJECTIVE ......................................................................................................... 89
6.4.1.2 DESCRIPTION ...................................................................................................... 89
6.4.1.3 RESULTS ............................................................................................................. 89
6.4.1.4 DISCUSSION ........................................................................................................ 90
6.4.2 EXPERIMENT 2 “PAGE DISCIPLINE PERFORMANCE ANALYSIS” ................................ 93
6.4.2.1 OBJECTIVE .......................................................................................................... 93
6.4.2.2 EXPERIMENT DESCRIPTION ................................................................................. 93
6.4.2.3 RESULTS ............................................................................................................. 94
6.4.2.4 DISCUSSION ........................................................................................................ 97
6.4.3 EXPERIMENT 3 “HUMAN EVALUATORS’ PERFORMANCE ANALYSIS” ....................... 99
6.4.3.1 OBJECTIVE .......................................................................................................... 99
6.4.3.2 EXPERIMENT DESCRIPTION ................................................................................. 99
6.4.3.3 RESULTS ............................................................................................................. 99
6.4.3.4 DISCUSSION ...................................................................................................... 100
6.4.4 EXPERIMENT 4 “ANALYZING EVALUATORS’ HOMOGENEITY” ............................... 101
6.4.4.1 OBJECTIVE ........................................................................................................ 101
6.4.4.2 EXPERIMENT DESCRIPTION ............................................................................... 101
6.4.4.3 RESULTS ........................................................................................................... 101
6.4.4.4 DISCUSSION ...................................................................................................... 102
CHAPTER 7 CONCLUSION AND FUTURE WORK ........................................................................... 103
REFERENCES .............................................................................................................................. 107
APPENDIX A PILOT EXPERMENT DETAILED RESULTS .................................................................. 117
APPENDIX B DATASET DESCRIPTION .......................................................................................... 127
APPENDIX C DETAILED EXPERIMENTAL RESULTS ....................................................................... 129
APPENDIX D SAMPLE OUTPUT SUMMARIES................................................................................. 152
viii
LIST OF FIGURES
Figure 4.1 An Example of Logically Divided Answer into Sub Headings……….…………...51
Figure 4.2 An Example of a Good Summary to Question in Figure 1…………….…………..51
Figure 4.3 An Example of a Bad Summary to Question in Figure 1…………….….................52
Figure 4.4 Features’ Scores Comparison…………………………………….………...............64
Figure 5.1 FAQWEBSUMM Overall System Architecture………………...…………............68
Figure 5.2 Segment Attributes Definitions. ………………………………….………………..69
Figure 5.3 Web Page Segmentation Example………………………………………………….70
Figure 5.4 FAQWEBSUMM Second Stage of Pre-processing…………….………………….72
Figure 5.5 FAQWEBSUMM Third Stage of Pre-processing……………….…………............73
Figure 5.6 XML File Structure Example after SBD and POS Stages……….………………...74
Figure 5.7 FAQWEBSUMM Fourth Stage of Pre-processing………………………...............75
Figure 5.8 FAQWEBSUMM Specialized FAQ pre-processing……………………………….76
Figure 5.9 FAQWEBSUMM Summarization Core…………………………………................77
Figure 5.10 Pseudo Code of Summarization Algorithm……………………………………….78
Figure 5.11 Pseudo Code for Calculating Similarity Feature Score….......................................79
Figure 5.12 Pseudo Code for Calculating Location Feature Score….........................................79
Figure 5.13 Pseudo Code for Calculating Query Overlap Feature Score……………………...80
Figure 5.14 Pseudo Code for Calculating Capitalized Words Feature Score………………….81
ix
Figure 6.1 Performance Distributions………………………………………………….............90
Figure 6.2 FAQWEBSUMM Score Distribution………………………………………............91
Figure 6.3 Copernic Score Distribution………….…………………………………………….92
Figure 6.4 Score Distribution Comparison…………………………………………………….93
Figure 6.5 Page Discipline Score Comparison.....……………………………………………..96
x
LIST OF TABLES
Table 4.1 Yes/No Question Type Examples………….………………………………………54
Table 4.2 Question Word Question Type Examples……….…………………………………55
Table 4.3 Choice Question Type Examples………………………..…………………………55
Table 4.4 Summary Evaluation for Selection Features………………….……………............64
Table 5.1 FAQWEBSUMM System Requirements………………………….……………….83
Table 6.1 FAQ Web Page Evaluation Quality Matrix…………………………….……..........85
Table 6.2 Average Page Scores Categorized by Discipline…..……………………………….95
Table 6.3 t-Test Values for All Pages by All Evaluators…….…………………......................97
Table 6.4 Evaluators Improvement Ratio…………………………………………..................99
Table 6.5 t-Test Results over Evaluator Scores……………………………….……………...100
Table 6.6 t-Test Results Comparing Evaluators’ Scores………………………......................102
xi
LIST OF ACRONYMS
FAQ
Frequently Asked Questions
Q/A
Question Answer Unit
FAQWEBSUMM
Frequently Asked Questions Web Pages Summarization
XML
Extensible Markup Language
HTML
Hyper Text Markup Language
CMS
Content Management System
PHP
Hypertext Preprocessor
RFC
Request for Comment
SDS
Single Document Summarization
MDS
Multi-Document Summarization
QBS
Query Based or Biased Summarization
DUC
Document Understanding Conference
AI
Artificial Intelligence
NLP
Natural Langue Processing
ML
Machine Learning HMM
Hidden Markov Model
TREC
Text Retrieval Conference
ROUGE
Recall Oriented Understudy for Gisting Evaluation
GA
Genetic Algorithms
MCBA
Modified Corpus Based Approach
LSA
Latent Semantic Analysis
TRM
Text Relationship Map
MRM
Mathematical Regression Model
FFNN
Feed Froward Neural Network
PNN
Probabilistic Neural Network
GMM
Gaussian Mixture Model
GP
Genetic Programming
UMLS
Unified Medical Language System
NE
Named Entity
xii
TAC
Text Analysis Conference
TFIDF
Term Frequency Inverse Document Frequency
RU
Relative Utility
CBRU
Cluster Based Relative Utility
CSIS
Cross Sentence Information Subsumbtion
MEAD
Platform for Multi-Document-Multi-Lingual Summarization
HIERSUM
Hierarchal Summarization Model
LDA
Latent Dirichlet Allocation
TOC
Table of Content
K.U.Leuven
A System for Generating Very Short Summaries
UAM
A System for Generating Very Short Summaries
TBS
Title Based Summarization
BAYESUM
A Bayesian Query Based Summarization Project
PDA
Personal Digital Assistant
ILP
Integer Linear Programming
AUC
Area under the ROC Curve
INEX
Initiative for the Evaluation of XML Retrieval Dataset
SUMMAC
TIPSTER Summarization Dataset
CMSA
Collective Message Summarization Approach
IMSA
Individual Message Summarization Approach
CRF
Conditional Random Fields
QALC
A Question Answering System
P/R
Precision Recall Evaluation
EDU
Elementary Discourse Unit
ANOVA
Analysis of Variance
BLEU
Bilingual Evaluation Understudy
SCU
Summary Content Unit
NeuralSumm
A Neural Network Based Summarization Project
ClassSumm
A Classification Based Summarization Project
SweSum
The Swedish Text Summarization Project
xiii
TS-ISF
Inverse Sentence Frequency
GistSumm
The Gist Summarizer Project
OTS
The Open Text Summarization Project
InText
A Text Summarization Project
FociSum
A Question Answering Based Summarization Project
WebSumm
A Web Document Summarization Project
TRESTLE
Text Retrieval Extraction and Summarization for Large Enterprises
KIP
Knowledge Intensive Process
MUC
Message Understanding Conference
Summons
A Multi-Document Summarization Project
MultiGen
A Multi-Document Summarization Project
FUF/SURGE
Functional Unification Formalism a Syntactic Realization Grammar
SBD
Sentence Boundary Disambiguation
POS
Part of Speech Tagging
IDE
Implementation Development Kit
JVM
JAVA Virtual Machine
JDK
JAVA Development Kit
OS
Operating System
MFC
Microsoft Foundation Class Library
t-Test
Student’s t-Test
URL
Unique Resource Locator
xiv
Chapter 1
INTRODUCTION
1.1 Overview
New innovative technologies, such as high-speed networks, portable devices,
inexpensive massive storage, along with the remarkable growth of the Internet, have led
to an enormous increase in the amount and availability of all types of documents
anywhere and anytime. Consequently, it has been realized that added value is not gained
merely through larger quantities of data, but through easier access to the required
information at the right time and in the most suitable form. Thus, there is a strong need
for improved means of facilitating information access.
Not surprisingly, that it has become a well known fact that people keep abreast of
the whole world affairs by collecting some information bites from multiple sources.
Additionally, some of the most important and practical applications that can make use of
the process of distilling the most important information out of some source to capture
some specific portion of information are as follows; Multimedia news summaries that
watches news and tell the user
what happened while he was away. Physicians' aids
applications summarize and compare the recommended treatments for certain patients
considering certain drugs. Meeting summarization that concludes what happened at a
missed teleconference and provides meeting minutes. Consequently, in this context,
automatic summarization of documents can be considered as the solution to problems
presented above.
Summarization -the art of abstracting key content from one or more information
sources-has become an integral part of everyday life. The product of this procedure still
contains the most important points of the original text. However, the meaning of text is
usually different as it depends on the context. It may include raw text documents,
1
multimedia documents, online documents, hypertexts, etc. In fact, the need for such
technology comes directly from the so called phenomenon of information overload; we
know that the access to solid and properly-developed summaries is essential. Given the
billions of pages the Web comprises, “please help me to select just that precise
information about the topic x!!” is a typical problem faced by most of us nearly each day
in every discipline.
Furthermore, it has been said from decades that more and more information is
becoming available and that tools are needed to handle it. Consequently, this narrows it
down to the conclusion of finding someone or something to give us an idea about the
content of those dozens of documents, i.e., to give us a grasp of the gist of the text.
1.2 Background
When talking about a (relatively) new and promising technology such as
summarization, it is important to state a caution one should avoid false expectations. For
instance, one should not expect newly automated systems able to replace humans in their
work. Instead, one should expect some kind of realistic applications which could either
help people or release them from a number of boring and recurrent tasks. Though the first
attempts at automatic summarization were made over 60 years ago [1-2], this area was
until quite recently a rather obscure research branch. No doubt, this was due to the non
availability of machine-readable texts. Only recently, however, does it seem that a
sufficient quantity of electronically available material especially through the World Wide
Web. Consequently, research in that field gained much interest in the past few years, and
exploded into prominence exactly since 1990 [1-2-3-4].
Systems interested in developing coherent summary, of any kind of text, need to
take into account several variables such as length, writing-style and syntax to make a
useful summary [5]. In addition, summaries can be characterized by reduction rate,
degree of informativeness and degree of well formedness [3-5].
2
The reduction rate is a great factor in characterizing and comparing one summary
to another and it is also known as the compression rate or condensation rate. It is simply
measured by dividing the total summary length by the total source text length and the
ration should be less than one, or else the summary would be the same as the actual
source text.
The degree of informativeness is another summary characterizer, which is mainly
interested in the degree of information conveyed by the summary in comparison to the
source which can be thought of as the fidelity to source-relevance to user’s Interests.
Another very important summary characterizer is how well the summary is formed
structurally and grammatically, or the degree of Well formedness.
However, a few issues were observed when learning about the topic of automatic
summarizers, one is, do they act like humans when formulating summaries? The answer
is, automatic systems seldom behave following the same strategies as humans, one reason
may be because we still do not have a very clear idea of what kind of strategies are they
actually. Nor do computing systems have at their disposal the same kind of resources and
background information that we humans have [6].
Another issue is how good summarizers are? In fact, this is a hard question to
answer, but we can say, that depends on several factors. Especially on what kind of things
we expect to find in the summary. Maybe we are generically interested to know what the
author is saying there -or maybe what we want to know is what the author says about
some particular issue. Besides, we should notice that there are many different types of
documents to be summarized –newspaper news, articles, e-mails, scientific papers, Web
sites, brochures, books, etc. In fact, in each case the level of text significance differs,
from documents in plain text to highly codified XML2 pages, passing through HTML3 or
any other mark-up language. Therefore, what makes a good summary is the direct answer
to the following questions: a good summary of what and to get what. In addition,
summaries tend to have problems in its final shape for instance it tends to have some
2
3
http://www.w3schools.com/xml/default.asp
http://www.w3schools.com/html/default.asp
3
problems like gaps within the summary or dangling anaphors. Grammatical mistakes and
plausible output can be factors that harm the form of the resulted summary [1-2-3]. In
conclusion, all of these issues normally entail the process of text summarization and need
to be considered carefully. In fact, it can be considered a part of the problem definition
itself.
On the other hand, one of the usual and challenging necessary tasks to be carried
out in any research is evaluation of the results. In the case of automatic text
summarization, evaluation is usually done by comparing automatic summaries against
some kind of reference- summary built up by humans normally referred to as the “gold
standard [1-2-3]. To achieve that, one usually asks a set of referees to summarize (or get
extracts from) a test corpus of documents. So, it has been proved that reaching a single
reference-summary is not a trivial task at all. Given the same document, rarely do the
referees agree on what information is relevant to be selected in a summary.
In fact, in some evaluation tests the same referees have been asked to summarize
the same documents letting a lapse of several weeks between one test and the other. In
such cases, referees turned out to agree with themselves in a range of about only 55% [6].
However, this drawback is usually solved using statistical methods, so managing to reach
some kind of average summary to be used as a benchmark. This seems to be a fairly
reasonable way of evaluating automatic summarization systems. All in all, we only shed
some light on the challenges faced by researchers in tackling the process of automatic
text summarization.
1.3 Problem Definition
Many summarization models have been proposed previously to tackle the problem
of summarizing Web pages. However, this research is directed towards summarizing a
special type of Web pages, namely, frequently asked question Web pages. The problem
can be defined formally as follows: Given a FAQ 4 Web page P that consists of a set of
4
http://en.wikipedia.org/wiki/FAQ
4
questions {Q1, Q2,….Qn} followed by a set of answers {A1, A2,…An} associated with
each question Q in P; The task of Web page summarization is to extract a subset of
sentences from Q in P, that best represents a brief answer to that given question.
1.4 Motivation
Prior to starting this work on FAQ Web page summarization we carried out an
informal pilot study to gauge the usability and importance for this type of pages. In fact,
it was found that there exists thousands of FAQ pages available on many subjects not
only online but also offline as most of products nowadays tend to attach a leaflet tilted as
FAQ along with its product describing their functionality and usage. In fact, this applies
to most if not every available discipline from commercial products, academia, service
providers, manuals, etc.
Moreover, there exist several Web sites online that catalog and provide search
capabilities for FAQs -for example, the Internet FAQ Consortium 5 . Furthermore, FAQ
nowadays tend to be stored in content management systems (CMS) 6 online, or in simple
text files. Since 1998, a number of specialized software programs have emerged, mostly
written in Perl 7 or PHP 8 , some of them are integrated into more complex FAQ handler
software applications; others, like phpMyFAQ 9 , can be run either as a stand-alone FAQ
or integrated into Web applications. Therefore, based on the above the probability of
accessing sites in the different disciplines that consists of FAQs is most probably high,
which propose that it should be given a satisfactory degree of consideration.
Additionally, due to the abundance of information available online, users find it
difficult to extract certain pieces of information quickly and efficiently. Hence there is a
great need to convert information into a condensed form to present a non-redundant
http://www.faqs.org/faqs/
http://en.wikipedia.org/wiki/Content_management_system
7
http://en.wikipedia.org/wiki/Perl
8
http://en.wikipedia.org/wiki/PHP
9
http://en.wikipedia.org/w/index.php?title=PhpMyFAQ&action=edit&redlink=1
5
6
5
account of the most relevant facts found across document(s) to ensure easily access and
review. This was proven to be quite an important and vital need especially for the World
Wide Web users whom in contact with an enormous may be endless information content.
As a result, it opens the door wide to the process of summarization, which comes in
multiple different flavors. For instance, here are some forms of the different types of
summaries. People tend to go through product reviews before actually buying it. People
tend to keep up with the world affairs by listening or reading news bites. People base
their investments in stock market based on summaries of updates to different stocks. All
in all, with summaries people can make effective decisions in less time.
In fact, this research is mainly motivated by the importance of FAQ -especiallyin the World Wide Web domain as depicted earlier along with how to introduce it into a
more concise form based on a newly adapted summarization technique. One more motive
to summarizing FAQ Web pages is the fact that structure of FAQ pages are almost a
standard as there have even been proposals for standardized FAQ formats, such as
RFC1153 10 and the Minimal Digest Format.
1.5 Objective
The main objective of this research is to develop a summarization technique that
make use of the extra knowledge extracted from the process of Web page segmentation
along with the usage of some selective features for sentence extraction to improve
frequently asked question Web page summarization. The document semantic is scattered
along the page and need to be detected correctly, in our case a set of questions and
answers. Therefore, it would be helpful if the knowledge contained in the structure of the
page can be uncovered to complement the Web page contents. Recent research work on
extractive-summary generation employs some heuristics, and do not indicate how to
select the relevant features for a specific type of text. Therefore, we adopt the hypothesis
that different text types need to be processed differently.
10
http://www.rfc-editor.org/info/rfc1153
6
1.6 Thesis Layout
The thesis is organized in the following manner: In chapter two a through literature
survey is conducted giving brief about the taxonomy of the different types of summaries,
early work in text summarization, Single document summarization(SDS), multi
document summarization (MDS), other approaches in text summarization and finally
summarization evaluation.
In chapter three we list some of the most famous summarization projects in the
field of text summarization on both the academic and commercial sides.
In chapter four we introduce the proposed methodology and how it is applicable
using a utility like Web page segmentation. Additionally, we show how we employ it to
comprehend the FAQ Web page structure in the form of questions plus answers.
In chapter five we introduce the system architecture and design to illustrate how the
problem is solved. It also contains some implementation issues like listing the tools used
to develop our system and also describes the target environment. The developed
summarization system has been given the name FAQWEBSUMM as a code name for the
project. The name illustrates the proposed functionality of the system that does FAQ Web
pages summarization.
In chapter six experimental results and analysis which is divided in the following
sections: Experimental Methodology –this section describes the environment where the
experiments took place, tools used and Dataset that were used. The following section
shows the actual quantitative results – in the form of charts and tables. A through analysis
to the results is given by explaining and highlighting what is important in terms of
meeting with the proposed final goals.
In chapter seven a conclusion is introduced to sum up and conclude the overall
contribution by first listing the claims and whether they were met by results from the
proposed system. Later, in the future work we discuss the possible exertions to this
research.
7
Chapter 2
LITERATURE SURVEY
2.1 Introduction
The aim of this chapter is to survey the most famous work done in the field of
automatic text summarization through introducing a range of fundamental terms and
ideas. There are multiple summary types corresponding to some of the different
typologies and current research lines in the field. We will show the taxonomy of
summarization categories existing according to literature.
Next we describe single document summarization (SDS) focusing on extractive
techniques. We will start by giving a brief introduction about it and mention some of the
early work ideas that were the spark to all modern approaches. Then we show some of
the most famous work in single document summarization through utilizing machine
learning techniques and statistical techniques.
Next we shift our attention to multi-document summarization (MDS), which is
now seen as one of most hot topic areas in the text summarization field. What makes it
interesting is that it deals with a set of problems. Some of these problems are, the need to
highlight differences between different reports and to deal with time-related issues (e.g.,
changing estimates of numbers of casualties). Another nontrivial problem is handling
cross-document co-references-for example, ensuring that mentions of a different names
in two different documents actually refer to the same person.
The literature of text summarization can not be simply cornered with the previous
two categories of single document and multiple documents only. However, there exist
some other summarization types that need to be introduced to this survey to give a
through introduction to the field. These are short summaries, query based summaries,
sentence
compression
summaries,
structure
based
summarization,
link
based
summarization, email messages summarization, customer review summarization, blog
8
summarization, question answering and FAQ summarization.
We will give a brief
introduction to each of these categories.
Another extremely important point to be depicted in the literature survey is
summarization evaluation. In fact, it is quite a hard problem to deal with; this is because
summarization systems differ from one system to another in the form of input, output and
the type of audience to whom the summarization system is directed towards. All of these
issues will be directed in the summarization evaluation section.
2.2 Taxonomy of Summarization Categories
Summaries can be differentiated from one another based on a set of aspects which
are purpose, form, dimension and context [3-4-5].
The purpose of a summary can be indicative to indicate the direction or the
orientation of the text and whether it is for or against some topic or opinion. A Summary
purpose can also be informative, means it only tells the facts in the text and the way it
formalizes is neutral. A critical summary is another purpose for the summary, where it
tries to criticize the text.
Another summary categorization can be based on the summary form and whether
it is extractive or abstractive summary. Extractive summaries are completely consisting
from the sentences or word sequences contained in the original document. Besides the
complete sentences, extracts can contain phrases and paragraphs. Problem with this type
of summaries is usually lack of the balance and cohesion. Sentences could be extracted
out of the context and anaphoric references can be broken. Abstract summaries are
containing word sequences not present in the original. They are usually built from the
existing content but using advanced methods.
The dimension of the source document can be another source of differentiation
between summaries as takes into account the input text and whether it is a single
document or a multi document.
One more categorization to the summary type is based on the context of the
summary based on whether it is query based or generic. In query based summaries it only
targets a single topic so it is more precise and concise. On the other hand, in generic or
9
query independent summaries, the summary is usually not be centered on a single topic,
so it tries to capture the main theme(s) of the text and to build the whole summary around
it unguided.
In this survey we will concentrate on general types of summaries based on
dimension or context of single-document, multi-document and other types of
summarization, analyzing the several approaches currently incorporated by each type.
2.3 Single-document Summarization
Usually, the flow of information in a given document is not uniform, which
means that some parts are more important than others. Thus the major challenge in
summarization lies in distinguishing the more informative parts of a document from the
less ones. Though there have been instances of research describing the automatic creation
of abstracts, but most of the work presented in the literature relies literally on extraction
of sentences to address the problem of single-document summarization. Unfortunately,
the research in the area of single document summarization is somehow declining since
the dropping of single document summarization track from DUC 11 challenge in 2003.
Additionally, according to [7], general performances in summarization systems
tend to be better in multi-summarization than single document summarization tasks. This
could be partially explained with the fact that repetitive occurrences in input document
can be used directly as an indication of importance in multi-document environments.
In this section, we describe some well-known extractive techniques in the field of
single document summarization. First, we show some of the early work that initiated
research in automatic summarization. Second, we show approaches involving machine
learning techniques. Finally, we briefly describe some deep natural language processing
techniques to tackle the problem.
Most early work on single-document summarization focused on technical
documents. Hans Peter Luhn, was the pioneer in using computer for information retrieval
research and application development.
In addition, the most cited paper on text
summarization is his [8] that describes research done at IBM in the 1950s. In this work,
11
http://duc.nist.gov
10
Luhn proposed that the frequency of a particular word in an article provides a useful
measure because of its significance. There are several key ideas put forward in this paper
that have assumed importance in later work on summarization.
Also related work in [9] provides early insight on a particular feature that was
thought to be very helpful in finding salient parts of documents which is the sentence
position. By experiment, he proved that in 85% of the time the topic sentence is the first
one and only 7% of the time it came last while the other cases are randomly distributed.
On the other hand, the work in [10] was based on the two features of word
frequency and positional importance which were incorporated in the previous two works.
Additionally, two other features were used: the presence of cue words (for example
words like significant, fortunately, obviously or hardly), and also incorporated the
structure of the document (whether the sentence is a title or heading). Weights were
attached to each of these features manually to score each sentence. However, evaluations
showed that approximately 44% of the automatic extracts matched the manual human
extracts [1].
In the 1980’s and 1990’s, interest is shifted toward using AI methods, hybrid
approaches and summarization of group documents and multimedia documents. In fact,
the main problem with the first generation of text summarization systems is that they only
used informal heuristics to determine the salient topics from the text representation
structures.
On the other hand, the second generation of summarization systems then adapted
some more mature knowledge representation approaches [11], based on the evolving
methodological framework of hybrid, classification-based knowledge representation
languages.
2.3.1 Machine Learning Methods
Lately, with the advent of new machine learning techniques in NLP, a series of
publications appeared that employed statistical techniques to produce document extracts.
However, initially most systems assumed feature independence and relied mostly on
naive-Bayes methods, others have focused on the choice of appropriate features and on
11
learning algorithms that make no independence assumptions. Other significant
approaches involve hidden Markov models and log-linear models to improve extractive
summarization. Other work, in contrast, uses neural networks and genetic algorithms to
improve purely extractive single document summarization.
Many machine learning systems relied on methods based on naive-Bayes
classifier to able to learn from data. The classification function categorizes each sentence
as worthy of extraction or not. In [12] they developed an approach that uses the
distribution of votes in a Bayesian Model to generate a probability of each sentence being
part of the summary. Their research showed that early sentences are shown more in
summaries other than final ones.
In [13] they tackled the problem of extracting a sentence from a document using a
hidden Markov model (HMM). The basic motivation behind using a sequential model is
to account for local dependencies between sentences. They incorporated only three
features: position of the sentence in the document (built into the state structure of the
HMM), number of terms in the sentence, and likeliness of the sentence terms given the
document terms. They used the TREC 12 dataset as a training corpus, while the evaluation
was done by comparing their extracts to human generated extracts.
Most systems claim that existing approaches to summarization have always
assumed feature independence. In [14], the author used log-linear models to hinder this
assumption and showed empirically that their system produced better extracts than naiveBayes models.
In 2001-02, DUC issued a task of creating a 100-word summary of a single news
article as most of the papers in that period of time concentrated on news articles.
Surprisingly, the best performing systems in the evaluations could not outperform the
baseline with statistical significance. This extremely strong baseline has been analyzed by
[7] and corresponds to the selection of the first n sentences of a newswire article.
However, this disappointing result was beaten by the introduction of the NetSum system
that was developed at Microsoft Research [15].
12
http://trec.nist.gov/
12
NetSum utilizes machine-learning method based on neural network algorithm
RankNet [16]. The system is customized for usage in extracting summaries of news
articles through highlighting three sentences. The goal is pure extraction without any
sentence compression or sentence regeneration. Thus, system is designed to extract three
sentences from single document that best match three document highlights. For sentence
ranking they used the RankNet, a ranking algorithm designed to rank a set of inputs that
uses the gradient descent method for training. The system is trained on pairs of sentences
in single document, such that first sentence in the pair should be ranked higher than
second one. Training is based on modified back-propagation algorithm for two layer
networks. The system relies on a two layer neural network. They employed two types of
features in their system. First, they used some surface features like sentence position in
article, word frequency in sentences and title similarity.
However, the novelty of their framework lay in the use of features that derived
information from query logs from Microsoft's news search engine13 and Wikipedia 14
entries. Hence, if the parts of news search query or Wikipedia title appear frequently in
the news article to be summarized, then higher importance score is attached to the
sentences containing these terms. The results of this summarization approach are
encouraging. Based on evaluation using ROUGE-1 [17] and comparing to the baseline
system, this system performs better than all previous systems for news article
summarization from DUC workshops.
In [18] they used machine learning algorithms to propose an automatic text
summarization approach through sentence segment extraction. This is simply done by
breaking each sentence into a set of segments. Each segment is then represented by a set
of predefined features like the location of the segment in text, the average term frequency
of each word occurring in the segment and also the frequency of the title words occurring
in the sentence segment. The feature scores are then combined using a machine learning
algorithm to generate a vector of highly ranked sentences to generate the summary from.
Another attempt in the same direction [19] coupled the use of machine learning algorithm
13
14
http://search.live.com/news
http://en.wikipedia.org
13
with the use of gene expression programming similar to genetic programming to produce
a better weighting scheme.
On the other hand, other approaches train genetic algorithms (GA) and use
mathematical regression models to obtain a suitable combination of feature weights. In
[20] they proposed two approaches to address text summarization: modified corpus-based
approach (MCBA) and LSA-based T.R.M.
The first is a trainable summarizer, which takes into account several features,
including position, positive keyword, negative keyword, centrality, and the resemblance
to the title, to generate summaries. Two new ideas are exploited: (a) sentence positions
are ranked to emphasize the significances of different sentence positions, and (b) the
score function is trained by the genetic algorithm (GA) to obtain a suitable combination
of feature weights.
The second uses latent semantic analysis (LSA) to derive the semantic matrix of a
document or a corpus and uses semantic sentence representation to construct a semantic
text relationship map. The two novel approaches were measured at several compression
rates on a data corpus composed of 100 political articles. It was found out that with
compression rate of 30%, an average f-measure of 49% for MCBA, 52% for MCBA+
GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved
respectively [20].
Another attempt in [21] to address the problem of improving content selection in
automatic text summarization through a trainable summarizer is proposed. It takes into
account several features, including sentence position, positive keyword, negative
keyword, sentence centrality, sentence resemblance to the title, sentence inclusion of
name entity, sentence inclusion of numerical data, sentence relative length, Bushy path of
the sentence and aggregated similarity for each sentence to generate summaries. They
investigate the effect of each sentence feature on the overall summarization task. Then
they combine all features to train genetic algorithm (GA) and mathematical regression
(MRM) models to obtain a suitable combination of feature weights.
Moreover, they use all feature parameters to train a feed forward neural network
(FFNN), probabilistic neural network (PNN) and Gaussian mixture model (GMM) in
14
order to construct a text summarizer for each model. Furthermore, they use the trained
models by one language to test summarization performance in the other language. The
performance of their approach [21] is measured at several compression rates on a data
corpus composed of 100 Arabic political articles and 100 English religious articles. The
results of the proposed approach are promising, especially the GMM approach.
Furthermore, in [22] they proposed a novel technique for summarizing text using
a combination of Genetic Algorithms (GA) and Genetic Programming (GP) to optimize
rule sets and membership functions of fuzzy systems. Their main goal is to develop an
optimal intelligent system to extract important sentences in the texts by reducing the
redundancy of data. The novelty of the proposed algorithm comes from the hypothesis
that fuzzy system is optimized for extractive based text summarization. In their work, GP
is used for structural part and GA for the string part (Membership functions). The chosen
fitness function considers both local properties and global summary properties by
considering various features of a given sentence such as its relative number of used
thematic words as well its location in the whole document. Their developed method is
compared with the standard fuzzy systems as well as two other commercial summarizers:
Microsoft word and Copernic Summarizer. Simulations demonstrate several significant
improvements with the proposed approach.
2.3.2 Deep Natural Language Analysis Methods
In this subsection, we describe a set of papers that detail approaches towards
single document summarization involving complex natural language analysis techniques.
They tend to use a set of heuristics to create document extracts. Most of these techniques
try to model the text's discourse structure. They incorporated considerable amounts of
linguistic analysis for performing the task of summarization. They also tried to reach a
middle ground between approaches based on the analysis of the semantic structure of the
text and approaches based on word statistics of the documents.
In [23] they introduced a system that focuses on dynamic summary generation
based on user input query. This approach has been designed for application in specific
domain (medical). However it can be used in general domain too. Their idea is based on
15
the fact that user selects the keywords to search for the document with specific
requirements. However, these keywords may not match the document’s main idea, thus
the document’s summary provided would be less relevant. Hence, the summary needs to
be generated dynamically, according to user requirements given by the search query.
The system is coupled with two ontology knowledge sources, WordNet and
UMLS. WordNet is widely known lexical database for the English, developed at the
Princeton University. UMLS is maintained by United States National Library of
Medicine and includes three knowledge sources: the Metathesaurus, the Semantic
Network and the Specialist lexicon.
Their system runs in three main steps. First, they evaluate and adjust the query in
regards to the WordNet and/or UMLS ontology knowledge. The redundant keywords are
removed and relevant ones added. Second, they calculate the distance between the
document’s sentences and the relevant query. Sentences below a predefined threshold are
subject to inclusion in the final document summary. Third, they calculate the distance
among the candidate summary sentences. The candidates are then separated into the
groups based on the threshold and the highest ranked candidate from each group will
become the part of document summary.
Early evaluations to this system as a part of DUC 2007 have shown good
potential. However, some problems related to redundancy reduction, lack of syntax
analysis and insufficient query analysis were highlighted by the conference and the
author claimed it will be addressed in their future work.
Other similar approach based on the semantic analysis of document has been
proposed in [13]. In this approach the authors propose a scoring scheme for sentence
extraction based on static and dynamic document features. Static features include
sentence locations and named entities (NE) in each sentence. Dynamic features used for
scoring include semantic similarity between sentences and user query.
In attaining their goal they incorporated three main stages, preprocessing, analysis
and summary generation. In the preprocessing, they removed unnecessary elements from
the document. Further, the document is tokenized and the sentence boundaries are
detected along with named entities recognition.
16
In the analysis step, the extraction and analysis of features are done and building
relevancy scoring for the sentences. Finally, the sentence score is calculated as a linear
combination of the weighted features. Their system was presented in TAC 15 2008. The
system has demonstrated better performances related to finding relevant content than
removing irrelevant one.
2.4 Multi-document Summarization
Nowadays it is enormously important to develop new ways for extracting text in
efficient way. There are systems such as single document summarization system that
support the automatic generation of extracts, abstract, or questions based on summaries.
Single-document summaries provide limited information about the contents of a single
document.
Instead, the situation in most cases is transformed to a user makes an inquiry
about a topic which usually involve large set of related documents especially with the
introduction of the internet. This query then would provide back hundreds of documents.
Although they differ in some areas, many of these documents from the content provide
the same information. A summary of each document would help in this case; however,
they would be semantically similar.
In today’s community, in which time plays an important role, multi-document
summarizers play essential role in such situations. Multi-document summarization
became very popular since the mid 1990s, mostly tackled the domain of news articles.
We will provide a brief overview of the most important ideas in that field.
In [24] they introduced MEAD a multi-document summarizer, which generates
summaries using cluster centroids produced by topic detection and tracking system,
where a centroid is a set of words that are statistically important to a cluster of
documents. In fact, centroids could be used both to classify relevant documents and to
identify salient sentences in a cluster.
15
http://www.nist.gov/tac/
17
Their algorithm run as follows; first relative documents are grouped together into
clusters. Each document is represented as a weighted vector of TF*IDF 16 and only
documents with similarity measure within a certain threshold is included into a cluster.
Then MEAD decides which sentences to include in the extract by ranking them
according to a set of parameters. They used three features to compute the salience of a
sentence: Centroid value, Positional value, and First-sentence overlap.
The centroid
value Ci for sentence Si is computed as the sum of the centroid values Cw;i of all words
in the sentence. The positional value is computed as follows: the first sentence in a
document gets the same score Cmax as the highest-ranking sentence in the document
according to the centroid value. The overlap value is computed as the inner product of the
sentence vectors for the current sentence i and the first sentence of the document. The
score of a sentence is the weighted sum of the scores for all words in it. In fact, they used
an equal weight for all three parameters.
They also introduced a new evaluation scheme based on sentence utility and
subsumption. This utility is called cluster-based relative utility (CBRU, or relative utility,
RU in short) which refers to the degree of relevance (from 0 to 10) of a particular
sentence to the general topic of the entire cluster. A utility of 0 means that the sentence is
not relevant to the cluster and a 10 marks an essential sentence.
They also introduced a related notion to the RU which is cross-sentence
informational subsumption (CSIS, or subsumption). CSIS reflects that certain sentences
repeat some of the information present in other sentences and may, therefore, be omitted
during summarization. Evaluation systems could be built based on RU and thus provide a
more quantifiable measure of sentences. In order to use CSIS in the evaluation, they
introduced a new parameter, E, which tells how much to penalize a system that includes
redundant information. All in all, they found that MEAD produces summaries that are
similar in quality to the ones produced by humans.
They also compared MEAD’s performance to an alternative method, multidocument lead, and showed how MEAD’s sentence scoring weights can be modified to
produce summaries significantly better than the alternatives.
16
http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
18
In [25] they presented two multi-document extraction systems one produce a
general purpose summary of a cluster of related documents and the other is an entity
based summary of documents related to a particular person.
The general-purpose summary is generated by a process that ranks sentences
based on their document and cluster “worthiness”. This means that there is a need to take
into consideration the relationship each sentence has to the set of documents (cluster) that
constitute the input to the process. In response they constructed a centroid representation
of the cluster of n related documents and then computed cosine similarity value between
each sentence in the document set and the centroid. They managed to use the following
set of features to score a sentence in a cluster. (i) Sentence cluster similarity, (ii) sentence
lead document similarity, (iii) absolute document position. These values are combined
with appropriate weights to produce the sentences final score which is used to rank them.
On the other hand, the personality-based summary is constructed by a process that
ranks sentences according to a metric that uses co-reference and lexical information in a
person profile. In fact, they have explored three ideas: first, identify and measure
references of the key entity in a sentence; second, identify and measure if person facets or
characteristics are referred to in a sentence; and finally identify and measure mention of
information associated with the key entity. The final score is the weighted sum of all
three features mentioned above. In both summary modules, a process of redundancy
removal is applied to exclude repeated information.
In [26] they aimed to explore document impact on summarization performance.
They proposed a document-based graph model to incorporate the document-level
information and the sentence-to-document relationship into the graph-based ranking
process. The basic graph-based model is a way of deciding the importance of a vertex
within a graph based on global information recursively drawn from the entire graph. The
basic idea is that of “voting” between the vertices. A link between two vertices is
considered as a vote cast from one vertex to the other vertex. The score associated with a
vertex is determined by the votes that are cast for it, and the score of the vertices casting
these votes.
19
Unfortunately, the basic graph-based model is built on the single-layer sentence
graph and the transition probability between two sentences in the Markov chain depends
only on the sentences themselves, not taking into account the document-level information
and the sentence-to-document relationship.
However, to incorporate the document-level information and the sentence-todocument relationship, they [26] proposed the document-based graph model based on the
two-layer link graph including both sentences and documents.
Moreover, they devised three methods to evaluate the document importance in a
document set. First, is to measure the cosine similarity between the document and the
whole document set as the importance score of the document.
Second, measure the
average similarity value between the document and any other document in the document
set as the importance score of the document. Third, it constructs a weighted graph
between documents and uses the PageRank17 algorithm to compute the rank scores of the
documents as the importance scores of the documents. The link weight between two
documents is computed using the cosine measure.
On the other hand, they proposed four methods to evaluate the sentence to
document correlation. The first three methods are based on sentence position in the
document, under the assumption that the first sentences in a document are usually more
important than other sentences. The last method is based on the content similarity
between the sentence and the document. In fact, the experimental results on DUC 2001
and DUC 2002 demonstrated the good effectiveness of their proposed model.
In [27] they present a generative probabilistic model for multi-document
summarization.
They started with a simple word frequency based model and then
constructed a sequence of models each injecting more structure into the representation of
document set content. Their final model, HIERSUM, utilizes a hierarchical LDA- Latent
Dirichlet Allocation -style model to represent content specificity as a hierarchy of topic
vocabulary distributions. The resulting model produces general summaries as well as
summaries for any of the learned sub-topics.
17
http://www.markhorrell.com/seo/pagerank.html
20
Their work [27] relies on the observation that the content of document collections
is highly structured, consisting of several topical themes, each with its own vocabulary
and ordering preferences. In addition, they rely on the hypothesis that a user may be
interested in general content of a document collection or may be one or more of the substories reside in the document. As a result, they choose to adapt their topic modeling
approach to allow modeling this aspect of document set content. In other words, rather
than drawing a single content distribution for a document collection, they draw a general
content distribution. In fact, at the task of producing generic DUC-style summaries,
HIERSUM yields state-of-the-art ROUGE performance.
2.5 Other Tasks in Text Summarization
2.5.1 Short Summaries
The generation of very short summaries (less than 75 bytes), a problem also called
headline generation was introduced by the DUC as a summarization task.
The logic
behind creating this task is based on the fact that the most relevant ideas are usually
expressed in a few sentences which allows for this simplification, and most of the
document can be discarded. It is well known that topic segmentation can be used as a
preprocessing step in numerous natural language processing applications.
In [28], they adapted a topic segmentation algorithm for the production of
automatic short summaries. Their proposed algorithm detects thematic structures in texts
using generic text structure cues. It associates key terms with each topic or subtopic and
outputs a tree-like table of content (TOC). Finally, they use these TOCs for producing
single document abstracts and multi-document abstracts and extracts. Their defense states
that the text structure trees reflect the most important terms. One clear advantage to this
approach is that it requires no prior training which makes it generally applicable.
Another attempt for tackling the problem of producing short summaries is the
UAM system [29]. It starts by selecting the most relevant sentences, using a genetic
algorithm and a combination of some heuristics as the fitness function. The weights for
each heuristic were obtained with another genetic algorithm built on top of them.
21
Secondly, they extract verb phrases and their arguments from those sentences. For
generating summaries, they connect highly ranked sentences with prepositions and
conjunctions whenever possible. Finally, extracts that still had space were completed
with the top-frequency words from the documents and collections and with noun phrases
selected by means of a combination of criteria.
According to their experiments [29], they found out that by setting the summary
length limit to 150 or 200, they would obtain summaries that were both grammatical and
informative. However, forcing a limit to the summary length had the consequence of
leaving out relevant verb phrases, and they were forced to fill in empty spaces with
keywords.
K.U.Leuven summarization system [30] provides another solution to the problem
of generating short summaries.
It participated in the DUC in three main tasks.
Generating very short single document summaries (headlines), short multi-document
summarizes and short summaries focused by questions.
Considering the first task, they employ one of two methods depending on the
headline focus. The first is picking out keywords seem a good approach to cover all the
content. The second is applying sentence compression techniques which are more
appropriate for a good readability. Additionally, the second method comes closer to the
way humans construct headlines.
Considering the second task, they cluster the term vectors of the important
sentences of single documents. Then they select important sentences based on the number
of keywords they contained (the same approach like in the case of the headlines).
Considering the third task, it requires summaries answering certain questions like
where is X? What is Y? When is Z?, where X is a person, Y is a thing and Z is an event.
Their system consists of a succession of filters/sentence selection modules: selecting
indicative sentences for the input person, thing or event, intersecting them with sentences
which are important for the whole document and filtering out indirect speech.
Finally, to eliminate redundant content while fitting into required length, they
cluster the resulting sentences from all the documents in a set with the covering method.
22
The documents are considered in chronological order and the sentences in the reading
order in the documents.
2.5.2 Query Based Summarization
It is also known as query biased summarization. This summarization technique is
based a keyword set entered by the user that serves as a search query. The query is
considered the centralized reference where all information around it is extracted and
ordered into a semi-coherent statement(s).
In [31] they proposed a query focused summarization system named BAYESUM.
It leverages the common case in which multiple documents are relevant to a single query.
Their algorithm functions by asking itself the following question: what is it about these
relevant documents that differentiate them from the non-relevant documents?
In fact, BAYESUM can be seen as providing a statistical formulation of this exact
question. BAYESUM is built on the concept of language models for information
retrieval. The idea behind the language modeling techniques used in information retrieval
is to represent either queries or documents (or both) as probability distributions, and then
use standard probabilistic techniques for comparing them.
A relevant approach is presented in [32] where they proposed a supervised
sentence ranking approach for use in extractive summarization. The system provides
query-focused multi-document summarization, both for single summaries and for series
of update summaries.
They broke the problem into three main tasks [32]. First, they perform text
normalization which is the process of preparing and cleaning text in the input dataset by
removing meta-data and then performing sentence segmentation. Second, they perform a
process of supervised sentence ranking through employing machine learning techniques
to qualify certain features to be used in sentence scoring. Finally, they select highly
scored sentences from a ranked list.
Another example of query based summarization is answering XML queries by
means of data summaries [33]. It is a summarized representation of XML data, based on
the concept of instance patterns, which both provides succinct information and is directly
23
queried. The physical representation of instance patterns exploits item sets or association
rules to summarize the content of XML datasets. Instance patterns may be used for
answering queries, either when fast and approximate answers are required, or when the
actual dataset is not available, for example, it is currently unreachable [33].
The query-biased summary, became a standard feature in the result presentation
of search engines. However, there is an apparent disadvantage of QBS is the generationcost at query-time. The summary needs to be generated for every single document
presented in response to a potentially diverse range of queries.
Accordingly, [34] proposed a solution to the previous problem through the use of
document titles as an alternative to queries. Since they use document titles, the summary
can be pre-generated statically. When the summaries are pre-generated, presenting them
to the users becomes a simple lookup in a database.
To justify their [34] title-biased approach of summary generation, they made three
research hypotheses. First, top ranking documents tend to have a query term in the title.
Second, searchers prefer to visit a document when a query term appears in the title. Third,
there is no significant difference between QBS and TBS in supporting search tasks. Their
experimental results showed that title-biased summaries are a promising alternative to
query-biased summaries, due to the behavior of existing retrieval systems as well as
searchers’ information seeking behavior.
2.5.3 Sentence Compression Based Summarization
Automatic sentence compression can be broadly described as the task of creating
a grammatical summary of a single sentence with minimal information loss. Sentence
compression is also known as sentence simplification, shortening or reduction. It has
recently attracted much attention, in part because of its relevance to some vital
applications. Examples include the generation of subtitles from spoken transcripts [35],
the display of text on small screens such as mobile phones or PDAs [36-37].
Another example of sentence compression is presented in [38]. They propose a
text simplification process that seeks to reduce the complexity of sentences in biomedical
abstracts in order to improve the performance of syntactic parsers on the processed
24
sentences. Syntactic parsing is typically one of the first steps in a text mining pipeline.
They classified English sentences into three categories. 1) Normal English sentences, like
in Newswire text, 2) normal biomedical English sentences – those sentences which can
be parsed without a problem by Link Grammar 18 , and 3) complex biomedical English
sentences – those sentences which can’t be parsed by Link Grammar.
Their approach run in three steps [38]: 1. Preprocessing through removal of
spurious phrases; 2. Replacement of gene names; 3. Replacement of noun phrases. They
evaluated their proposed method using a corpus of biomedical sentences annotated with
syntactic links. Experimental results showed an improvement of 2.90% for the CharniakMcClosky [37] parser and of 4.23% for the Link Grammar parser when processing
simplified sentences rather than the original sentences in the corpus. Sentence
compression is thought to be another way of summarization other than sentence selection.
In fact, some systems use sentence compression as a complimentary phase after sentence
selection to give a more concise summary.
In [39] they propose a joint content selection and compression model for singledocument summarization. They evaluate their approach on the task of generating “story
highlights”—a small number of brief, self-contained sentences that allow readers to
quickly gather information on news stories. Their output summaries must meet additional
requirements such as sentence length, overall length, topic coverage and, importantly,
grammaticality. They combine phrase and dependency information into a single data
structure, which allows it to express grammaticality as constraints across phrase
dependencies. Then they encode these constraints through the use of integer linear
programming (ILP), a well-studied optimization framework that is able to search the
entire solution space efficiently. A key aspect of their approach is the representation of
content by phrases rather than entire sentences. Salient phrases are selected to form the
summary.
In [40] they also offered a text summarization approach based on sentence
compression putting in mind two goals to be achieved simultaneously. Compressions
should be grammatical, and they should retain the most important pieces of information.
18
www.link.cs.cmu.edu/link/
25
Mostly, these two goals can conflict. As a solution, they devised two models to tackle the
problem one utilizes a noisy-channel while the other utilizes a decision-tree approach to
the problem.
Considering the noisy channel approach, they rely on the hypothesis that long
string was originally a short string, and then someone added some additional, optional
text to it. Thus compression is a matter of identifying the original short string. It is not
critical whether or not the “original” string is real or hypothetical. As result, they map the
problem into a noisy channel application.
Considering the decision tree model, as in the noisy-channel approach, they
assume that they were given as input a parse tree t. the goal is to “rewrite” t into a smaller
tree s, which corresponds to a compressed version of the original sentence subsumed by t.
However, the work in [41] offered another solution to the summarization task
through sentence compression rather than simply shorten a sentence by deleting words or
constituents, as in previous work. They rewrite it using additional operations such as
substitution, reordering, and insertion. They also presented a new corpus that is suited to
their task along with a discriminative tree-to-tree transduction model that enables them to
detect structural and lexical mismatches. Their model incorporates a novel grammar
extraction method, which uses a language model for coherent output, and can be easily
tuned to a wide range of compression specific loss functions.
2.5.4 Structure Based Summarization
In this section, we present a summarization technique that makes use of the
structures of the document, hierarchical summarization, fractal summarization, and other
techniques whose main goal is to generate a summary based on the structure of a given
text.
Hierarchical Summarization as shown in [42- 43- 44], on the other hand, depends
mainly on two stages: The first identifies the salience of each sentence in a document and
ranks the sentence accordingly, and the second builds a tree for all sentences such that its
root is the sentence with the highest salience. The advantage of their approach is that the
input document does not have to be in HTML and assume nothing about the document
26
structure; instead their algorithm is able to infer the document’s hierarchical structure
automatically. In addition, the work in [43] employed hierarchical summarization in a
web mail system for checking one’s web mail and proved that the use of hierarchal
summarization reduces the number of bytes per user request by more than half.
On the other hand, fractal summarization is derived from the document structure
and the fractal theory. Fractals are mathematical objects that have high degree of
redundancy, where they are made of transformed repetitive copies of themselves or even
part of themselves.
Similar to the geometrical fractals, documents have a hierarchical structure with
multiple levels, chapters, sections, subsections, paragraphs, sentences, and terms. It
generates a brief skeleton of summary at the first stage, and the details of the summary on
different levels of the document are generated on demand when users request them. This
summarization method significantly reduces the computation load in comparison to the
generation of an entire summary in one batch by a traditional automatic summarization
algorithm.
In [45] they incorporated this technique to provide handheld devices with
condensed summaries that can fit into their small display. Their algorithm run as follows:
The original document is represented as a fractal tree structure according to its document
structure.
The fractal value of each node is calculated by the sum of sentence weights under
the node by traditional summarization methods relying on some salient thematic features.
Among the features used to score each sentence are the TF*IDF score, the location of the
sentence, heading feature and cue phrases. Then they assign a weight to each feature
score to compute the overall score for a sentence. Depending on a preset quota they select
the highly ranked sentences of each fractal for summary.
Experiments [45] have shown that fractal summarization outperforms traditional
summarization. According to [45] fractal summarization can achieve up to 88.75%
precision and 84% on average, while traditional summarization can achieve up to
maximum 77.5% precision and 67% on average. In fact, fractal summarization is
27
applicable in social networks in summarizing events that have a certain structure and
contain large text such as events on Facebook 19 .
Comparing the hierarchical approach to the fractal approach, the proposed fractal
summarization method relies on an input Web document being formatted in HTML in
order to infer its structure, that, which is not required in the hierarchical approach.
Regarding social networks, this approach allows the use of a mobile phone for checking
one’s mail messages by allowing the user to access hierarchical summaries of the items in
his or her inbox.
Another attempt to tackle the problem of structure summarization is what’s
presented in [46] which present a novel approach in summarizing XML documents. The
novelty of this approach lies in that it is based on features not only from the content of
documents, but also from their logical structure.
Their sentence extraction-based method [46] employs a novel machine learning
approach based on the Area under the ROC curve (AUC) 20 . The main rationale of this
approach is to automatically combine different features, each being a numerical
representation of a given extraction criterion. Then the summarizer learns how to best
combine sentence features based on its effectiveness at assigning higher scores to
summary sentences than to non-summary sentences. This ordering criterion corresponds
exactly to what the learnt function is used for, i.e. ordering sentences. In other words,
they view the problem of sentence extraction as an ordering task to find which features
are more effective for producing summaries.
They [46] evaluated their model using the INEX [47] and SUMMAC 21 datasets.
Their findings can be summarized as follows: the inclusion of features from the logical
structure of documents increases the effectiveness of the summarizer, and that the novel
machine learning approach is also effective and well-suited to the task of summarization
in the context of XML documents. Furthermore, their approach is generic and is therefore
applicable to elements of varying granularity within the XML tree.
www.facebook.com
http://gim.unmc.edu/dxtests/ROC3.htm
21
http://www.itl.nist.gov/iaui/894.02/related projects/ tipster summac/cmplg.html
19
20
28
2.5.5 Link Based Summarization
It [48] they have shown that summaries of hypertext link targets are useful for
user cognition. Therefore, extending this technique may also be useful for summarization.
By using the context and content surrounding a link more efficiently, quality summaries
can be generated [49]. Using hyperlinks as summary markers within the page content
helps in studying how non-linear resources can help us create summaries.
In [49] they introduced two new algorithms to tackle the problem of
summarization by context, where summarization by context is concerned with two
separated fields, summarization algorithms and context of Web documents. The first
algorithm uses both the content and the context of the document, while the second one is
based only on the elements of the context.
It is shown that summaries [49] taking into account context are usually much
more relevant than those made only from the content of the target document. This is
based on the hypothesis that when a document points to another one, it often includes a
description of its link to this page and also the context of a page is enough to discriminate
it. They consider the context of a Web page by the textual content of all the documents
linking to it. These techniques have met some success in their applications however; they
do suffer some problems because summaries generated in this manner rely on a profusion
of associative and referential links. This is in opposition to the majority of websites
which miss real hypertext structures for purely structural linking [50].
2.5.6 Email Messages Summarization
Over the past few decades, email has become the preferred medium of
communication for many organizations and individuals. In addition, for many users
emails have evolved from a mere communication system to means of organizing
workflow, storing information and tracking tasks. Moreover, with the great advancements
in mobile technologies, one can easily check his/her email any where any time using a
web enables mobile device like a cellular device or PDA. However, users are introduced
to some challenges due to some limitation imposed by the mobile environment itself, for
example the small size the display and limited internet connectivity and bandwidth.
29
Email summarization can perfectly be fit in solving some of these challenges. In
fact, email summarization is some how different from ordinary summarization tasks,
since emails are usually short in size, do not obey grammatical rules, and are typically
written in a chatty style [51].
One approach presented in [52] works by first extracting simple noun phrases as
candidate units for representing the document meaning, and then use machine learning
algorithms to select the most prominent ones.
Another recent attempt [53] used clue words to provide an email summary of any
length as requested by the user. The authors define “a clue word in node (fragment) F [as]
a word which also appears in a semantically similar form in a parent or a child node of F
in the fragment quotation graph” [53].
In addition, [54] introduced the SmartMail system, a prototype system for
automatically identifying action items (tasks) in email messages. SmartMail presents the
user with a task-focused summary of a message. The final summary is produced by
identifying the task-related sentences in the message and then reformulating each task
related sentence as a brief (usually imperative) summation of the task. Therefore, the user
can simply add these action items to their “to do” list.
Some other attempt that went further, the one presented in [55], where they
proposed two approaches to email thread summarization: Collective Message
Summarization Approach (CMSA) that applies a multi-document summarization
approach and Individual Message Summarization Approach (IMSA) which deals with the
problem as a sequence of single-document summarization tasks. Both approaches are
driven by sentence compression. Instead of a purely extractive approach, they employ
linguistic and statistical methods to generate multiple compressions, and then select from
those candidates to produce the final summary.
Experimental results [55] highlighted two important points. First, CMS represents
a better approach to email thread summarization. Second, the current sentence
compression techniques do not improve summarization performance in this genre.
30
2.5.7 Customer Review Summarization
With the rapid expansion of e-commerce, people are more likely to express their
opinions and hands-on experiences on products or services they have purchased. These
reviews are important for both business organizations and personal costumers.
Companies may alter its marketing strategies based on these reviews.
Accordingly, review summarization can be a solution to such problem. However,
review summarization is different from the ordinary text summarization as review
summarization tries to extract the features being reviewed and determines whether the
customer opinion is positive or negative.
One proposed methodology is feature-based summarization [56], where given a
set of customer reviews of a particular product, the task involves three subtasks: First,
they identify the product features’ that customers have expressed their opinions on.
Second, for each feature, they identify review sentences that give positive or negative
opinions. Finally, they produce a summary using the discovered information.
Another attempt [57] employed a specialized summarizer for a certain type of
products with certain features. Furthermore, in [58] they proposed an approach that
focuses mainly on object feature based review summarization. They formulate the review
mining task as a joint structure tagging problem. They propose a new machine learning
framework based on Conditional Random Fields (CRFs). CRFs is a discriminative
model, which can easily integrate various features. It can employ rich features to jointly
extract positive opinions, negative opinions and object features for review sentences.
In fact, their experiments [57] on movie review and product review data sets
showed that structure-aware models outperform many state-of-the-art approaches to
review mining.
2.5.8 Blog Summarization
Blog is a self-publishing media on the Web that has been growing quickly and
becoming more and more popular 22 . Blogs allow millions of people to easily publish,
read, respond, and share their ideas, experiences and expertise. Thus it is important to
22
http://www.technorati.com/weblog/2006/11/161.html
31
have a tool to find and summarize the most informative and influential opinions from the
massive and complex blogosphere.
In [59] they proposed a blog summarization approach to target Twitter 23 . The
content of such a site is an extraordinarily large number of small textual messages, posted
by millions of users, at random or in response to perceived events or situations. Their
developed algorithm takes a trending phrase or any phrase specified by a user, collects a
large number of posts containing the phrase, and provides an automatically created
summary of the posts related to the term.
However, the work in [60] relied on the hypothesis that the behavior of an
individual is directly or indirectly affected by the thoughts, feelings, and actions of others
in a population. The same relation can be applied on the blogosphere. The conversation
in the blogosphere usually starts from innovators, who initiate ideas and opinions; then
followers are primarily influenced by the opinions of these innovators. Thus the opinions
of the influential innovators represent the millions of blogs and thousands of
conversations on any given topic. Accordingly, they summarize the blogosphere by
capturing the most influential blogs with highly innovative opinions.
2.5.9 Question Answering & FAQ Summarization
Question Answering is the task of automatically formulating an answer to satisfy
the need of a user. In other words it retrieves answers to questions rather than full
documents or best-matching passages. It has recently received attention from the
information retrieval, information extraction, machine learning, and natural language
processing communities as it is a multi-disciplinary task.
In [61] they have done some work on complex question decomposition in order to
help extracting accurate and responsive answers for question driven multi-document
summarization. Typically, complex questions address a topic that relates to many entities,
events and complex relations between them. In their work they presented three methods
for decomposing complex questions and evaluated their impact on the responsiveness of
the answers they enable. Moreover, they argue that the quality of question-focused
23
www.twitter.com
32
summaries depends in part on how complex questions are decomposed. The first method
decomposes questions based on syntactic information, whereas the other two use
semantic and coherence information for question decomposition. Their experimental
results [61] showed that by combining the two semantic-based question decomposition
methods achieved the highest responsiveness scores by 25%.
In [62] they proposed a conceptual model for automatically transforming topical
forum articles into FAQ summary, and empirically demonstrated the acceptability of this
model via a prototype system. Their experiment implied the time and manpower savings
in producing FAQ and illustrated the technical feasibility of such a model.
However, a research on automatically summarizing multiple-documents in [63]
produces frequently asked questions (FAQs) of a topical forum. Mainly, their work aimed
at enhancing the work in [62] of FAQ presentation model together with solving the
problem of domain terminology extraction used in domain identification. The research
was based on the four-part pattern traditional structure of Chinese articles namely
Introduction (I), Body (B), Sub theme (S), and Conclusion (C).
Their experiments [63] showed that, the informative level and readability of the
FAQ summary is significantly improved, by incorporating the native language
composition structure, which is more familiar to the user’s writing and reading style with
the topical groups or concepts in presenting the text summarization.
In [64] they presented a question answering system named QALC. Their system
consisted of four main modules: question analysis, document retrieval, document
processing and answer extraction.
The question analysis module determines some information about the question:
expected type of the answer, category of the question, keywords, etc.
This information is mainly used by the second module to retrieve related
documents which are then indexed and a subset of the highest ranked ones is kept.
The third module performs the procedure of named entities tagging to indexed
documents. The final module is in charge of extracting the answers from weighted
sentences: first, the sentences of the documents are weighted according to the presence of
the terms of the question and of named entities and their linear distance. Then answers
33
are extracted from the sentences depending on the expected type of the answer. They
proposed using a syntactic distance between syntactic tree structure of the question and
answer- instead of a linear distance between the terms of the question and answer- to
select sentences in their question answering system.
In [65] they introduced a statistical model for query-relevant summarization by
characterizing the relevance of a document to a query on a collection of FAQ documents.
Their approach is an extractive summarization: selecting either complete sentences or
paragraphs for summary generation. They view each answer in a FAQ as a summary of
the document relative to the question which preceded it. That is, a FAQ with N
question/answer pairs comes equipped with N different queries and summaries: the
answer to the K-th question is a summary of the document relative to the K- th question.
They have proposed a principled statistical model for answer ranking that has a
probabilistic interpretation as the best answer to q (query) within d (document) is s
(summary). In other words, they trained a machine learning module to the mapping of
document and query to summary.
2.6 Summarization Evaluation
The summarization process is a challenging enough task to overcome, but yet
there is another challenging problem encountered once we decide to carry out a
summarization task, which is simply how to evaluate your results? In fact, this is quite a
challenging task because of the non uniformity of input and output for most of
summarization systems.
There exist two major types of summarization evaluation methods intrinsic and
extrinsic [2, 3].
Intrinsic evaluation does compare automatically generated summaries against
some gold standard summary –Ideal summary mostly human generated. Mainly, it
measures the degree of coherence and informativeness. Therefore, if we have a known
34
good summary or some human authored gold standard, intrinsic methods would be very
much suitable in this case.
On the other hand, extrinsic evaluation measures the performance of the
automatically generated summary in relation to performing a particular task as it can also
be called task based evaluation. However, in some cases extrinsic evaluation can be time
consuming and expensive, thus it requires careful amount of planning.
In case of extraction based summaries, it is widely evaluated using Recall and
Precision scores [7]. Given an input text, a human’s extract, and a system’s extract, these
scores quantify how closely the system’s extract corresponds to the human’s extract. For
each unit-word, sentence, paragraph, section etc-, we let correct equals to the number of
sentences extracted by the system and the human; wrong equals to the number of
sentences extracted by the system but not by the human; and missed equals to the number
of sentences extracted by the human but not by the system. In other words, precision
reflects how many of the system’s extracted sentences were good, and Recall reflects
how many good sentences the system missed.
Despite the wide use and acceptance of recall and precession in extractive
summary evaluation, this type of evaluation suffers from several problems.
First, the nature of human variation [6-66] poses the problem of non uniformity of
human choice. Therefore, the same summary can give different recall and precision
scores when compared to different human summaries. Thus it makes it difficult to define
a fine gold standard.
Moreover, the system may choose a good sentence for extraction but still be
penalized by P/R evaluation. Therefore, it might be more beneficial to rely on recall
which highlights the degree of overlap than the non-overlap.
Another problem with P/R evaluation is the granularity [7-67], means operating
on the sentence level is not the best granularity for assessing the content of some source
35
text. Different sentences vary in length and wording conveys different meaning, though
shorter sentences are not always the better as sometimes the longer sentence conveys
more salient information. Moreover, two different sentences can still convey the same
meaning –can occur very much in multi-document summarization-humans would only
select one and the system is penalized for choosing them.
Relative utility [67] has been proposed as a way to address the human variation
and semantic equivalence problems in P/R evaluation. It requires that multiple judges
each score each sentence to detect its suitability for inclusion in a summary. They also
address the problem of semantic equivalence implicitly. This approach seems quite
appealing but unfortunately, it requires a great deal of manual effort in this operation of
sentence tagging.
On the other hand, The DUC has been carrying- out large-scale evaluations of
summarization systems on a common dataset since 2001. It finds quite deal of interest as
twenty sites on average do participate in the evaluation process each year. Although the
DUC adopts the approach of evaluation based on a single human model, it tries to
overcome this drawback by dividing the whole data into subsets and assign each subset to
a different annotator [66].
Additionally, DUC addresses the problem of sentence granularity and tried to
create some elementary discourse units (EDUs) as basis for evaluation. These EDUs
corresponds to clauses and each human made summary is divided into EDUs and the
summary is evaluated by the degree they cover for the different EDUs in the model. The
average score, called coverage was the average EDU score for the summary under
evaluation. The measure was recall-oriented, in essence measuring what fraction of the
model EDUs were covered by a summary [66].
DUC also supports abstractive summarization by using human abstracts as models
instead of human selection of sentences, although it requires more human involvement.
Due to the availability of the result data of the different systems registering in the DUC, it
allowed researchers to study the factors that influence the performance of the different
36
summarization systems. It has been reported by the analysis of variance 24 –ANOVAanalysis that the input and the model creator factors were turned out to be the most
significant factors [68] that affect summarization systems’ evaluation.
Two lines of research on evaluation [7] emerged in an effort to address some of
the issues raised by the DUC evaluation protocol: First, developing cheap automatic
methods for comparing human gold-standards with automatic summaries. Second,
developing better analysis of human variation of content selection, and using multiple
models to avoid result dependence on the gold-standard.
Another important issue to be addressed here in this context is automatic
summarization evaluation measures and how systems can be automatically evaluated
using some automatic evaluation tools.
In fact, this trend in using automatic evaluation tools is not new; it has been
known and widely used especially for machine translation evaluation using the BLEU
technique [69]. The BLEU is a machine translation automatic evaluation technique that
was designed to be easy, fast and inexpensive. Inspired by the success of the BLEU ngram overlap based measure, similar n-gram matching was tried for summarization since
machine translation and text summarization can both be viewed as natural language
processing similar tasks from a textual context.
ROUGE [70] system for automatic evaluation of summarization was developed
using the DUC scores and the computation of n-gram overlap between a summary and a
set of models. One of the most appealing merits of ROUGE is that it is recall oriented
unlike BELU which is precision oriented which enabled it to correlate better with DUC
coverage scores. A reason about what makes it better may be because of the use of
numerous set of parameters like words stemming, stop word removal and n-gram size.
The Pyramid Method [71], on the other hand, is concerned with analysis of the
variation in human summaries, as well as how evaluation results can be made less
dependent on the model used for evaluation. Multiple human abstracts are analyzed
manually to derive a gold-standard for evaluation.
24
http://en.wikipedia.org/wiki/Analysis_of_variance
37
The analysis is semantically driven information with the same meaning, even
when expressed using different wording in different summaries. It is marked as
expressing the same summary content unit (SCU). Each SCU is assigned a weight equal
to the number of human summarizers who expressed the SCU in their summaries. SCU
analysis [7] shows that summaries that differ in content can be equally good and assign a
score that is stable with respect of the models when 4 or 5 human summaries are
employed. A drawback of this approach is that it is very labor intensive. This method was
also introduced for evaluation of abstractive summaries, and requires analysis that is
unnecessary for extractive summaries [7]. The different automatic evaluation approaches
give different results, and sometimes it is not totally clear what the scores mean and
which automatic measure is to be preferred. As a result, this raises the question of what
score to be used and when exactly to use it. All in all, this concludes the overall problem
with automatic evaluation of summaries faced by researchers.
38
Chapter 3
TEXT SUMMARIZATION PROJECTS
The following is a list of the most famous and well known text summarization projects
available in both the academic and commercial fields. The following list is categorized by
the summarization approach adopted in summary generation and whether it is an
extraction based, an abstraction based or a hybrid approach.
3.1 Extractive Based Projects
The work done regarding extractive summarization can be categorized by
approaches which explicitly use natural language processing (NLP) techniques based on
computational linguistics and machine learning, and approaches that use non-NLP
techniques. The non- NLP based approaches can in turn be divided into statistical
methods, data mining methods, and knowledge based and question answering methods.
In fact, attempts mostly concentrate on machine learning approaches, statistical
approaches and question answering based approaches. As a result, we will introduce the
most famous text summarization projects covering these three categories. The following
section lists some projects that mainly rely on machine learning techniques.
3.1.1 Machine Learning Based Projects
Neural Summarizer (NeuralSumm) [72] is an automatic text summarizer that is
based upon a neural network that, after training, is capable of identifying relevant
sentences in a source text for producing the extract. NeuralSumm system makes use of a
machine learning technique, and runs on four processes: First, text segmentation, second,
features extraction, third, classification, and fourth summary production. In fact, the
learning process is primarily unsupervised, since it is based on a self-organizing map,
which clusters information from the training texts. NeuralSumm produces two clusters:
one that represents the important sentences of the training texts (and, thus, should be
39
included in summary) and another that represents the non important sentences (and, thus,
should be discarded).
ClassSumm [73] summarization project employs a classification system that
produces extracts based on a Machine Learning approach, in which summarization is
considered as a classification task. Actually, it is based on a Naïve Bayes classifier. In
order to perform the summarization process, the system performs the same four processes
employed by NeuralSumm as previously explained. First, text pre-processing is similar to
the one performed by TF-ISF-Summ [74]. Second, features extracted from each sentence
are of two kinds: statistical, i.e., based on measures and counting taken directly from the
text components, and linguistic, in which case they are extracted from a simplified
argumentative structure of the text. Third, summary generation is considered as a twovalued classification problem: sentences should be classified as relevant-to-extract or not.
In other words, according to the values of the features for each sentence, the classification
algorithm must “learn” which ones must belong to the summary. Finally, the sentences to
include in the summary will be those above the cutoff and, thus, those with the highest
probabilities of belonging to it.
The Text Summarization Project 25 is presented by the University of Ottawa.
Unfortunately, there are not so many details available about this research project except
what’s found in its proposal. In fact, the proposal mentioned that they use some machine
learning techniques that identify keywords in the original text. Whereas the process of
keyword identification signals importance of those sentences containing these keywords.
In addition, they mentioned their plan to use some surface level statistics such as word or
keyword frequency analysis and may be some level linguistic features as the position of
sentences in their given paragraphs.
The SweSum 26 research project is presented by the Royal Institute of Technology
in Sweden. It targets extractive summaries. Their work resembles the work of ISI
25
26
http://www.site.uottawa.ca/tanka/ts.html
http://www.nada.kth.se/%7Exmartin/swesum/index-eng.html
40
Summarist 27 .
The summarizer supports both Swedish and English languages in
newspaper or academic domains. The main idea behind this summarizer is that it ranks
sentences according to some weighted word level features and later to extract the highest
score sentences for summary. The weights to these features were trained on a tagged
Swedish news corpus. In addition, this summarization tool can be integrated to search
engine results giving quick extracts.
The Text Analyst 28 text summarization product targets the textual content present
in users offline documents. Fortunately, the official Web site of this summarization
product gives a brief description to the product’s system design. It simply works in three
steps. First, they construct a semantic network from the source document using some
trained neural network. In fact, they stress on the point that their semantic network
construction is fully independent on prior domain specific knowledge. Second, they
introduce to the user a graphical representation of concepts and relationships presented in
the original document for selection according to the user’s preference. Finally, they select
sentences with matching concepts and relationships for extraction and inclusion in final
summary.
3.1.2 Statistical Based Projects
The following section lists some approaches that mainly rely on statistical
techniques in the process of formulating extractive summaries.
The statistical based
approaches are based on measures and counting taken directly from the text components.
Gist Summarizer (GistSumm) [75] is an automatic summarizer based on a novel
extractive method, called gist based method. It is focused upon the matching of lexical
items of the source text against lexical items of a gist sentence, supposed to be the
sentence of the source text that best expresses its main idea. It can be determined by
means of a word frequency distribution. GistSumm comprises three main processes: text
segmentation, sentence ranking, and extract production. For GistSumm to work, the
27
28
http://www.isi.edu/natural-language/projects/SUMMARIST.html
http://www.megaputer.com/html/textanalyst.html
41
following premises must hold: (a) every text is built around a main idea, namely, its gist;
(b) it is possible to identify in a text just one sentence that best expresses its main idea,
namely, the gist sentence.
Term Frequency-Inverse Sentence Frequency-based Summarizer (TF-ISF-Summ)
[74] is an automatic summarizer that makes use of the TF-ISF (Term-Frequency InverseSentence-Frequency) metric to rank sentences in a given text and then extract the most
relevant ones. Similarly to GistSumm, the approach used by this system has also three
main steps: First, text pre-processing. Second, sentence ranking, and finally, extract
generation. It adapts Salton’s TF*IDF information retrieval measure [76] in that, instead
of signaling the documents to retrieve, it pinpoints those sentences of a source text that
must be included in a summary.
The ISI Summarist summarization project is presented by the University of
Southern California. It targets the process of Web document summarization. It has been
integrated to the Systran translation system 29 , developed by Systran Software Inc. of La
Jolla, CA, in order to provide a gisting tool for news articles in any language provided by
Systran. The overall system design can be divided into three main modules. First, it
applies some topic identification module using some statistical techniques on some
surface features such as word position and word count. In addition, they are in the way of
using cue phrases and discourse structure in the process of topic identification. Second,
the identified concepts are interpreted into a chain of lexically connected sentences.
Third, extract those sentences with most general concept. Though the extracted sentences
formulate good summaries, they plan for future work are to enhance the extracted
sentences to generate more coherent summary.
The Open Text Summarizer30 (OTS) is an open source tool for summarizing texts.
The program reads a text and decides which sentences are important and which are not. It
ships with Ubuntu, Fedora and other linux distros. OTS supports many (25+) languages
which are configured in XML files. OTS incorporates NLP techniques via an English
29
30
http://www.systranet.com/translate
http://libots.sourceforge.net/
42
language lexicon with synonyms and cue terms as well as rules for stemming and
parsing. These are used in combination with a statistical word frequency based method
for sentence scoring.
Copernic Summarizer [77] is a multilingual summarizer that can operate on four
languages English, French, Italian and Spanish. It can summarize Word documents, Web
pages, PDF files, email messages and even text from the Clipboard. It can be integrated
into applications such as Internet Explorer, Netscape Navigator, Adobe Acrobat, Acrobat
Reader, Outlook Express, Eudora, Microsoft Word, and Outlook. They operate based on
their patent pending WebEssence technology, which automatically removes irrelevant
text and content (such as ads and navigation items) from Web pages, Copernic
Summarizer focuses on the essential text elements, which results in even more relevant
summaries. This is achieved through the following steps [77]. First, they utilize a
Statistical model (S-Model). This model is used in order to find the vocabulary of the text
to detect similar concepts. Second, they apply a Knowledge Intensive Processes (KIP). It
imitates the way in which human make summary texts by taking into account the
following steps. (a) Language detection where it detects the language (English, German,
French or Spanish) of the document for applying its specific process. (b) The limits of
sentence recognition, where it identifies the beginning and endings of each sentence. (c)
Concept extraction where it uses some machine learning techniques to extract keywords.
(d) Document Segmentation, where it organizes the information that it can be divided
into larger related segments. (f) Sentence Selection, where it selects sentences according
to their importance (weight) discarding those that decrease readability and coherence.
Microsoft Office Word Summarizer is a text summarizer that can be found in
versions of Microsoft Office Word 2003 and Microsoft Office Word 2007. This tool can
generate summaries of by stating the desired number of sentences, words or even by
stating the percentage of words in the original text. It offers various ways of visualizing
summaries. One is highlighting the color of important sentences in the original document.
The summary created is the result of an analysis of key words; the selection of these is
done by assigning a score to each word. The most frequent words in the document will
43
have highest scores which will be considered as important and thus will be included in
the summary.
The InText 31 text summarization product targets the textual content present in
users offline documents. Unfortunately, there exist no specific details about the operation
of this product on its official Web site. However, they state that it extracts key sentences
by using key words, although the exact technique is not mentioned. In addition, their
description mentions that the user may choose one of several extraction techniques.
3.1.3 Question Answering Based Projects
The FociSum 32 project is presented by Columbia University. It adopts a question
answering approach in performing the summarization process. First, it extracts sentences
that answer key questions about event participants, organizations, places and all other
types of questions. The result of the previous stage is simply a concatenation of sentence
fragments extracted from the original document. Second, the system finds the foci of the
document under consideration using a named entity extractor. Then a question generator
is used to define the relationship between the entities extracted. Third, the system then
parses the document on the basis of syntactic form to find candidate answers to the
questions previously generated. Finally, the output is some sentence fragments and
clauses that represent the original document.
The Mitre's WebSumm 33 text summarization product targets the textual content
present in single or multiple Web documents. It can be integrated with search engines.
The output of this product is an extractive summary directed in selection by the user’s
query which can be considered as a question to be answered. The main idea behind this
system’s work is that it represents all source document(s) as a network of sentences. In
response, the system uses the query terms to extract or select those nodes related to the
http://www.islandsoft.com/products.html
http://www.cs.columbia.edu/%7Ehjing/sumDemo/FociSum/
33
http://www.mitre.org/news/the_edge/july_97/first.html
31
32
44
specific query words. Moreover, this product can handle similar and contrasting
sentences across the different documents.
3.2 Abstractive Based Projects
In abstractive summarization systems alter the original wording of sentences by
merging information from different sentences, removing parts of the sentences, or even
add new sentences based on document understanding. The work done regarding
abstractive summarization mainly uses natural language processing (NLP) techniques
based on computational linguistics and machine learning. In fact, abstractive based
approaches are very labor intensive and require intensive work on automatic deep
understanding of documents. As a result, there exist a few projects based on this approach
in comparison to extractive based approaches.
The TRESTLE 34 research project is presented by the University of Sheffield. It
targets news domain summaries. Unfortunately, there are not so many details available
about this research project on its official web site about the exact system architecture. Out
of the general information we have, we know that It applies a concept identification
module based on the recommendations about information style presented by the message
understanding conference 35 (MUC). It then uses the identified concepts in assuming
degrees of importance to the different sentences. Then it formulates the summary based
on the information present in these sentences.
The Summons 36 research project is presented by the University of Columbia. It
targets multi-document summaries in the news domain. The system is designed over the
results of the MUC style information extraction process. It is based on a template with
instantiated slots of pre-defined semantics. The summary is then generated by using their
sophisticated natural language generation stage. In fact, this generation stage consists of
three sub stages, a content selection sub stage, a sentence planning sub stage and a
surface generation sub stage. The system benefits from using the notion of templates this
http://nlp.shef.ac.uk/trestle/
http://en.wikipedia.org/wiki/Message_Understanding_Conference
36
http://www.cs.columbia.edu/%7Ehjing/sumDemo
34
35
45
is because they usually have a well defined semantics. Therefore, summaries produced by
this type of generation are of high quality that is comparable to human professional
abstracts. One apparent drawback of this project is that it is domain specific, relying on
news articles templates for information extraction stage.
3.3 Hybrid Based Projects
In this section we show hybrid approaches which combine extraction based
techniques with more traditional natural language processing techniques to produce
abstractive summaries.
The Cut and Paste system 37 targets single document domain independent texts. The
system is designed to operate on the results of sentence extraction summarizers and then
it performs a process of key concept extraction from these extracted sentences. The
following stage after concept identification is to combine these concepts to form new
sentences. In other words, the system cuts the surface form of the extracted key concepts
and then pastes them in to new sentences. In fact, this is simply done in two main steps.
First, it reduces sentences by removing any extraneous information where this process is
known as sentence compaction. This process uses probabilities learnt from a training
corpus. Second, the reduced sentences are then merged using some rules such as adding
extra information about the speakers, merging common sentence elements and adding
conjunctives.
The MultiGen system 38 is presented by the University of Columbia. It targets
multi-document summaries in the news domain. The main idea behind the work of this
system can be simplified and described as it extracts sentence fragments from the given
text that can be considered as key points of information presented in the set of related
documents given. This task is accomplished in three main steps. First, they utilize some
machine learning techniques to group paragraph sized units of text into clusters of related
topics. Second, they parse sentences from these clusters and then merge the resulted parse
37
38
http://www.cs.columbia.edu/%7Ehjing/sumDemo/CPS/
http://www.cs.columbia.edu/~hjing/sumDemo/
46
trees to form a logical representation of the commonly occurring concepts. Third, the
resulted logical representation is turned back into sentences using the FUF/SURGE
grammar 39 . In fact, the matching of concepts is performed via some linguistic knowledge
such as those found in stemming, part of speech tagging, synonymy and verb classes.
39
http://www.cs.bgu.ac.il/surge/index.html
47
Chapter 4
PROPOSED APPROACH FOR FAQ WEB PAGES
SUMMARIZATION
This research targets FAQ Web pages text summarization. Our approach is based
on segmenting Web pages into Q/A block units and extracting the most salient sentences
out of it. FAQs are generally well-structured documents, so the task of extracting the
constituent parts (question and answers in our case) is amenable to automation.
Consequently, after a FAQ Web pages are correctly segmented into Q/A units, we select
the best sentence(s) from each answer to each question based on some salient features.
The proposed approach is English language dependent.
In this chapter, the first section gives an overview of the methodology we propose
to tackle the problem of summarizing FAQ Web pages. Next, we give a brief about Web
page segmentation in general and how it captures the semantic content resides in a given
page. Additionally, we show how we employ Web page segmentation to comprehend the
FAQ Web page structure in the form of questions plus answers. Finally, we describe the
details of the features selection we employ and the logic behind how they are combined.
4.1 Proposed Methodology Overview
Web pages are devised into a question having a heading followed by its answers
with a different style. This is almost a standard for building a FAQ page. In most of the
FAQ Web pages the text is not scarce giving a good opportunity for summarizers to
work. Moreover, this type of pages is more informative than most of other types of Web
pages as it targets the questions and concerns of a Web site visitors. The questions may
also be grouped according to the degree of relatedness to answer a set of semantically
related issues. These observations made it clear that the FAQ Web pages summarization
48
may benefit from utilizing a Web page segmentation algorithm to correctly identify the
question and its answer(s) as a first step and later on to summarize it.
In fact, this research’s main goal is to extract sentences for final summary based
on some selection features that signals importance of some sentences. Related research
has shown that when humans perform the process of summarizing a document they tend
to use readymade text passages -extracts- [78], where 80% of sentences in their generated
summaries were closely matched with source document, while the remainder was as set
of new sentences added based upon understanding the document. As a result, most
automatic text summarization approaches tend to depend on the process of sentence(s)
selection from source document based on the salient features of document, such as
thematic, location, title, and cue features.
Moreover, extractive approaches tend to be faster, easily extendable and retain
most of the structure of the original document instead of flattening it. However, its main
disadvantage is that by using certain techniques, it may become misleading in terms of
structure.
After FAQ Web pages were correctly segmented utilizing a segmentation
algorithm into question and answers units, then we apply selection feature modules that
are used in order to form the final summary.
4.2 Web Page Segmentation
A very common mistaken notion is that Web pages are the smallest undividable
units of the Web [79]. In fact, Web pages consist of various content units, such as
navigation, decoration, interaction and contact information that may not be directly
related to the main topic of the page [79]. A single Web page may represent multiple
topics that usually distract the reader about the main topic of the page. Therefore, being
able to understand the actual semantic structure of the page can aid the summarization
process enormously. Knowing the relationship between all units –in our case the different
49
headings -resides in the Web page will uncover the different degrees of importance of
these units in relation to each other.
4.2.1 Applying Web Page Segmentation to FAQ Web Pages
As mentioned earlier, FAQ Web pages are devised into a manner of a question
having a higher heading or different style followed by its answer with a lower heading
and different style. Therefore, it makes more sense to use the algorithm described in [80],
where it extracts the hierarchal structure of Web documents based on the identification of
headings and the relationships between the different headings.
One objective of using this algorithm is to help us filter out most of the
misleading units reside in the Web page. As we mentioned before that Web pages usually
contain other content –decoration, navigation, contact information, etc.-beside text which
is our main interest.
Another objective of applying this algorithm is to state the boundaries of the
question and answer units. We could have retrieved the whole text resides in the Web
page and then process it, but we may lose some valuable information hidden in the
structure of the answer. For example, if the answer to a question is divided into lower sub
headings, this means that each sub heading either conveys a different type of information
or with different degree of importance. The figure below shows an example illustrating
that case.
50
How do I add CNN.com to my list of favorites?
If you are using Internet Explorer:
• Open the CNN home page http://www.cnn.com/ on Internet Explorer and click on Favorites.
• Click on "Add to favorites". A window will open confirming that Internet Explorer will add this page
to your Favorites list and confirm the name of the page.
• Click OK to continue.
• You may also file CNN within a folder in your list of Favorites.
• Click Create In to file the page in an existing folder or click New Folder to add another folder to your
list.
If you are using Netscape Navigator:
• Open the CNN.com home page http://www.cnn.com/ in Netscape.
• Click Bookmarks (on the upper left of the page, find Bookmarks next to the Location and URL).
• Choose Add Bookmark to automatically add CNN to your list of bookmarked web sites.
• Choose File Bookmark to file your CNN bookmark in a separate folder.
• Bookmarks can also be found under the Window option on your menu bar.
Figure 4.1 An Example of Logically Divided Answer into Sub Headings.
A good answer to the previous example should consider both headings the one
about Internet Explorer and the one about Netscape Navigator. Lacking the knowledge
that these are two different but equally weighted headings would result a non informative
summary. A Good conclusive 25% summary to the previous question would be
something like that:
If you are using Internet Explorer:
• Open the CNN home page http://www.cnn.com/ on Internet Explorer and click on
Favorites.
• Click on "Add to favorites". A window will open confirming that Internet Explorer will
add this page to your Favorites list and confirm the name of the page.
If you are using Netscape Navigator:
• Open the CNN.com home page http://www.cnn.com/ in Netscape.
• Click Bookmarks (on the upper left of the page, find Bookmarks next to the Location and
URL).
Figure 4.2 An Example of a Good Summary to Question in Figure 1.
51
A bad 25% summary lacking that knowledge to the previous question would be
something like that:
If you are using Internet Explorer:
• Open the CNN home page http://www.cnn.com/ on Internet Explorer and click on
Favorites.
• Click on "Add to favorites". A window will open confirming that Internet Explorer
will add this page to your Favorites list and confirm the name of the page.
• Click OK to continue.
You may also file CNN within a folder in your list of Favorites.
Figure 4.3 An Example of a Bad Summary to Question in Figure 1.
4.3 Features Selection
In literature there have been so many features proposed some tackling specific
problems while others thought to be more general. As previously mentioned in [8] they
proposed that the frequency of a particular word in an article provides a useful measure
because of its significance. Also related work in [9] provides early insight on a particular
feature that was thought to be very helpful, which is the sentence position in paragraph.
By experiment, he proved that in 85% of the time the topic sentence is the first one and
only 7% of the time it came last while the other cases are randomly distributed.
On the other hand, the work in [10] was based on the two features of word
frequency and positional importance which were incorporated in the previous two works.
Additionally, two other features were used: the presence of cue words (for example
words like significant, fortunately, obviously or hardly), and also incorporated the
structure of the document (whether the sentence is in a title or heading). In [20-21] they
account for a set of features to generate their summaries. In [20] they proposed an
approach that takes in account several kinds of document features, including position,
positive keyword, negative keyword, centrality, and the resemblance to the title, to
generate summaries. Positive keywords are defined as the keywords frequently included
52
in the summary. In contrast to positive keywords, negative keywords are the keywords
that are unlikely to occur in the summary. The centrality of a sentence implies its
similarity to others, which can be measured as the degree of vocabulary overlapping
between it and other sentences. In other words, if a sentence contains more concepts
identical to those of other sentences, it is more significant. Generally, the sentence with
the highest centrality denotes the centroid of the document. Resemblance to title feature
implies that the more the overlapping between sentences and the document title the more
important they are. These last two features are basically used in case of formulating
generic summaries.
In [21] they account for some other features like inclusion of named entities,
relative sentence length and the aggregated similarity. The similarity feature is simply a
vocabulary overlap between the 2 nodes (2 sentences) under consideration divided by the
longest length of the 2 sentences (for normalization). Aggregate similarity measures the
importance of a sentence, instead of counting the number of links connecting a node
(sentence) to other nodes. Weights were attached to each of these features manually to
score each sentence. In conclusion, these features perform differently on different text
types depending on different types of evidence. Means that some times some work
perfectly on a certain problem type and behaves badly on some other.
One other
observation on the number of included features to solve some problems the larger the
number the features included in summary generation the harder the weight assignment
scheme. In addition, including unnecessary non problem dependent features may ruin the
weighting scheme and put a negative influence on the other features. Therefore, when
one is devising a feature set to solve a summarization problem, he must carefully study
the problem underhand to highlight the most useful feature set and to include a feasible
number of features to be able to adjust weights.
In response, according to the definition of our problem of summarizing FAQ, we
find ourselves approaching the domain of question answering, which is the task of
automatically answering a question posed. Consequently, in order to define the useful
features to be used in selecting the most suitable sentences from an answer paragraph to a
53
given question, one has to determine the type of question being posed to determine the
most appropriate answer.
In fact, questions can be classified based on the expected type of answer. The
simplest widely accepted question type classification is divided into to three basic types
of question 40 .
First, Yes/No Questions, where the expected answer to this type of question is
explicitly either "Yes" or "No". An example to this type of question is depicted in table
4.1.
Second, Question Word Questions where the answer to this type of question is
possible through listing certain degree of informative details. An example to this type of
question is depicted in table 4.2.
Third, choice Questions where the answer to this type of question is mentioned in
the question itself. An example to this type of question is depicted in table 4.3.
Auxiliary verb
Subject
Main Verb
Do
You
Want
Can
You
Drive?
Has
She
Finished
Answer
“Yes or No”
Dinner?
Yes, I do.
No, I can't.
Her work?
Yes, she has.
Table 4.1 Yes/No Question Type Examples.
40
www.englishclub.com/grammar/verbs-questions_types.htm
54
Answer
“Information”
Question Word
Auxiliary Verb
Subject
Main Verb
Where
Do
You
Live?
When
Will
We
Have
lunch?
At 1pm.
Why
hasn't
Tara
Done
it?
Because she can't.
In Paris.
Table 4.2 Question Word Question Type Examples.
Answer
“In the question”
Auxiliary Verb
Subject
Main Verb
OR
Do
You
Want
Tea
Or
Coffee?
Coffee, please.
Will
We
Meet
John
Or
James?
John.
Did
She
Go
to London
Or
New
York?
She went to London.
Table 4.3 Choice Question Type Examples.
In the light of the previous question type classification, we came up with the
conclusion that we have to think about some possible selection features that can address
each type. Moreover, we aim at finding a general solution not knowing the exact type of
question, and yet being able to find a good answer to the question. Therefore, the main
idea in reaching our goal has come down to choose some selection features that target
each type of question and later find a way of combining them linearly.
In answer to the first question type, the first sentence in the answer text contains
the primary answer while sentences come after that just adds some extra more details.
Therefore, the selection feature that gives higher score to opening sentences is highly
preferable in answer to that type of question.
On the other hand, in answer to the second and third question types, one may need
some semantic similarity measures to weight the answer sentences against the question
55
sentence. In addition, the answer may also contain certain named entities to inform about
people, places, organizations, products, etc, which signals the importance of sentences
containing them. Thus one might utilize selection features to highlight sentences
possessing capitalized words.
Based on the analysis of different question types, we propose the use of four
selection feature criteria that we believe they help answering the three main question
types and in turn extracting the most salient sentences of the source text to be introduced
to the final summary. Then we give each feature a weight that we will later discuss how it
is calculated. The four features are then linearly combined with a single equation to give
an individual score to each sentence based on the degree of importance given to the four
different features.
4.3.1 Semantic Similarity Feature
The first feature that we use in developing the FAQWEBSUMM system feature is
referred to as “Similarity”. It evaluates the semantic similarity between two sentences
which are the question sentence and each sentence in the answer. This feature was
explicitly chosen to answer questions of type two and three. In fact, semantic similarity is
a confidence score that reflects the semantic relation between the meanings of two
sentences. We use a similarity score calculator developed by Dao and Simpson [81]. The
similarity feature evaluation depends on some dictionary-based algorithms which capture
the semantic similarity between two sentences, which is heavily based on the WordNet 41
semantic dictionary.
In fact, WordNet has a lexical database which is available online and provides a
large repository of English lexical items. It is designed to establish a connection between
four types of part of speech noun, verb, adjective and adverb. It contains the word with
both its explanation and its synonym. The specific meaning of one word under one type
of part of speech is called a sense. Each group of words having the same sense is
41
http://wordnet.princeton.edu/
56
combined together for having the same synonymous meaning. Each group has a gloss
that defines the concept it represents. For example, the words night, nighttime and dark
constitute a single group that has the following gloss: the time after sunset and before
sunrise while it is dark outside. Groups are connected to one another through some
explicit semantic relations. Some of these relations (hypernym, hyponym for nouns and
hypernym and troponym for verbs) constitute is-a-kind-of (holonymy) and is-a-part-of
(meronymy for nouns) hierarchies.
Mainly, the algorithm in [81] evaluates how similar are two sentences to each
other. The process of evaluating the similarity between two sentences is done in five
main steps:
 First, partition sentences into a list of words then tokens. This can be done
through the process of extracting the basic forms of words by removing stop
words and abbreviation.
 Second, perform part of speech tagging to identify the type of words (noun, verb,
etc.). In [81] they employed two types of taggers the first one attaches syntactic
roles to each word (subject, object, ..) and the second one attaches only functional
roles (noun, verb, ...).
 Third, perform stemming, which is the process of removing morphological and
inflectional endings of words. In [81] they used the porter stemming algorithm. It
works as follows; first, it split words into possible morphemes getting
intermediate form and then maps stems to categories and affixes to meaning. i.e.:
foxes -> fox + s -> fox.
 Fourth, perform semantic disambiguation, which can be defined as the process of
finding the most appropriate sense for a given word in a given sentence. In [81]
they modified and adapted the Lesk algorithm [82] which mainly counts the
number of words that are shared between two glosses. The algorithm in [81] runs
as follows:
57
1. Select a context: optimizes computational time so if N is long, they
define K context around the target word (or k-nearest neighbor) as the
sequence of words starting K words to the left of the target word and
ending K words to the right. This will reduce the computational space that
decreases the processing time. For example: If k is four, there will be two
words to the left of the target word and two words to the right.
2. For each word in the selected context, they look up and list all the
possible senses of both noun and verb word types.
3. For each sense of a word, they list the following relations:
o Its own gloss/definition that includes example texts that
WordNet provides to the glosses.
o The gloss of the synsets that are connected to it through the
hypernym relations. If there is more than one hypernym for a word
sense, then the glosses for each hypernym are concatenated into a
single gloss string.
o The gloss of the synsets that are connected to it through the
hyponym relations.
o The gloss of the synsets that are connected to it through the
meronym relations.
o The gloss of the synsets that are connected to it through the
troponym relations.
4. Combine all possible gloss pairs that are archived in the previous steps
and compute the relatedness by searching for overlap. The overall score is
the sum of the scores for each relation pair. When computing the
relatedness between two synsets s1 and s2, the pair hype-hype means the
gloss for the hypernym of s1 is compared to gloss for the hypernym of s2.
The pair hype-hypo means that the gloss for the hypernym of s1 is
compared to the gloss for the hyponym of s2 as indicated in [81] by the
following equation.
58
OverallScore(s1, s2)= Score(hype(s1)-hypo(s2)) + Score(gloss(s1)hypo(s2)) Score(hype(s1)-gloss(s2)).. (1)
Note: ( OverallScore(s1, s2) is also equivalent to OverallScore(s2, s1) ).
To score the overlap they use a new scoring mechanism that differentiates
between N-single words and N-consecutive word overlaps and effectively
treats each gloss as a bag of words. Measuring overlaps between two
strings is reduced to solve the problem of finding the longest common substring with maximal consecutives. Each overlap which contains N
consecutive words, contributes N2 to the score of the gloss sense
combination. For example: an overlap "ABC" has a score of 3^2=9 and
two single overlaps "AB" and "C" has a score of 2^2 + 1^1=5.
5. Once each combination has been scored, they pick up the sense that has
the highest score to be the most appropriate sense for the target word in
the selected context space.
 The above method allows us to find the most appropriate sense for each word in a
sentence. To compute the similarity between two sentences, they consider the
semantic similarity between word senses. They capture semantic similarity
between two word senses based on the path length similarity. Thus, the scoring
mechanism works as follows; it builds a semantic similarity relative matrix
R[m,n] of each pair of word senses, where R[i, j] is the semantic similarity
between the most appropriate sense of word at position i of sentence X and the
most appropriate sense of word at position j of sentence Y. The similarity between
sense can be computed as in [81] by equation 2:
Sim(s, t) = 1/distance(s, t)
(2)
where distance is the shortest path length from sense s to sense t by using node
counting. In other words, R [i,j] is also the weight of edge connecting i to j. The
match results from the previous step are combined into a single similarity value
for two sentences. There are many strategies to acquire an overall similarity of
two sets. To compute the overall score as in [81] they just evaluate equation 3:
59
(3)
where match(X, Y) are the matching word tokens between X and Y.
 Finally, the similarity is computed by dividing the sum of similarity values of all
match candidates of both sentences X and Y by the total number of set tokens.
We use this method in computing the similarity between the question sentence(s) and
the answer sentences. In fact, we give each sentence in the answer a numeric score
represented by the degree of similarity to the question as computed by this similarity
measure. This selection measure is considered the primary feature in FAQWEBSUMM
system as it responds to two question types out of three as previously mentioned. After
computing the score it is normalized based on the highest value gained within the answer
sentences.
4.3.2 Query Overlap Feature
The second feature that we use in FAQWEBSUMM is referred to as “Query
Overlap”. This feature is a simple but very useful method as stated in [31-34-65]. This
feature is also chosen to answer questions of type two and three. It scores each sentence
by a number of desirable words it contains. In our case the query will be formed from the
question sentence(s). The question is tokenized and tagged using a POS tagger to identify
the important word types. It was found that there are certain word types are more
important than others. For example nouns, adjectives and adverbs are found to be more
informative as they declare the statement about the main purpose of the text.
The system extracts the following word types; nouns, adverbs, adjectives and
gerunds -that represent verbs- from each question and formulate a query with the
60
extracted words. Then the system scores each sentence of the answer for the frequency of
existence of the query words.
Unlike the similarity score in the first feature, this feature does a direct match,
means the same word in the query has to be exactly the same in the answer individually
not related to other words in the question. This is because the first feature takes into
account the whole sentence and less semantically related words lessen the overall score,
while in this case we try to avoid this consequence. However, it is spawned from the first
feature as a natural extension. In fact, the logic behind including two selection features to
target semantically related content is that in two out of three question types –Word
question type and Choice question type-or 66 % of the expected answers are directly
related to the question.
4.3.3 Location Feature
The third feature that we use in FAQWEBSUMM system is the location feature.
As previously depicted, the significance of a sentence is indicated by its location based
on the hypotheses that topic sentences tend to occur at the beginning or sometimes in the
end of documents or paragraphs [2-3-4] - especially Yes/No questions.
We simply calculate this feature for each sentence in the answer by giving the
highest score to the first sentence and then degrading the score for following sentences.
For example if we have a paragraph of five sentences the first sentence takes a score of 5
and the following sentence takes a score of 4 until we reach the last sentence in the
paragraph with a score of 1. Later, these scores are normalized for combination with the
other features.
4.3.4 Word Capitalization Feature
The fourth feature used in FAQWEBSUMM system is the capitalized words
frequency. The logic behind using this feature is that capitalized words are important as
stated previously especially in response to the second question type that requires listing a
61
higher degree of details. In fact, they are important as they often stand for person’s name,
corporation’s name, product’s name, country’s name, etc. Therefore, they usually signal
importance for the sentences contains them. They tell the summarizer that these sentences
are good candidates for inclusion as they tend to have a piece of information that is
salient enough to be worth of consideration.
This feature is simply computed by counting the number of capitalized words in a
given sentence. Later these scores are normalized for combination with the other features.
We also have set a threshold value on the number of capitalized words in a given
sentence as if the number of detected capitalized words exceeded this value it would
mean that all text is written in capital. Therefore, it would mean nothing to give
importance to these sentences as they all have the same format, thus we ignore it by
giving it a zero score value.
4.3.5 Combining Features
Each of the previous features produce a numeric score that maps to how important
it consider each given sentence in the answer. However, the total is score is the linear
combination of the four features but not equally weighted depending on how we consider
the contribution of each feature to the final score. We compute the score of each sentence
as follows:
Total Score of Sentence(s) = 1 (Similarity(s)) + 2(Query Overlap(s)) +
3(Location(s)) + 4(Uppercase Words(s)) (4)
Where 1, 2, 3 and 4 are weights given for each feature in response to the expected
contribution in the overall score formula. A pilot experiment was performed before
reaching the optimum weights to be included in the final formula. The next section will
give details to the findings of this experiment.
62
4.4 Pilot Experiment “Detecting Optimum Feature Weights”
4.4.1 Objective
The main objective of this experiment is to determine the optimum weight of each
feature in eq.4.
4.4.2 Experiment Description
The proposed method for achieving this objective is as follows:
 First, formulate a summary by utilizing the scoring of one feature at a
time.
 Second, compare the features’ scores and rank them according to their
performance gain.
 Finally, devise a weighting scheme –means give a range to each feature
weight- and score it the same as in step one to make sure that the devised
methodology is sound and correct.
The experiment was done on the basis of one FAQ Web page with 29 questions in
total, which represent a sample of approximately 8% of the dataset we use. Fifteen
questions were found of Yes/No question type while fourteen questions were found
informative questions of type two. Each summary is evaluated by one human evaluator.
The evaluation is done on the basis of whether the summary is readable,
informative and short or not. The compression rate used in generating these summaries
was 25 %. It was chosen to give satisfactory readable and understandable compressed
summary. The evaluator gives a score ranged from very bad to excellent as depicted
above according to whether it meets the preset quality criteria and to what extent.
63
4.4.3 Results
The figure below shows a graphical representation to the comparison between
summary scores of all the 29 questions provided by all features. This is done by giving
the quality metric the following numerical values Excellent a value of 1, Very good 0.8,
Good 0.6, Bad 0.4 and Very Bad 0.2. The detailed score comparison between all features
along with a sample summary output is presented in Appendix A.
1
0.9
0.8
S co re
0.7
F1 Similarity
0.6
F2 Location
0.5
F3 Query
0.4
F4 Capitalized Words
0.3
0.2
0.1
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
Questions
Figure 4.4 Features’ Scores Comparison.
The table below shows the average score produced by each feature in response to
the given questions.
Feature
Average Score
F1 Similarity
0.800
F2 Location
0.793
F3 Query Overlap
F4 Capitalized
Words
0.558
0.496
Table 4.4 Summary Evaluation for Selection Features.
64
4.4.4 Discussion
Based on the above, we can see that the semantic similarity feature F1 produced the
highest average score in response to all questions. While the location features F2 came
second. The query overlap feature F3 came third and the capitalized Words frequency F4
came last. Moreover, we can also observe that the highest value and most constant curves
were the F1 and the F2 curves, as they always succeed in generating an output even if it
is not of great quality.
On the other hand, the answer sentences may not contain any capitalized Words
which results in summary generation failure which is apparent in the curve above. The
same applies to F3 sometimes the answer doesn’t contain direct association between
query words and answer sentence words, thus the need to compare the word senesces as
what’s done in the similarity feature. However, we can notice that in some cases F3 and
F4 show sudden improvement in comparison to the other features.
In conclusion, it was found that each of the features mentioned above performed
well in some cases based on different kinds of evidence. However, we believe if we
combine the features contributions together based on the above feature scores, we can
improve the overall score and avoid the times each feature fail by backing it up with other
features.
As a result, we proposed the following weight scheme. We will give the highest
weight (1) to the sentence(s) similarity feature as we witnessed that its average score
was the best amongst the other features. Next, we will give the second highest weight
(3) to the location feature as we witnessed that its average score was the second best
amongst the other features. The query overlap feature is given weight (2) less than the
location as it is considered a spawn of the similarity and scored the third average score
position. The last feature the Capitalized Words feature weight (4) is given the least
weight based on the same logic stated above the weighting can have arbitrary values but
obeys the following inequality between the feature weights.
65
α1 > α3 >α2>α4 (5)
Applying this weighting scheme to the same experimental data we used above in
evaluating previous features individually, we found out the following:
 First, the average score in answering all questions was 0.834, which is top scorer
compared to the highest value scored by the previous features individually. This
means that the proposed combination methodology improved the overall score.
 Second, the feature combination avoids the times where some features fail as it
will turn to an alternative feature in scoring the different sentences.
The detailed score to all questions along with a sample summary output to the
combined features scheme is presented in Appendix A. In fact, this is quite an impressive
result, because when you combine the different features they may distract each other and
finally provide bad results which didn’t occur in our sample test. The final conclusion can
be reached after performing a large scale experiment as will be seen in the experimental
results and analysis chapter. It will simply show how our proposed methodology
performs in comparison to real world commercial application and whether our hypothesis
in developing this scoring scheme stands or not.
66
Chapter 5
SYSTEM DESIGN AND IMPLEMENTATION
In this chapter we will present the adopted system architecture and system
implementation details. First, we give an overview of the overall proposed
FAQWEBSUMM system design describing the different phases to perform the
summarization process. Next, we describe in details the two main modules that constitute
our system namely the Preprocessing module and Summarization module. Finally, we
present the tools used to implement our system and describe the target environment.
5.1 FAQWEBSUMM Overall System Design
FAQWEBSUMM system was designed from the very beginning to be as
extendable as possible. As the adopted approach in automatic text summarization in this
research is an extraction based approach, therefore the idea of system expansion or
extension is a direct influence on system design. FAQWEBSUMM was designed to have
a solid core that allows other future types of summaries beside FAQs.
The system is mainly composed of two main modules namely the preprocessing
module and the summarization core module. The preprocessing is the module responsible
for preparing the input data and put it into an appropriate form –Q/A units– ready to be
used as input by the summarization module. In turn, the summarization module is
responsible for summarization quota calculation, sentence scoring by each selection
feature and finally summary generation which will be explained in details later. The
figure below explains the overall system architecture as described above.
67
FAQWEBSUM
M
Input
HTML
Pages
Pre-Processing
Module
Summarization
Core
Module
Final
Summary
Figure 5.1 FAQWEBSUMM Overall System Architecture
The sequence of operations undertaken by the summarization process starting
from having an input HTML FAQ page up to producing a summary for that page are
stated as follows. The system first receives input as HTML pages, which is then
forwarded to the preprocessing module. In fact, the preprocessing module is divided into
two internal modules.
The first internal preprocessing module runs into two main steps. The first step is
to run the segmentation tool and construct the Web page semantic structure. The output
of this step is an XML file describing the entire page in the form of semantic segments
which will be described in details afterwards.
The second step is responsible for building a parser interface to the XML output
file produced earlier by the segmentation module. In fact, the parser will enable us to
comprehend the segment structure of the given pages which correspond to fragments of
the Web page. However, the segments produced by the segmentation tool are not simply
the desired Q/A Segments. Thus we needed to include another preprocessing module –
Question Detection module -to filter out segments that do not correspond to real Q/A
units. After Q/A were detected correctly, we proceed to the summarization module to
score answer sentences and later to select the best for summary generation.
68
5.2 Pre-processing Stage
As stated earlier, Web pages are divided into a set of segments. The segments are
mainly consisted of a higher heading which, in our FAQ case, is the question sentence(s)
along with some text that serves as the answer(s) with a lower heading. In fact, the
answer may be scattered over a set of children segments if they were in the form of
points, bullets or sub-paragraphs defined by smaller sub-headings. The preprocessing
module is responsible for filtering out irrelevant segments, and leaving only Q/A
segments. This is done into two main steps. First, detect all segments in the given page
based on their heading level which is provided by the segmentation tool we have. Then
define and apply some filtering criteria to allow only Q/A segments for summary
generation.
5.2.1 Primary Pre-processing
This module runs the segmentation tool on the given page and the output is a set
of segments. These segments can be described as follows: The two main attributes that
describe the segment are the heading and Text. The heading may be a question, if it is a
relevant segment, or
may be any heading title in the page along with some text that
serves as the answer. There exists some other attributes like segment ID, parent ID, level,
order, length and number of sub segments. Definitions for these attributes are explained
in figure 5.2.
Segment ID: A unique identifier for the segment throughout the page.
Parent ID: The parent segment identifier.
Heading: The heading Text
Segment
Level: Indicates a segment’s level within the document’s
Segment ID
hierarchy
Heading
Order: Indicates the segment position with respect to its
Parent ID
siblings from the left to right.
Level
Length: Indicates the number of words in the segment.
Order
Text: Indicates the main text within the segment.
Length
NoOfSubSegments: Indicates the number of children this
Text
segments posses.
NoOfSubSegments
Figure 5.2 Segment Attributes Definitions.
69
Normally, Web pages are divided into a set of semantically related segments
having some hierarchal relation in the form of parent-child relationship. In fact,
sometimes the answer to some questions is logically scattered to sub paragraphs, bullets
or points having lower heading and yet still under the same heading. Detecting this type
of relation gives us the opportunity to produce conclusive summaries as a result of
knowing differences in sub headings. The following figure shows the hierarchal relation
of a given page.
Document
Segment ID =
1
Parent ID = 0
Level = 1
Segment ID =
2
Parent ID = 0
Level = 1
Segment ID =
4
Parent ID = 1
Level = 2
Segment ID =
5
Parent ID = 1
Level = 2
Segment ID =
2
Parent ID = 0
Level = 1
Figure 5.3 Web Page Segmentation Example
After Web pages were segmented and put into the previous form. The
FAQWEBSUMM maps these segments to its internal hierarchal structure. The internal
structure is divided into Page, Q/A, Sentence, Word and Token. The Page consists of
some Q/A units. The Q/A unit, consists of the question which may contain one or more
sentences along with the answer which is another set of sentences. The sentence consists
of a set of words which in turn consist of a set of tokens or letters. Devising this hierarchy
helped in attaining more control over the different constructs.
The segmentation tool provides us with only two constructs –Document & set of
segments -with which we build our own internal structure. The document construct,
maps to our Page construct that represents the whole page with all questions in it. The
70
segment construct, maps to our Q/A unit. Unfortunately, the segment construct only
contains raw text in the heading and text nodes that represent both the question and the
answer text with no explicit mapping to sentence or word constructs.
In fact, we perform two steps to identify both constructs correctly. The first step
is to extract sentences from the raw text found in the segment. This is done by applying a
sentence boundary disambiguation (SBD) module, where it is the problem in natural
language processing of deciding where sentences begin and end 42 . This module is
thought to be a very important module because if sentences were mistakenly detected, the
final summary will be badly evaluated. This is because the whole summarization system
is built on sentence extraction and the compression rate is computed based on the number
of sentences in each Q/A unit. However, this is a very challenging task because of the
punctuation marks are often ambiguous. For example, a period may denote an
abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence.
About 47% of the periods in the Wall Street Journal corpus denote abbreviations [83]. In
addition, question marks and exclamation marks may appear in embedded quotations,
emotions, computer code, and slang.
The most straight forward approach to detect
sentences is to look for a period and then if the preceding token is a capital letter and not
an abbreviation, then it is most probably an end for the sentence. The approach [83] work
in most cases with accuracy of around 90% depending on the list of abbreviations you
have. We managed to use a java implemented sentence detection 43 library offered by
LingPipe 44 that is trained over two types of corpus-English News Text corpus and
English Biomedical Text corpus. In fact, its proven accuracy of over 90% as stated by
LingPipe qualified it to be used by our system. The figure below shows the input and
output of the second step of preprocessing.
http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
http://alias-i.com/lingpipe/Web/demo-sentence.html
44
http://www.alias-i.com/lingpipe/
42
43
71
FAQWEBSUMM PreProcessor
XML
File “A”
Sentence
boundary
Disambiguation
(SBD)
Module
XML
File “B”
Proceed
to Next
Phase
Figure 5.4 FAQWEBSUMM Second Stage of Pre-processing.
The next step after sentences are successfully detected is to identify Words and
their tokens. In addition, we need to identify the different word types as it will be used by
the selection features involved. This is done through applying a Part of Speech Tagging
module, where POS 45
is also called grammatical tagging or
word
category
disambiguation.
POS can be defined as the process of marking up the words in a text (corpus) as
corresponding to a particular part of speech, based on both its definition, as well as its
context —i.e. relationship with adjacent and related words in a phrase, sentence, or
paragraph. In other words, it is the identification of the word type like for example
nouns, verbs, adjectives, adverbs, etc. We use part of speech tagging in order to detect the
type of words in the sentences under consideration. This helps us detect exact types of
words with different level of importance according to our defined criteria in the selection
process which will be depicted later. The part of speech tagger used in FAQWEBSUMM
is offered by LingPipe POS 46 . It is a general English tagger derived from the Brown
Corpus 47 . It is the first major corpus of English for computer analysis which was
developed at Brown University, by Henry Kucera and Nelson Francis. It consists of
about 1,000,000 words of running English prose text, made up of 500 samples from
http://en.wikipedia.org/wiki/Part-of-speech_tagging
http://alias-i.com/lingpipe/Web/demo-pos.html
47
http://en.wikipedia.org/wiki/Brown_Corpus
45
46
72
randomly chosen publications. The figure below shows the input and output of the third
step of preprocessing.
FAQWEBSUMM PreProcessor
XML
File “B”
Part of Speech
Tagging (POS)
Module
XML
File “C”
Proceed
to Next
Phase
Figure 5.5 FAQWEBSUMM Third Stage of Pre-processing.
The XML file is now having the same basic structure of the Web segmentation
tool which is supposed to be a big node representing the whole document, followed by a
set of nodes representing the different segments. The segment node contains two main
nodes. These two nodes are the heading node that contains the question text and the
second node is the text node that contains the question’s answer. The text within the two
nodes is the input to the two previously mentioned modules of sentence detection and
part of speech tagging. The sentence detection module creates an XML node with
identifier “s” to represent a sentence and divides the text into a set of s nodes. The part of
speech tagging also creates a different node called token node where it sets the token type
whether it is noun, verb, adverb, etc. The figure below shows an example to the XML
file structure after performing this stage on an input question.
73
<Segment>
<SegmentID>20</SegmentID>
<Heading>
<s i="0">
<token pos="nn">Q</token>
<token pos=".">.</token>
<token pos="do">Do</token>
<token pos="ppss">we</token>
<token pos="vb">draw</token>
<token pos="abn">all</token>
<token pos="dts">these</token>
<token pos="nn">picture</token>
<token pos=".">?</token>
</s>
</Heading>
<ParentID>19</ParentID>
<Level>6</Level>
<Order>1</Order>
<Length>22</Length>
<Text>
<s i="0">
<token pos="np">A</token>
<token pos=".">.</token>
<token pos="rb">No</token>
<token pos="--">-</token>
<token pos="cc">and</token>
<token pos="cs">that</token>
<token pos="'">'</token>
<token pos="vbz">s</token>
<token pos="at">the</token>
<token pos="jj">whole</token>
<token pos="nn">point</token>
<token pos=".">.</token>
</s>
<s i="1">
<token pos="dts">These</token>
<token pos="ber">are</token>
<token pos="at">the</token>
<token pos="jj">original</token>
<token pos="nns">paintings</token>
<token pos="vbn">produced</token>
<token pos="in">by</token>
<token pos="at">the</token>
<token pos="nns">studios</token>
<token pos=".">.</token>
</s>
</Text>
<NoOfSubSegments>0</NoOfSubSegments>
</Segment>
Figure 5.6 XML File Structure Example after SBD and POS Stages.
74
5.2.2 Secondary Pre-processing
The input data is now ready to be processed by the FAQWEBSUMM system. A
file reader module needs to be built to translate the XML file into an internal structure we
described earlier. The file reader expects the XML file to be in the previously described
form. This preprocessing phase is very crucial to the whole system, as it may introduce
many undesired faults if the structure is not adequately mapped to the internal
FAQWEBSUMM structure. The output of this phase is the expected internal
representation of the page in the form of Page, Q/A, Sentence and word structure. The
figure below shows the input and output of the fourth step of preprocessing.
FAQWEBSUMM Preprocessor
XML
File “C”
File
Reader
&
Parser
Internal
Form
Page
Segment
Paragraph
.etc.
Proceed
to Next
Phase
Figure 5.7 FAQWEBSUMM Fourth Stage of Pre-processing.
The preprocessing module still continues with some more specific preprocessing
steps to put the data into the exact form needed for summarization. FAQWEBSUMM
does some more preprocessing to remove the so called bad segments. Bad segments are
defined by FAQWEBSUMM to be those segments that have one or more of the following
judging criteria. Note that the segmentation tool provides us will all segments detected in
a Web page which contains decoration, navigation or any other irrelevant segments other
than Q/A segments .
75
First criterion is to be an empty text segment that is a non-question segment.
Therefore, we implemented a question detection module to be able to identify and filter
out non question segments. In fact, this is a challenging task as well, as falsely detecting
questions results introducing bad or unnecessary segments.
Second criterion is to remove segments that only contain questions solely with no
answers. This case happens frequently as many of the FAQ Web page designers first, list
all the questions in the top of the page and then proceed with answering each question on
its own. The segmentation tool detects this form of gathered questions as a complete
segment or set of segments.
Third criterion is to remove those segments containing non questions in their
heading node and is not children of other question nodes. FAQWEBSUMM has to
differentiate between important segments and segments that are considered to be spam to
the input. Therefore, the system tries to tailor the output of the segmentation tool that is
not specifically directed towards a system like FAQWEBSUMM. The figure below
shows the steps in of the final specialized FAQ preprocessing.
FAQWEBSUMM PreProcessor
Remove
Empty
NonQuestion
Segments
Remove
Segments
With
Questions
Only
Remove
Non
Question
Segments
End
Of
Preprocessing
Proceed
to Next
Phase
Figure 5.8 FAQWEBSUMM Specialized FAQ pre-processing.
76
5.3 Summarization Stage
As previously mentioned the core of the summarization system is extendable,
means it is designed to enable the addition of new types of summaries other than FAQ
summaries. For a new summarization type we just implement its own criteria of handling
and the system just proceed in a plug and play manner. The core and operation of the
FAQWEBSUMM summarizer is depicted in the figure below.
Input
From
PreProcessing
Phase
FAQWEBSUMM Summarizer
Calculate
Sentence
Rate
Per
QA
Calculate
QA
Score
Formalize Final Summary
Sort
Extract
Summary
Scores
Sentence
Final
Summary
Figure 5.9 FAQWEBSUMM Summarization Core.
The summarization stage is responsible for selecting the most suitable sentence(s)
as an answer to a given question. The summarization algorithms run in four main steps.
 The first step is to specify the number of sentences to be extracted from each
answer. This is done by multiply the compression ratio by the total number of
sentences in a given answer.
 The second step is to score all sentences in each answer against the previously
specified four selection features. Figure 5.10 is a pseudo code that illustrates
the first two steps in the summarization stage.
77
PROCEDURE Summarize_FAQ (Segment QASeg, NUM Comp_Ratio){
/*First, calculate the number of sentences to be extracted
from each question */
Num_Ext_Sent = └Comp_Ratio*Total_Num_Sent;
/*Second, calculate the combined total score for each
sentence in the QA segment*/
Seg_Sentence_Scores = Calc_Seg_Score(QASeg);
/*Third, sort sentences in descending order of their total
scores*/
Sort_Seg_Sentence_Scores(Seg_Sentence_Scores);
/*Finally, formalize the summary by picking the top most
ranked sentences*/
Formalize_Summary();
}
PROCEDURE Calc_Seg_Score (Segment QASeg){
/*This procedure calculates all sentence scores and then
combine them*/
for each Answer Sent in QASeg do{
/*Calculate each of the four scores and then normalize its
value to 1 by picking the highest score value and divide all
values by that value*/
Sim_Score = Calc_Similarity_Score(QASeg);
Sim_Score = Normalize (Sim_Score);
Loc_Score = Calc_Location_Score(QASeg);
Loc_Score = Normalize (Loc_Score);
Query_Score = Calc_Query_Score(QASeg);
Query_Score = Normalize (Query_Score);
Capital_Score = Calc_Upper_Words_Score(QASeg);
Capital_Score = Normalize (Capital_Score);
/*The score combination follows Eq. 1*/
Total_Sent_Score =
Sim_Score*0.5+Loc_Score*0.25+Query_Score*0.15
+Capital_Score*0.1;
All_Sent_Scores.Add(Total_Sent_Score);
}
return All_Sent_Scores
}
Figure 5.10 Pseudo Code of Summarization Algorithm
The feature calculations are computed as follows:
 The similarity score is computed by performing the following set of
actions on both the question sentence(s) and each of the answer sentences.
First, perform word stemming. Second, find the most appropriate sense for
every word. Third, build a semantic similarity relative matrix of each pair
of words. Fourth, compute the overall similarity score by dividing the sum
78
of similarity values of all match candidates of both sentences under
consideration by the total number of set tokens. The following figure
shows pseudo code for calculating the feature score.
PROCEDURE Calc_Similarity_Score (Segment QASeg) {
/* Given two sentences X and Y, we build semantic similarity
relative
matrix R[m, n] of each pair of word senses*/
for each Answer Sent in QASeg do {
Sum_X = 0;Sum_Y = 0;
|X| = Num of Words in Question Sent;
|Y| = Num of Words in Answer Sent;
for each Word in Question Sent do {
Max_i=0;
for each Word in Answer Sent do {
if (R[i, j] > Max_i)
Max_i=R[i, j] > Max_i;
}
Sum_X += Max_i;
}
for each Word in Answer Sent do {
Max_j=0;
for each Word in Question Sent do {
if (R[i, j] > Max_j)
Max_j=R[i, j] > Max_i;
}
Sum_Y += Max_j;
}
Overall_Sim = (Sum_X + Sum_Y) / 2 * (|X| + |Y|);
All_Sent_Scores.AddSentScore (Overall_Sim);
}
return All Sent Score;
Figure 5.11 Pseudo Code for Calculating Similarity Feature Score.
 The location feature is computed by giving the first sentence a score
equals to the total sentence count and giving the next the total sentence
count minus one and so on. The following figure shows pseudo code for
calculating the feature score.
PROCEDURE Calc_Location_Score (Segment QASeg) {
Num_Sent = QASeg.GetNumAnswerSent ()
for each Answer Sent in QASeg do { Sent_Score =
(Num_Sent)-(Sent_index);
All_Sent_Scores.AddSentScore (Sent_Score);
}
return All_Sent_Scores;
}
79
Figure 5.12 Pseudo Code for Calculating Location Feature Score.
 The query overlap feature is performed in two steps. First, we formulate a
query from the question sentence(s) by extracting the previously specified
word types. Then we compute their frequency in each sentence. The
following figure shows pseudo code for calculating the feature score.
PROCEDURE Calc_Query_Score (Segment QASeg) {
Query = ExtractQuery(QASeg.GetQuestion());
for each Answer Sent in QASeg do {
Query_Words = 0;
for each Word in Answer Sent do {
if Word is in Query THEN
Query_Words ++;
}
All_Sent_Scores.AddSentScore(Query_Words);
}
return All_Sent_Scores;
}
PROCEDURE ExtractQuery (Question q) {
for each Word in q DO {
if Word Type is noun OR adjective OR adverb Or
Gerund THEN
Add Word to Query;
}
return Query;
Figure 5.13 Pseudo Code for Calculating Query Overlap Feature Score.
 Finally, the capitalized Words frequency is computed by counting the
occurrences of capitalized words. After each step we normalize all scores
in order to be combined. Figure 5.14 shows pseudo code for calculating
the capitalized words feature score.
80
PROCEDURE Calc_Upper _Words_Score (Segment QASeg) {
Num_Sent = QASeg.GetNumAnswerSent ();
for each Answer Sent in QASeg do {
Upper_Case = 0;
for each Word in Answer Sent do {
if Word is Capital THEN
Upper_Case++;
}
All_Sent_Scores.AddSentScore (Upper_Case);
}
return All_Sent_Score;
}
Figure 5.14 Pseudo Code for Calculating Capitalized Words Feature Score.
 The final step in the process of score computation is the linear combination, as we
can see in the pseudo code in figure 5.9.
 The third step in the summarization stage is sorting the scores in a descending
order of the total score of each sentence.
 The final summarization step is the summary formation which takes place by
selecting the top scored sentences in response to each question for summary
generation.
5.4 FAQWEBSUMM System Implementation Issues
In this section we list the tools that we used to implement the FAQWEBSUMM and
target environment. The FAQWEBSUMM system modules were developed with multiple
programming languages.
First, the Web segmentation module is developed in ASP.Net 48 in a Web application
form. It is implemented using Visual Studio 2005 49 tool. It is an external module to the
FAQWEBSUMM. The input HTML pages are introduced to Web segmentation external
module and the output is in the form XML files.
48
49
http://www.asp.net/
http://msdn.microsoft.com/en-us/library/ms950416.aspx
81
Second,
the
primary
preprocessing
modules
that
do
sentence
boundary
disambiguation and part of speech tagging are implemented in java. It is implemented
using Net Beans 6.1 IDE 50 . This is because there are some helper libraries offered by
LingPipe SDK that help performing the previously mentioned tasks in a proper way.
NetBeans is an open source integrated development environment (IDE) for developing
software with Java, JavaScript, PHP, Python, Ruby, Groovy, C, C++, Scala, Clojure, and
others. The NetBeans IDE can run anywhere a Java Virtual Machine (JVM) is installed,
including Windows, Mac OS, Linux, and Solaris. A Java Development Kit (JDK) is
required for Java development functionality, but is not required for development in other
programming languages. In addition, it allows applications to be developed from a set of
modular software components called modules. The 6.1 release provides improved
performance in utilizing lower memory and run faster.
Thirds, the output of the two java preprocessing modules is a processed XML file.
The core of the internal preprocessing along with the feature calculation and handling is
implemented in MFC 51 8.0 C++ application. The Microsoft Foundation Class Library
(also Microsoft Foundation Classes or MFC) is a library that wraps portions of the
Windows API in C++ classes, including functionality that enables them to use a default
application framework. Classes are defined for many of the handle-managed Windows
objects and also for predefined windows and common controls. MFC 8.0 was released
with Visual Studio 2005.
In fact, MFC has a massively huge amount of third party
resources for it. Mainly we use it in developing our application due to its efficiency and
improved performance. The target environment to run all these applications is Microsoft
Windows XP while the minimum system requirements to run these applications are as
follows:
50
51
http://netbeans.org/community/releases/61/
http://en.wikipedia.org/wiki/Microsoft_Foundation_Class_Library
82
Requirement
Processor
Professional
600 MHz processor
Recommended: 1 gigahertz (GHz) processor
192 MB
Recommended: 256 MB
300 MB of available space required on system drive
Windows 2000 Service Pack 4, Windows XP
Service Pack 2, Windows Server 2003 Service Pack
1, or Windows Vista
For a 64-bit computer, the requirements are as
follows:
RAM
Available Hard Disk Space
Operating System
CD-ROM Drive or DVD-ROM Drive
Video
Mouse
Previously Installed Software
•
Windows Server 2003 Service Pack 1 x64
editions
•
Windows XP Professional x64 Edition
Required
Recommended: 1024 X 768, High Color 16-bit
Microsoft mouse or compatible pointing device
Microsoft dot Net Frame Work.
Java SDK 1.4.
Table 5.1 FAQWEBSUMM System Requirements.
83
Chapter 6
SYSTEM EVALUATION
In this chapter, we will present the evaluation procedure and data for our newly
adapted FAQ Web-page summarization method in comparison with the commercial tool
“Copernic Summarizer 2.1”. First, we give a brief about why we chose to use Copernic
Summarizer in evaluation of our system. Second, we present our evaluation
methodology. Third, we describe our evaluation dataset. Fourth, we introduce the
experimental results generated by the experiments we conducted and later discuss and
analyze these results in relation to our proposed hypothesis.
6.1 Why Copernic Summarizer?
There are some reasons that motivated us to use Copernic Summarizer as a
competitor in evaluating our system. Copernic Summarizer is a widely used multilingual
commercial summarizer that can operate on multiple languages. It can also summarize
various types of documents [77]. Moreover, it was previously used as a benchmark to
evaluate some summarization methods [22-84].
In fact, in [84] they used it in evaluating their question answering approach that is
some how similar to our work. Additionally, in [85] they presented a study on Copernic
Summarizer, Microsoft Office Word Summarizer 2003 and Microsoft Office Word
Summarizer 2007, with the objective to detect which of them gives the summaries more
similar to those made by a human. The summaries were evaluated by the ROUGE
system. In fact, Copernic Summarizer scored the best results in comparison to the
previously mentioned summarizers.
84
6.2 Evaluation Methodology
The evaluation was designed on the basis of twenty five percent compression
ratios for both the FAQWEBSUMM and Copernic Summarizer. It was chosen to give
satisfactory readable and understandable compressed summary. Five human evaluators
were employed to evaluate summaries for the FAQ pages resulted by both
FAQWEBSUMM and Copernic. Each evaluator was requested to evaluate the extracted
sentences which can be considered the most important ones for a given FAQ Web page.
There are two main quality metrics that the evaluators considered while carrying
out their evaluation. First, is the number of extracted sentences obeying the twenty five
percent compression ratio? Second, how good is the quality of the selected sentence and
to what extent it best answers the question under consideration? The quality metrics is
shown below.
Page Name
Question
Question 1
Question 2
Question n
Very Bad
Bad
Page x -part(A)
Good Very Good
Excellent
1
1
1
Numeric Score
1
0.6
0.8
Table 6.1 FAQ Web Page Evaluation Quality Matrix.
The table above is given to each evaluator twice for each page, one for
FAQWEBSUMM system and the other for Copernic Summarizer. The evaluator has the
original FAQ web page along with the summary from each system and does the
evaluation on every question in each page not knowing which summary represents which
system for integrity purposes. It is tagged as A and B one for each system.
Each evaluator gives a score ranged from very bad to excellent as depicted above
according to whether it meets the preset quality criteria and to what extent. After the
evaluator finishes the page evaluation, we do compute a numeric score as follows. We
give the quality metric Excellent a value of 1, Very good 0.8, Good 0.6, Bad 0.4 and
Very Bad 0.2.
85
This helps in plotting graphs between the same questions in the different pages
and among different evaluators also computing an average score for each page and cross
reference the evaluation with other evaluators.
Copernic formulates the summary by detecting the important concepts of the
whole page and does the summarization on this base. In order to make the comparison
fair, we fed the Copernic summarizer with each question and its answers as an individual
document and then concatenate questions and their summarizes to formulate the
summary of this page.
All in all, we use two comparison methodologies in evaluating both systems.
First, is comparing the average scores of each question in each page and in turn all the
pages together. The second method, was applying the widely used student’s t-Test 52 to
find out whether the scores produced by both systems of each page have statistical
difference.
In simple terms, the t-Test compares the actual difference between two means in
relation to the variation in the data (expressed as the standard deviation of the difference
between the means). The formula for the t-Test is a ratio. The top part of the ratio is just
the difference between the two means or averages. The bottom part is a measure of the
variability or dispersion of the scores. The difference between the means is the signal
that, in this case, we think our program or treatment introduced into the data; the bottom
part of the formula is a measure of variability that is essentially noise that may make it
harder to see the group difference 53 . It can be calculated by using the following
equation 54 .
http://en.wikipedia.org/wiki/Student%27s_t-test
http://www.socialresearchmethods.net/kb/stat_t.php
54
http://www.okstate.edu/ag/agedcm4h/academic/aged5980a/5980/newpage26.htm
52
53
86
eq. (6)
We apply it by choosing the level of significance required (normally p = 0.05) and
read the tabulated t value. If the calculated t value exceeds the tabulated value we say that
the means are significantly different at that level of significance. A significant difference
at p = 0.05 means that if the null hypothesis were correct (i.e. the samples do not differ)
then we would expect to get a t value as great as this on less than 5% of occasions. So we
can be reasonably confident that the evaluation scores do differ from one another, but we
still have nearly a 5% chance of being wrong in reaching this conclusion 55 .
Statistical tests allow us to make statements with a degree of precision, but cannot
actually prove or disprove anything. A significant result at the 95% probability level tells
us that our data are good enough to support a conclusion with 95% confidence (but there
is a 1 in 20 chance of being wrong).
55
http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress4a.html
87
6.3 Dataset
Due to the lack of FAQ summaries datasets, we managed to create our own
dataset. Twenty two FAQ web pages were manually and randomly collected describing
nine different disciplines in order to cover a wide and diverse range of FAQ topics. These
are namely: Software related topics, Customer Services, Business, Art, Health, Society,
Academic, News, and Sports. The dataset has a total of 390 questions with an average of
17 questions per page, while the average number of answers per question is one with a
total of approximately 2000 sentences.
Appendix B describes the dataset in more details. It shows the page-domain
mapping, a brief description about the pages, the Web pages URLs and the number of
questions resides in each page.
6.4 Experimental Results and Analysis
In this section, we will investigate the experimental results generated by both our
system -FAQWEBSUMM- and the Copernic system. In this research we conducted four
experiments to examine and support the validity of our hypothesis. In our experiments,
we used the dataset that is described in section 6.3.
The first experiment tests the summarization performance quality of our system
against the Copernic system in terms of which system produces short quality summaries.
The second and the third experiments are derived from the experimental results of the
first experiment.
The second experiments tests the summarization quality with respect to the
questions’ discipline and see if summaries we produced in certain disciplines are better
than others.
The third experiment to measures and compares the human evaluators’
performance in evaluating both summarizers.
88
The fourth experiment analyzes the evaluators’ scores and their homogeneity in
evaluating our system and the Copernic system as well. For each experiment, we state the
experiment’s objective, show the experimental results and finally analyze these results.
6.4.1 Experiment 1 “Performance Quality Evaluation”
6.4.1.1 Objective
The main objective of this experiment is to measure the summarization quality of
our system in comparison to the Copernic system in terms of which system gives more
readable and informative short summaries. In other words, we compare the numeric
scores given by all human evaluators to all questions in the dataset given by both
summarizers.
6.4.1.2 Description
This experiment runs on all questions of all pages resides in the dataset. In this
experiment, eight human evaluators were involved to evaluate summaries provided by
both summarizers. Each page is evaluated by three different evaluators and not
necessarily the same for all pages.
The involved human evaluators were asked to score the extracted answer(s) for
each question of all pages introduced in the dataset. Their only preference is the quality
of the extracted sentences in response to the question(s) under consideration and whether
or not the summary follows the preset compression ratio.
6.4.1.3 Results
Appendix C shows the scores of evaluating each question in each page in the form
of an average score of three human evaluators one for our system and the other for
Copernic. The overall average scores for all questions in all pages for FAQWEBSUMM
and Copernic were 0.732 and 0.610 respectively with an improvement in favor of our
89
new approach by approximately 20 %. In addition, Appendix D shows a sample of
summaries and their evaluation scores.
6.4.1.4 Discussion
Based on the above, you can see that, in general, the FAQWEBSUMM system
performs much better than the Copernic summarizer. The overall average scores for all
pages indicate that it is superior to the other system by approximately 20%. In fact, this is
a good enthusiastic result that urges in continuing down this path of research and digging
for better results. On the other hand, after applying the t-Test to the evaluation scores in
Appendix C, it was found that the differences between the two systems are extremely
statistically significant with a 95% confidence rate.
Moreover, figure 6.1 shows a summarized graphical representation that highlights
the performance distribution of both systems. It shows three categories. The blue
category shows how often our system scored higher than Copernic. The purple category
shows how often Copernic scored higher than our system. Finally, the yellow category
shows how often a tie existed between the scores of both systems.
FAQWebSum
Better in, 200,
51%
Tie, 123, 32%
Copernic Better
in, 67, 17%
Figure 6.1 Performance Distributions.
90
As you can see, our system performed better in 51% of the cases while Copernic
performed better in only 17% of the cases leaving 32% of the cases where both systems
had the same score.
The figure below shows a summarized graphical representation of the questions’
score distribution. It shows how the scores are distributed between the five scoring
categories namely Very Bad, Bad, Good, Very Good and Excellent. In addition, for each
category, it shows the number of questions it represents and their percent ratio from the
total number of questions.
Very Bad, 5, 1%
Bad, 13, 3%
Excellent, 118,
30%
Good , 80, 21%
Very Bad
Bad
Good
Very Good
Excellent
Very Good ,
174, 45%
Figure 6.2 FAQWEBSUMM Score Distribution.
Note that the largest category distribution is the Very Good category as 45% of all
questions had that score. The second largest category is the Excellent
covering 30% of
the answers. It means that 75% of the summaries of all questions got at least very good
scores. The Good Category covers 21% of the remaining 25% leaving only 4% to be
divided between Bad 3% and only 1% to the Very Bad category. In fact, these results
seem to be very promising and their distribution is very encouraging.
On the other hand, the figure below shows the same as Figure 6.2 but it represents
questions’ score distribution for the Copernic system.
91
Very Bad, 47,
12%
Excellent, 64,
16%
Bad, 39, 10%
Very Bad
Bad
Good
Very Good
Excellent
Very Good ,
134, 35%
Good , 106,
27%
Figure 6.3 Copernic Score Distribution.
As you can see, the largest category distribution is the Very Good category but
with 35% of the summaries of all questions had that score. The second largest category is
the Good category with 27% of the answers. It means that more than half of the space is
ranged between Good and Very Good. The Excellent Category takes only 16% of the
remaining 38% leaving 22 % to be divided between Bad 10% and 12% to the Very Bad
category.
Note that the score distribution of our system showed significant difference in
comparison to the Copernic’s score distribution. In addition, the figure below shows a
graphical representation of the comparison between the score distributions of both our
system and Copernic in terms of number of questions.
92
200
Number of Questions
180
160
140
120
FAQWebSum
100
Copernic
80
60
40
20
0
Very Bad
Bad
Good
Very Good
Excellent
Score Value
Figure 6.4 Score Distribution Comparison.
As you can see, FAQWEBSUMM has a larger score distribution in the best two
categories Excellent and Very Good covering 75% of the questions space. On the other
hand, Copernic has a slightly higher score in the Good Category along with the other two
low score categories Bad and Very Bad.
6.4.2 Experiment 2 “Page Discipline Performance Analysis”
6.4.2.1 Objective
The main objective of this experiment is to measure the summarization quality of
our system in comparison to the Copernic system in terms of which system gives higher
scores with respect to the different pages’ disciplines. In other words, we test which
system performs better on the different pages’ categories and whether the pages’
discipline has an impact on the summarization quality or not.
6.4.2.2 Experiment Description
This experiment uses the previous run on all pages of the dataset. It measures the
impact of different pages’ disciplines on the performance evaluation. The results of this
experiment are divided into two main parts. First, we will show how our system
93
outperforms the Copernic system by showing the improvement ratio in percent for pages
in each discipline. Second, we will show the t-Test values in terms of whether our scores
are statistically significant better than that of Copernic’s for pages in each discipline.
6.4.2.3 Results
The table below shows the average scores of all questions in each page as given
by its three human evaluators; one for our system and the other for Copernic. In fact, the
overall scores are in favor of the FAQWEBSUMM system.
94
Domain
Software
“Q1-Q113”
Customer
Service
“Q114-Q143”
Business
“Q144-Q180”
Art
“Q181-Q245”
Health
“Q246-Q308”
Society
“Q309-Q344”
News
“Q345-Q362”
Academic
“Q363-Q378”
Sports
“Q379-Q390”
Page
Page 1 “Q1-Q8”
Page 2 “Q9-Q43”
Page 3 “Q44-Q55”
Page 4 “Q56-Q107”
Page 5 “Q108-Q113”
Average
FAQWEBSUMM
0.600
0.699
0.694
0.676
0.488
0.632
Copernic
0.367
0.512
0.550
0.603
0.522
0.511
Improvement Ratio
63.6%
36.5%
26.3%
12.1%
-6.4%
23.6%
Page 6” Q114-Q120”
0.847
0.561
50.9%
Page 7” Q121-Q138”
Page 8 “Q139-Q143”
Average
0.678
0.720
0.748
0.581
0.767
0.637
16.6%
-6.2%
17.5%
Page 9” Q144-Q153”
Page 10” Q154-Q180”
Average
0.800
0.555
0.678
0.587
0.535
0.561
36.4%
3.7%
20.8%
Page 11” Q181-Q209”
0.788
0.618
27.5%
Page 12” Q210-Q2117”
Page 13 “Q218-Q245”
Average
0.795
0.862
0.815
0.450
0.774
0.614
76.7%
11.4%
32.8%
Page 14” Q246-Q254”
0.867
0.600
44.4%
Page 15” Q255-Q286”
Page 16” Q287-Q308”
Average
0.804
0.833
0.835
0.664
0.745
0.670
21.0%
11.8%
24.6%
Page 17”Q309-Q318”
0.813
0.647
25.8%
Page 18” Q319-Q339”
Page 19” Q340-Q344”
Average
0.775
0.733
0.774
0.539
0.720
0.635
43.6%
1.9%
21.8%
Page 20” Q345-Q362”
0.688
0.496
38.9%
Page 21” Q363-Q378”
0.696
0.683
1.9%
Page 22” Q379-Q390”
0.720
0.648
11.2%
Table 6.2 Average Page Scores Categorized by Discipline.
95
The figure below shows a graphical representation of the comparison between the scores
Quality
with respect to the pages’ discipline of both our system and Copernic.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
FAQW ebSum
Copernic
Pa ge Dicipline
Figure 6.5 Page Discipline Score Comparison.
A detailed comparison to the scores’ distribution for the different pages’
disciplines for both systems is presented in Appendix C.
Table 6.3 shows the t-Test’s result for the evaluation scores of all questions
summaries conducted by the two systems for each page. The overall t value is computed
all set of values in each discipline.
96
Domain
Page
t-Test
Page 1 “Q1-Q8”
Significant
Page 2 “Q9-Q43”
Significant
Page 3 “Q44-Q55”
Significant
Page 4 “Q56-Q107”
Significant
Page 5 “Q108-Q113”
Not Significant
Page 6” Q114-Q120”
Significant
Page 7” Q121-Q138”
Significant
Page 8 “Q139-Q143”
Not Significant
Business
“Q144-Q180”
Page 9” Q144-Q153”
Significant
Art
“Q181-Q245”
Page 10” Q154-Q180”
Not Significant
Page 11” Q181-Q209”
Significant
Page 12”Q210-Q2117”
Significant
Page 13 “Q218-Q245”
Significant
Page 14” Q246-Q254”
Significant
Page 15” Q255-Q286”
Significant
Page 16” Q287-Q308”
Significant
Page 17”Q309-Q318”
Significant
Page 18” Q319-Q339”
Significant
Page 19” Q340-Q344”
Not Significant
Page 20” Q345-Q362”
Significant
Significant
Page 21” Q363-Q378”
Not Significant
Not Significant
Page 22” Q379-Q390”
Not Significant
Not Significant
Software
“Q1-Q113”
Customer Service
“Q114-Q143”
Health
“Q246-Q308”
Society
“Q309-Q344”
News
“Q345-Q362”
Academic
“Q363-Q378”
Sports
“Q379-Q390”
Overall t-Test
Significant
Significant
Significant
Significant
Significant
Significant
Table 6.3 t-Test Values for All Pages by All Evaluators.
6.4.2.4 Discussion
Based on the above, you can see that the FAQWEBSUMM approach performs
better than the Copernic summarizer with respect to the average score comparison.
Moreover, after applying the t-Test to the evaluation scores as can be seen in table 6.3. It
was found that the differences between the two systems for 16 pages are statistically
significant while only 6 pages were found to be non-statistically significant. In addition,
97
you can see that in the seven domains: Software, Customer Service, Art, Health, society
and News, out of nine, the t-Test scores were found significant. Based on the above
results, you can see that approximately all of the pages showed statistical significance
difference in favor of our system which is excellent.
Moreover, taking the number of pages in which our tool scores better than the
Copernic tool we found that our tool scored better in 91% of the pages. The highest
improvement ratio obtained by our tool was 76% for an Art page 2 while the highest
improvement ratio for Copernic tool was 10% for a human rights page.
On the other hand, if we take a deeper look to table 6.3, we observe a relation
between the number of questions resides in a page or discipline and the degree of
significance of its score. In our dataset, the lowest number of questions in a page is found
to be 5, while the highest number is 52. If we divide the range between the lowest value
and the highest value to 5 sets giving a pace of 10 questions between each set, we find the
following.
 For the first set, we find that for small number of questions between 5 and
15 there exists 11 pages, 7 of them detected as significant and 4 detected
not significant.
 For the second set, with more questions from 16 to 25 there exist 5 pages,
4 of them detected as significant and only 1 is not.
 For the third set, with more questions there exists 5 pages, 4 detected
significant and only 1 is not.
 For the fourth and fifth sets, there exists 1 page in each set with the highest
number of questions and both of them were significant.
In other words, the probability of being significant for a small number of
questions from 5 to 15 was 63%. Moving to mid values of 25 to 35 questions the
probability was 75%. While for the largest two sets the probability reached 100%.
Therefore, we can draw this conclusion, the higher the number of questions resides in a
page being summarized by our system the most probable its scores being significant over
Copernic’s.
98
6.4.3 Experiment 3 “Human Evaluators’ Performance Analysis”
6.4.3.1 Objective
The main objective of this experiment is to measure and compare the human
evaluators’ performance in evaluating both summarizers.
6.4.3.2 Experiment Description
This experiment uses the previous run on all pages of the dataset. The results of
this experiment are divided into two main parts. First, we will show how our system
outperforms the Copernic system by showing the improvement ratio in percent for each
evaluator. Second, we will show the t-Test values in terms of whether the overall scores
of all pages for each evaluator are significant than that of Copernic or not.
6.4.3.3 Results
The table below shows the average scores for each evaluator to summaries they
were assigned provided by both summarizers along with the improvement ratio of our
system over Copernic’s.
Evaluator
Evaluator 1
Evaluator 2
Evaluator 3
Evaluator 4
Evaluator 5
Evaluator 6
Evaluator 7
Evaluator 8
FAQWEBSUMM
0.774
0.582
0.677
0.756
0.597
0.845
0.770
0.828
Average
Copernic Summarizer
0.721
0.464
0.524
0.622
0.506
0.635
0.620
0.662
Improvement Ratio
7.3%
25.4%
29.0%
21.6%
17.8%
33.0%
24.1%
25.0%
26.3%
Table 6.4 Evaluators Improvement Ratio.
As you can see that the overall average improvement of our summaries over
Copernic’s as stated by all evaluators is in favor of our new approach by approximately
26.3 %.
99
Table 6.5 shows the individual value to each evaluator in response to computing
the statistical significance for all pages he scored.
Evaluator
Evaluator 1
Evaluator 2
Evaluator 3
Evaluator 4
Evaluator 5
Evaluator 6
Evaluator 7
Evaluator 8
Degree of Significance
“Not Significant”
”Significant”
”Significant
“Significant”
“Not Significant”
”Significant
“Significant”
“Significant”
Table 6.5 t-Test Results Over Evaluator Scores.
As you can see that the individual scores of six evaluators out of eight were
detected statistically significant. In other words, the scores of 75 % of all evaluators
signaled a statistical significance between our system’s scores and Copernic’s.
6.4.3.4 Discussion
Considering the evaluators data, we found out that the average scores of all pages
they have evaluated are in favor of our system. The highest improvement ratio obtained
by the sixth evaluator was 33% while the lowest was 7.3%. Moreover, the average
improvement for all evaluators is approximately 26%.
This means that 100% of the
evaluators gave better scores on average to our system.
On the other hand, after applying the t-Test on the pages evaluated by each
evaluator, it showed that in general there is statistical significance between our system
and Copernic. Based on the above results, we can see that the scores of 75 % of all
evaluators signaled a statistical significance between our system’s scores and Copernic’s
while only 25% of the evaluator scores were not.
100
6.4.4 Experiment 4 “Analyzing Evaluators’ Homogeneity”
6.4.4.1 Objective
The main objective of this experiment is to measure the human evaluators’ degree
of homogeneity in scoring summaries for each system. In other words, we test how
different and how similar evaluations of pages common to a set of evaluators.
6.4.4.2 Experiment Description
This experiment uses the previous run on all pages of the dataset. It compares the
scores of all evaluators and measures the degree of agreement for common pages. This
means that if we have two summaries one for our system and the other for Copernic and
we have two evaluators, then we check the evaluations of evaluator 1 and evaluator 2 for
the summary generated by our system and check for similarities and differences of both
evaluations. The same is applicable to Copernic. We will show comparisons in terms of tTest for pairs of evaluators scored the same pages to measure the degree of homogeneity
in their scores.
6.4.4.3 Results
In a trial to measure the degree of overlap or homogeneity between the different
evaluators, table 6.6 was derived from comparing the scores of pairs of evaluators to
same pages. The table shows each pair of evaluators along with a t-Test value one for our
system and the other for Copernic’s. The t-Test value here indicates whether the scores
provided by both evaluators to the same system – once compare values of both evaluators
to our system and once compare values of both evaluators to Copernic’s- are significantly
different from each other or not.
101
Evaluator (a)
Evaluator (b)
FAQWEBSUMM
Copernic
1
2
Significant
Significant
1
3
Significant
Significant
1
4
Not Significant
Not Significant
1
5
Significant
Significant
2
3
Not Significant
Not Significant
2
4
Not Significant
Not Significant
2
5
Not Significant
Not Significant
3
4
Significant
Significant
3
5
Significant
Not Significant
4
6
6
7
5
7
8
8
Not Significant
Not Significant
Not Significant
Not Significant
Not Significant
Not Significant
Not Significant
Not Significant
Table 6.6 t-Test Results Comparing Evaluators’ Scores.
A detailed t-Test comparison between each pair of evaluators for individual pages
is presented in Appendix C.
6.4.4.4 Discussion
Considering the comparison between pairs of evaluators who scored the same
page set, we found that 8 evaluator pair records out of 13 did not signal statistical
significance in scoring our system-this means that the scores weren’t different for the
same pages by different evaluators. This means that 62 % of evaluators to same page sets
were compatible and they all agree to the same scoring quality.
On the other hand, we found that 9 evaluator pair records out of 13 did not signal
statistical significance in scoring Copernic’s. This means that 69 % of evaluators were
compatible and they all agree to the same scoring quality to Copernic’s. Based on the
above, both statements support our hypothesis that our system is better than Copernic’s in
summarizing FAQ Web pages. This is because most of the evaluators agree to the same
scoring for both systems.
102
Chapter 7
CONCLUSION AND FUTUREWORK
Conclusion
In this research we presented the preliminary results of our FAQ Web pages
automatic text summarization approach that is English language dependent. It was found
that FAQ Web pages are devised into a manner of a question having a specific heading
style, e.g. bold, underlines, or tagged. The answer then follows in a different lower style;
usually smaller font and may be scattered to subheadings if the answer is logically
divided.
Based on the above, FAQ Web pages summarization can benefit from utilizing
Web page segmentation algorithms which are based on visual cues to, first, identify the
different headings hierarchal structure and later on to extract questions and answers
segments out of it. In addition, we devised a new combination of selection features to
perform the summary generation task. These features are namely, question-answer
similarity, query overlap, sentence location in answer paragraphs and capitalized words
frequency. The choice of these features was influenced by the process of analyzing the
different question types and anticipating the expected correct answer.
The first feature –Sentence Similarity- evaluates the semantic similarity between
the question sentence and each sentence in the answer. It does so by comparing word
senses of both the question and answer words and then assigns each pair of words a
numerical value and then in turn an accumulated value to the whole sentence.
The second feature –Query Overlap-extracts the following word types; nouns,
adverbs, adjectives and gerunds from the question sentence and automatically formulate a
query with and count the number of matches with each of the answer sentences.
The third feature -Location-gives higher score to sentences at the beginning and
lessens the score to the following sentences.
103
The fourth feature -Capitalized Words Frequency- computes the frequency of
capitalized words in a sentence.
We give each feature a weight and then linearly combine them in a single
equation to give a cumulative score for each sentence. The different document features
were combined by a home grown weighting score function. It was found that using each
of the features solely performed well in some cases based on different kinds of evidence.
Pilot experimentations and analysis helped us in obtaining a suitable combination of
feature weights.
In our experiments, we showed the results of four experiments that supported the
validity of our hypothesis. The first experiment tested the summarization performance
quality of our system against the Copernic system in terms of which system produces
readable short quality summaries. The second experiment tested the summarization
quality with respect to the questions’ discipline. The third experiment to measures and
compares the human evaluators’ performance in evaluating both summarizers. The fourth
experiment analyzes the evaluators’ scores and their homogeneity in evaluating our
system and the Copernic system as well.
In general, it was found out that the FAQWEBSUMM system performs much
better than the Copernic summarizer. The overall average scores for all pages indicate
that it is superior to the other system by approximately 20% which seems to be quite
promising. In addition, the overall average for all pages indicates statistical significant
improvements for our approach in 62% of the cases when compared with the commercial
summarization tool. However, the superiority comes from the idea of knowing the web
page structure in advance that helps in reaching better results than applying a general
solution to all types of pages.
On the other hand, with respect to the page discipline, the scores of seven page
domains (77%) -Software, Customer Service, Art, Health, society and News- out of nine
were detected significant and only two were detected as not significant due to the lack of
test data in their categories that contained only one page.
104
Moreover, we believe if we have more pages in those disciplines -Academic and
Sports- we could have had the chance of proving that our home grown tool to be superior
in all disciplines which means that our approach work for all types of pages.
Considering human evaluators, it was found out that the average improvement for
all evaluators is in favor of our new approach by approximately 26.3 %. Taking the
number of evaluators in which they scored our tool to be better than Copernic’s in the
average case was 100 %. In addition, considering the comparison between pairs of
evaluators who scored the same page set, we found that 8 evaluator pair records out of 13
did not signal statistical significance in scoring our system-this means that the scores
weren’t different for the same pages by different evaluators. This means that 62 % of
evaluators to same page sets were compatible and they all agree to the same scoring
quality.
Main Contributions:
One of the main contributions in this research demonstrating that using the
proposed feature selection combination is a highly preferable solution to the problem of
summarizing FAQ Web pages.
Another contribution is by utilizing Web page segmentation we were able to
differentiate between the different constructs resides in Web pages, which enabled us to
filter out those constructs that we see unnecessary. As a result, this research proves the
hypotheses that by analyzing the structure of the FAQ Web pages, we can get better
summarization results.
The last contribution is related to those answers that are divided into a set of
paragraphs under smaller headings representing the logical structure introduced by the
page creator. The final summary is a subset of all those paragraphs not only one of them.
This is also a benefit of using the segmentation tool so that we know the exact structure
of the answer.
105
Future Work:
There are some improvements identified and need to be additionally addressed in
further research. First of all, improving the pre-processing phase, as it may result in major
data loss if not considered carefully.
There is also a need to experiment on a different dataset with different types of
Web pages covering different genres, employ different human evaluators with different
background and consider a way of utilizing automatic evaluation.
Another future enhancement is to do some work on complex question
decomposition and analysis in order to help extracting accurate and responsive answers to
these questions. Typically, complex questions address a topic that relates to many
entities, events and even complex relations between them.
In fact, the quality of
question-focused summaries depends in part on how complex questions are decomposed.
Therefore, by decomposing questions and finding the relationship between its entities, it
would be easier to identify related entities in answers, thus selecting the more accurate
responsive sentences as summary.
It would be also a good enhancement to exploit automatic tuning for feature weighting
based on the question type. As different question types –What, When, How, why, etcrequires different answer. Therefore, different features may be more appropriate when
considering a specific question type and others may be less effective. Moreover, some
answers may be structurally different than others as some may contain bullets or points
that needs to be processed differently. As a result sentence compaction may be introduced
then and not selecting the whole sentence as we do now.
106
REFERENCES
[1] D. Das and F.T. Martins. “A Survey on Automatic Text Summarization”. Literature
Survey for the Language and Statistics II course at CMU, 2007.
[2] K.Jones. “Automatic summarising: The State of the Art”, Information Processing and
Management: an International Journal, Volume 43, Issue 6, pp.1449-1481, 2007.
[3] S. Afantenos , V. Karkaletsis , P. Stamatopoulos, “Summarization From Medical
Documents: A Survey”, Artificial Intelligence in Medicine Journal, Volume.33, Issue 2,
pp.157-177, 2005.
[4] R. Basagic, D. Krupic, B. Suzic, D. Dr.techn, C. Gütl “Automatic Text
Summarization Group “Institute for Information Systems and Computer Media, Graz
http://www.iicm.tugraz.ac.at/cguetl/courses/isr/uearchive/uews2009/Ue10%20%20Automatic%20Text%20S
ummarization.pdf {Retrieved on 12-04-2011}
[5] L. Alemany, I. Castell´on, S. Climent, M. Fort, L. Padr´o, and H. Rodr´ıguez.
“Approaches to Text Summarization: Questions and Answers”. Inteligencia Artificial,
Revista Iberoamericana de Inteligencia Artificial Journal, Volume 8, Issue 22: pp. 79–
102, 2003.
[6]
S.
Roca,
“Automatic
Text
Summarization”
http://www.uoc.edu/humfil/digithum/digithum3/catala/Art_Climent_uk/climent/climent.h
tml {Retrieved on 10-04-2011}
[7] A. Nenkova.
“Summarization Evaluation for Text and Speech: Issues and
Approaches”. INTERSPEECH 2006. 9th International Conference on Spoken Language
Processing,
pp.
2023-2026,
Pittsburgh,
USA
, 2006.
[8] H. Luhn. “The Automatic Creation of Literature Abstracts”. IBM Journal of Research
Development, Volume 2, Issue 2, pp.159-165, 1958.
[9] P. Baxendale. “Machine-Made Index for Technical Literature - An Experiment”. IBM
Journal of Research Development, Volume 2, Issue 4, pp. 354-361, 1958.
[10] H. Edmundson “New Methods in Automatic Extracting”. Journal of the ACM,
Volume 16, Issue 2, pp. 264-285, 1969.
107
[11] U. Reimer and U. Hahn. "A Formal Model of Text Summarization Based on
Condensation Operators of a Terminological Logic"; In Proceedings of the Workshop on
Intelligent Scalable Summarization Conference, pp. 97–104, Madrid, Spain, 1997.
[12] T. Nomoto, “Bayesian Learning in Text Summarization”, In Proceedings of the
Conference on Human Language Technology and Empirical Methods in Natural
Language Processing , pp. 249-256, Vancouver, Canada, 2005.
[13] A. Bawakid, M. Oussalah: "A Semantic Summarization System”, In Proceedings of
the 1st Text Analysis Conference, pp. 285-291, Gaithersburg, USA,2008.
[14] M. Osborne. “Using Maximum Entropy for Sentence Extraction”. In Proceedings of
the ACL-02 Workshop on Automatic Summarization, pp. 1-8, Morristown, USA, 2002.
[15] K. Svore , L. Vanderwende, and C. Burges. “Enhancing Single-Document
Summarization by Combining RankNet and Third-Party Sources”. In Proceedings of the
EMNLP-CoNLL-07, pp. 448-457, Prague, Czech Republic, 2007.
[16] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G.
Hullender “Learning to Rank Using Gradient Descent”. In Proceedings of the 22nd
International Conference on Machine Learning, pp. 89-96, New York, USA, 2005.
[17] C. Lin,.Y. Rouge: “A package for Automatic Evaluation of Summaries”. In
Proceedings of the ACL-04 Workshop, pp 74-81, Barcelona, Spain, 2004.
[18] W. Chuang, J. Yang, “Extracting Sentence Segments for Text Summarization: a
Machine Learning Approach”, In Proceedings of the 23rd Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, pp.152-159,
Athens, Greece, 2000
[19] Z. Xie, X. Li, B. Di Eugenio, P. C. Nelson, W. Xiao, T. M. Tirpak, “Using Gene
Expression Programming to Construct Sentence Ranking Functions for Text
Summarization”, In Proceedings of the 20th international conference on Computational
Linguistics Table of Contents, pp. 1381-1385, Stroudsburg , USA, 2004.
[20] J.Yeh, H. Ke, W. Yang, I. Meng, “Text Summarization Using a Trainable
Summarizer and Latent Semantic Analysis”, Information Processing and Management:
an International Journal, Volume.41, Issue 1, p.75-95, 2005.
108
[21] M. Abdel Fattah , F. Ren, “GA, MR, FFNN, PNN and GMM Based Models for
Automatic Text Summarization”, Computer Speech and Language Journal, Volume 23
Issue 1, p.126-144, 2009
[22] A. Kiani and M. Akbarzadeh. “Automatic Text Summarization Using: Hybrid Fuzzy
GA-GP”. In Proceedings of IEEE International Conference on Fuzzy Systems, pp. 977983, Vancouver, Canada, 2006.
[23] R. Verma, P. Chen, W. Lu. "A Semantic Free-text Summarization System Using
Ontology Knowledge"; In Proceedings of the Document Understanding Conference, pp.
439-445, Rochester, New York, 2007.
[24] D. Radev , H. Jing , M. Styś , D. Tam, “Centroid-Based Summarization of Multiple
Documents”, Information Processing and Management Journal, Volume 40, Issue 6,
pp.919-938, 2004.
[25] H. Saggion and R. Gaizauskas. “Multi-document Summarization by Cluster/Profile
Relevance and Redundancy Removal”. In Proceedings of the 4th Document
Understanding Conference, pp386-392. Boston, USA, 2004.
[26] X. Wan. “An Exploration of Document Impact on Graph-based Multi-Document
Summarization”. In Proceedings of the Empirical Methods in Natural Language
Processing Conference, pp. 755–762, Honolulu, USA, 2008.
[27] A. Haghighi and L. Vanderwende. “Exploring Content Models for Multi-Document
Summarization”. In Proceedings of Human Language Technologies Conference, pp. 362–
370, Boulder, USA, 2009.
[28] R. Angheluta, R. De Busser, and M. Moens. “The Use of Topic Segmentation for
Automatic Summarization”. In Proceedings of the Second Document Understanding
Conference, pp. 264-271, Philadelphia, USA, 2002.
[29] E. Alfonseca, J. Guirao, and A. Sandoval. “Description of the UAM System for
Generating Very Short Summaries”, In Proceedings of the 4th Document Understanding
Conference, pp. 226-232, Boston,, USA, 2004.
[30] R. Angheluta, R. Mitra, X. Jing, and M. Moens. “K.U. Leuven Summarization
System”, at DUC 2004. In Proceedings of the 4th Document Understanding Conference,
pp. 286-292, Boston, USA, 2004.
109
[31] H. Daum´e and D. Marcu. “Bayesian Query-Focused Summarization”. In
Proceedings of the 44th Annual Meeting of the Association for Computational
Linguistics and the Twenty-First International Conference on Computational Linguistics,
pp 305–312, Sydney, Australia, 2006.
[32] S. Fisher and B. Roark. “Feature Expansion for Query-Focused Supervised
Sentence Ranking”. "; In Proceedings of the Document Understanding Conference, pp.
213-221, Rochester, New York, 2007.
[33] E. Baralis , P. Garza , E. Quintarelli , L. Tanca, “Answering XML Queries by
Means of Data Summaries”, ACM Transactions on Information Systems Journal, Volume
25, Issue 3, pp.10-16, 2007.
[34] H. Joho, D. Hannah, and J. Jose. “Emulating Query Biased Summaries Using
Document Titles”. In Proceedings of the 31st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pp 709–710, New
York, USA, 2008.
[35] Y. Vandeghinste. “Sentence Compression for Automated Subtitling: A Hybrid
Approach”. In Proceedings of the ACL Workshop on Text Summarization, pp.89–95,
Barcelona, Spain, 2004.
[36] H. Jing, “Sentence Reduction for Automatic Text Summarization”, In Proceedings
of the Sixth Conference on Applied Natural Language Processing, pp.310-315, Seattle,
USA, 2000.
[37] D. McClosky and E. Charniak.. “Self-Training for Biomedical Parsing”. In
Proceedings of the Association for Computational Linguistics Conference, pp.852-865
,Columbus, USA, 2008.
[38] S. Jonnalagadda, L. Tari, J. Hakenberg, C. Baral, and G. Gonzalez. “Towards
Effective Sentence Simplification for Automatic Processing of Biomedical Text”. In
Proceedings of Human Language Technologies Conference, pp. 177–180, Boulder, USA,
2009.
[39] K. Woodsend , M. Lapata, “Automatic Generation of Story Highlights”, In
Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, pp.565-574, Uppsala, Sweden, 2010.
110
[40] K. Knight , D. Marcu, “Summarization Beyond Sentence Extraction: A Probabilistic
Approach to Sentence Compression”, Journal of Artificial Intelligence, Volume 139,
Issue 1, pp.91-107, 2002.
[41] T. Cohn and M. Lapata. “Sentence Compression Beyond Word Deletion”. In
Proceedings of the 22nd International Conference on Computational Linguistics, pp.
137–144, Manchester, UK, 2008.
[42] D. Radev , O. Kareem , J. Otterbacher, “Hierarchical Text Summarization for WAPEnabled Mobile Devices”, In Proceedings of the 28th annual international ACM SIGIR
Conference on Research and Development in Information Retrieval, pp. 679-679,
Salvador, Brazil , 2005.
[43] J. Otterbacher , D. Radev , O. Kareem, “News to go: Hierarchical Text
Summarization for Mobile Devices”, In Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in information retrieval , pp. 589596, Seattle, USA, 2006.
[44] J. Otterbacher, D. Radev, O. Kareem, “Hierarchical Summarization for Delivering
Information to Mobile Devices”, Information Processing and Management Journal,
Volume 44, Issue 2, pp.931-947, 2008
[45] C. Yang, F. Wang, “Fractal Summarization for Mobile Devices to Access Large
Documents on the Web”, In Proceedings of the 12th international conference on World
Wide Web, pp.215-224, Budapest, Hungary , 2003.
[46] M. Amini, A. Tombros, N. Usunier, M. Lalmas, and P. Gallinari. “Learning to
Summarise XML Documents Using Content and Structure”. In Proceedings of the 14th
International Conference on Information and Knowledge Management, pp. 297–298,
Bremen, Germany, 2005.
[47] N. Fuhr, S. Malik, and M. Lalmas. “Overview of the Initiative for the Evaluation of
XML Retrieval (inex) 2003”. In Proceedings of the 2nd INEX Workshop, pp.1-11,
Dagstuhl, Germany, 2004.
[48] S. Harper, Y. Yesilada, C. Goble, and R. Stevens. “How Much is Too Much in a
Hypertext Link? Investigating Context and Preview – A Formative Evaluation”. In
111
[49] J. Delort, B. Bouchon-Meunier and M. Rifqi. “Enhanced Web Document
Summarization Using Hyperlinks”. In Proceedings of the 14th ACM Conference on
Hypertext and Hypermedia, pp. 208-215, Nottingham, UK, 2003.
[50] S. Harper and N. Patel. “Gist Summaries for Visually Impaired Surfers”. In
Proceedings of the 7th international ACM SIGACCESS Conference on Computers and
Accessibility, pp. 90-97, New York, USA, 2005.
[51] A. Dalli, Y. Xia, Y. Wilks, “FASIL Email Summarisation System”, In Proceedings
of the 20th International Conference on Computational Linguistics,pp.994-999, Geneva,
Switzerland, 2004
[52] E. Toukermann , S. Muresan , J. L. Klavans, “GIST-IT: Summarizing Email Using
Linguistic Knowledge and Machine Learning”, In Proceedings of the workshop on
Human Language Technology and Knowledge Management, pp.1-8, Toulouse, France,
2001.
[53] G. Carenini, R. Ng, X. Zhou, “Summarizing Email Conversations with Clue
Words”, In Proceedings of the 16th International Conference on World Wide Web, pp.
91-100, Banff, Canada, 2007.
[54] S. Oliver, E. Ringger, M. Gamon, and R. Campbell. “Task-Focused Summarization
of Email”. In Proceedings of the ACL-04 Workshop Text Summarization, pp 43-50,
Barcelona, Spain, 2004.
[55] D. Zajic , B.. Dorr , J. Lin, “Single-Document and Multi-Document Summarization
Techniques for Email Threads Using Sentence Compression”, Information Processing
and Management Journal, Volume 44, Issue 4, pp.1600-1610, 2008
[56] M. Hu , B. Liu. “Mining and Summarizing Customer Reviews”, In Proceedings of
the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 168-177, Seattle, USA, 2004.
[57] L. Zhuang , F. Jing , X. Zhu, “Movie Review Mining and Summarization”, In
Proceedings of the 15th ACM international conference on Information and Knowledge
Management, pp. 43-50, Arlington, USA, 2006,
112
[58] F. Li, C. Han, M. Huang, X. Zhu, Y. Xia, S. Zhang, and H. Yu. “Structure Aware
Review Mining and Summarization”. In Proceedings of the 23rd International
Conference on Computational Linguistics Association for Computational Linguistics, pp.
653-661, Beijing, China, 2010.
[59] B. Sharifi, M. Hutton, and J. Kalita. “Summarizing Microblogs Automatically”. In
Proceedings of the 2010 Annual Conference of the North American Chapter of the
Association for Computational Linguistics Association for Computational Linguistics ,
pp. 685-688, LA, USA, 2010.
[60] X. Song, Y. Chi, K. Hino, L. B.Tseng “Summarization System by Identifying
Influential Blogs”, In Proceedings of the International Conference on Weblogs and Social
Media, pp. 325-326, Boulder, U.S.A., 2007.
[61] F. Lacatusu, A. Hickl, and S. Harabagiu,. “Impact of Question Decomposition on the
Quality of Answer Summaries”, In Proceedings of the 5fth International Conference on
Language Resources and Evaluation, pp. 233-236, Genoa, Italy, 2006.
[62] Y. Tao, C. Huang and C. Yang. “Designing an automatic FAQ abstraction for
internet forum”. Journal of Information Management, Volume 13, Issue 2, pp. 89-112,
2006.
[63] Y. Tao, S. Liu and C. Lin, “Summary of FAQs from a Topical Forum Based on the
Native Composition Structure”, Expert Systems with Applications Journal, Volume 38,
Issue 1, pp. 527-535, 2011.
[64] V. Barbier and A.-Laure Ligozat. “A Syntactic Strategy for Filtering Sentences in a
Question Answering”, In Proceedings of the 5th International Conference on Recent
Advances in Natural Language Processing , pp.18-24, Borovets, Bulgaria, September
2005.
[65] A. Berger , V. Mittal, “Query-Relevant Summarization Using FAQs”, In
Proceedings of the 38th Annual Meeting on Association for Computational Linguistics ,
pp.294-301, Hong Kong, 2000.
[66] D. Radev and D. Tam, “Single-Document and Multi-Document Summary
Evaluation Via Relative Utility,” In Proceedings of 12th International Conference on
Information and Knowledge Management, pp.21-30, New Orleans , USA, 2003.
113
[67] D. Harman and P. Over. “The Effects of Human Variation in DUC Summarization
Evaluation”. In Proceedings of the ACL-04 Workshop, pp. 10-17, Barcelona, Spain,
2004.
[68] K. McKeown, V. Hatzivassiloglou, R. Barzilay, B. Schiffman, D. Evans, S. Teufel,
“Columbia Multi-Document Summarization: Approach and Evaluation,” In Proceedings
of the Document Understanding Conference, pp.217-226, New Orleans, USA, 2001.
[69] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A Method for Automatic
Evaluation of Machine Translation,” In Proceedings of
the 40th Annual Meeting on
Association for Computational Linguistics, pp. 311-318, Philadelphia, USA, 2002.
[70] C. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” In
Proceedings of the Workshop on Text Summarization Branches, pp. 74-81, Barcelona,
Spain, 2004.
[71] A. Nenkova and R. Passonneau, “Evaluating Content Selection in Summarization:
The Pyramid Method,” In Proceedings of the HLT/NAACL, pp 145-152, Boston, USA ,
2004.
[72] T. Pardo, L. Rino, M. Nunes: “NeuralSumm: A Connexionist Approach to
Automatic Text Summarization”. In: Proceedings of the 4th Encontro Nacional de
Inteligência Artificial, pp. 210-218, Campinas, Brazil, 2003.
[73] J. Neto , A. Freitas , C. Kaestner, “Automatic Text Summarization Using a Machine
Learning Approach”, In Proceedings of the 16th Brazilian Symposium on Artificial
Intelligence: Advances in Artificial Intelligence, pp.205-215, Porto de Galinhas/Recife,
Brazil, 2002.
[74] T. Pardo, L. Rino, M. Nunes. “GistSumm: A Summarization Tool Based on a New
Extractive Method”. In Proceedings of the 6th Workshop on Computational Processing of
the Portuguese, pp. 210–218, Porto Alegre, Brazil,2003.
[75] L. Neto, J. Santos, A. Kaestner, C. Freitas. “Document Clustering and Text
Summarization”. In Proceedings of the 4th International Conference of Practical
Applications of Knowledge Discovery and Data Mining, pp. 41–55, Manchester, UK,
2000.
114
[76] G. Salton, C. Buckley. “Term-Weighting Approaches In Automatic Text Retrieval.
Information Processing and Management, pp. 513–523, Ithaca, USA, 1987.
[77]
Copernic
Summarizer,
“Technologies
WhitePaper,
2003”.
http://www.copernic.com/data/pdf/summarization-whitepapereng.pdf, {Retrieved on 10
04-2011}
[78] E. Niggemeyer B. SimSum. “An Empirically Founded Simulation of Summarizing”.
Information Processing & Management Journal, Volume 36, Issue 4, pp. 659-682, 2000.
[79] D. Cai, S. Yu, J. Wen, and W. Ma “VIPS: A Vision-Based Page Segmentation
Algorithm” http://research.microsoft.com/pubs/70027/tr-2003-79.pdf, {Retrieved on 1204-2011}.
[80] M. Azmy, S. El-Beltagy, and A. Rafea. “Extracting the Latent Hierarchical Structure
of Web Documents”, In Proceedings of the International Conference on Signal-Image
Technology and Internet-Based Systems, pp.385-393, Hammamat, Tunisia, 2006.
[81] N. Dao, T. Simpson. "Measuring Similarity between Sentences" (online),
http://wordnetdotnet.googlecode.com/svn/trunk/Projects/Thanh/Paper/WordNetDotNet_S
emantic_Similarity.pdf, {Retrieved on 10-04-2011}
[82] M. Lesk. “Automatic Sense Disambiguation Using Machine Readable Dictionaries:
How to Tell a Pine Cone from an Ice Cream Cone”, In Proceedings of the 5th annual
International Conference on Systems Documentation, pp.24-26, Toronto, Canada, 1986.
[83] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. "Automatic Extraction of Rules for
Sentence
Boundary
Disambiguation".
University
of
Patras.
http://www.ling.gu.se/~lager/Mutbl/Papers/sent_bound.ps. {Retrieved 10-04-2011}
[84] H. Chen and C. Lin and E. Hovy. “Automatic Summarization System coupled with a
Question-Answering System (QAAS)” In Proceedings of the COLING ’02 Workshop on
Multilingual Summarization and Question Answering, pp. 365-374, Taipei, Taiwan,
2002.
115
[85] R. Arnulfo, G. Hernández, Y. Ledeneva, G. Matías, M. Ángel, H. Dominguez, J.
Chavez, A. Gelbukh, J. Luis, T. Fabela, “Comparing Commercial Tools and State-of-theArt Methods for Generating Text Summaries”, In Proceedings of the 8th Mexican
International Conference on Artificial Intelligence IEEE Computer Society Washington,
DC, USA, 2009.
116
APPENDIX A
PILOT EXPERIMENT DETAILED RESULTS
The table below shows summaries provided by running our system using each feature
individually also it shows summaries provided by our proposed combined weighted
scheme. The table also provides the scores as stated by the human evaluator involved.
Summaries provided below are for a Web page that can be found at the following link:
http://www.cartoonfactory.com/faq.html
Score
Question
SIMILARITY
Good
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
Excellent
Very Bad
1 Q. Do we draw all these picture?
These are the original paintings produced by the studios to make
cartoons , or reproductions from those studios based on their
cartoons and produced using the same techniques .
A . No - and that ' s the whole point .
None.
Very Bad
Excellent
None.
A. No - and that’s the whole point.
Good
Good
Very Bad
Very
Good
Good
2 Q. How are these pictures made?
A . That is a long and complicated process .
A . That is a long and complicated process .
None.
We have a couple of pages that show the various steps in creating
Limited Editions and Fine Art Sericels.
A . That is a long and complicated process.
Feature
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
Excellent
Very
Good
Very Bad
Very Bad
3 Q. Do you have a catalog? If so could you send me one?
Instead , we are publishing our catalog online .
A . Sorry , we no longer publish a printed catalog .
None.
None.
117
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Excellent
Excellent
Very
Good
Very Bad
Instead, we are publishing our catalog online.
4 Q. I live in England ( or Australia , or Japan ) Do you ship
here ?
No problem - we ship worldwide via FedEx .
A . Yes .
None.
Bad
Excellent
For additional information, please see our Foreign Orders page.
No problem - we ship worldwide via FedEx.
Bad
Excellent
Bad
5 Q. Do you have a good recipe for Sea Bass?
We love seafood .
A . No , but if you do , please send it to us .
We love seafood .
Very Bad
Bad
None.
We love seafood.
Very
Good
Good
Very
Good
Very Bad
Combined
Excellent
SIMILARITY
Very
Good
LOCATION
Very
Good
6 Q. With all these different types of art , I am confused ... what
should I do ?
The worst thing you can do is jump in and buy something and later
regret it .
A . Ask questions . Bug us .
Make sure all your questions have been answered fully .
None.
A. Ask questions. Know each of the different types of art available
and as much other information as you can gather.
7 Q. What type of art is the best investment?
A . While animation art tends to be a good investment , that is not
always the case .
A . While animation art tends to be a good investment , that is not
always the case . Even if you do see an increase in the value of
your collection , it may not be as dramatic as some of the stories
you may have heard . If you are seriously concerned ab
118
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
Very
Good
Very
Good
Very
Good
A . While animation art tends to be a good investment , that is not
always the case . If you are seriously concerned about a high return
on investment , we would recommend you invest in the stock
market . If you are just concerned about your investment
Buy Something You Love. Know What You Are Buying. Buy
Something You Really Love.
A . While animation art tends to be a good investment, that is not
always the case.
Very
Good
Good
Very Bad
8 Q. How do we know we can trust The Cartoon Factory?
The bottom line is , at some point , you are going to have to trust us
to do business with us .
A . This is a tough question , and one we try hard to answer .
None.
Very Bad
Good
None.
A . This is a tough question, and one we try hard to answer.
Very
Good
9 Q. What happens if I do have a problem with The Cartoon
Factory?
We work hard to satisfy our customers , but we work harder if they
ever experience a problem . All of these policies are outlined on our
About The Cartoon Factory page . If - not that this will happen , but
if there is a problem , you can appeal to your c
A . First , let us know . We work hard to satisfy our customers , but
we work harder if they ever experience a problem . We are
committed to our customers complete satisfaction . We , like any
business , occasionally make mistakes .
In a more long term sense , we guarantee the authenticity of all the
art we sell for life . All of these policies are outlined on our About
The Cartoon Factory page. If you would like a little more assurance
, just make your purchase by credit card . You
All of these policies are outlined on our About The Cartoon Factory
page.
We work hard to satisfy our customers, but we work harder if they
ever experience a problem. ( I think we have made two ... ) If there
has been a problem - in shipping, in framing, or if it is just not the
image you thought it was - we will take care of
Good
10 Q . Does The Cartoon Factory buy Animation Art ?
To get the ball rolling , drop us an e - mail with as complete a
description of your artwork as possible !
SIMILARITY
Very
Good
LOCATION
Good
QUERY
CAPITALIZED
WORDS
Good
Combined
SIMILARITY
Good
119
LOCATION
QUERY
CAPITALIZED
WORDS
Good
Bad
A . Under some circumstances , yes .
Which could be about anything , really .
Very Bad
Combined
Excellent
None.
It has to be something either we are looking for , want terribly bad ,
something that just strikes our fancy , or something at such a good
price we can ' t pass it up .
SIMILARITY
Excellent
LOCATION
QUERY
Bad
Very Bad
CAPITALIZED
WORDS
Very
Good
Combined
Good
11 Q . I have a cel of XXX ... what is it worth ?
Do you want to know it ' s current market value , it ' s replacement
value , what you should sell it for , or what we would buy it for ? If
you did not purchase the artwork from The Cartoon Factory , for us
to do anything , you must bring the artwork int
A . Again , there is a long , involved answer to this question . One
thing you must ask yourself is : What kind of appraisal do you want
?
None.
If you did not purchase the artwork from The Cartoon Factory, for
us to do anything, you must bring the artwork into our gallery in
person, or ship it to our gallery so that we can accurately evaluate
the artwork.
A . Again , there is a long , involved answer to this question . Do
you want to know it ' s current market value , it ' s replacement
value , what you should sell it for , or what we would buy it for ?
Excellent
Excellent
Very Bad
12 Q . Do you know Mickey Mouse?
A . Yes .
A . Yes .
None.
Very Bad
Excellent
None.
A. Yes.
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
Good
Bad
Very Bad
CAPITALIZED
WORDS
Very
Good
13 Q. How are cartoons made?
You might also want to check out books on this subject .
A . Again , there is no short answer to this question .
None
We are working on a page that will address this question, though...
In the meantime, we do have some excellent online displays of how
Limited Editions and Fine Art Sericels are made.
120
Combined
Bad
SIMILARITY
Very
Good
LOCATION
Good
QUERY
Very
Good
CAPITALIZED
WORDS
Good
Combined
Very
Good
SIMILARITY
LOCATION
Good
Very
Good
A. Again, there is no short answer to this question.
14 Q. How many cels does a typical cartoon yield?
First , the mechanics of film : film runs at 24 frames per second . (
Don ' t think it matters that video is 30 fps - they are still shot on
film ... ) As a matter of course , though , what is typically done , at
least by big budget films , is 12 cels pe
First , the mechanics of film : film runs at 24 frames per second .
That number never changes . Ever . So , the MOST any cartoon can
be animated to is 24 fps . ( Don ' t think it matters that video is 30
fps - they are still shot on film ... ) As a matt
Of these , you may find that while the main character may be
animated at 12 fps , secondary characters in the scene may be
animated at 4 fps , or used as part of a " loop " of animated cels .
For example , lets imagine a generic scene of Homer talking to
So, the MOST any cartoon can be animated to is 24 fps. For
example, lets imagine a generic scene of Homer talking to Marge.
Marge may only blink in this example scene, which would mean
there are only 5 cels of Marge, even though there may be 300 of
Homer
First , the mechanics of film : film runs at 24 frames per second .
So , the MOST any cartoon can be animated to is 24 fps . ( Don ' t
think it matters that video is 30 fps - they are still shot on film ... )
As a matter of course, though, what is typic
15 Q. Can you help me with my school project ? What is the
History of Animation ?
( Or how are cartoons made ... or ... )
A number of great books have been written on these subjects , and
they use 100 ' s of pages !
A . We are sorry , but that is currently well beyond the scope of
what we can do .
A number of great books have been written on these subjects , and
they use 100 ' s of pages !
QUERY
CAPITALIZED
WORDS
Good
Combined
Good
None.
A number of great books have been written on these subjects, and
they use 100 ' s of pages !
Very
16 Q. Does The Cartoon Factory stock video tapes , plush toys
or other items other than cels ?
The Cartoon Factory is just an Animation Art Gallery - what we sell
SIMILARITY
Very Bad
121
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
Good
Very
Good
Very
Good
Very
Good
Very
Good
Excellent
Very
Good
Very
Good
Very
Good
Very
Good
are Animation Art Cels .
A . No , we do not .
The Cartoon Factory is just an Animation Art Gallery - what we sell
are Animation Art Cels .
The Cartoon Factory is just an Animation Art Gallery- what we sell
are Animation Art Cels.
The Cartoon Factory is just an Animation Art Gallery - what we sell
are Animation Art Cels .
17 Q. Can you tell me if " XXXX " is on videotape , or when it
will air again ?
What may or may not be on video , or when it may be released is a
matter that the studios plan and decide , not something we have any
voice in , or any knowledge of .
A . Again , The Cartoon Factory is just an Animation Art Gallery what we sell are Animation Art Cels .
A . Again , The Cartoon Factory is just an Animation Art Gallery what we sell are Animation Art Cels .
Again, The Cartoon Factory is just an Animation Art Gallery- what
we sell are Animation Art Cels.
A. Again, The Cartoon Factory is just an Animation Art Gallery what we sell are Animation Art Cels.
SIMILARITY
Excellent
LOCATION
QUERY
CAPITALIZED
WORDS
Excellent
Very Bad
Combined
Excellent
18 Q. Is "XXXXX" on DVD when it be released on DVD?
A . Sorry , we are not a DVD store ; what is and is not on DVD , or
when it may be released on DVD is not something we concern
ourselves with .
A . Sorry , we are not a DVD store ; what is and is not on DVD , or
when it may be released on DVD is not something we concern
ourselves with .
None.
We recommend Amazon.com, or their affiliate at the Big Cartoon
DataBase Video Store.
A. Sorry, we are not a DVD store; what is and is not on DVD, or
when it may be released on DVD is not something we concern
ourselves with.
Very
Good
19 Q. Can you send me a list of all the Disney Feature
Cartoons?
Why should we when The Big Cartoon DataBase already lists all
the Disney cartoons online, with plenty of great facts?
SIMILARITY
Good
122
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
Very
Good
Very
Good
Very
Good
Very
Good
SIMILARITY
Very
Good
LOCATION
Excellent
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
Very
Good
Very Bad
Excellent
Very
Good
Very
Good
Very
Good
Very
Good
Very
Good
Excellent
Very
Good
A. We could, but we won't.
Why should we when The Big Cartoon DataBase already lists all
the Disney cartoons online, with plenty of great facts?
Why should we when The Big Cartoon DataBase already lists all
the Disney cartoons online, with plenty of great facts?
Why should we when The Big Cartoon DataBase already lists all
the Disney cartoons online, with plenty of great facts?
20 Q. What was the cartoon that had the little red xxxx that
floated when he ate biscuits?
If you really need a question answered , and it ' s really bugging you
, we suggest posting the question to a place that deals with cartoon
trivia , such as this cartoon forum .
A . Trivia questions - of any type - are not something we can take
the time to answer .
If you really need a question answered, and it's really bugging you,
we suggest posting the question to a place that deals with cartoon
trivia, such as this cartoon forum.
None.
A. Trivia questions - of any type - are not something we can take
the time to answer.
21 Q. Can I receive any free products from Disney?
This is a very important point : We are not Disney , Warner Bros .
or any of the other studios .
A . We don ' t know - you ' ll have to ask Disney .
We don't know- you'll have to ask Disney.
This is a very important point: We are not Disney, Warner Bros. or
any of the other studios.
This is a very important point: We are not Disney, Warner Bros. or
any of the other studios.
22 Q. My son / daughter / wife / friend is a great artist! Who
should they call at Disney to show their work?
The artists and producers are in the production end , we deal with
the consumer products divisions of the studios .
A . We do not deal in that part of the business .
123
QUERY
Good
CAPITALIZED
WORDS
Good
Combined
Excellent
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
Very
Good
Very
Good
Very
Good
Very Bad
Very
Good
In many case, if you want to know what Disney requires of young
artists, you should probably ask them- we would be quite remiss to
assume to speak for them.
In many case, if you want to know what Disney requires of young
artists, you should probably ask them- we would be quite remiss to
assume to speak for them.
The artists and producers are in the production end, we deal with
the consumer products divisions of the studios.
23 Q. My son / daughter / wife / friend is a great artist! Where
should they go to school?
If your son or daughter is presently in school , they would do much
better to involve their schools ' career counselors ( or those at a
local college ) , people who deal with this sort of thing every day .
And if they are out of school , they should rese
A . Again , we do not deal creative end of the business . Nor are we
very familiar with schools or schooling .
If your son or daughter is presently in school , they would do much
better to involve their schools ' career counselors ( or those at a
local college ) , people who deal with this sort of thing every day .
And if they are out of school , they should resea
None.
If your son or daughter is presently in school , they would do much
better to involve their schools ' career counselors ( or those at a
local college ) , people who deal with this sort of thing every day .
And if they are out of school, they should resea
24 Q. How can I paint Cels, and what materials are used?
A . As we do not do any production ourselves , this is not
something we can really address directly .
A . As we do not do any production ourselves , this is not
something we can really address directly .
A . As we do not do any production ourselves , this is not
something we can really address directly .
SIMILARITY
Excellent
LOCATION
Excellent
QUERY
CAPITALIZED
WORDS
Excellent
Combined
Excellent
None.
A. As we do not do any production ourselves, this is not something
we can really address directly.
Very
25 Q. So how can I contact the studios directly?
A . We have listed various studios addresses and phone numbers
SIMILARITY
Very Bad
124
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
SIMILARITY
LOCATION
QUERY
CAPITALIZED
WORDS
Combined
Good
Very
Good
Very
Good
below .
Good
Very
Good
A . We have listed various studios addresses and phone numbers
below .
A . We have listed various studios addresses and phone numbers
below .
We suggest you try calling information in Los Angeles (area codes
213, 818 or 310) for other studios.
A. We have listed various studios addresses and phone numbers
below.
Excellent
Excellent
Excellent
26 Q . Do you have e - mail addresses for the studios?
A . No , we do not have any public e - mail addresses .
A . No , we do not have any public e - mail addresses .
A . No , we do not have any public e - mail addresses .
Very Bad
Excellent
None.
A. No, we do not have any public e - mail addresses.
SIMILARITY
Very
Good
LOCATION
Very
Good
QUERY
Good
CAPITALIZED
WORDS
Very
Good
Combined
Excellent
27 Q. Can I use the images on The Cartoon Factory site?
We are NOT the copyright holder for any of these images , and you
can not get that permission from us at The Cartoon Factory . Use of
this site acknowledges agreement to these Terms of Use . To use
any images from this site for any purpose is a violatio
A . First of all , you need to understand that every cartoon character
was created by someone ; therefore someone owns these characters
. Not you , and certainly not us . To legally use any copyrighted
cartoon character for any reason , you must have the
All of the images on our site are copyrighted , and as such , are
protected by US and international copyright law . Additionally , the
use of this site and it ' s contents and systems are governed by our
Terms of Use . To use any images from this site fo
We are NOT the copyright holder for any of these images, and you
can not get that permission from us at The Cartoon Factory. Or, link
to the Clip Art section of NetLinks.
We are NOT the copyright holder for any of these images, and you
can not get that permission from us at The Cartoon Factory. All of
the images on our site are copyrighted, and as such, are protected by
US and international copyright law. Use of this sit
Excellent
28 Q. Can you provide a link exchange with our site?
A . Unfortunately , The Cartoon Factory site is not currently set up
SIMILARITY
125
to provide this option
LOCATION
Excellent
QUERY
CAPITALIZED
WORDS
Very
Good
Very
Good
Combined
Excellent
SIMILARITY
Very
Good
LOCATION
Excellent
QUERY
Good
CAPITALIZED
WORDS
Good
Combined
Very
Good
A . Unfortunately , The Cartoon Factory site is not currently set up
to provide this option
If you have a cartoon or comics related site , you might consider
adding it to the Toon . Com directory , which is cartoon and comics
- specific .
The Cartoon Factory site is not currently set up to provide this
option.
A. Unfortunately, The Cartoon Factory site is not currently set up to
provide this option.
29 Q. Can The Cartoon Factory License me or my company to
use cartoon images?
To legally use any copyrighted cartoon character for any reason ,
you must have the permission of the copyright owner . We are
NOT the copyright holder for any of these images , and you can not
get licensing permission or permissions for use from us at T
A . No . Not at all . Ever. To legally use any copyrighted cartoon
character for any reason , you must have the permission of the
copyright owner .
We are NOT the copyright holder for any of these images , and you
can not get licensing permission or permissions for use from us at
The Cartoon Factory . All of the images on our site are copyrighted
, and as such , are protected by US and international
We are NOT the copyright holder for any of these images, and you
can not get licensing permission or permissions for use from us at
The Cartoon Factory. Any infringement of property of The Cartoon
Factory will be prosecuted
To legally use any copyrighted cartoon character for any reason,
you must have the permission of the copyright owner. We are NOT
the copyright holder for any of these images, and you can not get
licensing permission or permissions for use from us at The
126
APPENDIX B
DATASET DESCRIPTION
Page ID
Domain
Description
1
Software
FAQs about learning
CPP programming.
2
Software
3
Software
FAQs about learning
Java.
FAQs about Java
tutorials.
4
Software
5
Software
6
Customer
Service
7
Customer
Service
Customer
Service
8
FAQs about Google
maps.
FAQs about AI Software
agents.
FAQs about pets
travelling in Delta
airlines.
FAQs about Delta
airlines check-in.
FAQs about DHL
shipping.
9
Business
FAQs about
eCommerce.
FAQs about project
management.
FAQs about cartoon
making.
10
Business
11
Art
12
Art
FAQs about the
GRAMMY award.
13
Art
FAQs about
Shakespeare’s life.
127
URL
# Questions
http://www.parashift.co
m/c++-faq-lite/how-tolearn-cpp.html
http://java.sun.com/prod
ucts/jdk/faq.html
http://www.iam.ubc.ca/
guides/javatut99/inform
ation/FAQ.html
http://code.google.com/
apis/maps/faq.html
http://www.davidreilly.c
om/topics/software_age
nts/
http://www.delta.com/h
elp/faqs/pet_travel/inde
x.jsp
http://www.delta.com/h
elp/faqs/checkin/
http://www.dhlusa.com/custserv/faq.as
p?PageID=SH&nav=FA
Q/Shipping
http://www.nhbis.com/e
commerce_faqs.html
http://www.maxwidema
n.com/
http://www.cartoonfacto
ry.com/faq.html
8
http://www2.grammy.co
m/GRAMMY_Awards/
Voting/FAQs/
http://absoluteshakespea
re.com/trivia/faq/faq.ht
35
12
52
6
7
18
5
10
27
29
8
28
m
14
Health
FAQs about Google
health.
15
Health
FAQs about Avian Flu.
16
Health
FAQs about soy food.
17
Society
FAQs about
communism.
18
Society
FAQs about
environmental safety.
19
Society
FAQs about human
rights.
20
News
FAQs about CNN News.
21
Academic
FAQs about AUC
exchange program.
22
Sports
FAQs about Football.
http://www.google.com/
9
intl/enUS/health/faq.html
http://www.who.int/csr/
32
disease/avian_influenza/
avian_faqs/en/
http://www.soyfoods.or
22
g/health/faq
http://www.angelfire.co
10
m/mn2/Communism/faq
.html
http://www.safety.roche
21
ster.edu/FAQ/answers.h
tml
http://web.worldbank.or
5
g/WBSITE/EXTERNA
L/EXTSITETOOLS/0,,
contentMDK:20749693
~pagePK:98400~piPK:
98424~theSitePK:9547
4,00.html
http://www.cnn.com/fee
18
dback/help/
http://www.aucegypt.ed
16
u/students/IPO/FAQ/Pa
ges/AUC.aspx
http://football.about.co
12
m/od/frequentlyaskedqu
estions/Common_Quest
ions.htm
Total Number of
390
Questions
Approximate Number
2000
of Sentences
Table 1.0 Dataset Description.
128
APPENDIX C
DETAILED EXPERIMENTAL RESULTS
The table below shows the scores of each question in each page in the form of an average
score of three human evaluators one for our system and the other for Copernic.
Page
Page 1
Q1-Q8
Page 2
Q9-Q43
Question FAQWEBSUMM
Q1
0.267
Q2
0.667
Q3
0.600
Q4
0.533
Q5
0.667
Q6
0.733
Q7
0.600
Q8
0.733
Q9
0.800
Q 10
0.867
Q 11
0.600
Q 12
0.600
Q 13
0.800
Q 14
0.933
Q 15
0.733
Q 16
1.000
Q 17
0.800
Q 18
0.600
Q 19
0.333
Q 20
0.667
Q 21
0.600
Q 22
0.467
Q 23
0.733
Q 24
0.800
Q 25
0.867
Q 26
0.733
Q 27
0.733
Q 28
0.800
Q 29
0.867
129
Copernic
0.467
0.333
0.467
0.533
0.267
0.267
0.200
0.400
0.667
0.800
0.533
0.467
0.200
0.933
0.200
0.200
0.200
0.733
0.200
0.533
0.667
0.667
0.733
0.533
0.533
0.533
0.200
0.533
0.600
Improvement
Ratio
-42.9%
100.0%
28.6%
0.0%
150.0%
175.0%
200.0%
83.3%
20.0%
8.3%
12.5%
28.6%
300.0%
0.0%
266.7%
400.0%
300.0%
-18.2%
66.7%
25.0%
-10.0%
-30.0%
0.0%
50.0%
62.5%
37.5%
266.7%
50.0%
44.4%
Page 3
Q44-Q55
Page 4
Q56Q107
Q 30
Q 31
Q 32
Q 33
Q 34
Q 35
Q 36
Q 37
Q 38
Q 39
Q 40
Q 41
Q 42
0.933
0.733
0.800
0.867
0.800
0.600
0.667
0.600
0.600
0.667
0.533
0.400
0.300
0.200
0.400
0.733
0.533
0.467
0.600
0.467
0.600
0.467
0.933
0.533
0.600
0.300
366.7%
83.3%
9.1%
62.5%
71.4%
0.0%
42.9%
0.0%
28.6%
-28.6%
0.0%
-33.3%
0.0%
Q 43
Q 44
Q 45
Q 46
Q 47
Q 48
Q 49
Q 50
Q 51
Q 52
Q 53
Q 54
Q 55
Q 56
Q 57
Q 58
Q 59
Q 60
Q 61
Q 62
Q 63
Q 64
Q 65
Q 66
Q 67
Q 68
Q 69
0.300
0.867
0.800
0.600
0.600
0.867
0.467
0.733
0.867
0.867
0.400
0.533
0.733
0.667
0.8
0.867
0.867
0.467
0.667
0.867
0.800
0.600
0.733
0.867
0.733
0.600
0.600
0.300
0.867
0.533
0.467
0.600
0.467
0.867
0.467
0.267
0.533
0.600
0.200
0.733
0.667
0.867
0.400
0.867
0.467
0.200
0.533
0.667
0.800
0.200
0.867
0.733
0.800
0.800
0.0%
0.0%
50.0%
28.6%
0.0%
85.7%
-46.2%
57.1%
225.0%
62.5%
-33.3%
166.7%
0.0%
0.0%
-7.7%
116.8%
0.0%
0.0%
233.5%
62.7%
19.9%
-25.0%
266.5%
0.0%
0.0%
-25.0%
-25.0%
130
Page 5
Q108Q113
Q 70
Q 71
Q 72
Q 73
Q 74
Q 75
Q 76
Q 77
Q 78
Q 79
Q 80
Q 81
Q 82
Q 83
Q 84
Q 85
Q 86
Q 87
Q 88
Q 89
Q 90
Q 91
Q 92
Q 93
Q 94
Q 95
Q 96
Q 97
Q 98
Q 99
Q 100
Q 101
Q 102
Q 103
Q 104
Q 105
Q 106
Q 107
Q 108
Q 109
Q 110
0.733
0.867
0.667
0.867
0.800
0.533
0.733
0.867
0.667
0.467
0.867
0.867
0.933
0.800
0.667
0.733
0.600
0.533
0.600
0.733
0.867
0.600
0.800
0.667
0.733
0.733
0.400
0.533
0.600
0.600
0.467
0.467
0.600
0.533
0.533
0.533
0.467
0.400
0.533
0.467
0.333
0.733
0.867
0.733
0.200
0.200
0.533
0.733
0.600
0.200
0.467
0.800
0.867
0.600
0.733
0.333
0.600
0.867
0.800
0.667
0.733
0.867
0.733
0.800
0.800
0.667
0.667
0.533
0.333
0.600
0.533
0.467
0.267
0.600
0.400
0.600
0.533
0.467
0.400
0.600
0.467
0.667
131
0.0%
0.0%
-9.0%
333.5%
300.0%
0.0%
0.0%
44.5%
233.5%
0.0%
8.4%
0.0%
55.5%
9.1%
100.3%
22.2%
-30.8%
-33.4%
-10.0%
0.0%
0.0%
-18.1%
0.0%
-16.6%
9.9%
9.9%
-25.0%
60.1%
0.0%
12.6%
0.0%
74.9%
0.0%
33.3%
-11.2%
0.0%
0.0%
0.0%
-11.2%
0.0%
-50.1%
Page 6
Q114Q120
Page 7
Q121Q138
Page 8
Q139Q143
Page 9
Q144Q153
Q 111
Q 112
0.533
0.467
0.333
0.467
60.1%
0.0%
Q 113
Q 114
Q 115
Q 116
Q 117
Q 118
Q 119
0.600
0.867
0.733
0.867
0.800
0.800
1.000
0.600
0.867
0.200
0.200
0.800
0.800
0.200
0.0%
0.0%
266.5%
333.5%
0.0%
0.0%
400.0%
Q 120
Q 121
Q 122
Q 123
Q 124
Q 125
Q 126
Q 127
Q 128
Q 129
Q 130
Q 131
Q 132
Q 133
Q 134
Q 135
Q 136
Q 137
Q 138
Q 139
Q 140
Q 141
Q 142
0.867
0.733
0.733
0.933
0.667
0.800
0.733
0.933
0.867
0.467
0.867
0.933
0.400
0.867
0.800
0.933
0.600
0.667
0.667
0.600
0.733
0.800
0.533
0.867
0.800
0.400
0.933
0.667
0.800
0.400
0.933
0.467
0.600
0.867
0.933
0.800
0.867
0.467
0.933
0.867
0.867
0.200
0.600
0.733
0.800
0.733
0.0%
-8.3%
83.3%
0.0%
0.0%
0.0%
83.3%
0.0%
85.7%
-22.2%
0.0%
0.0%
-50.0%
0.0%
71.4%
0.0%
-30.8%
-23.1%
233.4%
0.0%
0.0%
0.0%
-27.3%
Q 143
Q 144
Q 145
Q 146
Q 147
0.933
0.733
0.867
0.667
0.867
0.933
0.733
0.867
0.800
0.800
0.0%
0.0%
0.0%
-16.7%
8.3%
132
Page 10
Q154Q180
Page 11
Q181Q209
Q 148
Q 149
Q 150
Q 151
Q 152
0.933
1.000
0.867
0.800
0.600
0.667
0.333
0.600
0.200
0.267
40.0%
200.0%
44.5%
300.0%
125.0%
Q 153
Q 154
Q 155
Q 156
Q 157
Q 158
Q 159
Q 160
Q 161
Q 162
Q 163
Q 164
Q 165
Q 166
Q 167
Q 168
Q 169
Q 170
Q 171
Q 172
Q 173
Q 174
Q 175
Q 176
Q 177
Q 178
Q 179
Q 180
Q 181
Q 182
Q 183
Q 184
Q 185
0.667
0.400
0.467
0.600
0.667
0.533
0.400
0.667
0.533
0.600
0.333
0.667
0.600
0.667
0.600
0.733
0.667
0.467
0.400
0.533
0.667
0.600
0.400
0.467
0.600
0.600
0.533
0.600
0.733
0.467
0.800
0.867
0.467
0.600
0.400
0.600
0.267
0.667
0.600
0.400
0.667
0.400
0.600
0.800
0.667
0.667
0.467
0.467
0.800
0.600
0.400
0.400
0.400
0.467
0.667
0.333
0.600
0.600
0.600
0.533
0.400
0.400
0.533
0.800
0.200
0.200
11.1%
0.0%
-22.2%
124.7%
0.0%
-11.2%
0.0%
0.0%
33.3%
0.0%
-58.4%
0.0%
-10.0%
42.8%
28.5%
-8.4%
11.2%
16.8%
0.0%
33.3%
42.8%
-10.0%
20.1%
-22.2%
0.0%
0.0%
0.0%
50.0%
83.3%
-12.4%
0.0%
333.5%
133.5%
133
Page 12
Q210Q2117
Page 13
Q218Q245
Q 186
Q 187
Q 188
Q 189
Q 190
Q 191
Q 192
Q 193
Q 194
Q 195
Q 196
Q 197
Q 198
Q 199
Q 200
Q 201
Q 202
Q 203
Q 204
Q 205
Q 206
Q 207
Q 208
Q 209
Q 210
Q 211
Q 212
Q 213
Q 214
Q 215
Q 216
0.733
0.733
0.333
0.733
0.667
0.867
1.000
0.467
0.800
0.733
0.867
0.533
0.933
0.867
0.667
0.800
0.800
0.733
0.800
0.800
0.867
0.733
0.867
0.800
0.733
0.667
0.800
1.000
0.933
0.933
1.000
0.600
0.733
0.400
0.667
0.400
0.600
0.200
0.867
0.867
0.733
0.467
0.600
0.467
0.867
0.800
0.867
0.800
0.733
0.800
0.800
0.200
0.600
0.867
0.733
0.600
0.533
0.600
0.200
0.400
0.200
0.867
22.2%
0.0%
-16.8%
9.9%
66.8%
44.5%
400.0%
-46.1%
-7.7%
0.0%
85.7%
-11.2%
99.8%
0.0%
-16.6%
-7.7%
0.0%
0.0%
0.0%
0.0%
333.5%
22.2%
0.0%
9.1%
22.2%
25.1%
33.3%
400.0%
133.3%
366.5%
15.3%
Q 217
Q 218
Q 219
Q 220
Q 221
Q 222
Q 223
Q 224
0.800
1.000
0.867
0.733
0.800
0.800
0.667
0.733
0.200
1.000
0.867
0.733
0.800
0.800
0.667
0.733
300.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
134
Page 14
Q246Q254
Page 15
Q255Q286
Q 225
Q 226
Q 227
Q 228
Q 229
Q 230
Q 231
Q 232
Q 233
Q 234
Q 235
Q 236
Q 237
Q 238
Q 239
Q 240
Q 241
Q 242
Q 243
Q 244
0.667
0.933
0.733
0.867
0.933
0.933
0.933
0.867
0.867
1.000
1.000
0.867
0.933
0.867
0.733
0.867
1.000
0.600
0.867
0.800
0.733
0.933
0.733
0.667
0.867
0.200
0.933
0.733
0.733
0.867
0.867
0.867
0.933
0.800
0.600
0.800
0.733
0.600
0.867
0.667
-9.0%
0.0%
0.0%
30.0%
7.6%
366.5%
0.0%
18.3%
18.3%
15.3%
15.3%
0.0%
0.0%
8.4%
22.2%
8.4%
36.4%
0.0%
0.0%
19.9%
Q 245
Q 246
Q 247
Q 248
Q 249
Q 250
Q 251
Q 252
Q 253
0.933
0.800
0.933
0.733
1.000
1.000
0.800
1.000
0.867
0.933
0.733
0.800
0.733
0.667
0.867
0.333
0.400
0.400
0.0%
9.1%
16.6%
0.0%
49.9%
15.3%
140.2%
150.0%
116.8%
Q 254
Q 255
Q 256
Q 257
Q 258
Q 259
Q 260
Q 261
Q 262
0.667
0.867
0.800
0.933
0.867
0.933
0.733
1.000
0.667
0.467
0.600
0.733
0.200
0.867
0.800
0.800
0.667
0.200
42.8%
44.5%
9.1%
366.5%
0.0%
16.6%
-8.4%
49.9%
233.5%
135
Page 16
Q287Q308
Q 263
Q 264
Q 265
Q 266
Q 267
Q 268
Q 269
Q 270
Q 271
Q 272
Q 273
Q 274
Q 275
Q 276
Q 277
Q 278
Q 279
Q 280
Q 281
Q 282
Q 283
Q 284
Q 285
0.733
0.667
0.733
0.733
0.867
0.733
0.933
0.733
0.933
0.733
0.867
0.533
0.800
0.867
0.800
1.000
0.667
0.667
0.800
0.867
0.667
0.867
0.867
0.733
0.333
0.200
0.733
0.600
0.800
0.933
0.800
0.933
0.933
0.800
0.533
0.667
0.733
0.333
0.867
0.733
0.533
0.600
0.533
0.533
0.800
0.867
0.0%
100.3%
266.5%
0.0%
44.5%
-8.4%
0.0%
-8.4%
0.0%
-21.4%
8.4%
0.0%
19.9%
18.3%
140.2%
15.3%
-9.0%
25.1%
33.3%
62.7%
25.1%
8.4%
0.0%
Q 286
Q 287
Q 288
Q 289
Q 290
Q 291
Q 292
Q 293
Q 294
Q 295
Q 296
Q 297
Q 298
Q 299
0.867
0.933
1.000
0.600
0.867
0.800
0.800
1.000
1.000
0.933
0.867
0.667
0.733
0.933
0.867
0.933
0.933
0.733
0.733
0.467
0.667
0.533
0.867
0.333
0.667
0.933
0.467
0.933
0.0%
0.0%
7.2%
-18.1%
18.3%
71.3%
19.9%
87.6%
15.3%
180.2%
30.0%
-28.5%
57.0%
0.0%
136
Page 17
Q309Q318
Page 18
Q319Q339
Q 300
Q 301
Q 302
Q 303
Q 304
Q 305
Q 306
Q 307
Q 308
Q 309
Q 310
Q 311
Q 312
Q 313
Q 314
Q 315
Q 316
Q 317
0.800
0.667
0.867
0.600
0.933
0.533
0.867
0.933
0.933
0.867
0.800
0.867
0.933
0.733
0.800
0.667
0.867
0.933
0.667
0.733
0.800
0.867
0.800
0.867
0.533
0.933
0.800
0.533
0.667
0.733
0.867
0.600
0.200
0.667
0.933
0.533
19.9%
-9.0%
8.4%
-30.8%
16.6%
-38.5%
62.7%
0.0%
16.6%
62.7%
19.9%
18.3%
7.6%
22.2%
300.0%
0.0%
-7.1%
75.0%
Q 318
Q 319
Q 320
Q 321
Q 322
Q 323
Q 324
Q 325
Q 326
Q 327
Q 328
Q 329
Q 330
Q 331
Q 332
Q 333
Q 334
Q 335
Q 336
Q 337
Q 338
Q 339
0.667
0.667
0.733
0.933
0.733
0.733
0.667
0.667
0.867
0.667
0.800
0.933
0.867
0.800
0.800
0.733
0.800
0.800
0.800
0.667
0.933
0.667
0.733
0.600
0.733
0.600
0.200
0.467
0.333
0.667
0.733
0.667
0.400
0.467
0.200
0.467
0.733
0.667
0.533
0.667
0.333
0.333
0.733
0.800
-9.0%
11.2%
0.0%
55.5%
266.5%
57.0%
100.3%
0.0%
18.3%
0.0%
100.0%
99.8%
333.5%
71.3%
9.1%
9.9%
50.1%
19.9%
140.2%
100.3%
27.3%
-16.6%
137
Page 19
Q340Q344
Page 20
Q345Q362
Page 21
Q363Q378
Q 340
Q 341
Q 342
Q 343
0.733
0.800
0.800
0.733
0.800
0.733
0.400
0.800
-8.4%
9.1%
100.0%
-8.4%
Q 344
Q 345
Q 346
Q 347
Q 348
Q 349
Q 350
Q 351
Q 352
Q 353
Q 354
Q 355
Q 356
Q 357
Q 358
Q 359
Q 360
Q 361
Q 362
Q 363
Q 364
Q 365
Q 366
Q 367
Q 368
Q 369
Q 370
Q 371
Q 372
Q 373
Q 374
Q 375
Q 376
Q 377
0.600
0.933
0.733
0.667
0.733
0.733
0.800
0.867
0.733
0.800
0.800
0.733
0.600
0.600
0.533
0.733
0.267
0.467
0.667
1.000
0.733
0.467
0.733
0.667
0.733
0.733
0.733
0.800
0.533
0.667
0.600
0.667
0.667
0.733
0.867
0.200
0.467
0.667
0.333
0.800
0.400
0.800
0.267
0.200
0.867
0.533
0.467
0.667
0.533
0.600
0.200
0.533
0.400
1.000
0.667
0.867
0.733
0.667
0.533
0.733
0.733
0.267
0.467
0.667
0.800
0.667
0.733
0.733
-30.8%
366.5%
57.0%
0.0%
120.1%
-8.4%
100.0%
8.4%
174.5%
300.0%
-7.7%
37.5%
28.5%
-10.0%
0.0%
22.2%
33.5%
-12.4%
66.8%
0.0%
9.9%
-46.1%
0.0%
0.0%
37.5%
0.0%
0.0%
199.6%
14.1%
0.0%
-25.0%
0.0%
-9.0%
0.0%
138
Page 22
Q379Q390
Overall
Average
Q 378
Q 379
Q 380
Q 381
Q 382
Q 383
Q 384
Q 385
Q 386
Q 387
Q 388
Q 389
0.667
0.267
0.733
0.733
0.467
0.733
0.867
0.733
0.800
0.933
0.933
0.800
0.667
0.267
0.733
0.733
0.467
0.533
0.400
0.733
0.800
0.933
0.933
0.800
0.0%
0.0%
0.0%
0.0%
0.0%
37.5%
116.8%
0.0%
0.0%
0.0%
0.0%
0.0%
Q 390
0.467
0.467
0.0%
0.732
0.610
20.1%
Table 2.0 Questions average scores.
The set of figures below shows a comparison of the score distributions for both
summarizers with respect to each page discipline.
Very Bad 3%
Excellent 19%
Excellent
10%
Bad 5%
Very Bad
18%
Bad
8%
Very Good
25%
Good 37%
Very Bad
Bad
Good
Very Good
Excellent
Good
39%
Very Good
36%
FAQWebSum "Sof tw are"
Copernic "Sof tw are"
Figure 1 Software Discipline Score Distribution Comparison.
139
Bad
3%
Very Bad
13%
Good
13%
Bad
7%
Excellent
37%
Excellent
40%
Very Bad
Good
13%
Bad
Good
Very Good
Excellent
Very Good
44%
Very Good
30%
Copernic "Customer Service"
FAQWebSum "Customer Service"
Figure 2 Customer Service Discipline Score Distribution Comparison.
Excellent
14%
Excellent
3%
Bad
14%
Very Bad
8%
Very Good
30%
Bad
24%
Very Bad
Bad
Good
Very Good
30%
Very Good
Excellent
Good
42%
Good
35%
FAQWebSum "Business"
Copernic "Business"
Figure 3 Business Discipline Score Distribution Comparison.
140
Bad
2%
Good
8%
Very Bad
12%
Excellent
26%
Bad
6%
Very Bad
Excellent
45%
Good
18%
Very Good
45%
Bad
Good
Very Good
Excellent
Very Good
38%
FAQWebSum "Art"
Copernic "Art"
Figure 4 Art Discipline Score Distribution Comparison.
Good
6%
Excellent
53%
Very Bad
5%
Bad
10%
Excellent
25%
Good
19%
Very Good
41%
Very Bad
Bad
Good
Very Good
Excellent
Very Good
41%
FAQWebSum "Health"
Copernic "Health"
Figure 5 Health Discipline Score Distribution Comparison.
Good
3%
Excellent
8%
Very Bad
8%
Excellent
28%
Bad
14%
Very Bad
Bad
Good
Very Good
45%
Good
25%
Very Good
69%
FAQWebSum "Society"
Copernic "Society"
Figure 6 Society Discipline Score Distribution Comparison.
141
Very Good
Excellent
Excellent
11%
Very Bad
6%
Excellent
6%
Good
22%
Very Bad
22%
Very Good
22%
Very Bad
Bad
Good
Bad
17%
Very Good
61%
Good
33%
FAQWebSum "New s"
Very Good
Excellent
Copernic "New s"
Figure 7 News Discipline Score Distribution Comparison.
Excellent
6%
Excellent
13%
Good
19%
Very Bad
6%
Good
13%
Very Bad
Bad
Good
Very Good
Very Good
75%
Excellent
Very Good
68%
FAQWebSum "Academic"
Copernic "Academic"
Figure 8 Academic Discipline Score Distribution Comparison.
142
Excellent
25%
Very Bad
8%
Very Bad
8%
Excellent
17%
Bad
8%
Good
17%
Very Bad
Bad
Good
25%
Good
Very Good
Excellent
Very Good
42%
Very Good
50%
Copernic "Sports"
FAQWebSum "Sports"
Figure 9 Sports Discipline Score Distribution Comparison.
The set of figures below shows a comparison of the score distributions for both
summarizers with respect to each human evaluator.
Very Bad Bad
3%
2%
Very Bad
14%
Excellent
37%
Bad
5%
Excellent
38%
Very Bad
Bad
Good
46%
Good
Very Good
Very Good
12%
Very Good
7%
FAQWebSum Score Distribution
Good
36%
Copernic Score Distribution
Figure 10 Evaluator 1 Score Distribution Comparison.
143
Excellent
Excellent
Ver y Bad
6%
8%
Very Good
14%
Excellent
4%
Very Bad
27%
Bad
20%
Ver y Good
28%
Very Bad
Bad
Good
Good
28%
Very Good
Excellent
Bad
27%
Good
38%
FAQWebSum Sc or e Dist r ibut ion
Copernic Score Distribution
Figure 11 Evaluator 2 Score Distribution Comparison.
Excellent
10%
Very Bad
3%
Excellent
7%
Bad
16%
Very Bad
24%
Very Good
22%
Very Bad
Bad
Very Good
37%
Bad
16%
Good
34%
Good
Very Good
Excellent
Good
31%
Copernic Score Distribution
FAQWebSum Score Distribution
Figure 12 Evaluator 3 Score Distribution Comparison.
Very Bad
2%
Bad
4%
Excellent
29%
Good
26%
Excellent
41%
Very Bad
20%
Bad
4%
Very Bad
Bad
Good
Very Good
Good
21%
Very Good
27%
Very Good
26%
Copernic Score Distribution
FAQWebSum Score Distribution
Figure 13 Evaluator 4 Score Distribution Comparison.
144
Excellent
Very Bad
8%
Excellent
19%
Excellent
13%
Very Good
11%
Bad
29%
Very Good
11%
Very Bad
27%
Very Bad
Bad
Good
Very Good
Excellent
Good
21%
Bad
28%
Good
33%
FAQWebSum Score Distribution
Copernic Score Distribution
Figure 14 Evaluator 5 Score Distribution Comparison.
Very Bad Bad
1%
5%
Very Bad
13%
Excellent
23%
Good
16%
Bad
15%
Excellent
41%
Very Bad
Bad
Good
Very Good
Very Good
23%
Excellent
Very Good
37%
Good
26%
FAQWebSum Score Distribution
Copernic Score Distribution
Figure 15 Evaluator 6 Score Distribution Comparison.
Very Bad
2%
Bad
8%
Excellent
18%
Very Bad
13%
Excellent
30%
Bad
9%
Very Bad
Good
27%
Bad
Good
Very Good
Very Good
28%
Excellent
Good
32%
Very Good
33%
FAQWebSum Score Distribution
Copernic Score Distribution
Figure 16 Evaluator 7 Score Distribution Comparison.
145
Bad
1%
Excellent
9%
Good
15%
Very Bad
15%
Bad
2%
Excellent
33%
Very Bad
Bad
Good
Good
25%
Very Good
49%
Very Good
51%
FAQWebSum Score Distribution
Very Good
Excellent
Copernic Score Distribution
Figure 17 Evaluator 8 Score Distribution Comparison.
The following set of tables shows the t-Test comparison between each pair of evaluators
who scored the same set of pages. The tables show a record for each page scored by pair
of evaluators and whether the values of both evaluators to each summarizer were
significant or not. In other words, it shows if the scores of both evaluators are
homogeneous to each other or not. If the scores are comparable with small tolerance then
the t-Test value would be not significant and significant otherwise.
Page
FAQWEBSUMM Copernic
5
Significant
Significant
8
Not Significant
Not Significant
20
Significant
Significant
21
Significant
Significant
22
Not Significant
Not Significant
Table 2 Evaluators 1 and 2 t-Test Comparison.
146
Page
FAQWEBSUMM Copernic
4
Significant
Significant
6
Not Significant
Not Significant
9
Significant
Not Significant
20
Significant
Significant
21
Significant
Significant
22
Not Significant
Not Significant
Table 3 Evaluators 1 and 3 t-Test Comparison.
Page
FAQWEBSUMM Copernic
4
Significant
Not Significant
6
Not Significant
Not Significant
8
Not Significant
Not Significant
9
Not Significant
Not Significant
Table 4 Evaluators 1 and 4 t-Test Comparison.
Page
FAQWEBSUMM Copernic
5
Not Significant
Not Significant
10
Significant
Significant
Table 5 Evaluators 1 and 5 t-Test Comparison.
147
Page
FAQWEBSUMM Copernic
1
Not Significant
Not Significant
2
Significant
Not Significant
7
Not Significant
Not Significant
20
Not Significant
Significant
21
Not Significant
Not Significant
22
Not Significant
Not Significant
Table 6 Evaluators 2 and 3 t-Test Comparison.
Page
FAQWEBSUMM Copernic
1
Not Significant
Not Significant
3
Not Significant
Not Significant
7
Significant
Significant
8
Not Significant
Not Significant
Table 7 Evaluators 2 and 4 t-Test Comparison.
Page
FAQWEBSUMM Copernic
2
Not Significant
Not Significant
3
Not Significant
Not Significant
5
Not Significant
Not Significant
10
Significant
Not Significant
Table 8 Evaluators 2 and 5 t-Test Comparison.
148
Page
FAQWEBSUMM Copernic
1 Not Significant
Not Significant
4 Significant
Significant
6 Not Significant
Not Significant
7 Significant
Significant
9 Significant
Not Significant
Table 9 Evaluators 3 and 4 t-Test Comparison.
Page
FAQWEBSUMM Copernic
2
Significant
Not Significant
Table 10 Evaluators 3 and 5 t-Test Comparison.
Page
FAQWEBSUMM Copernic
3
Not Significant
Not Significant
Table 11 Evaluators 4 and 5 t-Test Comparison.
149
Page
FAQWEBSUMM Copernic
11
Not Significant
Not Significant
12
Not Significant
Not Significant
13
Not Significant
Not Significant
14
Not Significant
Not Significant
15
Significant
Not Significant
16
Not Significant
Not Significant
17
Not Significant
Not Significant
18
Significant
Not Significant
19
Not Significant
Not Significant
Table 12 Evaluators 6 and 7 t-Test Comparison.
Page
FAQWEBSUMM Copernic
11
Not Significant
Not Significant
12
Not Significant
Not Significant
13
Not Significant
Not Significant
14
Not Significant
Not Significant
15
Not Significant
Not Significant
16
Not Significant
Not Significant
17
Not Significant
Not Significant
18
Not Significant
Significant
19
Not Significant
Not Significant
Table 13 Evaluators 6 and 8 t-Test Comparison.
150
Page
FAQWEBSUMM Copernic
11
Not Significant
Not Significant
12
Significant
Not Significant
13
Not Significant
Significant
14
Significant
Not Significant
15
Significant
Not Significant
16
Not Significant
Not Significant
17
Not Significant
Not Significant
18
Significant
Significant
19
Not Significant
Not Significant
Table 14 Evaluators 7 and 8 t-Test Comparison.
151
APPENDIX D
SAMPLE OUPUT SUMMARIES
The table below shows sample summaries provided by running our system against
Copernic. The table also shows the average score given by all human evaluators in
response to each question. Summaries provided below are for a Web page that can be
found
at
the
following
link:
http://www2.grammy.com/GRAMMY_Awards/Voting/FAQs/
Summarizer
Score
Very
FAQWEBSUMM Good
Copernic
Good
FAQWEBSUMM Good
Copernic
Bad
FAQWEBSUMM Excellent
Copernic
Good
Question
1-What ' s the difference between an entry and a nomination
?
Entries are recordings submitted for GRAMMY consideration .
Entries that meet all eligibility requirements are then voted on by
The Academy ' s voting members and the results of that vote are
the nominations.
2-What are the eligibility requirements ?
For the 52nd Annual GRAMMY Awards , albums must be
released between Oct . 1 , 2008 and August 31 , 2009 .
sales by label to a branch or recognized independent distributor,
via the Internet, or mail order / retail sales for a nationally
marketed product.
3-How are recordings entered ?
The Academy accepts entries online from its members and from
registered labels.
Entrants are provided information on how to submit their
recordings electronically for consideration.
4-Who can vote ?
FAQWEBSUMM Excellent
Recording Academy voting members only .
Copernic
None
Very Bad
152
FAQWEBSUMM Excellent
Copernic
Bad
5-Who qualifies as a Voting Member ?
Recording Academy voting members are professionals with
creative or technical credits on six commercially released tracks (
or their equivalent ) .
These may include vocalists, conductors, songwriters,
composers, engineers, producers, instrumentalists, arrangers, art
directors, album notes writers, narrators, and music video artists
and technicians.
FAQWEBSUMM Excellent
6-How many GRAMMY categories are there ?
There are currently 29 fields ( Pop , Gospel , Classical , etc . )
and 109 categories within those fields .
Copernic
None
Very Bad
FAQWEBSUMM Excellent
Copernic
Excellent
7-How are categories changed or added ?
Proposals for changes to the categories are reviewed each year
by The Academy ' s Awards & Nominations Committee , with
final approval by The Academy ' s Trustees .
Proposals for changes to the categories are reviewed each year
by The Academy ' s Awards & Nominations Committee, with
final approval by The Academy ' s Trustees.
FAQWEBSUMM Excellent
8-What is the difference between Record Of The Year and
Song Of The Year ?
The Record Of The Year category recognizes the artist â€™ s
performance as well as the overall contributions of the producer (
s ) , recording engineer ( s ) and / or mixer ( s ) if other than the
artist .
Copernic
None
Very Bad
Summaries provided below are for a Web page that can be found at the following link:
http://www.google.com/intl/en-US/health/faq.html
Summarizer
Score
FAQWEBSUMM Excellent
Question
1. What kind of health information can I store in Google
Health?
You can store as much or as little information in Google
Health as you want.
153
Copernic
Very Good
FAQWEBSUMM Excellent
Copernic
Excellent
FAQWEBSUMM Very Good
Copernic
Very Good
FAQWEBSUMM Excellent
Copernic
Good
FAQWEBSUMM Excellent
Copernic
Excellent
You can store as much or as little information in Google
Health as you want. You can store wellness data, medical
records, or both in Google Health including personalized
wellness goals around weight or exercise as an example, or
more traditional medical history such as your medications,
allergies, procedures, immunizations, conditions, health
insurance information and test results.
2. How can Google Health help me?
Google Health offers a simple online dashboard you can use to
track all your health - related information and goals.
Google Health offers a simple online dashboard you can use to
track all your health - related information and goals.
3. Is Google Health a new way to search for health
information?
Google Health helps you organize, track, monitor, and act on
your health information.
Google Health helps you organize, track, monitor, and act on
your health information. com search results page but you are
no longer inside the Google Health product.
4. How much does Google Health cost to use?
Google Health is available from Google at no charge. There is
no cost to sign up. And if you already have a Google account
set up then you are set.
Google Health is available from Google at no charge. There is
no charge to doctors ' offices, hospitals, retail pharmacies, and
health device makers and other application developers that
partner with Google Health. Some third party companies that
are integrated with Google Health and provide customized
services may charge you directly but it ' s up to you to elect to
work with them.
5. If Google doesn’t t charge for Google Health, how does
Google make money off of it?
Much like other Google products we offer, Google Health is
made available at no charge to anyone who uses it.
Much like other Google products we offer, Google Health is
made available at no charge to anyone who uses it.
154
FAQWEBSUMM Very Good
6. How does Google Health protect the privacy of my
health information?
You should know two main things up front: We will never sell
your personal health information or data we will not share
your health data with individuals or third parties unless you
explicitly tell us to do so or except in certain limited
circumstances described in our Privacy Policy.
Copernic
None
Very Bad
7. Will my employer or health insurance provider be able
to see my Google Health profile?
FAQWEBSUMM Excellent
You are in control of who views your Google Health profile.
Copernic
None
Very Bad
FAQWEBSUMM Very Good
Copernic
Good
FAQWEBSUMM Good
Copernic
Good
8. Does the data I store in Google Health get used for
other Google products, like Search?
Yes, we share information between Google products to enable
cross - product functionality.
Does the data I store in Google Health get used for other
Google products, like Search? For example, Google Health
can help you save your doctors ' contact information in your
Google Contact List.
9. Is Google Health a PHR (personal health record)?
A personal health record (PHR) is a patient - directed
information tool that allows the patient to enter and gather
information from a variety of healthcare information systems
such as hospitals, physicians, health insurance plans, and retail
pharmacies.
Is Google Health a PHR (personal health record)? We believe
it ' s not enough to offer a place where you can store, manage,
and share your health information.
155