FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION
by user
Comments
Transcript
FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION
The American University in Cairo School of Sciences and Engineering FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION A Thesis submitted to Department of Computer Science and Engineering In partial fulfilment of the requirements for The degree of Master of Science by Yassien M. Shaalan Bachelor of Science, Computer Engineering, Cairo University Under the Supervision of Dr. Ahmed Rafea April 2011 The American University in Cairo School of Sciences and Engineering FREQUENTLY ASKED QUESTIONS WEB PAGES AUTOMATIC TEXT SUMMARIZATION A Thesis Submitted by Yassien Mohamed Shaalan To the Department of Computer Science and Engineering April 2011 In partial fulfillment of the requirements for the degree of Master of Science Has been approved by Dr. . . . . . . . . . . . . . . . . . . . . . . Thesis Committee Chair / Adviser Affiliation Dr. . . . . . . . . . . . . . . . . . . . . . . Thesis Committee Chair / Adviser Affiliation Dr. . . . . . . . . . . . . . . . . . . . . . . Thesis Committee Reader / examiner Affiliation Dr. . . . . . . . . . . . . . . . . . . . . . . Thesis Committee Reader / examiner Affiliation Dr. . . . . . . . . . . . . . . . . . . . . . . Thesis Committee Reader / examiner Affiliation Department Chair/ Date Dean i Date DEDICATION This thesis is dedicated to my beloved parents and wife, whom without their profound support and help it would not have been possible to accomplish anything. An honourable dedication goes to my dear uncle Dr. Mohamed Labib for his unfailing assistance, excellent advice, superior help and for encouraging me and believing in what I am capable of doing. Finally, I would like to dedicate it to honour the memory of my father who passed away during the course of this research. ii ACKNOWLEDGMENTS I would like to express my deepest sense of gratitude to each and every person who helped or supported me to complete my thesis. It h a s b e e n a challenging and rich experience. Accordingly, it would be my pleasure to take the time to convey my gratitude to them. First, I would like express my sincere gratitude to my supervisor Dr. Ahmed Rafea for his outstanding advice, great help, patient guidance and support that he provided through his rich professional experiences. I would like to express my sincere thanks to my thesis committee for their supportive guidance and constructive feedback in evaluating my thesis proposal. I would like also to acknowledge the help of my colleague Micheal Azmy who provided me with his Web page segmentation tool with which I started my research project. A special thanks to my dear friends- Ahmed Fekry, Taysser Abo El Magd, Marwa Shaalan, Ali Sayed, Islam Shaker, Ahmed Yassien, Reda Salim, Mohamed El Farran, Mahmoud Khedr and Essam Abd’ AL- who participated voluntarily to help me in my experiments. I am also thankful to all my professors at The American University in Cairo (AUC) for their continuous support, guidance and advice. I would like to thank all the Computer Science and Engineering Staff for her their unfailing assistance. iii ABSTRACT This research is directed towards automating frequently asked questions Web pages summarization, a task that captures the most salient pieces of information in answers of each question. In fact, it was found that there exists thousands of FAQ pages available on many subjects not only online but also offline as most of products nowadays tend to attach a leaflet tilted as FAQ along with its product describing their functionality and usage. Moreover, FAQ Web pages are devised into a manner of a question having a specific heading style, e.g. bold, underlines, or tagged. The answer then follows in a different lower style; usually smaller font and may be scattered to subheadings if the answer is logically divided. This uniformity in the creation of FAQ Web pages and its abundance online, made it pretty much amenable to automation. To achieve this objective, an approach, which applies Web page segmentation to detect Q/A –Question and Answer-units along with the use of some selective statistical sentence extraction features for summary generation, is proposed. The proposed approach is English language dependent. These features are namely, question-answer similarity, query overlap, sentence location in answer paragraphs and capitalized words frequency. The choice of these features was mainly influenced by the process of analyzing the different question types and anticipating the expected correct answer. The first feature –Sentence Similarity- evaluates the semantic similarity between the question sentence and each sentence in the answer. It does so by comparing word senses of both the question and answer words and then assigns each pair of words a numerical value and then in turn an accumulated value to the whole sentence. The second feature –Query Overlap-extracts the following word types; nouns, adverbs, adjectives and gerunds from the question sentence and automatically formulate a query with and count the number of matches with each of the answer sentences. The third feature -Location-gives higher score to sentences at the beginning and lessens the score to the following sentences. The fourth feature -Capitalized Words Frequency- computes the frequency of capitalized words in a sentence. We give each feature a weight and then linearly combine them iv in a single equation to give a cumulative score for each sentence. The different document features were combined by a home grown weighting score function. It was found that using each of the features solely performed well in some cases based on different kinds of evidence. Pilot experimentations and analysis helped us in obtaining a suitable combination of feature weights. The thesis proposes three main contributions. First, using the proposed feature selection combination is a highly preferable solution to the problem of summarizing FAQ Web pages. Second, utilizing Web page segmentation, enables the differentiation between the different constructs resides in Web pages, which enables us to filter out those constructs that are not Q/A. Third, detect those answers which are logically divided into a set of paragraphs under smaller headings representing the logical structure. Thus generating a conclusive summary with all paragraphs included. The automatically generated summaries are compared to summaries generated by a widely used commercial tool named Copernic Summarizer 2.1 1 . The comparison is performed via a formal evaluation process involving human evaluators. In order to verify the efficiency of our approach we conducted four experiments. The first experiment tested the summarization performance quality of our system against the Copernic system in terms of which system produces readable short quality summaries. The second experiment tested the summarization quality with respect to the questions’ discipline. The third experiment to measures and compares the human evaluators’ performance in evaluating both summarizers. The fourth experiment analyzes the evaluators’ scores and their homogeneity in evaluating our system and the Copernic system as well. In general, it was found out that our system –FAQWEBSUMMperforms much better than the Copernic summarizer. The overall average scores for all pages indicate that it is superior to the other system by approximately 20% which is quite promising. The superiority comes from the idea of knowing the web page structure in advance that helps in reaching better results than applying a general solution to all types of pages. Keywords: Web Document Summarization, FAQ, Question Answering and Web Page Segmentation. 1 http://www.copernic.com/en/products/summarizer/ v TABLE OF CONTENT CHAPTER 1 INTRODUCTION ........................................................................................................ 1 1.1 OVERVIEW ........................................................................................................................ 1 1.2 BACKGROUND .................................................................................................................. 2 1.3 PROBLEM DEFINITION ...................................................................................................... 4 1.4 MOTIVATION .................................................................................................................... 5 1.5 OBJECTIVE ........................................................................................................................ 6 1.6 THESIS LAYOUT ................................................................................................................ 7 CHAPTER 2 LITERATURE SURVEY ............................................................................................... 8 2.1 INTRODUCTION ................................................................................................................. 8 2.2 TAXONOMY OF SUMMARIZATION CATEGORIES ................................................................ 9 2.3 SINGLE-DOCUMENT SUMMARIZATION ............................................................................ 10 2.3.1 MACHINE LEARNING METHODS ............................................................................... 11 2.3.2 DEEP NATURAL LANGUAGE ANALYSIS METHODS ................................................... 15 2.4 MULTI-DOCUMENT SUMMARIZATION ............................................................................. 17 2.5 OTHER TASKS IN TEXT SUMMARIZATION ....................................................................... 21 2.5.1 SHORT SUMMARIES .................................................................................................. 21 2.5.2 QUERY BASED SUMMARIZATION .............................................................................. 23 2.5.3 SENTENCE COMPRESSION BASED SUMMARIZATION ................................................. 24 2.5.4 STRUCTURE BASED SUMMARIZATION ...................................................................... 26 2.5.5 LINK BASED SUMMARIZATION ................................................................................. 29 2.5.6 EMAIL MESSAGES SUMMARIZATION ........................................................................ 29 2.5.7 CUSTOMER REVIEW SUMMARIZATION ..................................................................... 31 2.5.8 BLOG SUMMARIZATION ............................................................................................ 31 2.5.9 QUESTION ANSWERING & FAQ SUMMARIZATION ................................................... 32 2.6 SUMMARIZATION EVALUATION ...................................................................................... 34 CHAPTER 3 TEXT SUMMARIZATION PROJECTS .......................................................................... 39 3.1 EXTRACTIVE BASED PROJECTS ....................................................................................... 39 3.1.1 MACHINE LEARNING BASED PROJECTS .................................................................... 39 3.1.2 STATISTICAL BASED PROJECTS ................................................................................ 41 vi 3.1.3 QUESTION ANSWERING BASED PROJECTS ................................................................ 44 3.2 ABSTRACTIVE BASED PROJECTS ..................................................................................... 45 3.3 HYBRID BASED PROJECTS .............................................................................................. 46 CHAPTER 4 PROPOSED APPROACH TO FAQ WEB PAGES SUMMARIZATION ................................. 48 4.1 PROPOSED METHODOLOGY OVERVIEW .......................................................................... 48 4.2 WEB PAGE SEGMENTATION ............................................................................................ 49 4.2.1 APPLYING WEB PAGE SEGMENTATION TO FAQ WEB PAGES ................................... 50 4.3 FEATURES SELECTION .................................................................................................... 52 4.3.1 SEMANTIC SIMILARITY FEATURE.............................................................................. 56 4.3.3 LOCATION FEATURE ................................................................................................. 61 4.3.4 WORD CAPITALIZATION FEATURE ............................................................................ 61 4.3.5 COMBINING FEATURES ............................................................................................. 62 4.4 PILOT EXPERIMENT “DETECTING OPTIMUM FEATURE WEIGHTS” .................................. 63 4.4.1 OBJECTIVE ................................................................................................................ 63 4.4.2 EXPERIMENT DESCRIPTION ....................................................................................... 63 4.4.3 RESULTS ................................................................................................................... 64 4.4.4 DISCUSSION .............................................................................................................. 65 CHAPTER 5 SYSTEM DESING AND IMPLEMENTATION ................................................................... 67 5.1 FAQWEBSUMM OVERALL SYSTEM DESIGN ............................................................... 67 5.2 PRE-PROCESSING STAGE ................................................................................................. 69 5.2.1 PRIMARY PRE-PROCESSING....................................................................................... 69 5.2.2 SECONDARY PRE-PROCESSING.................................................................................. 75 5.3 SUMMARIZATION STAGE ................................................................................................ 77 5.4 FAQWEBSUMM SYSTEM IMPLEMENTATION ISSUES ................................................... 81 CHAPTER 6 SYSTEM EVALUATION ............................................................................................... 84 6.1 WHY COPERNIC SUMMARIZER? ..................................................................................... 84 6.2 EVALUATION METHODOLOGY ....................................................................................... 85 6.3 DATASET........................................................................................................................ 88 6.4 EXPERIMENTAL RESULTS AND ANALYSIS ...................................................................... 88 6.4.1 EXPERIMENT 1 “PERFORMANCE QUALITY EVALUATION” ........................................ 89 vii 6.4.1.1 OBJECTIVE ......................................................................................................... 89 6.4.1.2 DESCRIPTION ...................................................................................................... 89 6.4.1.3 RESULTS ............................................................................................................. 89 6.4.1.4 DISCUSSION ........................................................................................................ 90 6.4.2 EXPERIMENT 2 “PAGE DISCIPLINE PERFORMANCE ANALYSIS” ................................ 93 6.4.2.1 OBJECTIVE .......................................................................................................... 93 6.4.2.2 EXPERIMENT DESCRIPTION ................................................................................. 93 6.4.2.3 RESULTS ............................................................................................................. 94 6.4.2.4 DISCUSSION ........................................................................................................ 97 6.4.3 EXPERIMENT 3 “HUMAN EVALUATORS’ PERFORMANCE ANALYSIS” ....................... 99 6.4.3.1 OBJECTIVE .......................................................................................................... 99 6.4.3.2 EXPERIMENT DESCRIPTION ................................................................................. 99 6.4.3.3 RESULTS ............................................................................................................. 99 6.4.3.4 DISCUSSION ...................................................................................................... 100 6.4.4 EXPERIMENT 4 “ANALYZING EVALUATORS’ HOMOGENEITY” ............................... 101 6.4.4.1 OBJECTIVE ........................................................................................................ 101 6.4.4.2 EXPERIMENT DESCRIPTION ............................................................................... 101 6.4.4.3 RESULTS ........................................................................................................... 101 6.4.4.4 DISCUSSION ...................................................................................................... 102 CHAPTER 7 CONCLUSION AND FUTURE WORK ........................................................................... 103 REFERENCES .............................................................................................................................. 107 APPENDIX A PILOT EXPERMENT DETAILED RESULTS .................................................................. 117 APPENDIX B DATASET DESCRIPTION .......................................................................................... 127 APPENDIX C DETAILED EXPERIMENTAL RESULTS ....................................................................... 129 APPENDIX D SAMPLE OUTPUT SUMMARIES................................................................................. 152 viii LIST OF FIGURES Figure 4.1 An Example of Logically Divided Answer into Sub Headings……….…………...51 Figure 4.2 An Example of a Good Summary to Question in Figure 1…………….…………..51 Figure 4.3 An Example of a Bad Summary to Question in Figure 1…………….….................52 Figure 4.4 Features’ Scores Comparison…………………………………….………...............64 Figure 5.1 FAQWEBSUMM Overall System Architecture………………...…………............68 Figure 5.2 Segment Attributes Definitions. ………………………………….………………..69 Figure 5.3 Web Page Segmentation Example………………………………………………….70 Figure 5.4 FAQWEBSUMM Second Stage of Pre-processing…………….………………….72 Figure 5.5 FAQWEBSUMM Third Stage of Pre-processing……………….…………............73 Figure 5.6 XML File Structure Example after SBD and POS Stages……….………………...74 Figure 5.7 FAQWEBSUMM Fourth Stage of Pre-processing………………………...............75 Figure 5.8 FAQWEBSUMM Specialized FAQ pre-processing……………………………….76 Figure 5.9 FAQWEBSUMM Summarization Core…………………………………................77 Figure 5.10 Pseudo Code of Summarization Algorithm……………………………………….78 Figure 5.11 Pseudo Code for Calculating Similarity Feature Score….......................................79 Figure 5.12 Pseudo Code for Calculating Location Feature Score….........................................79 Figure 5.13 Pseudo Code for Calculating Query Overlap Feature Score……………………...80 Figure 5.14 Pseudo Code for Calculating Capitalized Words Feature Score………………….81 ix Figure 6.1 Performance Distributions………………………………………………….............90 Figure 6.2 FAQWEBSUMM Score Distribution………………………………………............91 Figure 6.3 Copernic Score Distribution………….…………………………………………….92 Figure 6.4 Score Distribution Comparison…………………………………………………….93 Figure 6.5 Page Discipline Score Comparison.....……………………………………………..96 x LIST OF TABLES Table 4.1 Yes/No Question Type Examples………….………………………………………54 Table 4.2 Question Word Question Type Examples……….…………………………………55 Table 4.3 Choice Question Type Examples………………………..…………………………55 Table 4.4 Summary Evaluation for Selection Features………………….……………............64 Table 5.1 FAQWEBSUMM System Requirements………………………….……………….83 Table 6.1 FAQ Web Page Evaluation Quality Matrix…………………………….……..........85 Table 6.2 Average Page Scores Categorized by Discipline…..……………………………….95 Table 6.3 t-Test Values for All Pages by All Evaluators…….…………………......................97 Table 6.4 Evaluators Improvement Ratio…………………………………………..................99 Table 6.5 t-Test Results over Evaluator Scores……………………………….……………...100 Table 6.6 t-Test Results Comparing Evaluators’ Scores………………………......................102 xi LIST OF ACRONYMS FAQ Frequently Asked Questions Q/A Question Answer Unit FAQWEBSUMM Frequently Asked Questions Web Pages Summarization XML Extensible Markup Language HTML Hyper Text Markup Language CMS Content Management System PHP Hypertext Preprocessor RFC Request for Comment SDS Single Document Summarization MDS Multi-Document Summarization QBS Query Based or Biased Summarization DUC Document Understanding Conference AI Artificial Intelligence NLP Natural Langue Processing ML Machine Learning HMM Hidden Markov Model TREC Text Retrieval Conference ROUGE Recall Oriented Understudy for Gisting Evaluation GA Genetic Algorithms MCBA Modified Corpus Based Approach LSA Latent Semantic Analysis TRM Text Relationship Map MRM Mathematical Regression Model FFNN Feed Froward Neural Network PNN Probabilistic Neural Network GMM Gaussian Mixture Model GP Genetic Programming UMLS Unified Medical Language System NE Named Entity xii TAC Text Analysis Conference TFIDF Term Frequency Inverse Document Frequency RU Relative Utility CBRU Cluster Based Relative Utility CSIS Cross Sentence Information Subsumbtion MEAD Platform for Multi-Document-Multi-Lingual Summarization HIERSUM Hierarchal Summarization Model LDA Latent Dirichlet Allocation TOC Table of Content K.U.Leuven A System for Generating Very Short Summaries UAM A System for Generating Very Short Summaries TBS Title Based Summarization BAYESUM A Bayesian Query Based Summarization Project PDA Personal Digital Assistant ILP Integer Linear Programming AUC Area under the ROC Curve INEX Initiative for the Evaluation of XML Retrieval Dataset SUMMAC TIPSTER Summarization Dataset CMSA Collective Message Summarization Approach IMSA Individual Message Summarization Approach CRF Conditional Random Fields QALC A Question Answering System P/R Precision Recall Evaluation EDU Elementary Discourse Unit ANOVA Analysis of Variance BLEU Bilingual Evaluation Understudy SCU Summary Content Unit NeuralSumm A Neural Network Based Summarization Project ClassSumm A Classification Based Summarization Project SweSum The Swedish Text Summarization Project xiii TS-ISF Inverse Sentence Frequency GistSumm The Gist Summarizer Project OTS The Open Text Summarization Project InText A Text Summarization Project FociSum A Question Answering Based Summarization Project WebSumm A Web Document Summarization Project TRESTLE Text Retrieval Extraction and Summarization for Large Enterprises KIP Knowledge Intensive Process MUC Message Understanding Conference Summons A Multi-Document Summarization Project MultiGen A Multi-Document Summarization Project FUF/SURGE Functional Unification Formalism a Syntactic Realization Grammar SBD Sentence Boundary Disambiguation POS Part of Speech Tagging IDE Implementation Development Kit JVM JAVA Virtual Machine JDK JAVA Development Kit OS Operating System MFC Microsoft Foundation Class Library t-Test Student’s t-Test URL Unique Resource Locator xiv Chapter 1 INTRODUCTION 1.1 Overview New innovative technologies, such as high-speed networks, portable devices, inexpensive massive storage, along with the remarkable growth of the Internet, have led to an enormous increase in the amount and availability of all types of documents anywhere and anytime. Consequently, it has been realized that added value is not gained merely through larger quantities of data, but through easier access to the required information at the right time and in the most suitable form. Thus, there is a strong need for improved means of facilitating information access. Not surprisingly, that it has become a well known fact that people keep abreast of the whole world affairs by collecting some information bites from multiple sources. Additionally, some of the most important and practical applications that can make use of the process of distilling the most important information out of some source to capture some specific portion of information are as follows; Multimedia news summaries that watches news and tell the user what happened while he was away. Physicians' aids applications summarize and compare the recommended treatments for certain patients considering certain drugs. Meeting summarization that concludes what happened at a missed teleconference and provides meeting minutes. Consequently, in this context, automatic summarization of documents can be considered as the solution to problems presented above. Summarization -the art of abstracting key content from one or more information sources-has become an integral part of everyday life. The product of this procedure still contains the most important points of the original text. However, the meaning of text is usually different as it depends on the context. It may include raw text documents, 1 multimedia documents, online documents, hypertexts, etc. In fact, the need for such technology comes directly from the so called phenomenon of information overload; we know that the access to solid and properly-developed summaries is essential. Given the billions of pages the Web comprises, “please help me to select just that precise information about the topic x!!” is a typical problem faced by most of us nearly each day in every discipline. Furthermore, it has been said from decades that more and more information is becoming available and that tools are needed to handle it. Consequently, this narrows it down to the conclusion of finding someone or something to give us an idea about the content of those dozens of documents, i.e., to give us a grasp of the gist of the text. 1.2 Background When talking about a (relatively) new and promising technology such as summarization, it is important to state a caution one should avoid false expectations. For instance, one should not expect newly automated systems able to replace humans in their work. Instead, one should expect some kind of realistic applications which could either help people or release them from a number of boring and recurrent tasks. Though the first attempts at automatic summarization were made over 60 years ago [1-2], this area was until quite recently a rather obscure research branch. No doubt, this was due to the non availability of machine-readable texts. Only recently, however, does it seem that a sufficient quantity of electronically available material especially through the World Wide Web. Consequently, research in that field gained much interest in the past few years, and exploded into prominence exactly since 1990 [1-2-3-4]. Systems interested in developing coherent summary, of any kind of text, need to take into account several variables such as length, writing-style and syntax to make a useful summary [5]. In addition, summaries can be characterized by reduction rate, degree of informativeness and degree of well formedness [3-5]. 2 The reduction rate is a great factor in characterizing and comparing one summary to another and it is also known as the compression rate or condensation rate. It is simply measured by dividing the total summary length by the total source text length and the ration should be less than one, or else the summary would be the same as the actual source text. The degree of informativeness is another summary characterizer, which is mainly interested in the degree of information conveyed by the summary in comparison to the source which can be thought of as the fidelity to source-relevance to user’s Interests. Another very important summary characterizer is how well the summary is formed structurally and grammatically, or the degree of Well formedness. However, a few issues were observed when learning about the topic of automatic summarizers, one is, do they act like humans when formulating summaries? The answer is, automatic systems seldom behave following the same strategies as humans, one reason may be because we still do not have a very clear idea of what kind of strategies are they actually. Nor do computing systems have at their disposal the same kind of resources and background information that we humans have [6]. Another issue is how good summarizers are? In fact, this is a hard question to answer, but we can say, that depends on several factors. Especially on what kind of things we expect to find in the summary. Maybe we are generically interested to know what the author is saying there -or maybe what we want to know is what the author says about some particular issue. Besides, we should notice that there are many different types of documents to be summarized –newspaper news, articles, e-mails, scientific papers, Web sites, brochures, books, etc. In fact, in each case the level of text significance differs, from documents in plain text to highly codified XML2 pages, passing through HTML3 or any other mark-up language. Therefore, what makes a good summary is the direct answer to the following questions: a good summary of what and to get what. In addition, summaries tend to have problems in its final shape for instance it tends to have some 2 3 http://www.w3schools.com/xml/default.asp http://www.w3schools.com/html/default.asp 3 problems like gaps within the summary or dangling anaphors. Grammatical mistakes and plausible output can be factors that harm the form of the resulted summary [1-2-3]. In conclusion, all of these issues normally entail the process of text summarization and need to be considered carefully. In fact, it can be considered a part of the problem definition itself. On the other hand, one of the usual and challenging necessary tasks to be carried out in any research is evaluation of the results. In the case of automatic text summarization, evaluation is usually done by comparing automatic summaries against some kind of reference- summary built up by humans normally referred to as the “gold standard [1-2-3]. To achieve that, one usually asks a set of referees to summarize (or get extracts from) a test corpus of documents. So, it has been proved that reaching a single reference-summary is not a trivial task at all. Given the same document, rarely do the referees agree on what information is relevant to be selected in a summary. In fact, in some evaluation tests the same referees have been asked to summarize the same documents letting a lapse of several weeks between one test and the other. In such cases, referees turned out to agree with themselves in a range of about only 55% [6]. However, this drawback is usually solved using statistical methods, so managing to reach some kind of average summary to be used as a benchmark. This seems to be a fairly reasonable way of evaluating automatic summarization systems. All in all, we only shed some light on the challenges faced by researchers in tackling the process of automatic text summarization. 1.3 Problem Definition Many summarization models have been proposed previously to tackle the problem of summarizing Web pages. However, this research is directed towards summarizing a special type of Web pages, namely, frequently asked question Web pages. The problem can be defined formally as follows: Given a FAQ 4 Web page P that consists of a set of 4 http://en.wikipedia.org/wiki/FAQ 4 questions {Q1, Q2,….Qn} followed by a set of answers {A1, A2,…An} associated with each question Q in P; The task of Web page summarization is to extract a subset of sentences from Q in P, that best represents a brief answer to that given question. 1.4 Motivation Prior to starting this work on FAQ Web page summarization we carried out an informal pilot study to gauge the usability and importance for this type of pages. In fact, it was found that there exists thousands of FAQ pages available on many subjects not only online but also offline as most of products nowadays tend to attach a leaflet tilted as FAQ along with its product describing their functionality and usage. In fact, this applies to most if not every available discipline from commercial products, academia, service providers, manuals, etc. Moreover, there exist several Web sites online that catalog and provide search capabilities for FAQs -for example, the Internet FAQ Consortium 5 . Furthermore, FAQ nowadays tend to be stored in content management systems (CMS) 6 online, or in simple text files. Since 1998, a number of specialized software programs have emerged, mostly written in Perl 7 or PHP 8 , some of them are integrated into more complex FAQ handler software applications; others, like phpMyFAQ 9 , can be run either as a stand-alone FAQ or integrated into Web applications. Therefore, based on the above the probability of accessing sites in the different disciplines that consists of FAQs is most probably high, which propose that it should be given a satisfactory degree of consideration. Additionally, due to the abundance of information available online, users find it difficult to extract certain pieces of information quickly and efficiently. Hence there is a great need to convert information into a condensed form to present a non-redundant http://www.faqs.org/faqs/ http://en.wikipedia.org/wiki/Content_management_system 7 http://en.wikipedia.org/wiki/Perl 8 http://en.wikipedia.org/wiki/PHP 9 http://en.wikipedia.org/w/index.php?title=PhpMyFAQ&action=edit&redlink=1 5 6 5 account of the most relevant facts found across document(s) to ensure easily access and review. This was proven to be quite an important and vital need especially for the World Wide Web users whom in contact with an enormous may be endless information content. As a result, it opens the door wide to the process of summarization, which comes in multiple different flavors. For instance, here are some forms of the different types of summaries. People tend to go through product reviews before actually buying it. People tend to keep up with the world affairs by listening or reading news bites. People base their investments in stock market based on summaries of updates to different stocks. All in all, with summaries people can make effective decisions in less time. In fact, this research is mainly motivated by the importance of FAQ -especiallyin the World Wide Web domain as depicted earlier along with how to introduce it into a more concise form based on a newly adapted summarization technique. One more motive to summarizing FAQ Web pages is the fact that structure of FAQ pages are almost a standard as there have even been proposals for standardized FAQ formats, such as RFC1153 10 and the Minimal Digest Format. 1.5 Objective The main objective of this research is to develop a summarization technique that make use of the extra knowledge extracted from the process of Web page segmentation along with the usage of some selective features for sentence extraction to improve frequently asked question Web page summarization. The document semantic is scattered along the page and need to be detected correctly, in our case a set of questions and answers. Therefore, it would be helpful if the knowledge contained in the structure of the page can be uncovered to complement the Web page contents. Recent research work on extractive-summary generation employs some heuristics, and do not indicate how to select the relevant features for a specific type of text. Therefore, we adopt the hypothesis that different text types need to be processed differently. 10 http://www.rfc-editor.org/info/rfc1153 6 1.6 Thesis Layout The thesis is organized in the following manner: In chapter two a through literature survey is conducted giving brief about the taxonomy of the different types of summaries, early work in text summarization, Single document summarization(SDS), multi document summarization (MDS), other approaches in text summarization and finally summarization evaluation. In chapter three we list some of the most famous summarization projects in the field of text summarization on both the academic and commercial sides. In chapter four we introduce the proposed methodology and how it is applicable using a utility like Web page segmentation. Additionally, we show how we employ it to comprehend the FAQ Web page structure in the form of questions plus answers. In chapter five we introduce the system architecture and design to illustrate how the problem is solved. It also contains some implementation issues like listing the tools used to develop our system and also describes the target environment. The developed summarization system has been given the name FAQWEBSUMM as a code name for the project. The name illustrates the proposed functionality of the system that does FAQ Web pages summarization. In chapter six experimental results and analysis which is divided in the following sections: Experimental Methodology –this section describes the environment where the experiments took place, tools used and Dataset that were used. The following section shows the actual quantitative results – in the form of charts and tables. A through analysis to the results is given by explaining and highlighting what is important in terms of meeting with the proposed final goals. In chapter seven a conclusion is introduced to sum up and conclude the overall contribution by first listing the claims and whether they were met by results from the proposed system. Later, in the future work we discuss the possible exertions to this research. 7 Chapter 2 LITERATURE SURVEY 2.1 Introduction The aim of this chapter is to survey the most famous work done in the field of automatic text summarization through introducing a range of fundamental terms and ideas. There are multiple summary types corresponding to some of the different typologies and current research lines in the field. We will show the taxonomy of summarization categories existing according to literature. Next we describe single document summarization (SDS) focusing on extractive techniques. We will start by giving a brief introduction about it and mention some of the early work ideas that were the spark to all modern approaches. Then we show some of the most famous work in single document summarization through utilizing machine learning techniques and statistical techniques. Next we shift our attention to multi-document summarization (MDS), which is now seen as one of most hot topic areas in the text summarization field. What makes it interesting is that it deals with a set of problems. Some of these problems are, the need to highlight differences between different reports and to deal with time-related issues (e.g., changing estimates of numbers of casualties). Another nontrivial problem is handling cross-document co-references-for example, ensuring that mentions of a different names in two different documents actually refer to the same person. The literature of text summarization can not be simply cornered with the previous two categories of single document and multiple documents only. However, there exist some other summarization types that need to be introduced to this survey to give a through introduction to the field. These are short summaries, query based summaries, sentence compression summaries, structure based summarization, link based summarization, email messages summarization, customer review summarization, blog 8 summarization, question answering and FAQ summarization. We will give a brief introduction to each of these categories. Another extremely important point to be depicted in the literature survey is summarization evaluation. In fact, it is quite a hard problem to deal with; this is because summarization systems differ from one system to another in the form of input, output and the type of audience to whom the summarization system is directed towards. All of these issues will be directed in the summarization evaluation section. 2.2 Taxonomy of Summarization Categories Summaries can be differentiated from one another based on a set of aspects which are purpose, form, dimension and context [3-4-5]. The purpose of a summary can be indicative to indicate the direction or the orientation of the text and whether it is for or against some topic or opinion. A Summary purpose can also be informative, means it only tells the facts in the text and the way it formalizes is neutral. A critical summary is another purpose for the summary, where it tries to criticize the text. Another summary categorization can be based on the summary form and whether it is extractive or abstractive summary. Extractive summaries are completely consisting from the sentences or word sequences contained in the original document. Besides the complete sentences, extracts can contain phrases and paragraphs. Problem with this type of summaries is usually lack of the balance and cohesion. Sentences could be extracted out of the context and anaphoric references can be broken. Abstract summaries are containing word sequences not present in the original. They are usually built from the existing content but using advanced methods. The dimension of the source document can be another source of differentiation between summaries as takes into account the input text and whether it is a single document or a multi document. One more categorization to the summary type is based on the context of the summary based on whether it is query based or generic. In query based summaries it only targets a single topic so it is more precise and concise. On the other hand, in generic or 9 query independent summaries, the summary is usually not be centered on a single topic, so it tries to capture the main theme(s) of the text and to build the whole summary around it unguided. In this survey we will concentrate on general types of summaries based on dimension or context of single-document, multi-document and other types of summarization, analyzing the several approaches currently incorporated by each type. 2.3 Single-document Summarization Usually, the flow of information in a given document is not uniform, which means that some parts are more important than others. Thus the major challenge in summarization lies in distinguishing the more informative parts of a document from the less ones. Though there have been instances of research describing the automatic creation of abstracts, but most of the work presented in the literature relies literally on extraction of sentences to address the problem of single-document summarization. Unfortunately, the research in the area of single document summarization is somehow declining since the dropping of single document summarization track from DUC 11 challenge in 2003. Additionally, according to [7], general performances in summarization systems tend to be better in multi-summarization than single document summarization tasks. This could be partially explained with the fact that repetitive occurrences in input document can be used directly as an indication of importance in multi-document environments. In this section, we describe some well-known extractive techniques in the field of single document summarization. First, we show some of the early work that initiated research in automatic summarization. Second, we show approaches involving machine learning techniques. Finally, we briefly describe some deep natural language processing techniques to tackle the problem. Most early work on single-document summarization focused on technical documents. Hans Peter Luhn, was the pioneer in using computer for information retrieval research and application development. In addition, the most cited paper on text summarization is his [8] that describes research done at IBM in the 1950s. In this work, 11 http://duc.nist.gov 10 Luhn proposed that the frequency of a particular word in an article provides a useful measure because of its significance. There are several key ideas put forward in this paper that have assumed importance in later work on summarization. Also related work in [9] provides early insight on a particular feature that was thought to be very helpful in finding salient parts of documents which is the sentence position. By experiment, he proved that in 85% of the time the topic sentence is the first one and only 7% of the time it came last while the other cases are randomly distributed. On the other hand, the work in [10] was based on the two features of word frequency and positional importance which were incorporated in the previous two works. Additionally, two other features were used: the presence of cue words (for example words like significant, fortunately, obviously or hardly), and also incorporated the structure of the document (whether the sentence is a title or heading). Weights were attached to each of these features manually to score each sentence. However, evaluations showed that approximately 44% of the automatic extracts matched the manual human extracts [1]. In the 1980’s and 1990’s, interest is shifted toward using AI methods, hybrid approaches and summarization of group documents and multimedia documents. In fact, the main problem with the first generation of text summarization systems is that they only used informal heuristics to determine the salient topics from the text representation structures. On the other hand, the second generation of summarization systems then adapted some more mature knowledge representation approaches [11], based on the evolving methodological framework of hybrid, classification-based knowledge representation languages. 2.3.1 Machine Learning Methods Lately, with the advent of new machine learning techniques in NLP, a series of publications appeared that employed statistical techniques to produce document extracts. However, initially most systems assumed feature independence and relied mostly on naive-Bayes methods, others have focused on the choice of appropriate features and on 11 learning algorithms that make no independence assumptions. Other significant approaches involve hidden Markov models and log-linear models to improve extractive summarization. Other work, in contrast, uses neural networks and genetic algorithms to improve purely extractive single document summarization. Many machine learning systems relied on methods based on naive-Bayes classifier to able to learn from data. The classification function categorizes each sentence as worthy of extraction or not. In [12] they developed an approach that uses the distribution of votes in a Bayesian Model to generate a probability of each sentence being part of the summary. Their research showed that early sentences are shown more in summaries other than final ones. In [13] they tackled the problem of extracting a sentence from a document using a hidden Markov model (HMM). The basic motivation behind using a sequential model is to account for local dependencies between sentences. They incorporated only three features: position of the sentence in the document (built into the state structure of the HMM), number of terms in the sentence, and likeliness of the sentence terms given the document terms. They used the TREC 12 dataset as a training corpus, while the evaluation was done by comparing their extracts to human generated extracts. Most systems claim that existing approaches to summarization have always assumed feature independence. In [14], the author used log-linear models to hinder this assumption and showed empirically that their system produced better extracts than naiveBayes models. In 2001-02, DUC issued a task of creating a 100-word summary of a single news article as most of the papers in that period of time concentrated on news articles. Surprisingly, the best performing systems in the evaluations could not outperform the baseline with statistical significance. This extremely strong baseline has been analyzed by [7] and corresponds to the selection of the first n sentences of a newswire article. However, this disappointing result was beaten by the introduction of the NetSum system that was developed at Microsoft Research [15]. 12 http://trec.nist.gov/ 12 NetSum utilizes machine-learning method based on neural network algorithm RankNet [16]. The system is customized for usage in extracting summaries of news articles through highlighting three sentences. The goal is pure extraction without any sentence compression or sentence regeneration. Thus, system is designed to extract three sentences from single document that best match three document highlights. For sentence ranking they used the RankNet, a ranking algorithm designed to rank a set of inputs that uses the gradient descent method for training. The system is trained on pairs of sentences in single document, such that first sentence in the pair should be ranked higher than second one. Training is based on modified back-propagation algorithm for two layer networks. The system relies on a two layer neural network. They employed two types of features in their system. First, they used some surface features like sentence position in article, word frequency in sentences and title similarity. However, the novelty of their framework lay in the use of features that derived information from query logs from Microsoft's news search engine13 and Wikipedia 14 entries. Hence, if the parts of news search query or Wikipedia title appear frequently in the news article to be summarized, then higher importance score is attached to the sentences containing these terms. The results of this summarization approach are encouraging. Based on evaluation using ROUGE-1 [17] and comparing to the baseline system, this system performs better than all previous systems for news article summarization from DUC workshops. In [18] they used machine learning algorithms to propose an automatic text summarization approach through sentence segment extraction. This is simply done by breaking each sentence into a set of segments. Each segment is then represented by a set of predefined features like the location of the segment in text, the average term frequency of each word occurring in the segment and also the frequency of the title words occurring in the sentence segment. The feature scores are then combined using a machine learning algorithm to generate a vector of highly ranked sentences to generate the summary from. Another attempt in the same direction [19] coupled the use of machine learning algorithm 13 14 http://search.live.com/news http://en.wikipedia.org 13 with the use of gene expression programming similar to genetic programming to produce a better weighting scheme. On the other hand, other approaches train genetic algorithms (GA) and use mathematical regression models to obtain a suitable combination of feature weights. In [20] they proposed two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (a) sentence positions are ranked to emphasize the significances of different sentence positions, and (b) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. It was found out that with compression rate of 30%, an average f-measure of 49% for MCBA, 52% for MCBA+ GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively [20]. Another attempt in [21] to address the problem of improving content selection in automatic text summarization through a trainable summarizer is proposed. It takes into account several features, including sentence position, positive keyword, negative keyword, sentence centrality, sentence resemblance to the title, sentence inclusion of name entity, sentence inclusion of numerical data, sentence relative length, Bushy path of the sentence and aggregated similarity for each sentence to generate summaries. They investigate the effect of each sentence feature on the overall summarization task. Then they combine all features to train genetic algorithm (GA) and mathematical regression (MRM) models to obtain a suitable combination of feature weights. Moreover, they use all feature parameters to train a feed forward neural network (FFNN), probabilistic neural network (PNN) and Gaussian mixture model (GMM) in 14 order to construct a text summarizer for each model. Furthermore, they use the trained models by one language to test summarization performance in the other language. The performance of their approach [21] is measured at several compression rates on a data corpus composed of 100 Arabic political articles and 100 English religious articles. The results of the proposed approach are promising, especially the GMM approach. Furthermore, in [22] they proposed a novel technique for summarizing text using a combination of Genetic Algorithms (GA) and Genetic Programming (GP) to optimize rule sets and membership functions of fuzzy systems. Their main goal is to develop an optimal intelligent system to extract important sentences in the texts by reducing the redundancy of data. The novelty of the proposed algorithm comes from the hypothesis that fuzzy system is optimized for extractive based text summarization. In their work, GP is used for structural part and GA for the string part (Membership functions). The chosen fitness function considers both local properties and global summary properties by considering various features of a given sentence such as its relative number of used thematic words as well its location in the whole document. Their developed method is compared with the standard fuzzy systems as well as two other commercial summarizers: Microsoft word and Copernic Summarizer. Simulations demonstrate several significant improvements with the proposed approach. 2.3.2 Deep Natural Language Analysis Methods In this subsection, we describe a set of papers that detail approaches towards single document summarization involving complex natural language analysis techniques. They tend to use a set of heuristics to create document extracts. Most of these techniques try to model the text's discourse structure. They incorporated considerable amounts of linguistic analysis for performing the task of summarization. They also tried to reach a middle ground between approaches based on the analysis of the semantic structure of the text and approaches based on word statistics of the documents. In [23] they introduced a system that focuses on dynamic summary generation based on user input query. This approach has been designed for application in specific domain (medical). However it can be used in general domain too. Their idea is based on 15 the fact that user selects the keywords to search for the document with specific requirements. However, these keywords may not match the document’s main idea, thus the document’s summary provided would be less relevant. Hence, the summary needs to be generated dynamically, according to user requirements given by the search query. The system is coupled with two ontology knowledge sources, WordNet and UMLS. WordNet is widely known lexical database for the English, developed at the Princeton University. UMLS is maintained by United States National Library of Medicine and includes three knowledge sources: the Metathesaurus, the Semantic Network and the Specialist lexicon. Their system runs in three main steps. First, they evaluate and adjust the query in regards to the WordNet and/or UMLS ontology knowledge. The redundant keywords are removed and relevant ones added. Second, they calculate the distance between the document’s sentences and the relevant query. Sentences below a predefined threshold are subject to inclusion in the final document summary. Third, they calculate the distance among the candidate summary sentences. The candidates are then separated into the groups based on the threshold and the highest ranked candidate from each group will become the part of document summary. Early evaluations to this system as a part of DUC 2007 have shown good potential. However, some problems related to redundancy reduction, lack of syntax analysis and insufficient query analysis were highlighted by the conference and the author claimed it will be addressed in their future work. Other similar approach based on the semantic analysis of document has been proposed in [13]. In this approach the authors propose a scoring scheme for sentence extraction based on static and dynamic document features. Static features include sentence locations and named entities (NE) in each sentence. Dynamic features used for scoring include semantic similarity between sentences and user query. In attaining their goal they incorporated three main stages, preprocessing, analysis and summary generation. In the preprocessing, they removed unnecessary elements from the document. Further, the document is tokenized and the sentence boundaries are detected along with named entities recognition. 16 In the analysis step, the extraction and analysis of features are done and building relevancy scoring for the sentences. Finally, the sentence score is calculated as a linear combination of the weighted features. Their system was presented in TAC 15 2008. The system has demonstrated better performances related to finding relevant content than removing irrelevant one. 2.4 Multi-document Summarization Nowadays it is enormously important to develop new ways for extracting text in efficient way. There are systems such as single document summarization system that support the automatic generation of extracts, abstract, or questions based on summaries. Single-document summaries provide limited information about the contents of a single document. Instead, the situation in most cases is transformed to a user makes an inquiry about a topic which usually involve large set of related documents especially with the introduction of the internet. This query then would provide back hundreds of documents. Although they differ in some areas, many of these documents from the content provide the same information. A summary of each document would help in this case; however, they would be semantically similar. In today’s community, in which time plays an important role, multi-document summarizers play essential role in such situations. Multi-document summarization became very popular since the mid 1990s, mostly tackled the domain of news articles. We will provide a brief overview of the most important ideas in that field. In [24] they introduced MEAD a multi-document summarizer, which generates summaries using cluster centroids produced by topic detection and tracking system, where a centroid is a set of words that are statistically important to a cluster of documents. In fact, centroids could be used both to classify relevant documents and to identify salient sentences in a cluster. 15 http://www.nist.gov/tac/ 17 Their algorithm run as follows; first relative documents are grouped together into clusters. Each document is represented as a weighted vector of TF*IDF 16 and only documents with similarity measure within a certain threshold is included into a cluster. Then MEAD decides which sentences to include in the extract by ranking them according to a set of parameters. They used three features to compute the salience of a sentence: Centroid value, Positional value, and First-sentence overlap. The centroid value Ci for sentence Si is computed as the sum of the centroid values Cw;i of all words in the sentence. The positional value is computed as follows: the first sentence in a document gets the same score Cmax as the highest-ranking sentence in the document according to the centroid value. The overlap value is computed as the inner product of the sentence vectors for the current sentence i and the first sentence of the document. The score of a sentence is the weighted sum of the scores for all words in it. In fact, they used an equal weight for all three parameters. They also introduced a new evaluation scheme based on sentence utility and subsumption. This utility is called cluster-based relative utility (CBRU, or relative utility, RU in short) which refers to the degree of relevance (from 0 to 10) of a particular sentence to the general topic of the entire cluster. A utility of 0 means that the sentence is not relevant to the cluster and a 10 marks an essential sentence. They also introduced a related notion to the RU which is cross-sentence informational subsumption (CSIS, or subsumption). CSIS reflects that certain sentences repeat some of the information present in other sentences and may, therefore, be omitted during summarization. Evaluation systems could be built based on RU and thus provide a more quantifiable measure of sentences. In order to use CSIS in the evaluation, they introduced a new parameter, E, which tells how much to penalize a system that includes redundant information. All in all, they found that MEAD produces summaries that are similar in quality to the ones produced by humans. They also compared MEAD’s performance to an alternative method, multidocument lead, and showed how MEAD’s sentence scoring weights can be modified to produce summaries significantly better than the alternatives. 16 http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html 18 In [25] they presented two multi-document extraction systems one produce a general purpose summary of a cluster of related documents and the other is an entity based summary of documents related to a particular person. The general-purpose summary is generated by a process that ranks sentences based on their document and cluster “worthiness”. This means that there is a need to take into consideration the relationship each sentence has to the set of documents (cluster) that constitute the input to the process. In response they constructed a centroid representation of the cluster of n related documents and then computed cosine similarity value between each sentence in the document set and the centroid. They managed to use the following set of features to score a sentence in a cluster. (i) Sentence cluster similarity, (ii) sentence lead document similarity, (iii) absolute document position. These values are combined with appropriate weights to produce the sentences final score which is used to rank them. On the other hand, the personality-based summary is constructed by a process that ranks sentences according to a metric that uses co-reference and lexical information in a person profile. In fact, they have explored three ideas: first, identify and measure references of the key entity in a sentence; second, identify and measure if person facets or characteristics are referred to in a sentence; and finally identify and measure mention of information associated with the key entity. The final score is the weighted sum of all three features mentioned above. In both summary modules, a process of redundancy removal is applied to exclude repeated information. In [26] they aimed to explore document impact on summarization performance. They proposed a document-based graph model to incorporate the document-level information and the sentence-to-document relationship into the graph-based ranking process. The basic graph-based model is a way of deciding the importance of a vertex within a graph based on global information recursively drawn from the entire graph. The basic idea is that of “voting” between the vertices. A link between two vertices is considered as a vote cast from one vertex to the other vertex. The score associated with a vertex is determined by the votes that are cast for it, and the score of the vertices casting these votes. 19 Unfortunately, the basic graph-based model is built on the single-layer sentence graph and the transition probability between two sentences in the Markov chain depends only on the sentences themselves, not taking into account the document-level information and the sentence-to-document relationship. However, to incorporate the document-level information and the sentence-todocument relationship, they [26] proposed the document-based graph model based on the two-layer link graph including both sentences and documents. Moreover, they devised three methods to evaluate the document importance in a document set. First, is to measure the cosine similarity between the document and the whole document set as the importance score of the document. Second, measure the average similarity value between the document and any other document in the document set as the importance score of the document. Third, it constructs a weighted graph between documents and uses the PageRank17 algorithm to compute the rank scores of the documents as the importance scores of the documents. The link weight between two documents is computed using the cosine measure. On the other hand, they proposed four methods to evaluate the sentence to document correlation. The first three methods are based on sentence position in the document, under the assumption that the first sentences in a document are usually more important than other sentences. The last method is based on the content similarity between the sentence and the document. In fact, the experimental results on DUC 2001 and DUC 2002 demonstrated the good effectiveness of their proposed model. In [27] they present a generative probabilistic model for multi-document summarization. They started with a simple word frequency based model and then constructed a sequence of models each injecting more structure into the representation of document set content. Their final model, HIERSUM, utilizes a hierarchical LDA- Latent Dirichlet Allocation -style model to represent content specificity as a hierarchy of topic vocabulary distributions. The resulting model produces general summaries as well as summaries for any of the learned sub-topics. 17 http://www.markhorrell.com/seo/pagerank.html 20 Their work [27] relies on the observation that the content of document collections is highly structured, consisting of several topical themes, each with its own vocabulary and ordering preferences. In addition, they rely on the hypothesis that a user may be interested in general content of a document collection or may be one or more of the substories reside in the document. As a result, they choose to adapt their topic modeling approach to allow modeling this aspect of document set content. In other words, rather than drawing a single content distribution for a document collection, they draw a general content distribution. In fact, at the task of producing generic DUC-style summaries, HIERSUM yields state-of-the-art ROUGE performance. 2.5 Other Tasks in Text Summarization 2.5.1 Short Summaries The generation of very short summaries (less than 75 bytes), a problem also called headline generation was introduced by the DUC as a summarization task. The logic behind creating this task is based on the fact that the most relevant ideas are usually expressed in a few sentences which allows for this simplification, and most of the document can be discarded. It is well known that topic segmentation can be used as a preprocessing step in numerous natural language processing applications. In [28], they adapted a topic segmentation algorithm for the production of automatic short summaries. Their proposed algorithm detects thematic structures in texts using generic text structure cues. It associates key terms with each topic or subtopic and outputs a tree-like table of content (TOC). Finally, they use these TOCs for producing single document abstracts and multi-document abstracts and extracts. Their defense states that the text structure trees reflect the most important terms. One clear advantage to this approach is that it requires no prior training which makes it generally applicable. Another attempt for tackling the problem of producing short summaries is the UAM system [29]. It starts by selecting the most relevant sentences, using a genetic algorithm and a combination of some heuristics as the fitness function. The weights for each heuristic were obtained with another genetic algorithm built on top of them. 21 Secondly, they extract verb phrases and their arguments from those sentences. For generating summaries, they connect highly ranked sentences with prepositions and conjunctions whenever possible. Finally, extracts that still had space were completed with the top-frequency words from the documents and collections and with noun phrases selected by means of a combination of criteria. According to their experiments [29], they found out that by setting the summary length limit to 150 or 200, they would obtain summaries that were both grammatical and informative. However, forcing a limit to the summary length had the consequence of leaving out relevant verb phrases, and they were forced to fill in empty spaces with keywords. K.U.Leuven summarization system [30] provides another solution to the problem of generating short summaries. It participated in the DUC in three main tasks. Generating very short single document summaries (headlines), short multi-document summarizes and short summaries focused by questions. Considering the first task, they employ one of two methods depending on the headline focus. The first is picking out keywords seem a good approach to cover all the content. The second is applying sentence compression techniques which are more appropriate for a good readability. Additionally, the second method comes closer to the way humans construct headlines. Considering the second task, they cluster the term vectors of the important sentences of single documents. Then they select important sentences based on the number of keywords they contained (the same approach like in the case of the headlines). Considering the third task, it requires summaries answering certain questions like where is X? What is Y? When is Z?, where X is a person, Y is a thing and Z is an event. Their system consists of a succession of filters/sentence selection modules: selecting indicative sentences for the input person, thing or event, intersecting them with sentences which are important for the whole document and filtering out indirect speech. Finally, to eliminate redundant content while fitting into required length, they cluster the resulting sentences from all the documents in a set with the covering method. 22 The documents are considered in chronological order and the sentences in the reading order in the documents. 2.5.2 Query Based Summarization It is also known as query biased summarization. This summarization technique is based a keyword set entered by the user that serves as a search query. The query is considered the centralized reference where all information around it is extracted and ordered into a semi-coherent statement(s). In [31] they proposed a query focused summarization system named BAYESUM. It leverages the common case in which multiple documents are relevant to a single query. Their algorithm functions by asking itself the following question: what is it about these relevant documents that differentiate them from the non-relevant documents? In fact, BAYESUM can be seen as providing a statistical formulation of this exact question. BAYESUM is built on the concept of language models for information retrieval. The idea behind the language modeling techniques used in information retrieval is to represent either queries or documents (or both) as probability distributions, and then use standard probabilistic techniques for comparing them. A relevant approach is presented in [32] where they proposed a supervised sentence ranking approach for use in extractive summarization. The system provides query-focused multi-document summarization, both for single summaries and for series of update summaries. They broke the problem into three main tasks [32]. First, they perform text normalization which is the process of preparing and cleaning text in the input dataset by removing meta-data and then performing sentence segmentation. Second, they perform a process of supervised sentence ranking through employing machine learning techniques to qualify certain features to be used in sentence scoring. Finally, they select highly scored sentences from a ranked list. Another example of query based summarization is answering XML queries by means of data summaries [33]. It is a summarized representation of XML data, based on the concept of instance patterns, which both provides succinct information and is directly 23 queried. The physical representation of instance patterns exploits item sets or association rules to summarize the content of XML datasets. Instance patterns may be used for answering queries, either when fast and approximate answers are required, or when the actual dataset is not available, for example, it is currently unreachable [33]. The query-biased summary, became a standard feature in the result presentation of search engines. However, there is an apparent disadvantage of QBS is the generationcost at query-time. The summary needs to be generated for every single document presented in response to a potentially diverse range of queries. Accordingly, [34] proposed a solution to the previous problem through the use of document titles as an alternative to queries. Since they use document titles, the summary can be pre-generated statically. When the summaries are pre-generated, presenting them to the users becomes a simple lookup in a database. To justify their [34] title-biased approach of summary generation, they made three research hypotheses. First, top ranking documents tend to have a query term in the title. Second, searchers prefer to visit a document when a query term appears in the title. Third, there is no significant difference between QBS and TBS in supporting search tasks. Their experimental results showed that title-biased summaries are a promising alternative to query-biased summaries, due to the behavior of existing retrieval systems as well as searchers’ information seeking behavior. 2.5.3 Sentence Compression Based Summarization Automatic sentence compression can be broadly described as the task of creating a grammatical summary of a single sentence with minimal information loss. Sentence compression is also known as sentence simplification, shortening or reduction. It has recently attracted much attention, in part because of its relevance to some vital applications. Examples include the generation of subtitles from spoken transcripts [35], the display of text on small screens such as mobile phones or PDAs [36-37]. Another example of sentence compression is presented in [38]. They propose a text simplification process that seeks to reduce the complexity of sentences in biomedical abstracts in order to improve the performance of syntactic parsers on the processed 24 sentences. Syntactic parsing is typically one of the first steps in a text mining pipeline. They classified English sentences into three categories. 1) Normal English sentences, like in Newswire text, 2) normal biomedical English sentences – those sentences which can be parsed without a problem by Link Grammar 18 , and 3) complex biomedical English sentences – those sentences which can’t be parsed by Link Grammar. Their approach run in three steps [38]: 1. Preprocessing through removal of spurious phrases; 2. Replacement of gene names; 3. Replacement of noun phrases. They evaluated their proposed method using a corpus of biomedical sentences annotated with syntactic links. Experimental results showed an improvement of 2.90% for the CharniakMcClosky [37] parser and of 4.23% for the Link Grammar parser when processing simplified sentences rather than the original sentences in the corpus. Sentence compression is thought to be another way of summarization other than sentence selection. In fact, some systems use sentence compression as a complimentary phase after sentence selection to give a more concise summary. In [39] they propose a joint content selection and compression model for singledocument summarization. They evaluate their approach on the task of generating “story highlights”—a small number of brief, self-contained sentences that allow readers to quickly gather information on news stories. Their output summaries must meet additional requirements such as sentence length, overall length, topic coverage and, importantly, grammaticality. They combine phrase and dependency information into a single data structure, which allows it to express grammaticality as constraints across phrase dependencies. Then they encode these constraints through the use of integer linear programming (ILP), a well-studied optimization framework that is able to search the entire solution space efficiently. A key aspect of their approach is the representation of content by phrases rather than entire sentences. Salient phrases are selected to form the summary. In [40] they also offered a text summarization approach based on sentence compression putting in mind two goals to be achieved simultaneously. Compressions should be grammatical, and they should retain the most important pieces of information. 18 www.link.cs.cmu.edu/link/ 25 Mostly, these two goals can conflict. As a solution, they devised two models to tackle the problem one utilizes a noisy-channel while the other utilizes a decision-tree approach to the problem. Considering the noisy channel approach, they rely on the hypothesis that long string was originally a short string, and then someone added some additional, optional text to it. Thus compression is a matter of identifying the original short string. It is not critical whether or not the “original” string is real or hypothetical. As result, they map the problem into a noisy channel application. Considering the decision tree model, as in the noisy-channel approach, they assume that they were given as input a parse tree t. the goal is to “rewrite” t into a smaller tree s, which corresponds to a compressed version of the original sentence subsumed by t. However, the work in [41] offered another solution to the summarization task through sentence compression rather than simply shorten a sentence by deleting words or constituents, as in previous work. They rewrite it using additional operations such as substitution, reordering, and insertion. They also presented a new corpus that is suited to their task along with a discriminative tree-to-tree transduction model that enables them to detect structural and lexical mismatches. Their model incorporates a novel grammar extraction method, which uses a language model for coherent output, and can be easily tuned to a wide range of compression specific loss functions. 2.5.4 Structure Based Summarization In this section, we present a summarization technique that makes use of the structures of the document, hierarchical summarization, fractal summarization, and other techniques whose main goal is to generate a summary based on the structure of a given text. Hierarchical Summarization as shown in [42- 43- 44], on the other hand, depends mainly on two stages: The first identifies the salience of each sentence in a document and ranks the sentence accordingly, and the second builds a tree for all sentences such that its root is the sentence with the highest salience. The advantage of their approach is that the input document does not have to be in HTML and assume nothing about the document 26 structure; instead their algorithm is able to infer the document’s hierarchical structure automatically. In addition, the work in [43] employed hierarchical summarization in a web mail system for checking one’s web mail and proved that the use of hierarchal summarization reduces the number of bytes per user request by more than half. On the other hand, fractal summarization is derived from the document structure and the fractal theory. Fractals are mathematical objects that have high degree of redundancy, where they are made of transformed repetitive copies of themselves or even part of themselves. Similar to the geometrical fractals, documents have a hierarchical structure with multiple levels, chapters, sections, subsections, paragraphs, sentences, and terms. It generates a brief skeleton of summary at the first stage, and the details of the summary on different levels of the document are generated on demand when users request them. This summarization method significantly reduces the computation load in comparison to the generation of an entire summary in one batch by a traditional automatic summarization algorithm. In [45] they incorporated this technique to provide handheld devices with condensed summaries that can fit into their small display. Their algorithm run as follows: The original document is represented as a fractal tree structure according to its document structure. The fractal value of each node is calculated by the sum of sentence weights under the node by traditional summarization methods relying on some salient thematic features. Among the features used to score each sentence are the TF*IDF score, the location of the sentence, heading feature and cue phrases. Then they assign a weight to each feature score to compute the overall score for a sentence. Depending on a preset quota they select the highly ranked sentences of each fractal for summary. Experiments [45] have shown that fractal summarization outperforms traditional summarization. According to [45] fractal summarization can achieve up to 88.75% precision and 84% on average, while traditional summarization can achieve up to maximum 77.5% precision and 67% on average. In fact, fractal summarization is 27 applicable in social networks in summarizing events that have a certain structure and contain large text such as events on Facebook 19 . Comparing the hierarchical approach to the fractal approach, the proposed fractal summarization method relies on an input Web document being formatted in HTML in order to infer its structure, that, which is not required in the hierarchical approach. Regarding social networks, this approach allows the use of a mobile phone for checking one’s mail messages by allowing the user to access hierarchical summaries of the items in his or her inbox. Another attempt to tackle the problem of structure summarization is what’s presented in [46] which present a novel approach in summarizing XML documents. The novelty of this approach lies in that it is based on features not only from the content of documents, but also from their logical structure. Their sentence extraction-based method [46] employs a novel machine learning approach based on the Area under the ROC curve (AUC) 20 . The main rationale of this approach is to automatically combine different features, each being a numerical representation of a given extraction criterion. Then the summarizer learns how to best combine sentence features based on its effectiveness at assigning higher scores to summary sentences than to non-summary sentences. This ordering criterion corresponds exactly to what the learnt function is used for, i.e. ordering sentences. In other words, they view the problem of sentence extraction as an ordering task to find which features are more effective for producing summaries. They [46] evaluated their model using the INEX [47] and SUMMAC 21 datasets. Their findings can be summarized as follows: the inclusion of features from the logical structure of documents increases the effectiveness of the summarizer, and that the novel machine learning approach is also effective and well-suited to the task of summarization in the context of XML documents. Furthermore, their approach is generic and is therefore applicable to elements of varying granularity within the XML tree. www.facebook.com http://gim.unmc.edu/dxtests/ROC3.htm 21 http://www.itl.nist.gov/iaui/894.02/related projects/ tipster summac/cmplg.html 19 20 28 2.5.5 Link Based Summarization It [48] they have shown that summaries of hypertext link targets are useful for user cognition. Therefore, extending this technique may also be useful for summarization. By using the context and content surrounding a link more efficiently, quality summaries can be generated [49]. Using hyperlinks as summary markers within the page content helps in studying how non-linear resources can help us create summaries. In [49] they introduced two new algorithms to tackle the problem of summarization by context, where summarization by context is concerned with two separated fields, summarization algorithms and context of Web documents. The first algorithm uses both the content and the context of the document, while the second one is based only on the elements of the context. It is shown that summaries [49] taking into account context are usually much more relevant than those made only from the content of the target document. This is based on the hypothesis that when a document points to another one, it often includes a description of its link to this page and also the context of a page is enough to discriminate it. They consider the context of a Web page by the textual content of all the documents linking to it. These techniques have met some success in their applications however; they do suffer some problems because summaries generated in this manner rely on a profusion of associative and referential links. This is in opposition to the majority of websites which miss real hypertext structures for purely structural linking [50]. 2.5.6 Email Messages Summarization Over the past few decades, email has become the preferred medium of communication for many organizations and individuals. In addition, for many users emails have evolved from a mere communication system to means of organizing workflow, storing information and tracking tasks. Moreover, with the great advancements in mobile technologies, one can easily check his/her email any where any time using a web enables mobile device like a cellular device or PDA. However, users are introduced to some challenges due to some limitation imposed by the mobile environment itself, for example the small size the display and limited internet connectivity and bandwidth. 29 Email summarization can perfectly be fit in solving some of these challenges. In fact, email summarization is some how different from ordinary summarization tasks, since emails are usually short in size, do not obey grammatical rules, and are typically written in a chatty style [51]. One approach presented in [52] works by first extracting simple noun phrases as candidate units for representing the document meaning, and then use machine learning algorithms to select the most prominent ones. Another recent attempt [53] used clue words to provide an email summary of any length as requested by the user. The authors define “a clue word in node (fragment) F [as] a word which also appears in a semantically similar form in a parent or a child node of F in the fragment quotation graph” [53]. In addition, [54] introduced the SmartMail system, a prototype system for automatically identifying action items (tasks) in email messages. SmartMail presents the user with a task-focused summary of a message. The final summary is produced by identifying the task-related sentences in the message and then reformulating each task related sentence as a brief (usually imperative) summation of the task. Therefore, the user can simply add these action items to their “to do” list. Some other attempt that went further, the one presented in [55], where they proposed two approaches to email thread summarization: Collective Message Summarization Approach (CMSA) that applies a multi-document summarization approach and Individual Message Summarization Approach (IMSA) which deals with the problem as a sequence of single-document summarization tasks. Both approaches are driven by sentence compression. Instead of a purely extractive approach, they employ linguistic and statistical methods to generate multiple compressions, and then select from those candidates to produce the final summary. Experimental results [55] highlighted two important points. First, CMS represents a better approach to email thread summarization. Second, the current sentence compression techniques do not improve summarization performance in this genre. 30 2.5.7 Customer Review Summarization With the rapid expansion of e-commerce, people are more likely to express their opinions and hands-on experiences on products or services they have purchased. These reviews are important for both business organizations and personal costumers. Companies may alter its marketing strategies based on these reviews. Accordingly, review summarization can be a solution to such problem. However, review summarization is different from the ordinary text summarization as review summarization tries to extract the features being reviewed and determines whether the customer opinion is positive or negative. One proposed methodology is feature-based summarization [56], where given a set of customer reviews of a particular product, the task involves three subtasks: First, they identify the product features’ that customers have expressed their opinions on. Second, for each feature, they identify review sentences that give positive or negative opinions. Finally, they produce a summary using the discovered information. Another attempt [57] employed a specialized summarizer for a certain type of products with certain features. Furthermore, in [58] they proposed an approach that focuses mainly on object feature based review summarization. They formulate the review mining task as a joint structure tagging problem. They propose a new machine learning framework based on Conditional Random Fields (CRFs). CRFs is a discriminative model, which can easily integrate various features. It can employ rich features to jointly extract positive opinions, negative opinions and object features for review sentences. In fact, their experiments [57] on movie review and product review data sets showed that structure-aware models outperform many state-of-the-art approaches to review mining. 2.5.8 Blog Summarization Blog is a self-publishing media on the Web that has been growing quickly and becoming more and more popular 22 . Blogs allow millions of people to easily publish, read, respond, and share their ideas, experiences and expertise. Thus it is important to 22 http://www.technorati.com/weblog/2006/11/161.html 31 have a tool to find and summarize the most informative and influential opinions from the massive and complex blogosphere. In [59] they proposed a blog summarization approach to target Twitter 23 . The content of such a site is an extraordinarily large number of small textual messages, posted by millions of users, at random or in response to perceived events or situations. Their developed algorithm takes a trending phrase or any phrase specified by a user, collects a large number of posts containing the phrase, and provides an automatically created summary of the posts related to the term. However, the work in [60] relied on the hypothesis that the behavior of an individual is directly or indirectly affected by the thoughts, feelings, and actions of others in a population. The same relation can be applied on the blogosphere. The conversation in the blogosphere usually starts from innovators, who initiate ideas and opinions; then followers are primarily influenced by the opinions of these innovators. Thus the opinions of the influential innovators represent the millions of blogs and thousands of conversations on any given topic. Accordingly, they summarize the blogosphere by capturing the most influential blogs with highly innovative opinions. 2.5.9 Question Answering & FAQ Summarization Question Answering is the task of automatically formulating an answer to satisfy the need of a user. In other words it retrieves answers to questions rather than full documents or best-matching passages. It has recently received attention from the information retrieval, information extraction, machine learning, and natural language processing communities as it is a multi-disciplinary task. In [61] they have done some work on complex question decomposition in order to help extracting accurate and responsive answers for question driven multi-document summarization. Typically, complex questions address a topic that relates to many entities, events and complex relations between them. In their work they presented three methods for decomposing complex questions and evaluated their impact on the responsiveness of the answers they enable. Moreover, they argue that the quality of question-focused 23 www.twitter.com 32 summaries depends in part on how complex questions are decomposed. The first method decomposes questions based on syntactic information, whereas the other two use semantic and coherence information for question decomposition. Their experimental results [61] showed that by combining the two semantic-based question decomposition methods achieved the highest responsiveness scores by 25%. In [62] they proposed a conceptual model for automatically transforming topical forum articles into FAQ summary, and empirically demonstrated the acceptability of this model via a prototype system. Their experiment implied the time and manpower savings in producing FAQ and illustrated the technical feasibility of such a model. However, a research on automatically summarizing multiple-documents in [63] produces frequently asked questions (FAQs) of a topical forum. Mainly, their work aimed at enhancing the work in [62] of FAQ presentation model together with solving the problem of domain terminology extraction used in domain identification. The research was based on the four-part pattern traditional structure of Chinese articles namely Introduction (I), Body (B), Sub theme (S), and Conclusion (C). Their experiments [63] showed that, the informative level and readability of the FAQ summary is significantly improved, by incorporating the native language composition structure, which is more familiar to the user’s writing and reading style with the topical groups or concepts in presenting the text summarization. In [64] they presented a question answering system named QALC. Their system consisted of four main modules: question analysis, document retrieval, document processing and answer extraction. The question analysis module determines some information about the question: expected type of the answer, category of the question, keywords, etc. This information is mainly used by the second module to retrieve related documents which are then indexed and a subset of the highest ranked ones is kept. The third module performs the procedure of named entities tagging to indexed documents. The final module is in charge of extracting the answers from weighted sentences: first, the sentences of the documents are weighted according to the presence of the terms of the question and of named entities and their linear distance. Then answers 33 are extracted from the sentences depending on the expected type of the answer. They proposed using a syntactic distance between syntactic tree structure of the question and answer- instead of a linear distance between the terms of the question and answer- to select sentences in their question answering system. In [65] they introduced a statistical model for query-relevant summarization by characterizing the relevance of a document to a query on a collection of FAQ documents. Their approach is an extractive summarization: selecting either complete sentences or paragraphs for summary generation. They view each answer in a FAQ as a summary of the document relative to the question which preceded it. That is, a FAQ with N question/answer pairs comes equipped with N different queries and summaries: the answer to the K-th question is a summary of the document relative to the K- th question. They have proposed a principled statistical model for answer ranking that has a probabilistic interpretation as the best answer to q (query) within d (document) is s (summary). In other words, they trained a machine learning module to the mapping of document and query to summary. 2.6 Summarization Evaluation The summarization process is a challenging enough task to overcome, but yet there is another challenging problem encountered once we decide to carry out a summarization task, which is simply how to evaluate your results? In fact, this is quite a challenging task because of the non uniformity of input and output for most of summarization systems. There exist two major types of summarization evaluation methods intrinsic and extrinsic [2, 3]. Intrinsic evaluation does compare automatically generated summaries against some gold standard summary –Ideal summary mostly human generated. Mainly, it measures the degree of coherence and informativeness. Therefore, if we have a known 34 good summary or some human authored gold standard, intrinsic methods would be very much suitable in this case. On the other hand, extrinsic evaluation measures the performance of the automatically generated summary in relation to performing a particular task as it can also be called task based evaluation. However, in some cases extrinsic evaluation can be time consuming and expensive, thus it requires careful amount of planning. In case of extraction based summaries, it is widely evaluated using Recall and Precision scores [7]. Given an input text, a human’s extract, and a system’s extract, these scores quantify how closely the system’s extract corresponds to the human’s extract. For each unit-word, sentence, paragraph, section etc-, we let correct equals to the number of sentences extracted by the system and the human; wrong equals to the number of sentences extracted by the system but not by the human; and missed equals to the number of sentences extracted by the human but not by the system. In other words, precision reflects how many of the system’s extracted sentences were good, and Recall reflects how many good sentences the system missed. Despite the wide use and acceptance of recall and precession in extractive summary evaluation, this type of evaluation suffers from several problems. First, the nature of human variation [6-66] poses the problem of non uniformity of human choice. Therefore, the same summary can give different recall and precision scores when compared to different human summaries. Thus it makes it difficult to define a fine gold standard. Moreover, the system may choose a good sentence for extraction but still be penalized by P/R evaluation. Therefore, it might be more beneficial to rely on recall which highlights the degree of overlap than the non-overlap. Another problem with P/R evaluation is the granularity [7-67], means operating on the sentence level is not the best granularity for assessing the content of some source 35 text. Different sentences vary in length and wording conveys different meaning, though shorter sentences are not always the better as sometimes the longer sentence conveys more salient information. Moreover, two different sentences can still convey the same meaning –can occur very much in multi-document summarization-humans would only select one and the system is penalized for choosing them. Relative utility [67] has been proposed as a way to address the human variation and semantic equivalence problems in P/R evaluation. It requires that multiple judges each score each sentence to detect its suitability for inclusion in a summary. They also address the problem of semantic equivalence implicitly. This approach seems quite appealing but unfortunately, it requires a great deal of manual effort in this operation of sentence tagging. On the other hand, The DUC has been carrying- out large-scale evaluations of summarization systems on a common dataset since 2001. It finds quite deal of interest as twenty sites on average do participate in the evaluation process each year. Although the DUC adopts the approach of evaluation based on a single human model, it tries to overcome this drawback by dividing the whole data into subsets and assign each subset to a different annotator [66]. Additionally, DUC addresses the problem of sentence granularity and tried to create some elementary discourse units (EDUs) as basis for evaluation. These EDUs corresponds to clauses and each human made summary is divided into EDUs and the summary is evaluated by the degree they cover for the different EDUs in the model. The average score, called coverage was the average EDU score for the summary under evaluation. The measure was recall-oriented, in essence measuring what fraction of the model EDUs were covered by a summary [66]. DUC also supports abstractive summarization by using human abstracts as models instead of human selection of sentences, although it requires more human involvement. Due to the availability of the result data of the different systems registering in the DUC, it allowed researchers to study the factors that influence the performance of the different 36 summarization systems. It has been reported by the analysis of variance 24 –ANOVAanalysis that the input and the model creator factors were turned out to be the most significant factors [68] that affect summarization systems’ evaluation. Two lines of research on evaluation [7] emerged in an effort to address some of the issues raised by the DUC evaluation protocol: First, developing cheap automatic methods for comparing human gold-standards with automatic summaries. Second, developing better analysis of human variation of content selection, and using multiple models to avoid result dependence on the gold-standard. Another important issue to be addressed here in this context is automatic summarization evaluation measures and how systems can be automatically evaluated using some automatic evaluation tools. In fact, this trend in using automatic evaluation tools is not new; it has been known and widely used especially for machine translation evaluation using the BLEU technique [69]. The BLEU is a machine translation automatic evaluation technique that was designed to be easy, fast and inexpensive. Inspired by the success of the BLEU ngram overlap based measure, similar n-gram matching was tried for summarization since machine translation and text summarization can both be viewed as natural language processing similar tasks from a textual context. ROUGE [70] system for automatic evaluation of summarization was developed using the DUC scores and the computation of n-gram overlap between a summary and a set of models. One of the most appealing merits of ROUGE is that it is recall oriented unlike BELU which is precision oriented which enabled it to correlate better with DUC coverage scores. A reason about what makes it better may be because of the use of numerous set of parameters like words stemming, stop word removal and n-gram size. The Pyramid Method [71], on the other hand, is concerned with analysis of the variation in human summaries, as well as how evaluation results can be made less dependent on the model used for evaluation. Multiple human abstracts are analyzed manually to derive a gold-standard for evaluation. 24 http://en.wikipedia.org/wiki/Analysis_of_variance 37 The analysis is semantically driven information with the same meaning, even when expressed using different wording in different summaries. It is marked as expressing the same summary content unit (SCU). Each SCU is assigned a weight equal to the number of human summarizers who expressed the SCU in their summaries. SCU analysis [7] shows that summaries that differ in content can be equally good and assign a score that is stable with respect of the models when 4 or 5 human summaries are employed. A drawback of this approach is that it is very labor intensive. This method was also introduced for evaluation of abstractive summaries, and requires analysis that is unnecessary for extractive summaries [7]. The different automatic evaluation approaches give different results, and sometimes it is not totally clear what the scores mean and which automatic measure is to be preferred. As a result, this raises the question of what score to be used and when exactly to use it. All in all, this concludes the overall problem with automatic evaluation of summaries faced by researchers. 38 Chapter 3 TEXT SUMMARIZATION PROJECTS The following is a list of the most famous and well known text summarization projects available in both the academic and commercial fields. The following list is categorized by the summarization approach adopted in summary generation and whether it is an extraction based, an abstraction based or a hybrid approach. 3.1 Extractive Based Projects The work done regarding extractive summarization can be categorized by approaches which explicitly use natural language processing (NLP) techniques based on computational linguistics and machine learning, and approaches that use non-NLP techniques. The non- NLP based approaches can in turn be divided into statistical methods, data mining methods, and knowledge based and question answering methods. In fact, attempts mostly concentrate on machine learning approaches, statistical approaches and question answering based approaches. As a result, we will introduce the most famous text summarization projects covering these three categories. The following section lists some projects that mainly rely on machine learning techniques. 3.1.1 Machine Learning Based Projects Neural Summarizer (NeuralSumm) [72] is an automatic text summarizer that is based upon a neural network that, after training, is capable of identifying relevant sentences in a source text for producing the extract. NeuralSumm system makes use of a machine learning technique, and runs on four processes: First, text segmentation, second, features extraction, third, classification, and fourth summary production. In fact, the learning process is primarily unsupervised, since it is based on a self-organizing map, which clusters information from the training texts. NeuralSumm produces two clusters: one that represents the important sentences of the training texts (and, thus, should be 39 included in summary) and another that represents the non important sentences (and, thus, should be discarded). ClassSumm [73] summarization project employs a classification system that produces extracts based on a Machine Learning approach, in which summarization is considered as a classification task. Actually, it is based on a Naïve Bayes classifier. In order to perform the summarization process, the system performs the same four processes employed by NeuralSumm as previously explained. First, text pre-processing is similar to the one performed by TF-ISF-Summ [74]. Second, features extracted from each sentence are of two kinds: statistical, i.e., based on measures and counting taken directly from the text components, and linguistic, in which case they are extracted from a simplified argumentative structure of the text. Third, summary generation is considered as a twovalued classification problem: sentences should be classified as relevant-to-extract or not. In other words, according to the values of the features for each sentence, the classification algorithm must “learn” which ones must belong to the summary. Finally, the sentences to include in the summary will be those above the cutoff and, thus, those with the highest probabilities of belonging to it. The Text Summarization Project 25 is presented by the University of Ottawa. Unfortunately, there are not so many details available about this research project except what’s found in its proposal. In fact, the proposal mentioned that they use some machine learning techniques that identify keywords in the original text. Whereas the process of keyword identification signals importance of those sentences containing these keywords. In addition, they mentioned their plan to use some surface level statistics such as word or keyword frequency analysis and may be some level linguistic features as the position of sentences in their given paragraphs. The SweSum 26 research project is presented by the Royal Institute of Technology in Sweden. It targets extractive summaries. Their work resembles the work of ISI 25 26 http://www.site.uottawa.ca/tanka/ts.html http://www.nada.kth.se/%7Exmartin/swesum/index-eng.html 40 Summarist 27 . The summarizer supports both Swedish and English languages in newspaper or academic domains. The main idea behind this summarizer is that it ranks sentences according to some weighted word level features and later to extract the highest score sentences for summary. The weights to these features were trained on a tagged Swedish news corpus. In addition, this summarization tool can be integrated to search engine results giving quick extracts. The Text Analyst 28 text summarization product targets the textual content present in users offline documents. Fortunately, the official Web site of this summarization product gives a brief description to the product’s system design. It simply works in three steps. First, they construct a semantic network from the source document using some trained neural network. In fact, they stress on the point that their semantic network construction is fully independent on prior domain specific knowledge. Second, they introduce to the user a graphical representation of concepts and relationships presented in the original document for selection according to the user’s preference. Finally, they select sentences with matching concepts and relationships for extraction and inclusion in final summary. 3.1.2 Statistical Based Projects The following section lists some approaches that mainly rely on statistical techniques in the process of formulating extractive summaries. The statistical based approaches are based on measures and counting taken directly from the text components. Gist Summarizer (GistSumm) [75] is an automatic summarizer based on a novel extractive method, called gist based method. It is focused upon the matching of lexical items of the source text against lexical items of a gist sentence, supposed to be the sentence of the source text that best expresses its main idea. It can be determined by means of a word frequency distribution. GistSumm comprises three main processes: text segmentation, sentence ranking, and extract production. For GistSumm to work, the 27 28 http://www.isi.edu/natural-language/projects/SUMMARIST.html http://www.megaputer.com/html/textanalyst.html 41 following premises must hold: (a) every text is built around a main idea, namely, its gist; (b) it is possible to identify in a text just one sentence that best expresses its main idea, namely, the gist sentence. Term Frequency-Inverse Sentence Frequency-based Summarizer (TF-ISF-Summ) [74] is an automatic summarizer that makes use of the TF-ISF (Term-Frequency InverseSentence-Frequency) metric to rank sentences in a given text and then extract the most relevant ones. Similarly to GistSumm, the approach used by this system has also three main steps: First, text pre-processing. Second, sentence ranking, and finally, extract generation. It adapts Salton’s TF*IDF information retrieval measure [76] in that, instead of signaling the documents to retrieve, it pinpoints those sentences of a source text that must be included in a summary. The ISI Summarist summarization project is presented by the University of Southern California. It targets the process of Web document summarization. It has been integrated to the Systran translation system 29 , developed by Systran Software Inc. of La Jolla, CA, in order to provide a gisting tool for news articles in any language provided by Systran. The overall system design can be divided into three main modules. First, it applies some topic identification module using some statistical techniques on some surface features such as word position and word count. In addition, they are in the way of using cue phrases and discourse structure in the process of topic identification. Second, the identified concepts are interpreted into a chain of lexically connected sentences. Third, extract those sentences with most general concept. Though the extracted sentences formulate good summaries, they plan for future work are to enhance the extracted sentences to generate more coherent summary. The Open Text Summarizer30 (OTS) is an open source tool for summarizing texts. The program reads a text and decides which sentences are important and which are not. It ships with Ubuntu, Fedora and other linux distros. OTS supports many (25+) languages which are configured in XML files. OTS incorporates NLP techniques via an English 29 30 http://www.systranet.com/translate http://libots.sourceforge.net/ 42 language lexicon with synonyms and cue terms as well as rules for stemming and parsing. These are used in combination with a statistical word frequency based method for sentence scoring. Copernic Summarizer [77] is a multilingual summarizer that can operate on four languages English, French, Italian and Spanish. It can summarize Word documents, Web pages, PDF files, email messages and even text from the Clipboard. It can be integrated into applications such as Internet Explorer, Netscape Navigator, Adobe Acrobat, Acrobat Reader, Outlook Express, Eudora, Microsoft Word, and Outlook. They operate based on their patent pending WebEssence technology, which automatically removes irrelevant text and content (such as ads and navigation items) from Web pages, Copernic Summarizer focuses on the essential text elements, which results in even more relevant summaries. This is achieved through the following steps [77]. First, they utilize a Statistical model (S-Model). This model is used in order to find the vocabulary of the text to detect similar concepts. Second, they apply a Knowledge Intensive Processes (KIP). It imitates the way in which human make summary texts by taking into account the following steps. (a) Language detection where it detects the language (English, German, French or Spanish) of the document for applying its specific process. (b) The limits of sentence recognition, where it identifies the beginning and endings of each sentence. (c) Concept extraction where it uses some machine learning techniques to extract keywords. (d) Document Segmentation, where it organizes the information that it can be divided into larger related segments. (f) Sentence Selection, where it selects sentences according to their importance (weight) discarding those that decrease readability and coherence. Microsoft Office Word Summarizer is a text summarizer that can be found in versions of Microsoft Office Word 2003 and Microsoft Office Word 2007. This tool can generate summaries of by stating the desired number of sentences, words or even by stating the percentage of words in the original text. It offers various ways of visualizing summaries. One is highlighting the color of important sentences in the original document. The summary created is the result of an analysis of key words; the selection of these is done by assigning a score to each word. The most frequent words in the document will 43 have highest scores which will be considered as important and thus will be included in the summary. The InText 31 text summarization product targets the textual content present in users offline documents. Unfortunately, there exist no specific details about the operation of this product on its official Web site. However, they state that it extracts key sentences by using key words, although the exact technique is not mentioned. In addition, their description mentions that the user may choose one of several extraction techniques. 3.1.3 Question Answering Based Projects The FociSum 32 project is presented by Columbia University. It adopts a question answering approach in performing the summarization process. First, it extracts sentences that answer key questions about event participants, organizations, places and all other types of questions. The result of the previous stage is simply a concatenation of sentence fragments extracted from the original document. Second, the system finds the foci of the document under consideration using a named entity extractor. Then a question generator is used to define the relationship between the entities extracted. Third, the system then parses the document on the basis of syntactic form to find candidate answers to the questions previously generated. Finally, the output is some sentence fragments and clauses that represent the original document. The Mitre's WebSumm 33 text summarization product targets the textual content present in single or multiple Web documents. It can be integrated with search engines. The output of this product is an extractive summary directed in selection by the user’s query which can be considered as a question to be answered. The main idea behind this system’s work is that it represents all source document(s) as a network of sentences. In response, the system uses the query terms to extract or select those nodes related to the http://www.islandsoft.com/products.html http://www.cs.columbia.edu/%7Ehjing/sumDemo/FociSum/ 33 http://www.mitre.org/news/the_edge/july_97/first.html 31 32 44 specific query words. Moreover, this product can handle similar and contrasting sentences across the different documents. 3.2 Abstractive Based Projects In abstractive summarization systems alter the original wording of sentences by merging information from different sentences, removing parts of the sentences, or even add new sentences based on document understanding. The work done regarding abstractive summarization mainly uses natural language processing (NLP) techniques based on computational linguistics and machine learning. In fact, abstractive based approaches are very labor intensive and require intensive work on automatic deep understanding of documents. As a result, there exist a few projects based on this approach in comparison to extractive based approaches. The TRESTLE 34 research project is presented by the University of Sheffield. It targets news domain summaries. Unfortunately, there are not so many details available about this research project on its official web site about the exact system architecture. Out of the general information we have, we know that It applies a concept identification module based on the recommendations about information style presented by the message understanding conference 35 (MUC). It then uses the identified concepts in assuming degrees of importance to the different sentences. Then it formulates the summary based on the information present in these sentences. The Summons 36 research project is presented by the University of Columbia. It targets multi-document summaries in the news domain. The system is designed over the results of the MUC style information extraction process. It is based on a template with instantiated slots of pre-defined semantics. The summary is then generated by using their sophisticated natural language generation stage. In fact, this generation stage consists of three sub stages, a content selection sub stage, a sentence planning sub stage and a surface generation sub stage. The system benefits from using the notion of templates this http://nlp.shef.ac.uk/trestle/ http://en.wikipedia.org/wiki/Message_Understanding_Conference 36 http://www.cs.columbia.edu/%7Ehjing/sumDemo 34 35 45 is because they usually have a well defined semantics. Therefore, summaries produced by this type of generation are of high quality that is comparable to human professional abstracts. One apparent drawback of this project is that it is domain specific, relying on news articles templates for information extraction stage. 3.3 Hybrid Based Projects In this section we show hybrid approaches which combine extraction based techniques with more traditional natural language processing techniques to produce abstractive summaries. The Cut and Paste system 37 targets single document domain independent texts. The system is designed to operate on the results of sentence extraction summarizers and then it performs a process of key concept extraction from these extracted sentences. The following stage after concept identification is to combine these concepts to form new sentences. In other words, the system cuts the surface form of the extracted key concepts and then pastes them in to new sentences. In fact, this is simply done in two main steps. First, it reduces sentences by removing any extraneous information where this process is known as sentence compaction. This process uses probabilities learnt from a training corpus. Second, the reduced sentences are then merged using some rules such as adding extra information about the speakers, merging common sentence elements and adding conjunctives. The MultiGen system 38 is presented by the University of Columbia. It targets multi-document summaries in the news domain. The main idea behind the work of this system can be simplified and described as it extracts sentence fragments from the given text that can be considered as key points of information presented in the set of related documents given. This task is accomplished in three main steps. First, they utilize some machine learning techniques to group paragraph sized units of text into clusters of related topics. Second, they parse sentences from these clusters and then merge the resulted parse 37 38 http://www.cs.columbia.edu/%7Ehjing/sumDemo/CPS/ http://www.cs.columbia.edu/~hjing/sumDemo/ 46 trees to form a logical representation of the commonly occurring concepts. Third, the resulted logical representation is turned back into sentences using the FUF/SURGE grammar 39 . In fact, the matching of concepts is performed via some linguistic knowledge such as those found in stemming, part of speech tagging, synonymy and verb classes. 39 http://www.cs.bgu.ac.il/surge/index.html 47 Chapter 4 PROPOSED APPROACH FOR FAQ WEB PAGES SUMMARIZATION This research targets FAQ Web pages text summarization. Our approach is based on segmenting Web pages into Q/A block units and extracting the most salient sentences out of it. FAQs are generally well-structured documents, so the task of extracting the constituent parts (question and answers in our case) is amenable to automation. Consequently, after a FAQ Web pages are correctly segmented into Q/A units, we select the best sentence(s) from each answer to each question based on some salient features. The proposed approach is English language dependent. In this chapter, the first section gives an overview of the methodology we propose to tackle the problem of summarizing FAQ Web pages. Next, we give a brief about Web page segmentation in general and how it captures the semantic content resides in a given page. Additionally, we show how we employ Web page segmentation to comprehend the FAQ Web page structure in the form of questions plus answers. Finally, we describe the details of the features selection we employ and the logic behind how they are combined. 4.1 Proposed Methodology Overview Web pages are devised into a question having a heading followed by its answers with a different style. This is almost a standard for building a FAQ page. In most of the FAQ Web pages the text is not scarce giving a good opportunity for summarizers to work. Moreover, this type of pages is more informative than most of other types of Web pages as it targets the questions and concerns of a Web site visitors. The questions may also be grouped according to the degree of relatedness to answer a set of semantically related issues. These observations made it clear that the FAQ Web pages summarization 48 may benefit from utilizing a Web page segmentation algorithm to correctly identify the question and its answer(s) as a first step and later on to summarize it. In fact, this research’s main goal is to extract sentences for final summary based on some selection features that signals importance of some sentences. Related research has shown that when humans perform the process of summarizing a document they tend to use readymade text passages -extracts- [78], where 80% of sentences in their generated summaries were closely matched with source document, while the remainder was as set of new sentences added based upon understanding the document. As a result, most automatic text summarization approaches tend to depend on the process of sentence(s) selection from source document based on the salient features of document, such as thematic, location, title, and cue features. Moreover, extractive approaches tend to be faster, easily extendable and retain most of the structure of the original document instead of flattening it. However, its main disadvantage is that by using certain techniques, it may become misleading in terms of structure. After FAQ Web pages were correctly segmented utilizing a segmentation algorithm into question and answers units, then we apply selection feature modules that are used in order to form the final summary. 4.2 Web Page Segmentation A very common mistaken notion is that Web pages are the smallest undividable units of the Web [79]. In fact, Web pages consist of various content units, such as navigation, decoration, interaction and contact information that may not be directly related to the main topic of the page [79]. A single Web page may represent multiple topics that usually distract the reader about the main topic of the page. Therefore, being able to understand the actual semantic structure of the page can aid the summarization process enormously. Knowing the relationship between all units –in our case the different 49 headings -resides in the Web page will uncover the different degrees of importance of these units in relation to each other. 4.2.1 Applying Web Page Segmentation to FAQ Web Pages As mentioned earlier, FAQ Web pages are devised into a manner of a question having a higher heading or different style followed by its answer with a lower heading and different style. Therefore, it makes more sense to use the algorithm described in [80], where it extracts the hierarchal structure of Web documents based on the identification of headings and the relationships between the different headings. One objective of using this algorithm is to help us filter out most of the misleading units reside in the Web page. As we mentioned before that Web pages usually contain other content –decoration, navigation, contact information, etc.-beside text which is our main interest. Another objective of applying this algorithm is to state the boundaries of the question and answer units. We could have retrieved the whole text resides in the Web page and then process it, but we may lose some valuable information hidden in the structure of the answer. For example, if the answer to a question is divided into lower sub headings, this means that each sub heading either conveys a different type of information or with different degree of importance. The figure below shows an example illustrating that case. 50 How do I add CNN.com to my list of favorites? If you are using Internet Explorer: • Open the CNN home page http://www.cnn.com/ on Internet Explorer and click on Favorites. • Click on "Add to favorites". A window will open confirming that Internet Explorer will add this page to your Favorites list and confirm the name of the page. • Click OK to continue. • You may also file CNN within a folder in your list of Favorites. • Click Create In to file the page in an existing folder or click New Folder to add another folder to your list. If you are using Netscape Navigator: • Open the CNN.com home page http://www.cnn.com/ in Netscape. • Click Bookmarks (on the upper left of the page, find Bookmarks next to the Location and URL). • Choose Add Bookmark to automatically add CNN to your list of bookmarked web sites. • Choose File Bookmark to file your CNN bookmark in a separate folder. • Bookmarks can also be found under the Window option on your menu bar. Figure 4.1 An Example of Logically Divided Answer into Sub Headings. A good answer to the previous example should consider both headings the one about Internet Explorer and the one about Netscape Navigator. Lacking the knowledge that these are two different but equally weighted headings would result a non informative summary. A Good conclusive 25% summary to the previous question would be something like that: If you are using Internet Explorer: • Open the CNN home page http://www.cnn.com/ on Internet Explorer and click on Favorites. • Click on "Add to favorites". A window will open confirming that Internet Explorer will add this page to your Favorites list and confirm the name of the page. If you are using Netscape Navigator: • Open the CNN.com home page http://www.cnn.com/ in Netscape. • Click Bookmarks (on the upper left of the page, find Bookmarks next to the Location and URL). Figure 4.2 An Example of a Good Summary to Question in Figure 1. 51 A bad 25% summary lacking that knowledge to the previous question would be something like that: If you are using Internet Explorer: • Open the CNN home page http://www.cnn.com/ on Internet Explorer and click on Favorites. • Click on "Add to favorites". A window will open confirming that Internet Explorer will add this page to your Favorites list and confirm the name of the page. • Click OK to continue. You may also file CNN within a folder in your list of Favorites. Figure 4.3 An Example of a Bad Summary to Question in Figure 1. 4.3 Features Selection In literature there have been so many features proposed some tackling specific problems while others thought to be more general. As previously mentioned in [8] they proposed that the frequency of a particular word in an article provides a useful measure because of its significance. Also related work in [9] provides early insight on a particular feature that was thought to be very helpful, which is the sentence position in paragraph. By experiment, he proved that in 85% of the time the topic sentence is the first one and only 7% of the time it came last while the other cases are randomly distributed. On the other hand, the work in [10] was based on the two features of word frequency and positional importance which were incorporated in the previous two works. Additionally, two other features were used: the presence of cue words (for example words like significant, fortunately, obviously or hardly), and also incorporated the structure of the document (whether the sentence is in a title or heading). In [20-21] they account for a set of features to generate their summaries. In [20] they proposed an approach that takes in account several kinds of document features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Positive keywords are defined as the keywords frequently included 52 in the summary. In contrast to positive keywords, negative keywords are the keywords that are unlikely to occur in the summary. The centrality of a sentence implies its similarity to others, which can be measured as the degree of vocabulary overlapping between it and other sentences. In other words, if a sentence contains more concepts identical to those of other sentences, it is more significant. Generally, the sentence with the highest centrality denotes the centroid of the document. Resemblance to title feature implies that the more the overlapping between sentences and the document title the more important they are. These last two features are basically used in case of formulating generic summaries. In [21] they account for some other features like inclusion of named entities, relative sentence length and the aggregated similarity. The similarity feature is simply a vocabulary overlap between the 2 nodes (2 sentences) under consideration divided by the longest length of the 2 sentences (for normalization). Aggregate similarity measures the importance of a sentence, instead of counting the number of links connecting a node (sentence) to other nodes. Weights were attached to each of these features manually to score each sentence. In conclusion, these features perform differently on different text types depending on different types of evidence. Means that some times some work perfectly on a certain problem type and behaves badly on some other. One other observation on the number of included features to solve some problems the larger the number the features included in summary generation the harder the weight assignment scheme. In addition, including unnecessary non problem dependent features may ruin the weighting scheme and put a negative influence on the other features. Therefore, when one is devising a feature set to solve a summarization problem, he must carefully study the problem underhand to highlight the most useful feature set and to include a feasible number of features to be able to adjust weights. In response, according to the definition of our problem of summarizing FAQ, we find ourselves approaching the domain of question answering, which is the task of automatically answering a question posed. Consequently, in order to define the useful features to be used in selecting the most suitable sentences from an answer paragraph to a 53 given question, one has to determine the type of question being posed to determine the most appropriate answer. In fact, questions can be classified based on the expected type of answer. The simplest widely accepted question type classification is divided into to three basic types of question 40 . First, Yes/No Questions, where the expected answer to this type of question is explicitly either "Yes" or "No". An example to this type of question is depicted in table 4.1. Second, Question Word Questions where the answer to this type of question is possible through listing certain degree of informative details. An example to this type of question is depicted in table 4.2. Third, choice Questions where the answer to this type of question is mentioned in the question itself. An example to this type of question is depicted in table 4.3. Auxiliary verb Subject Main Verb Do You Want Can You Drive? Has She Finished Answer “Yes or No” Dinner? Yes, I do. No, I can't. Her work? Yes, she has. Table 4.1 Yes/No Question Type Examples. 40 www.englishclub.com/grammar/verbs-questions_types.htm 54 Answer “Information” Question Word Auxiliary Verb Subject Main Verb Where Do You Live? When Will We Have lunch? At 1pm. Why hasn't Tara Done it? Because she can't. In Paris. Table 4.2 Question Word Question Type Examples. Answer “In the question” Auxiliary Verb Subject Main Verb OR Do You Want Tea Or Coffee? Coffee, please. Will We Meet John Or James? John. Did She Go to London Or New York? She went to London. Table 4.3 Choice Question Type Examples. In the light of the previous question type classification, we came up with the conclusion that we have to think about some possible selection features that can address each type. Moreover, we aim at finding a general solution not knowing the exact type of question, and yet being able to find a good answer to the question. Therefore, the main idea in reaching our goal has come down to choose some selection features that target each type of question and later find a way of combining them linearly. In answer to the first question type, the first sentence in the answer text contains the primary answer while sentences come after that just adds some extra more details. Therefore, the selection feature that gives higher score to opening sentences is highly preferable in answer to that type of question. On the other hand, in answer to the second and third question types, one may need some semantic similarity measures to weight the answer sentences against the question 55 sentence. In addition, the answer may also contain certain named entities to inform about people, places, organizations, products, etc, which signals the importance of sentences containing them. Thus one might utilize selection features to highlight sentences possessing capitalized words. Based on the analysis of different question types, we propose the use of four selection feature criteria that we believe they help answering the three main question types and in turn extracting the most salient sentences of the source text to be introduced to the final summary. Then we give each feature a weight that we will later discuss how it is calculated. The four features are then linearly combined with a single equation to give an individual score to each sentence based on the degree of importance given to the four different features. 4.3.1 Semantic Similarity Feature The first feature that we use in developing the FAQWEBSUMM system feature is referred to as “Similarity”. It evaluates the semantic similarity between two sentences which are the question sentence and each sentence in the answer. This feature was explicitly chosen to answer questions of type two and three. In fact, semantic similarity is a confidence score that reflects the semantic relation between the meanings of two sentences. We use a similarity score calculator developed by Dao and Simpson [81]. The similarity feature evaluation depends on some dictionary-based algorithms which capture the semantic similarity between two sentences, which is heavily based on the WordNet 41 semantic dictionary. In fact, WordNet has a lexical database which is available online and provides a large repository of English lexical items. It is designed to establish a connection between four types of part of speech noun, verb, adjective and adverb. It contains the word with both its explanation and its synonym. The specific meaning of one word under one type of part of speech is called a sense. Each group of words having the same sense is 41 http://wordnet.princeton.edu/ 56 combined together for having the same synonymous meaning. Each group has a gloss that defines the concept it represents. For example, the words night, nighttime and dark constitute a single group that has the following gloss: the time after sunset and before sunrise while it is dark outside. Groups are connected to one another through some explicit semantic relations. Some of these relations (hypernym, hyponym for nouns and hypernym and troponym for verbs) constitute is-a-kind-of (holonymy) and is-a-part-of (meronymy for nouns) hierarchies. Mainly, the algorithm in [81] evaluates how similar are two sentences to each other. The process of evaluating the similarity between two sentences is done in five main steps: First, partition sentences into a list of words then tokens. This can be done through the process of extracting the basic forms of words by removing stop words and abbreviation. Second, perform part of speech tagging to identify the type of words (noun, verb, etc.). In [81] they employed two types of taggers the first one attaches syntactic roles to each word (subject, object, ..) and the second one attaches only functional roles (noun, verb, ...). Third, perform stemming, which is the process of removing morphological and inflectional endings of words. In [81] they used the porter stemming algorithm. It works as follows; first, it split words into possible morphemes getting intermediate form and then maps stems to categories and affixes to meaning. i.e.: foxes -> fox + s -> fox. Fourth, perform semantic disambiguation, which can be defined as the process of finding the most appropriate sense for a given word in a given sentence. In [81] they modified and adapted the Lesk algorithm [82] which mainly counts the number of words that are shared between two glosses. The algorithm in [81] runs as follows: 57 1. Select a context: optimizes computational time so if N is long, they define K context around the target word (or k-nearest neighbor) as the sequence of words starting K words to the left of the target word and ending K words to the right. This will reduce the computational space that decreases the processing time. For example: If k is four, there will be two words to the left of the target word and two words to the right. 2. For each word in the selected context, they look up and list all the possible senses of both noun and verb word types. 3. For each sense of a word, they list the following relations: o Its own gloss/definition that includes example texts that WordNet provides to the glosses. o The gloss of the synsets that are connected to it through the hypernym relations. If there is more than one hypernym for a word sense, then the glosses for each hypernym are concatenated into a single gloss string. o The gloss of the synsets that are connected to it through the hyponym relations. o The gloss of the synsets that are connected to it through the meronym relations. o The gloss of the synsets that are connected to it through the troponym relations. 4. Combine all possible gloss pairs that are archived in the previous steps and compute the relatedness by searching for overlap. The overall score is the sum of the scores for each relation pair. When computing the relatedness between two synsets s1 and s2, the pair hype-hype means the gloss for the hypernym of s1 is compared to gloss for the hypernym of s2. The pair hype-hypo means that the gloss for the hypernym of s1 is compared to the gloss for the hyponym of s2 as indicated in [81] by the following equation. 58 OverallScore(s1, s2)= Score(hype(s1)-hypo(s2)) + Score(gloss(s1)hypo(s2)) Score(hype(s1)-gloss(s2)).. (1) Note: ( OverallScore(s1, s2) is also equivalent to OverallScore(s2, s1) ). To score the overlap they use a new scoring mechanism that differentiates between N-single words and N-consecutive word overlaps and effectively treats each gloss as a bag of words. Measuring overlaps between two strings is reduced to solve the problem of finding the longest common substring with maximal consecutives. Each overlap which contains N consecutive words, contributes N2 to the score of the gloss sense combination. For example: an overlap "ABC" has a score of 3^2=9 and two single overlaps "AB" and "C" has a score of 2^2 + 1^1=5. 5. Once each combination has been scored, they pick up the sense that has the highest score to be the most appropriate sense for the target word in the selected context space. The above method allows us to find the most appropriate sense for each word in a sentence. To compute the similarity between two sentences, they consider the semantic similarity between word senses. They capture semantic similarity between two word senses based on the path length similarity. Thus, the scoring mechanism works as follows; it builds a semantic similarity relative matrix R[m,n] of each pair of word senses, where R[i, j] is the semantic similarity between the most appropriate sense of word at position i of sentence X and the most appropriate sense of word at position j of sentence Y. The similarity between sense can be computed as in [81] by equation 2: Sim(s, t) = 1/distance(s, t) (2) where distance is the shortest path length from sense s to sense t by using node counting. In other words, R [i,j] is also the weight of edge connecting i to j. The match results from the previous step are combined into a single similarity value for two sentences. There are many strategies to acquire an overall similarity of two sets. To compute the overall score as in [81] they just evaluate equation 3: 59 (3) where match(X, Y) are the matching word tokens between X and Y. Finally, the similarity is computed by dividing the sum of similarity values of all match candidates of both sentences X and Y by the total number of set tokens. We use this method in computing the similarity between the question sentence(s) and the answer sentences. In fact, we give each sentence in the answer a numeric score represented by the degree of similarity to the question as computed by this similarity measure. This selection measure is considered the primary feature in FAQWEBSUMM system as it responds to two question types out of three as previously mentioned. After computing the score it is normalized based on the highest value gained within the answer sentences. 4.3.2 Query Overlap Feature The second feature that we use in FAQWEBSUMM is referred to as “Query Overlap”. This feature is a simple but very useful method as stated in [31-34-65]. This feature is also chosen to answer questions of type two and three. It scores each sentence by a number of desirable words it contains. In our case the query will be formed from the question sentence(s). The question is tokenized and tagged using a POS tagger to identify the important word types. It was found that there are certain word types are more important than others. For example nouns, adjectives and adverbs are found to be more informative as they declare the statement about the main purpose of the text. The system extracts the following word types; nouns, adverbs, adjectives and gerunds -that represent verbs- from each question and formulate a query with the 60 extracted words. Then the system scores each sentence of the answer for the frequency of existence of the query words. Unlike the similarity score in the first feature, this feature does a direct match, means the same word in the query has to be exactly the same in the answer individually not related to other words in the question. This is because the first feature takes into account the whole sentence and less semantically related words lessen the overall score, while in this case we try to avoid this consequence. However, it is spawned from the first feature as a natural extension. In fact, the logic behind including two selection features to target semantically related content is that in two out of three question types –Word question type and Choice question type-or 66 % of the expected answers are directly related to the question. 4.3.3 Location Feature The third feature that we use in FAQWEBSUMM system is the location feature. As previously depicted, the significance of a sentence is indicated by its location based on the hypotheses that topic sentences tend to occur at the beginning or sometimes in the end of documents or paragraphs [2-3-4] - especially Yes/No questions. We simply calculate this feature for each sentence in the answer by giving the highest score to the first sentence and then degrading the score for following sentences. For example if we have a paragraph of five sentences the first sentence takes a score of 5 and the following sentence takes a score of 4 until we reach the last sentence in the paragraph with a score of 1. Later, these scores are normalized for combination with the other features. 4.3.4 Word Capitalization Feature The fourth feature used in FAQWEBSUMM system is the capitalized words frequency. The logic behind using this feature is that capitalized words are important as stated previously especially in response to the second question type that requires listing a 61 higher degree of details. In fact, they are important as they often stand for person’s name, corporation’s name, product’s name, country’s name, etc. Therefore, they usually signal importance for the sentences contains them. They tell the summarizer that these sentences are good candidates for inclusion as they tend to have a piece of information that is salient enough to be worth of consideration. This feature is simply computed by counting the number of capitalized words in a given sentence. Later these scores are normalized for combination with the other features. We also have set a threshold value on the number of capitalized words in a given sentence as if the number of detected capitalized words exceeded this value it would mean that all text is written in capital. Therefore, it would mean nothing to give importance to these sentences as they all have the same format, thus we ignore it by giving it a zero score value. 4.3.5 Combining Features Each of the previous features produce a numeric score that maps to how important it consider each given sentence in the answer. However, the total is score is the linear combination of the four features but not equally weighted depending on how we consider the contribution of each feature to the final score. We compute the score of each sentence as follows: Total Score of Sentence(s) = 1 (Similarity(s)) + 2(Query Overlap(s)) + 3(Location(s)) + 4(Uppercase Words(s)) (4) Where 1, 2, 3 and 4 are weights given for each feature in response to the expected contribution in the overall score formula. A pilot experiment was performed before reaching the optimum weights to be included in the final formula. The next section will give details to the findings of this experiment. 62 4.4 Pilot Experiment “Detecting Optimum Feature Weights” 4.4.1 Objective The main objective of this experiment is to determine the optimum weight of each feature in eq.4. 4.4.2 Experiment Description The proposed method for achieving this objective is as follows: First, formulate a summary by utilizing the scoring of one feature at a time. Second, compare the features’ scores and rank them according to their performance gain. Finally, devise a weighting scheme –means give a range to each feature weight- and score it the same as in step one to make sure that the devised methodology is sound and correct. The experiment was done on the basis of one FAQ Web page with 29 questions in total, which represent a sample of approximately 8% of the dataset we use. Fifteen questions were found of Yes/No question type while fourteen questions were found informative questions of type two. Each summary is evaluated by one human evaluator. The evaluation is done on the basis of whether the summary is readable, informative and short or not. The compression rate used in generating these summaries was 25 %. It was chosen to give satisfactory readable and understandable compressed summary. The evaluator gives a score ranged from very bad to excellent as depicted above according to whether it meets the preset quality criteria and to what extent. 63 4.4.3 Results The figure below shows a graphical representation to the comparison between summary scores of all the 29 questions provided by all features. This is done by giving the quality metric the following numerical values Excellent a value of 1, Very good 0.8, Good 0.6, Bad 0.4 and Very Bad 0.2. The detailed score comparison between all features along with a sample summary output is presented in Appendix A. 1 0.9 0.8 S co re 0.7 F1 Similarity 0.6 F2 Location 0.5 F3 Query 0.4 F4 Capitalized Words 0.3 0.2 0.1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Questions Figure 4.4 Features’ Scores Comparison. The table below shows the average score produced by each feature in response to the given questions. Feature Average Score F1 Similarity 0.800 F2 Location 0.793 F3 Query Overlap F4 Capitalized Words 0.558 0.496 Table 4.4 Summary Evaluation for Selection Features. 64 4.4.4 Discussion Based on the above, we can see that the semantic similarity feature F1 produced the highest average score in response to all questions. While the location features F2 came second. The query overlap feature F3 came third and the capitalized Words frequency F4 came last. Moreover, we can also observe that the highest value and most constant curves were the F1 and the F2 curves, as they always succeed in generating an output even if it is not of great quality. On the other hand, the answer sentences may not contain any capitalized Words which results in summary generation failure which is apparent in the curve above. The same applies to F3 sometimes the answer doesn’t contain direct association between query words and answer sentence words, thus the need to compare the word senesces as what’s done in the similarity feature. However, we can notice that in some cases F3 and F4 show sudden improvement in comparison to the other features. In conclusion, it was found that each of the features mentioned above performed well in some cases based on different kinds of evidence. However, we believe if we combine the features contributions together based on the above feature scores, we can improve the overall score and avoid the times each feature fail by backing it up with other features. As a result, we proposed the following weight scheme. We will give the highest weight (1) to the sentence(s) similarity feature as we witnessed that its average score was the best amongst the other features. Next, we will give the second highest weight (3) to the location feature as we witnessed that its average score was the second best amongst the other features. The query overlap feature is given weight (2) less than the location as it is considered a spawn of the similarity and scored the third average score position. The last feature the Capitalized Words feature weight (4) is given the least weight based on the same logic stated above the weighting can have arbitrary values but obeys the following inequality between the feature weights. 65 α1 > α3 >α2>α4 (5) Applying this weighting scheme to the same experimental data we used above in evaluating previous features individually, we found out the following: First, the average score in answering all questions was 0.834, which is top scorer compared to the highest value scored by the previous features individually. This means that the proposed combination methodology improved the overall score. Second, the feature combination avoids the times where some features fail as it will turn to an alternative feature in scoring the different sentences. The detailed score to all questions along with a sample summary output to the combined features scheme is presented in Appendix A. In fact, this is quite an impressive result, because when you combine the different features they may distract each other and finally provide bad results which didn’t occur in our sample test. The final conclusion can be reached after performing a large scale experiment as will be seen in the experimental results and analysis chapter. It will simply show how our proposed methodology performs in comparison to real world commercial application and whether our hypothesis in developing this scoring scheme stands or not. 66 Chapter 5 SYSTEM DESIGN AND IMPLEMENTATION In this chapter we will present the adopted system architecture and system implementation details. First, we give an overview of the overall proposed FAQWEBSUMM system design describing the different phases to perform the summarization process. Next, we describe in details the two main modules that constitute our system namely the Preprocessing module and Summarization module. Finally, we present the tools used to implement our system and describe the target environment. 5.1 FAQWEBSUMM Overall System Design FAQWEBSUMM system was designed from the very beginning to be as extendable as possible. As the adopted approach in automatic text summarization in this research is an extraction based approach, therefore the idea of system expansion or extension is a direct influence on system design. FAQWEBSUMM was designed to have a solid core that allows other future types of summaries beside FAQs. The system is mainly composed of two main modules namely the preprocessing module and the summarization core module. The preprocessing is the module responsible for preparing the input data and put it into an appropriate form –Q/A units– ready to be used as input by the summarization module. In turn, the summarization module is responsible for summarization quota calculation, sentence scoring by each selection feature and finally summary generation which will be explained in details later. The figure below explains the overall system architecture as described above. 67 FAQWEBSUM M Input HTML Pages Pre-Processing Module Summarization Core Module Final Summary Figure 5.1 FAQWEBSUMM Overall System Architecture The sequence of operations undertaken by the summarization process starting from having an input HTML FAQ page up to producing a summary for that page are stated as follows. The system first receives input as HTML pages, which is then forwarded to the preprocessing module. In fact, the preprocessing module is divided into two internal modules. The first internal preprocessing module runs into two main steps. The first step is to run the segmentation tool and construct the Web page semantic structure. The output of this step is an XML file describing the entire page in the form of semantic segments which will be described in details afterwards. The second step is responsible for building a parser interface to the XML output file produced earlier by the segmentation module. In fact, the parser will enable us to comprehend the segment structure of the given pages which correspond to fragments of the Web page. However, the segments produced by the segmentation tool are not simply the desired Q/A Segments. Thus we needed to include another preprocessing module – Question Detection module -to filter out segments that do not correspond to real Q/A units. After Q/A were detected correctly, we proceed to the summarization module to score answer sentences and later to select the best for summary generation. 68 5.2 Pre-processing Stage As stated earlier, Web pages are divided into a set of segments. The segments are mainly consisted of a higher heading which, in our FAQ case, is the question sentence(s) along with some text that serves as the answer(s) with a lower heading. In fact, the answer may be scattered over a set of children segments if they were in the form of points, bullets or sub-paragraphs defined by smaller sub-headings. The preprocessing module is responsible for filtering out irrelevant segments, and leaving only Q/A segments. This is done into two main steps. First, detect all segments in the given page based on their heading level which is provided by the segmentation tool we have. Then define and apply some filtering criteria to allow only Q/A segments for summary generation. 5.2.1 Primary Pre-processing This module runs the segmentation tool on the given page and the output is a set of segments. These segments can be described as follows: The two main attributes that describe the segment are the heading and Text. The heading may be a question, if it is a relevant segment, or may be any heading title in the page along with some text that serves as the answer. There exists some other attributes like segment ID, parent ID, level, order, length and number of sub segments. Definitions for these attributes are explained in figure 5.2. Segment ID: A unique identifier for the segment throughout the page. Parent ID: The parent segment identifier. Heading: The heading Text Segment Level: Indicates a segment’s level within the document’s Segment ID hierarchy Heading Order: Indicates the segment position with respect to its Parent ID siblings from the left to right. Level Length: Indicates the number of words in the segment. Order Text: Indicates the main text within the segment. Length NoOfSubSegments: Indicates the number of children this Text segments posses. NoOfSubSegments Figure 5.2 Segment Attributes Definitions. 69 Normally, Web pages are divided into a set of semantically related segments having some hierarchal relation in the form of parent-child relationship. In fact, sometimes the answer to some questions is logically scattered to sub paragraphs, bullets or points having lower heading and yet still under the same heading. Detecting this type of relation gives us the opportunity to produce conclusive summaries as a result of knowing differences in sub headings. The following figure shows the hierarchal relation of a given page. Document Segment ID = 1 Parent ID = 0 Level = 1 Segment ID = 2 Parent ID = 0 Level = 1 Segment ID = 4 Parent ID = 1 Level = 2 Segment ID = 5 Parent ID = 1 Level = 2 Segment ID = 2 Parent ID = 0 Level = 1 Figure 5.3 Web Page Segmentation Example After Web pages were segmented and put into the previous form. The FAQWEBSUMM maps these segments to its internal hierarchal structure. The internal structure is divided into Page, Q/A, Sentence, Word and Token. The Page consists of some Q/A units. The Q/A unit, consists of the question which may contain one or more sentences along with the answer which is another set of sentences. The sentence consists of a set of words which in turn consist of a set of tokens or letters. Devising this hierarchy helped in attaining more control over the different constructs. The segmentation tool provides us with only two constructs –Document & set of segments -with which we build our own internal structure. The document construct, maps to our Page construct that represents the whole page with all questions in it. The 70 segment construct, maps to our Q/A unit. Unfortunately, the segment construct only contains raw text in the heading and text nodes that represent both the question and the answer text with no explicit mapping to sentence or word constructs. In fact, we perform two steps to identify both constructs correctly. The first step is to extract sentences from the raw text found in the segment. This is done by applying a sentence boundary disambiguation (SBD) module, where it is the problem in natural language processing of deciding where sentences begin and end 42 . This module is thought to be a very important module because if sentences were mistakenly detected, the final summary will be badly evaluated. This is because the whole summarization system is built on sentence extraction and the compression rate is computed based on the number of sentences in each Q/A unit. However, this is a very challenging task because of the punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations [83]. In addition, question marks and exclamation marks may appear in embedded quotations, emotions, computer code, and slang. The most straight forward approach to detect sentences is to look for a period and then if the preceding token is a capital letter and not an abbreviation, then it is most probably an end for the sentence. The approach [83] work in most cases with accuracy of around 90% depending on the list of abbreviations you have. We managed to use a java implemented sentence detection 43 library offered by LingPipe 44 that is trained over two types of corpus-English News Text corpus and English Biomedical Text corpus. In fact, its proven accuracy of over 90% as stated by LingPipe qualified it to be used by our system. The figure below shows the input and output of the second step of preprocessing. http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation http://alias-i.com/lingpipe/Web/demo-sentence.html 44 http://www.alias-i.com/lingpipe/ 42 43 71 FAQWEBSUMM PreProcessor XML File “A” Sentence boundary Disambiguation (SBD) Module XML File “B” Proceed to Next Phase Figure 5.4 FAQWEBSUMM Second Stage of Pre-processing. The next step after sentences are successfully detected is to identify Words and their tokens. In addition, we need to identify the different word types as it will be used by the selection features involved. This is done through applying a Part of Speech Tagging module, where POS 45 is also called grammatical tagging or word category disambiguation. POS can be defined as the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context —i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. In other words, it is the identification of the word type like for example nouns, verbs, adjectives, adverbs, etc. We use part of speech tagging in order to detect the type of words in the sentences under consideration. This helps us detect exact types of words with different level of importance according to our defined criteria in the selection process which will be depicted later. The part of speech tagger used in FAQWEBSUMM is offered by LingPipe POS 46 . It is a general English tagger derived from the Brown Corpus 47 . It is the first major corpus of English for computer analysis which was developed at Brown University, by Henry Kucera and Nelson Francis. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from http://en.wikipedia.org/wiki/Part-of-speech_tagging http://alias-i.com/lingpipe/Web/demo-pos.html 47 http://en.wikipedia.org/wiki/Brown_Corpus 45 46 72 randomly chosen publications. The figure below shows the input and output of the third step of preprocessing. FAQWEBSUMM PreProcessor XML File “B” Part of Speech Tagging (POS) Module XML File “C” Proceed to Next Phase Figure 5.5 FAQWEBSUMM Third Stage of Pre-processing. The XML file is now having the same basic structure of the Web segmentation tool which is supposed to be a big node representing the whole document, followed by a set of nodes representing the different segments. The segment node contains two main nodes. These two nodes are the heading node that contains the question text and the second node is the text node that contains the question’s answer. The text within the two nodes is the input to the two previously mentioned modules of sentence detection and part of speech tagging. The sentence detection module creates an XML node with identifier “s” to represent a sentence and divides the text into a set of s nodes. The part of speech tagging also creates a different node called token node where it sets the token type whether it is noun, verb, adverb, etc. The figure below shows an example to the XML file structure after performing this stage on an input question. 73 <Segment> <SegmentID>20</SegmentID> <Heading> <s i="0"> <token pos="nn">Q</token> <token pos=".">.</token> <token pos="do">Do</token> <token pos="ppss">we</token> <token pos="vb">draw</token> <token pos="abn">all</token> <token pos="dts">these</token> <token pos="nn">picture</token> <token pos=".">?</token> </s> </Heading> <ParentID>19</ParentID> <Level>6</Level> <Order>1</Order> <Length>22</Length> <Text> <s i="0"> <token pos="np">A</token> <token pos=".">.</token> <token pos="rb">No</token> <token pos="--">-</token> <token pos="cc">and</token> <token pos="cs">that</token> <token pos="'">'</token> <token pos="vbz">s</token> <token pos="at">the</token> <token pos="jj">whole</token> <token pos="nn">point</token> <token pos=".">.</token> </s> <s i="1"> <token pos="dts">These</token> <token pos="ber">are</token> <token pos="at">the</token> <token pos="jj">original</token> <token pos="nns">paintings</token> <token pos="vbn">produced</token> <token pos="in">by</token> <token pos="at">the</token> <token pos="nns">studios</token> <token pos=".">.</token> </s> </Text> <NoOfSubSegments>0</NoOfSubSegments> </Segment> Figure 5.6 XML File Structure Example after SBD and POS Stages. 74 5.2.2 Secondary Pre-processing The input data is now ready to be processed by the FAQWEBSUMM system. A file reader module needs to be built to translate the XML file into an internal structure we described earlier. The file reader expects the XML file to be in the previously described form. This preprocessing phase is very crucial to the whole system, as it may introduce many undesired faults if the structure is not adequately mapped to the internal FAQWEBSUMM structure. The output of this phase is the expected internal representation of the page in the form of Page, Q/A, Sentence and word structure. The figure below shows the input and output of the fourth step of preprocessing. FAQWEBSUMM Preprocessor XML File “C” File Reader & Parser Internal Form Page Segment Paragraph .etc. Proceed to Next Phase Figure 5.7 FAQWEBSUMM Fourth Stage of Pre-processing. The preprocessing module still continues with some more specific preprocessing steps to put the data into the exact form needed for summarization. FAQWEBSUMM does some more preprocessing to remove the so called bad segments. Bad segments are defined by FAQWEBSUMM to be those segments that have one or more of the following judging criteria. Note that the segmentation tool provides us will all segments detected in a Web page which contains decoration, navigation or any other irrelevant segments other than Q/A segments . 75 First criterion is to be an empty text segment that is a non-question segment. Therefore, we implemented a question detection module to be able to identify and filter out non question segments. In fact, this is a challenging task as well, as falsely detecting questions results introducing bad or unnecessary segments. Second criterion is to remove segments that only contain questions solely with no answers. This case happens frequently as many of the FAQ Web page designers first, list all the questions in the top of the page and then proceed with answering each question on its own. The segmentation tool detects this form of gathered questions as a complete segment or set of segments. Third criterion is to remove those segments containing non questions in their heading node and is not children of other question nodes. FAQWEBSUMM has to differentiate between important segments and segments that are considered to be spam to the input. Therefore, the system tries to tailor the output of the segmentation tool that is not specifically directed towards a system like FAQWEBSUMM. The figure below shows the steps in of the final specialized FAQ preprocessing. FAQWEBSUMM PreProcessor Remove Empty NonQuestion Segments Remove Segments With Questions Only Remove Non Question Segments End Of Preprocessing Proceed to Next Phase Figure 5.8 FAQWEBSUMM Specialized FAQ pre-processing. 76 5.3 Summarization Stage As previously mentioned the core of the summarization system is extendable, means it is designed to enable the addition of new types of summaries other than FAQ summaries. For a new summarization type we just implement its own criteria of handling and the system just proceed in a plug and play manner. The core and operation of the FAQWEBSUMM summarizer is depicted in the figure below. Input From PreProcessing Phase FAQWEBSUMM Summarizer Calculate Sentence Rate Per QA Calculate QA Score Formalize Final Summary Sort Extract Summary Scores Sentence Final Summary Figure 5.9 FAQWEBSUMM Summarization Core. The summarization stage is responsible for selecting the most suitable sentence(s) as an answer to a given question. The summarization algorithms run in four main steps. The first step is to specify the number of sentences to be extracted from each answer. This is done by multiply the compression ratio by the total number of sentences in a given answer. The second step is to score all sentences in each answer against the previously specified four selection features. Figure 5.10 is a pseudo code that illustrates the first two steps in the summarization stage. 77 PROCEDURE Summarize_FAQ (Segment QASeg, NUM Comp_Ratio){ /*First, calculate the number of sentences to be extracted from each question */ Num_Ext_Sent = └Comp_Ratio*Total_Num_Sent; /*Second, calculate the combined total score for each sentence in the QA segment*/ Seg_Sentence_Scores = Calc_Seg_Score(QASeg); /*Third, sort sentences in descending order of their total scores*/ Sort_Seg_Sentence_Scores(Seg_Sentence_Scores); /*Finally, formalize the summary by picking the top most ranked sentences*/ Formalize_Summary(); } PROCEDURE Calc_Seg_Score (Segment QASeg){ /*This procedure calculates all sentence scores and then combine them*/ for each Answer Sent in QASeg do{ /*Calculate each of the four scores and then normalize its value to 1 by picking the highest score value and divide all values by that value*/ Sim_Score = Calc_Similarity_Score(QASeg); Sim_Score = Normalize (Sim_Score); Loc_Score = Calc_Location_Score(QASeg); Loc_Score = Normalize (Loc_Score); Query_Score = Calc_Query_Score(QASeg); Query_Score = Normalize (Query_Score); Capital_Score = Calc_Upper_Words_Score(QASeg); Capital_Score = Normalize (Capital_Score); /*The score combination follows Eq. 1*/ Total_Sent_Score = Sim_Score*0.5+Loc_Score*0.25+Query_Score*0.15 +Capital_Score*0.1; All_Sent_Scores.Add(Total_Sent_Score); } return All_Sent_Scores } Figure 5.10 Pseudo Code of Summarization Algorithm The feature calculations are computed as follows: The similarity score is computed by performing the following set of actions on both the question sentence(s) and each of the answer sentences. First, perform word stemming. Second, find the most appropriate sense for every word. Third, build a semantic similarity relative matrix of each pair of words. Fourth, compute the overall similarity score by dividing the sum 78 of similarity values of all match candidates of both sentences under consideration by the total number of set tokens. The following figure shows pseudo code for calculating the feature score. PROCEDURE Calc_Similarity_Score (Segment QASeg) { /* Given two sentences X and Y, we build semantic similarity relative matrix R[m, n] of each pair of word senses*/ for each Answer Sent in QASeg do { Sum_X = 0;Sum_Y = 0; |X| = Num of Words in Question Sent; |Y| = Num of Words in Answer Sent; for each Word in Question Sent do { Max_i=0; for each Word in Answer Sent do { if (R[i, j] > Max_i) Max_i=R[i, j] > Max_i; } Sum_X += Max_i; } for each Word in Answer Sent do { Max_j=0; for each Word in Question Sent do { if (R[i, j] > Max_j) Max_j=R[i, j] > Max_i; } Sum_Y += Max_j; } Overall_Sim = (Sum_X + Sum_Y) / 2 * (|X| + |Y|); All_Sent_Scores.AddSentScore (Overall_Sim); } return All Sent Score; Figure 5.11 Pseudo Code for Calculating Similarity Feature Score. The location feature is computed by giving the first sentence a score equals to the total sentence count and giving the next the total sentence count minus one and so on. The following figure shows pseudo code for calculating the feature score. PROCEDURE Calc_Location_Score (Segment QASeg) { Num_Sent = QASeg.GetNumAnswerSent () for each Answer Sent in QASeg do { Sent_Score = (Num_Sent)-(Sent_index); All_Sent_Scores.AddSentScore (Sent_Score); } return All_Sent_Scores; } 79 Figure 5.12 Pseudo Code for Calculating Location Feature Score. The query overlap feature is performed in two steps. First, we formulate a query from the question sentence(s) by extracting the previously specified word types. Then we compute their frequency in each sentence. The following figure shows pseudo code for calculating the feature score. PROCEDURE Calc_Query_Score (Segment QASeg) { Query = ExtractQuery(QASeg.GetQuestion()); for each Answer Sent in QASeg do { Query_Words = 0; for each Word in Answer Sent do { if Word is in Query THEN Query_Words ++; } All_Sent_Scores.AddSentScore(Query_Words); } return All_Sent_Scores; } PROCEDURE ExtractQuery (Question q) { for each Word in q DO { if Word Type is noun OR adjective OR adverb Or Gerund THEN Add Word to Query; } return Query; Figure 5.13 Pseudo Code for Calculating Query Overlap Feature Score. Finally, the capitalized Words frequency is computed by counting the occurrences of capitalized words. After each step we normalize all scores in order to be combined. Figure 5.14 shows pseudo code for calculating the capitalized words feature score. 80 PROCEDURE Calc_Upper _Words_Score (Segment QASeg) { Num_Sent = QASeg.GetNumAnswerSent (); for each Answer Sent in QASeg do { Upper_Case = 0; for each Word in Answer Sent do { if Word is Capital THEN Upper_Case++; } All_Sent_Scores.AddSentScore (Upper_Case); } return All_Sent_Score; } Figure 5.14 Pseudo Code for Calculating Capitalized Words Feature Score. The final step in the process of score computation is the linear combination, as we can see in the pseudo code in figure 5.9. The third step in the summarization stage is sorting the scores in a descending order of the total score of each sentence. The final summarization step is the summary formation which takes place by selecting the top scored sentences in response to each question for summary generation. 5.4 FAQWEBSUMM System Implementation Issues In this section we list the tools that we used to implement the FAQWEBSUMM and target environment. The FAQWEBSUMM system modules were developed with multiple programming languages. First, the Web segmentation module is developed in ASP.Net 48 in a Web application form. It is implemented using Visual Studio 2005 49 tool. It is an external module to the FAQWEBSUMM. The input HTML pages are introduced to Web segmentation external module and the output is in the form XML files. 48 49 http://www.asp.net/ http://msdn.microsoft.com/en-us/library/ms950416.aspx 81 Second, the primary preprocessing modules that do sentence boundary disambiguation and part of speech tagging are implemented in java. It is implemented using Net Beans 6.1 IDE 50 . This is because there are some helper libraries offered by LingPipe SDK that help performing the previously mentioned tasks in a proper way. NetBeans is an open source integrated development environment (IDE) for developing software with Java, JavaScript, PHP, Python, Ruby, Groovy, C, C++, Scala, Clojure, and others. The NetBeans IDE can run anywhere a Java Virtual Machine (JVM) is installed, including Windows, Mac OS, Linux, and Solaris. A Java Development Kit (JDK) is required for Java development functionality, but is not required for development in other programming languages. In addition, it allows applications to be developed from a set of modular software components called modules. The 6.1 release provides improved performance in utilizing lower memory and run faster. Thirds, the output of the two java preprocessing modules is a processed XML file. The core of the internal preprocessing along with the feature calculation and handling is implemented in MFC 51 8.0 C++ application. The Microsoft Foundation Class Library (also Microsoft Foundation Classes or MFC) is a library that wraps portions of the Windows API in C++ classes, including functionality that enables them to use a default application framework. Classes are defined for many of the handle-managed Windows objects and also for predefined windows and common controls. MFC 8.0 was released with Visual Studio 2005. In fact, MFC has a massively huge amount of third party resources for it. Mainly we use it in developing our application due to its efficiency and improved performance. The target environment to run all these applications is Microsoft Windows XP while the minimum system requirements to run these applications are as follows: 50 51 http://netbeans.org/community/releases/61/ http://en.wikipedia.org/wiki/Microsoft_Foundation_Class_Library 82 Requirement Processor Professional 600 MHz processor Recommended: 1 gigahertz (GHz) processor 192 MB Recommended: 256 MB 300 MB of available space required on system drive Windows 2000 Service Pack 4, Windows XP Service Pack 2, Windows Server 2003 Service Pack 1, or Windows Vista For a 64-bit computer, the requirements are as follows: RAM Available Hard Disk Space Operating System CD-ROM Drive or DVD-ROM Drive Video Mouse Previously Installed Software • Windows Server 2003 Service Pack 1 x64 editions • Windows XP Professional x64 Edition Required Recommended: 1024 X 768, High Color 16-bit Microsoft mouse or compatible pointing device Microsoft dot Net Frame Work. Java SDK 1.4. Table 5.1 FAQWEBSUMM System Requirements. 83 Chapter 6 SYSTEM EVALUATION In this chapter, we will present the evaluation procedure and data for our newly adapted FAQ Web-page summarization method in comparison with the commercial tool “Copernic Summarizer 2.1”. First, we give a brief about why we chose to use Copernic Summarizer in evaluation of our system. Second, we present our evaluation methodology. Third, we describe our evaluation dataset. Fourth, we introduce the experimental results generated by the experiments we conducted and later discuss and analyze these results in relation to our proposed hypothesis. 6.1 Why Copernic Summarizer? There are some reasons that motivated us to use Copernic Summarizer as a competitor in evaluating our system. Copernic Summarizer is a widely used multilingual commercial summarizer that can operate on multiple languages. It can also summarize various types of documents [77]. Moreover, it was previously used as a benchmark to evaluate some summarization methods [22-84]. In fact, in [84] they used it in evaluating their question answering approach that is some how similar to our work. Additionally, in [85] they presented a study on Copernic Summarizer, Microsoft Office Word Summarizer 2003 and Microsoft Office Word Summarizer 2007, with the objective to detect which of them gives the summaries more similar to those made by a human. The summaries were evaluated by the ROUGE system. In fact, Copernic Summarizer scored the best results in comparison to the previously mentioned summarizers. 84 6.2 Evaluation Methodology The evaluation was designed on the basis of twenty five percent compression ratios for both the FAQWEBSUMM and Copernic Summarizer. It was chosen to give satisfactory readable and understandable compressed summary. Five human evaluators were employed to evaluate summaries for the FAQ pages resulted by both FAQWEBSUMM and Copernic. Each evaluator was requested to evaluate the extracted sentences which can be considered the most important ones for a given FAQ Web page. There are two main quality metrics that the evaluators considered while carrying out their evaluation. First, is the number of extracted sentences obeying the twenty five percent compression ratio? Second, how good is the quality of the selected sentence and to what extent it best answers the question under consideration? The quality metrics is shown below. Page Name Question Question 1 Question 2 Question n Very Bad Bad Page x -part(A) Good Very Good Excellent 1 1 1 Numeric Score 1 0.6 0.8 Table 6.1 FAQ Web Page Evaluation Quality Matrix. The table above is given to each evaluator twice for each page, one for FAQWEBSUMM system and the other for Copernic Summarizer. The evaluator has the original FAQ web page along with the summary from each system and does the evaluation on every question in each page not knowing which summary represents which system for integrity purposes. It is tagged as A and B one for each system. Each evaluator gives a score ranged from very bad to excellent as depicted above according to whether it meets the preset quality criteria and to what extent. After the evaluator finishes the page evaluation, we do compute a numeric score as follows. We give the quality metric Excellent a value of 1, Very good 0.8, Good 0.6, Bad 0.4 and Very Bad 0.2. 85 This helps in plotting graphs between the same questions in the different pages and among different evaluators also computing an average score for each page and cross reference the evaluation with other evaluators. Copernic formulates the summary by detecting the important concepts of the whole page and does the summarization on this base. In order to make the comparison fair, we fed the Copernic summarizer with each question and its answers as an individual document and then concatenate questions and their summarizes to formulate the summary of this page. All in all, we use two comparison methodologies in evaluating both systems. First, is comparing the average scores of each question in each page and in turn all the pages together. The second method, was applying the widely used student’s t-Test 52 to find out whether the scores produced by both systems of each page have statistical difference. In simple terms, the t-Test compares the actual difference between two means in relation to the variation in the data (expressed as the standard deviation of the difference between the means). The formula for the t-Test is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores. The difference between the means is the signal that, in this case, we think our program or treatment introduced into the data; the bottom part of the formula is a measure of variability that is essentially noise that may make it harder to see the group difference 53 . It can be calculated by using the following equation 54 . http://en.wikipedia.org/wiki/Student%27s_t-test http://www.socialresearchmethods.net/kb/stat_t.php 54 http://www.okstate.edu/ag/agedcm4h/academic/aged5980a/5980/newpage26.htm 52 53 86 eq. (6) We apply it by choosing the level of significance required (normally p = 0.05) and read the tabulated t value. If the calculated t value exceeds the tabulated value we say that the means are significantly different at that level of significance. A significant difference at p = 0.05 means that if the null hypothesis were correct (i.e. the samples do not differ) then we would expect to get a t value as great as this on less than 5% of occasions. So we can be reasonably confident that the evaluation scores do differ from one another, but we still have nearly a 5% chance of being wrong in reaching this conclusion 55 . Statistical tests allow us to make statements with a degree of precision, but cannot actually prove or disprove anything. A significant result at the 95% probability level tells us that our data are good enough to support a conclusion with 95% confidence (but there is a 1 in 20 chance of being wrong). 55 http://www.biology.ed.ac.uk/research/groups/jdeacon/statistics/tress4a.html 87 6.3 Dataset Due to the lack of FAQ summaries datasets, we managed to create our own dataset. Twenty two FAQ web pages were manually and randomly collected describing nine different disciplines in order to cover a wide and diverse range of FAQ topics. These are namely: Software related topics, Customer Services, Business, Art, Health, Society, Academic, News, and Sports. The dataset has a total of 390 questions with an average of 17 questions per page, while the average number of answers per question is one with a total of approximately 2000 sentences. Appendix B describes the dataset in more details. It shows the page-domain mapping, a brief description about the pages, the Web pages URLs and the number of questions resides in each page. 6.4 Experimental Results and Analysis In this section, we will investigate the experimental results generated by both our system -FAQWEBSUMM- and the Copernic system. In this research we conducted four experiments to examine and support the validity of our hypothesis. In our experiments, we used the dataset that is described in section 6.3. The first experiment tests the summarization performance quality of our system against the Copernic system in terms of which system produces short quality summaries. The second and the third experiments are derived from the experimental results of the first experiment. The second experiments tests the summarization quality with respect to the questions’ discipline and see if summaries we produced in certain disciplines are better than others. The third experiment to measures and compares the human evaluators’ performance in evaluating both summarizers. 88 The fourth experiment analyzes the evaluators’ scores and their homogeneity in evaluating our system and the Copernic system as well. For each experiment, we state the experiment’s objective, show the experimental results and finally analyze these results. 6.4.1 Experiment 1 “Performance Quality Evaluation” 6.4.1.1 Objective The main objective of this experiment is to measure the summarization quality of our system in comparison to the Copernic system in terms of which system gives more readable and informative short summaries. In other words, we compare the numeric scores given by all human evaluators to all questions in the dataset given by both summarizers. 6.4.1.2 Description This experiment runs on all questions of all pages resides in the dataset. In this experiment, eight human evaluators were involved to evaluate summaries provided by both summarizers. Each page is evaluated by three different evaluators and not necessarily the same for all pages. The involved human evaluators were asked to score the extracted answer(s) for each question of all pages introduced in the dataset. Their only preference is the quality of the extracted sentences in response to the question(s) under consideration and whether or not the summary follows the preset compression ratio. 6.4.1.3 Results Appendix C shows the scores of evaluating each question in each page in the form of an average score of three human evaluators one for our system and the other for Copernic. The overall average scores for all questions in all pages for FAQWEBSUMM and Copernic were 0.732 and 0.610 respectively with an improvement in favor of our 89 new approach by approximately 20 %. In addition, Appendix D shows a sample of summaries and their evaluation scores. 6.4.1.4 Discussion Based on the above, you can see that, in general, the FAQWEBSUMM system performs much better than the Copernic summarizer. The overall average scores for all pages indicate that it is superior to the other system by approximately 20%. In fact, this is a good enthusiastic result that urges in continuing down this path of research and digging for better results. On the other hand, after applying the t-Test to the evaluation scores in Appendix C, it was found that the differences between the two systems are extremely statistically significant with a 95% confidence rate. Moreover, figure 6.1 shows a summarized graphical representation that highlights the performance distribution of both systems. It shows three categories. The blue category shows how often our system scored higher than Copernic. The purple category shows how often Copernic scored higher than our system. Finally, the yellow category shows how often a tie existed between the scores of both systems. FAQWebSum Better in, 200, 51% Tie, 123, 32% Copernic Better in, 67, 17% Figure 6.1 Performance Distributions. 90 As you can see, our system performed better in 51% of the cases while Copernic performed better in only 17% of the cases leaving 32% of the cases where both systems had the same score. The figure below shows a summarized graphical representation of the questions’ score distribution. It shows how the scores are distributed between the five scoring categories namely Very Bad, Bad, Good, Very Good and Excellent. In addition, for each category, it shows the number of questions it represents and their percent ratio from the total number of questions. Very Bad, 5, 1% Bad, 13, 3% Excellent, 118, 30% Good , 80, 21% Very Bad Bad Good Very Good Excellent Very Good , 174, 45% Figure 6.2 FAQWEBSUMM Score Distribution. Note that the largest category distribution is the Very Good category as 45% of all questions had that score. The second largest category is the Excellent covering 30% of the answers. It means that 75% of the summaries of all questions got at least very good scores. The Good Category covers 21% of the remaining 25% leaving only 4% to be divided between Bad 3% and only 1% to the Very Bad category. In fact, these results seem to be very promising and their distribution is very encouraging. On the other hand, the figure below shows the same as Figure 6.2 but it represents questions’ score distribution for the Copernic system. 91 Very Bad, 47, 12% Excellent, 64, 16% Bad, 39, 10% Very Bad Bad Good Very Good Excellent Very Good , 134, 35% Good , 106, 27% Figure 6.3 Copernic Score Distribution. As you can see, the largest category distribution is the Very Good category but with 35% of the summaries of all questions had that score. The second largest category is the Good category with 27% of the answers. It means that more than half of the space is ranged between Good and Very Good. The Excellent Category takes only 16% of the remaining 38% leaving 22 % to be divided between Bad 10% and 12% to the Very Bad category. Note that the score distribution of our system showed significant difference in comparison to the Copernic’s score distribution. In addition, the figure below shows a graphical representation of the comparison between the score distributions of both our system and Copernic in terms of number of questions. 92 200 Number of Questions 180 160 140 120 FAQWebSum 100 Copernic 80 60 40 20 0 Very Bad Bad Good Very Good Excellent Score Value Figure 6.4 Score Distribution Comparison. As you can see, FAQWEBSUMM has a larger score distribution in the best two categories Excellent and Very Good covering 75% of the questions space. On the other hand, Copernic has a slightly higher score in the Good Category along with the other two low score categories Bad and Very Bad. 6.4.2 Experiment 2 “Page Discipline Performance Analysis” 6.4.2.1 Objective The main objective of this experiment is to measure the summarization quality of our system in comparison to the Copernic system in terms of which system gives higher scores with respect to the different pages’ disciplines. In other words, we test which system performs better on the different pages’ categories and whether the pages’ discipline has an impact on the summarization quality or not. 6.4.2.2 Experiment Description This experiment uses the previous run on all pages of the dataset. It measures the impact of different pages’ disciplines on the performance evaluation. The results of this experiment are divided into two main parts. First, we will show how our system 93 outperforms the Copernic system by showing the improvement ratio in percent for pages in each discipline. Second, we will show the t-Test values in terms of whether our scores are statistically significant better than that of Copernic’s for pages in each discipline. 6.4.2.3 Results The table below shows the average scores of all questions in each page as given by its three human evaluators; one for our system and the other for Copernic. In fact, the overall scores are in favor of the FAQWEBSUMM system. 94 Domain Software “Q1-Q113” Customer Service “Q114-Q143” Business “Q144-Q180” Art “Q181-Q245” Health “Q246-Q308” Society “Q309-Q344” News “Q345-Q362” Academic “Q363-Q378” Sports “Q379-Q390” Page Page 1 “Q1-Q8” Page 2 “Q9-Q43” Page 3 “Q44-Q55” Page 4 “Q56-Q107” Page 5 “Q108-Q113” Average FAQWEBSUMM 0.600 0.699 0.694 0.676 0.488 0.632 Copernic 0.367 0.512 0.550 0.603 0.522 0.511 Improvement Ratio 63.6% 36.5% 26.3% 12.1% -6.4% 23.6% Page 6” Q114-Q120” 0.847 0.561 50.9% Page 7” Q121-Q138” Page 8 “Q139-Q143” Average 0.678 0.720 0.748 0.581 0.767 0.637 16.6% -6.2% 17.5% Page 9” Q144-Q153” Page 10” Q154-Q180” Average 0.800 0.555 0.678 0.587 0.535 0.561 36.4% 3.7% 20.8% Page 11” Q181-Q209” 0.788 0.618 27.5% Page 12” Q210-Q2117” Page 13 “Q218-Q245” Average 0.795 0.862 0.815 0.450 0.774 0.614 76.7% 11.4% 32.8% Page 14” Q246-Q254” 0.867 0.600 44.4% Page 15” Q255-Q286” Page 16” Q287-Q308” Average 0.804 0.833 0.835 0.664 0.745 0.670 21.0% 11.8% 24.6% Page 17”Q309-Q318” 0.813 0.647 25.8% Page 18” Q319-Q339” Page 19” Q340-Q344” Average 0.775 0.733 0.774 0.539 0.720 0.635 43.6% 1.9% 21.8% Page 20” Q345-Q362” 0.688 0.496 38.9% Page 21” Q363-Q378” 0.696 0.683 1.9% Page 22” Q379-Q390” 0.720 0.648 11.2% Table 6.2 Average Page Scores Categorized by Discipline. 95 The figure below shows a graphical representation of the comparison between the scores Quality with respect to the pages’ discipline of both our system and Copernic. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FAQW ebSum Copernic Pa ge Dicipline Figure 6.5 Page Discipline Score Comparison. A detailed comparison to the scores’ distribution for the different pages’ disciplines for both systems is presented in Appendix C. Table 6.3 shows the t-Test’s result for the evaluation scores of all questions summaries conducted by the two systems for each page. The overall t value is computed all set of values in each discipline. 96 Domain Page t-Test Page 1 “Q1-Q8” Significant Page 2 “Q9-Q43” Significant Page 3 “Q44-Q55” Significant Page 4 “Q56-Q107” Significant Page 5 “Q108-Q113” Not Significant Page 6” Q114-Q120” Significant Page 7” Q121-Q138” Significant Page 8 “Q139-Q143” Not Significant Business “Q144-Q180” Page 9” Q144-Q153” Significant Art “Q181-Q245” Page 10” Q154-Q180” Not Significant Page 11” Q181-Q209” Significant Page 12”Q210-Q2117” Significant Page 13 “Q218-Q245” Significant Page 14” Q246-Q254” Significant Page 15” Q255-Q286” Significant Page 16” Q287-Q308” Significant Page 17”Q309-Q318” Significant Page 18” Q319-Q339” Significant Page 19” Q340-Q344” Not Significant Page 20” Q345-Q362” Significant Significant Page 21” Q363-Q378” Not Significant Not Significant Page 22” Q379-Q390” Not Significant Not Significant Software “Q1-Q113” Customer Service “Q114-Q143” Health “Q246-Q308” Society “Q309-Q344” News “Q345-Q362” Academic “Q363-Q378” Sports “Q379-Q390” Overall t-Test Significant Significant Significant Significant Significant Significant Table 6.3 t-Test Values for All Pages by All Evaluators. 6.4.2.4 Discussion Based on the above, you can see that the FAQWEBSUMM approach performs better than the Copernic summarizer with respect to the average score comparison. Moreover, after applying the t-Test to the evaluation scores as can be seen in table 6.3. It was found that the differences between the two systems for 16 pages are statistically significant while only 6 pages were found to be non-statistically significant. In addition, 97 you can see that in the seven domains: Software, Customer Service, Art, Health, society and News, out of nine, the t-Test scores were found significant. Based on the above results, you can see that approximately all of the pages showed statistical significance difference in favor of our system which is excellent. Moreover, taking the number of pages in which our tool scores better than the Copernic tool we found that our tool scored better in 91% of the pages. The highest improvement ratio obtained by our tool was 76% for an Art page 2 while the highest improvement ratio for Copernic tool was 10% for a human rights page. On the other hand, if we take a deeper look to table 6.3, we observe a relation between the number of questions resides in a page or discipline and the degree of significance of its score. In our dataset, the lowest number of questions in a page is found to be 5, while the highest number is 52. If we divide the range between the lowest value and the highest value to 5 sets giving a pace of 10 questions between each set, we find the following. For the first set, we find that for small number of questions between 5 and 15 there exists 11 pages, 7 of them detected as significant and 4 detected not significant. For the second set, with more questions from 16 to 25 there exist 5 pages, 4 of them detected as significant and only 1 is not. For the third set, with more questions there exists 5 pages, 4 detected significant and only 1 is not. For the fourth and fifth sets, there exists 1 page in each set with the highest number of questions and both of them were significant. In other words, the probability of being significant for a small number of questions from 5 to 15 was 63%. Moving to mid values of 25 to 35 questions the probability was 75%. While for the largest two sets the probability reached 100%. Therefore, we can draw this conclusion, the higher the number of questions resides in a page being summarized by our system the most probable its scores being significant over Copernic’s. 98 6.4.3 Experiment 3 “Human Evaluators’ Performance Analysis” 6.4.3.1 Objective The main objective of this experiment is to measure and compare the human evaluators’ performance in evaluating both summarizers. 6.4.3.2 Experiment Description This experiment uses the previous run on all pages of the dataset. The results of this experiment are divided into two main parts. First, we will show how our system outperforms the Copernic system by showing the improvement ratio in percent for each evaluator. Second, we will show the t-Test values in terms of whether the overall scores of all pages for each evaluator are significant than that of Copernic or not. 6.4.3.3 Results The table below shows the average scores for each evaluator to summaries they were assigned provided by both summarizers along with the improvement ratio of our system over Copernic’s. Evaluator Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 Evaluator 5 Evaluator 6 Evaluator 7 Evaluator 8 FAQWEBSUMM 0.774 0.582 0.677 0.756 0.597 0.845 0.770 0.828 Average Copernic Summarizer 0.721 0.464 0.524 0.622 0.506 0.635 0.620 0.662 Improvement Ratio 7.3% 25.4% 29.0% 21.6% 17.8% 33.0% 24.1% 25.0% 26.3% Table 6.4 Evaluators Improvement Ratio. As you can see that the overall average improvement of our summaries over Copernic’s as stated by all evaluators is in favor of our new approach by approximately 26.3 %. 99 Table 6.5 shows the individual value to each evaluator in response to computing the statistical significance for all pages he scored. Evaluator Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 Evaluator 5 Evaluator 6 Evaluator 7 Evaluator 8 Degree of Significance “Not Significant” ”Significant” ”Significant “Significant” “Not Significant” ”Significant “Significant” “Significant” Table 6.5 t-Test Results Over Evaluator Scores. As you can see that the individual scores of six evaluators out of eight were detected statistically significant. In other words, the scores of 75 % of all evaluators signaled a statistical significance between our system’s scores and Copernic’s. 6.4.3.4 Discussion Considering the evaluators data, we found out that the average scores of all pages they have evaluated are in favor of our system. The highest improvement ratio obtained by the sixth evaluator was 33% while the lowest was 7.3%. Moreover, the average improvement for all evaluators is approximately 26%. This means that 100% of the evaluators gave better scores on average to our system. On the other hand, after applying the t-Test on the pages evaluated by each evaluator, it showed that in general there is statistical significance between our system and Copernic. Based on the above results, we can see that the scores of 75 % of all evaluators signaled a statistical significance between our system’s scores and Copernic’s while only 25% of the evaluator scores were not. 100 6.4.4 Experiment 4 “Analyzing Evaluators’ Homogeneity” 6.4.4.1 Objective The main objective of this experiment is to measure the human evaluators’ degree of homogeneity in scoring summaries for each system. In other words, we test how different and how similar evaluations of pages common to a set of evaluators. 6.4.4.2 Experiment Description This experiment uses the previous run on all pages of the dataset. It compares the scores of all evaluators and measures the degree of agreement for common pages. This means that if we have two summaries one for our system and the other for Copernic and we have two evaluators, then we check the evaluations of evaluator 1 and evaluator 2 for the summary generated by our system and check for similarities and differences of both evaluations. The same is applicable to Copernic. We will show comparisons in terms of tTest for pairs of evaluators scored the same pages to measure the degree of homogeneity in their scores. 6.4.4.3 Results In a trial to measure the degree of overlap or homogeneity between the different evaluators, table 6.6 was derived from comparing the scores of pairs of evaluators to same pages. The table shows each pair of evaluators along with a t-Test value one for our system and the other for Copernic’s. The t-Test value here indicates whether the scores provided by both evaluators to the same system – once compare values of both evaluators to our system and once compare values of both evaluators to Copernic’s- are significantly different from each other or not. 101 Evaluator (a) Evaluator (b) FAQWEBSUMM Copernic 1 2 Significant Significant 1 3 Significant Significant 1 4 Not Significant Not Significant 1 5 Significant Significant 2 3 Not Significant Not Significant 2 4 Not Significant Not Significant 2 5 Not Significant Not Significant 3 4 Significant Significant 3 5 Significant Not Significant 4 6 6 7 5 7 8 8 Not Significant Not Significant Not Significant Not Significant Not Significant Not Significant Not Significant Not Significant Table 6.6 t-Test Results Comparing Evaluators’ Scores. A detailed t-Test comparison between each pair of evaluators for individual pages is presented in Appendix C. 6.4.4.4 Discussion Considering the comparison between pairs of evaluators who scored the same page set, we found that 8 evaluator pair records out of 13 did not signal statistical significance in scoring our system-this means that the scores weren’t different for the same pages by different evaluators. This means that 62 % of evaluators to same page sets were compatible and they all agree to the same scoring quality. On the other hand, we found that 9 evaluator pair records out of 13 did not signal statistical significance in scoring Copernic’s. This means that 69 % of evaluators were compatible and they all agree to the same scoring quality to Copernic’s. Based on the above, both statements support our hypothesis that our system is better than Copernic’s in summarizing FAQ Web pages. This is because most of the evaluators agree to the same scoring for both systems. 102 Chapter 7 CONCLUSION AND FUTUREWORK Conclusion In this research we presented the preliminary results of our FAQ Web pages automatic text summarization approach that is English language dependent. It was found that FAQ Web pages are devised into a manner of a question having a specific heading style, e.g. bold, underlines, or tagged. The answer then follows in a different lower style; usually smaller font and may be scattered to subheadings if the answer is logically divided. Based on the above, FAQ Web pages summarization can benefit from utilizing Web page segmentation algorithms which are based on visual cues to, first, identify the different headings hierarchal structure and later on to extract questions and answers segments out of it. In addition, we devised a new combination of selection features to perform the summary generation task. These features are namely, question-answer similarity, query overlap, sentence location in answer paragraphs and capitalized words frequency. The choice of these features was influenced by the process of analyzing the different question types and anticipating the expected correct answer. The first feature –Sentence Similarity- evaluates the semantic similarity between the question sentence and each sentence in the answer. It does so by comparing word senses of both the question and answer words and then assigns each pair of words a numerical value and then in turn an accumulated value to the whole sentence. The second feature –Query Overlap-extracts the following word types; nouns, adverbs, adjectives and gerunds from the question sentence and automatically formulate a query with and count the number of matches with each of the answer sentences. The third feature -Location-gives higher score to sentences at the beginning and lessens the score to the following sentences. 103 The fourth feature -Capitalized Words Frequency- computes the frequency of capitalized words in a sentence. We give each feature a weight and then linearly combine them in a single equation to give a cumulative score for each sentence. The different document features were combined by a home grown weighting score function. It was found that using each of the features solely performed well in some cases based on different kinds of evidence. Pilot experimentations and analysis helped us in obtaining a suitable combination of feature weights. In our experiments, we showed the results of four experiments that supported the validity of our hypothesis. The first experiment tested the summarization performance quality of our system against the Copernic system in terms of which system produces readable short quality summaries. The second experiment tested the summarization quality with respect to the questions’ discipline. The third experiment to measures and compares the human evaluators’ performance in evaluating both summarizers. The fourth experiment analyzes the evaluators’ scores and their homogeneity in evaluating our system and the Copernic system as well. In general, it was found out that the FAQWEBSUMM system performs much better than the Copernic summarizer. The overall average scores for all pages indicate that it is superior to the other system by approximately 20% which seems to be quite promising. In addition, the overall average for all pages indicates statistical significant improvements for our approach in 62% of the cases when compared with the commercial summarization tool. However, the superiority comes from the idea of knowing the web page structure in advance that helps in reaching better results than applying a general solution to all types of pages. On the other hand, with respect to the page discipline, the scores of seven page domains (77%) -Software, Customer Service, Art, Health, society and News- out of nine were detected significant and only two were detected as not significant due to the lack of test data in their categories that contained only one page. 104 Moreover, we believe if we have more pages in those disciplines -Academic and Sports- we could have had the chance of proving that our home grown tool to be superior in all disciplines which means that our approach work for all types of pages. Considering human evaluators, it was found out that the average improvement for all evaluators is in favor of our new approach by approximately 26.3 %. Taking the number of evaluators in which they scored our tool to be better than Copernic’s in the average case was 100 %. In addition, considering the comparison between pairs of evaluators who scored the same page set, we found that 8 evaluator pair records out of 13 did not signal statistical significance in scoring our system-this means that the scores weren’t different for the same pages by different evaluators. This means that 62 % of evaluators to same page sets were compatible and they all agree to the same scoring quality. Main Contributions: One of the main contributions in this research demonstrating that using the proposed feature selection combination is a highly preferable solution to the problem of summarizing FAQ Web pages. Another contribution is by utilizing Web page segmentation we were able to differentiate between the different constructs resides in Web pages, which enabled us to filter out those constructs that we see unnecessary. As a result, this research proves the hypotheses that by analyzing the structure of the FAQ Web pages, we can get better summarization results. The last contribution is related to those answers that are divided into a set of paragraphs under smaller headings representing the logical structure introduced by the page creator. The final summary is a subset of all those paragraphs not only one of them. This is also a benefit of using the segmentation tool so that we know the exact structure of the answer. 105 Future Work: There are some improvements identified and need to be additionally addressed in further research. First of all, improving the pre-processing phase, as it may result in major data loss if not considered carefully. There is also a need to experiment on a different dataset with different types of Web pages covering different genres, employ different human evaluators with different background and consider a way of utilizing automatic evaluation. Another future enhancement is to do some work on complex question decomposition and analysis in order to help extracting accurate and responsive answers to these questions. Typically, complex questions address a topic that relates to many entities, events and even complex relations between them. In fact, the quality of question-focused summaries depends in part on how complex questions are decomposed. Therefore, by decomposing questions and finding the relationship between its entities, it would be easier to identify related entities in answers, thus selecting the more accurate responsive sentences as summary. It would be also a good enhancement to exploit automatic tuning for feature weighting based on the question type. As different question types –What, When, How, why, etcrequires different answer. Therefore, different features may be more appropriate when considering a specific question type and others may be less effective. Moreover, some answers may be structurally different than others as some may contain bullets or points that needs to be processed differently. As a result sentence compaction may be introduced then and not selecting the whole sentence as we do now. 106 REFERENCES [1] D. Das and F.T. Martins. “A Survey on Automatic Text Summarization”. Literature Survey for the Language and Statistics II course at CMU, 2007. [2] K.Jones. “Automatic summarising: The State of the Art”, Information Processing and Management: an International Journal, Volume 43, Issue 6, pp.1449-1481, 2007. [3] S. Afantenos , V. Karkaletsis , P. Stamatopoulos, “Summarization From Medical Documents: A Survey”, Artificial Intelligence in Medicine Journal, Volume.33, Issue 2, pp.157-177, 2005. [4] R. Basagic, D. Krupic, B. Suzic, D. Dr.techn, C. Gütl “Automatic Text Summarization Group “Institute for Information Systems and Computer Media, Graz http://www.iicm.tugraz.ac.at/cguetl/courses/isr/uearchive/uews2009/Ue10%20%20Automatic%20Text%20S ummarization.pdf {Retrieved on 12-04-2011} [5] L. Alemany, I. Castell´on, S. Climent, M. Fort, L. Padr´o, and H. Rodr´ıguez. “Approaches to Text Summarization: Questions and Answers”. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial Journal, Volume 8, Issue 22: pp. 79– 102, 2003. [6] S. Roca, “Automatic Text Summarization” http://www.uoc.edu/humfil/digithum/digithum3/catala/Art_Climent_uk/climent/climent.h tml {Retrieved on 10-04-2011} [7] A. Nenkova. “Summarization Evaluation for Text and Speech: Issues and Approaches”. INTERSPEECH 2006. 9th International Conference on Spoken Language Processing, pp. 2023-2026, Pittsburgh, USA , 2006. [8] H. Luhn. “The Automatic Creation of Literature Abstracts”. IBM Journal of Research Development, Volume 2, Issue 2, pp.159-165, 1958. [9] P. Baxendale. “Machine-Made Index for Technical Literature - An Experiment”. IBM Journal of Research Development, Volume 2, Issue 4, pp. 354-361, 1958. [10] H. Edmundson “New Methods in Automatic Extracting”. Journal of the ACM, Volume 16, Issue 2, pp. 264-285, 1969. 107 [11] U. Reimer and U. Hahn. "A Formal Model of Text Summarization Based on Condensation Operators of a Terminological Logic"; In Proceedings of the Workshop on Intelligent Scalable Summarization Conference, pp. 97–104, Madrid, Spain, 1997. [12] T. Nomoto, “Bayesian Learning in Text Summarization”, In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing , pp. 249-256, Vancouver, Canada, 2005. [13] A. Bawakid, M. Oussalah: "A Semantic Summarization System”, In Proceedings of the 1st Text Analysis Conference, pp. 285-291, Gaithersburg, USA,2008. [14] M. Osborne. “Using Maximum Entropy for Sentence Extraction”. In Proceedings of the ACL-02 Workshop on Automatic Summarization, pp. 1-8, Morristown, USA, 2002. [15] K. Svore , L. Vanderwende, and C. Burges. “Enhancing Single-Document Summarization by Combining RankNet and Third-Party Sources”. In Proceedings of the EMNLP-CoNLL-07, pp. 448-457, Prague, Czech Republic, 2007. [16] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender “Learning to Rank Using Gradient Descent”. In Proceedings of the 22nd International Conference on Machine Learning, pp. 89-96, New York, USA, 2005. [17] C. Lin,.Y. Rouge: “A package for Automatic Evaluation of Summaries”. In Proceedings of the ACL-04 Workshop, pp 74-81, Barcelona, Spain, 2004. [18] W. Chuang, J. Yang, “Extracting Sentence Segments for Text Summarization: a Machine Learning Approach”, In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.152-159, Athens, Greece, 2000 [19] Z. Xie, X. Li, B. Di Eugenio, P. C. Nelson, W. Xiao, T. M. Tirpak, “Using Gene Expression Programming to Construct Sentence Ranking Functions for Text Summarization”, In Proceedings of the 20th international conference on Computational Linguistics Table of Contents, pp. 1381-1385, Stroudsburg , USA, 2004. [20] J.Yeh, H. Ke, W. Yang, I. Meng, “Text Summarization Using a Trainable Summarizer and Latent Semantic Analysis”, Information Processing and Management: an International Journal, Volume.41, Issue 1, p.75-95, 2005. 108 [21] M. Abdel Fattah , F. Ren, “GA, MR, FFNN, PNN and GMM Based Models for Automatic Text Summarization”, Computer Speech and Language Journal, Volume 23 Issue 1, p.126-144, 2009 [22] A. Kiani and M. Akbarzadeh. “Automatic Text Summarization Using: Hybrid Fuzzy GA-GP”. In Proceedings of IEEE International Conference on Fuzzy Systems, pp. 977983, Vancouver, Canada, 2006. [23] R. Verma, P. Chen, W. Lu. "A Semantic Free-text Summarization System Using Ontology Knowledge"; In Proceedings of the Document Understanding Conference, pp. 439-445, Rochester, New York, 2007. [24] D. Radev , H. Jing , M. Styś , D. Tam, “Centroid-Based Summarization of Multiple Documents”, Information Processing and Management Journal, Volume 40, Issue 6, pp.919-938, 2004. [25] H. Saggion and R. Gaizauskas. “Multi-document Summarization by Cluster/Profile Relevance and Redundancy Removal”. In Proceedings of the 4th Document Understanding Conference, pp386-392. Boston, USA, 2004. [26] X. Wan. “An Exploration of Document Impact on Graph-based Multi-Document Summarization”. In Proceedings of the Empirical Methods in Natural Language Processing Conference, pp. 755–762, Honolulu, USA, 2008. [27] A. Haghighi and L. Vanderwende. “Exploring Content Models for Multi-Document Summarization”. In Proceedings of Human Language Technologies Conference, pp. 362– 370, Boulder, USA, 2009. [28] R. Angheluta, R. De Busser, and M. Moens. “The Use of Topic Segmentation for Automatic Summarization”. In Proceedings of the Second Document Understanding Conference, pp. 264-271, Philadelphia, USA, 2002. [29] E. Alfonseca, J. Guirao, and A. Sandoval. “Description of the UAM System for Generating Very Short Summaries”, In Proceedings of the 4th Document Understanding Conference, pp. 226-232, Boston,, USA, 2004. [30] R. Angheluta, R. Mitra, X. Jing, and M. Moens. “K.U. Leuven Summarization System”, at DUC 2004. In Proceedings of the 4th Document Understanding Conference, pp. 286-292, Boston, USA, 2004. 109 [31] H. Daum´e and D. Marcu. “Bayesian Query-Focused Summarization”. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the Twenty-First International Conference on Computational Linguistics, pp 305–312, Sydney, Australia, 2006. [32] S. Fisher and B. Roark. “Feature Expansion for Query-Focused Supervised Sentence Ranking”. "; In Proceedings of the Document Understanding Conference, pp. 213-221, Rochester, New York, 2007. [33] E. Baralis , P. Garza , E. Quintarelli , L. Tanca, “Answering XML Queries by Means of Data Summaries”, ACM Transactions on Information Systems Journal, Volume 25, Issue 3, pp.10-16, 2007. [34] H. Joho, D. Hannah, and J. Jose. “Emulating Query Biased Summaries Using Document Titles”. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 709–710, New York, USA, 2008. [35] Y. Vandeghinste. “Sentence Compression for Automated Subtitling: A Hybrid Approach”. In Proceedings of the ACL Workshop on Text Summarization, pp.89–95, Barcelona, Spain, 2004. [36] H. Jing, “Sentence Reduction for Automatic Text Summarization”, In Proceedings of the Sixth Conference on Applied Natural Language Processing, pp.310-315, Seattle, USA, 2000. [37] D. McClosky and E. Charniak.. “Self-Training for Biomedical Parsing”. In Proceedings of the Association for Computational Linguistics Conference, pp.852-865 ,Columbus, USA, 2008. [38] S. Jonnalagadda, L. Tari, J. Hakenberg, C. Baral, and G. Gonzalez. “Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text”. In Proceedings of Human Language Technologies Conference, pp. 177–180, Boulder, USA, 2009. [39] K. Woodsend , M. Lapata, “Automatic Generation of Story Highlights”, In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp.565-574, Uppsala, Sweden, 2010. 110 [40] K. Knight , D. Marcu, “Summarization Beyond Sentence Extraction: A Probabilistic Approach to Sentence Compression”, Journal of Artificial Intelligence, Volume 139, Issue 1, pp.91-107, 2002. [41] T. Cohn and M. Lapata. “Sentence Compression Beyond Word Deletion”. In Proceedings of the 22nd International Conference on Computational Linguistics, pp. 137–144, Manchester, UK, 2008. [42] D. Radev , O. Kareem , J. Otterbacher, “Hierarchical Text Summarization for WAPEnabled Mobile Devices”, In Proceedings of the 28th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 679-679, Salvador, Brazil , 2005. [43] J. Otterbacher , D. Radev , O. Kareem, “News to go: Hierarchical Text Summarization for Mobile Devices”, In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , pp. 589596, Seattle, USA, 2006. [44] J. Otterbacher, D. Radev, O. Kareem, “Hierarchical Summarization for Delivering Information to Mobile Devices”, Information Processing and Management Journal, Volume 44, Issue 2, pp.931-947, 2008 [45] C. Yang, F. Wang, “Fractal Summarization for Mobile Devices to Access Large Documents on the Web”, In Proceedings of the 12th international conference on World Wide Web, pp.215-224, Budapest, Hungary , 2003. [46] M. Amini, A. Tombros, N. Usunier, M. Lalmas, and P. Gallinari. “Learning to Summarise XML Documents Using Content and Structure”. In Proceedings of the 14th International Conference on Information and Knowledge Management, pp. 297–298, Bremen, Germany, 2005. [47] N. Fuhr, S. Malik, and M. Lalmas. “Overview of the Initiative for the Evaluation of XML Retrieval (inex) 2003”. In Proceedings of the 2nd INEX Workshop, pp.1-11, Dagstuhl, Germany, 2004. [48] S. Harper, Y. Yesilada, C. Goble, and R. Stevens. “How Much is Too Much in a Hypertext Link? Investigating Context and Preview – A Formative Evaluation”. In 111 [49] J. Delort, B. Bouchon-Meunier and M. Rifqi. “Enhanced Web Document Summarization Using Hyperlinks”. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pp. 208-215, Nottingham, UK, 2003. [50] S. Harper and N. Patel. “Gist Summaries for Visually Impaired Surfers”. In Proceedings of the 7th international ACM SIGACCESS Conference on Computers and Accessibility, pp. 90-97, New York, USA, 2005. [51] A. Dalli, Y. Xia, Y. Wilks, “FASIL Email Summarisation System”, In Proceedings of the 20th International Conference on Computational Linguistics,pp.994-999, Geneva, Switzerland, 2004 [52] E. Toukermann , S. Muresan , J. L. Klavans, “GIST-IT: Summarizing Email Using Linguistic Knowledge and Machine Learning”, In Proceedings of the workshop on Human Language Technology and Knowledge Management, pp.1-8, Toulouse, France, 2001. [53] G. Carenini, R. Ng, X. Zhou, “Summarizing Email Conversations with Clue Words”, In Proceedings of the 16th International Conference on World Wide Web, pp. 91-100, Banff, Canada, 2007. [54] S. Oliver, E. Ringger, M. Gamon, and R. Campbell. “Task-Focused Summarization of Email”. In Proceedings of the ACL-04 Workshop Text Summarization, pp 43-50, Barcelona, Spain, 2004. [55] D. Zajic , B.. Dorr , J. Lin, “Single-Document and Multi-Document Summarization Techniques for Email Threads Using Sentence Compression”, Information Processing and Management Journal, Volume 44, Issue 4, pp.1600-1610, 2008 [56] M. Hu , B. Liu. “Mining and Summarizing Customer Reviews”, In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168-177, Seattle, USA, 2004. [57] L. Zhuang , F. Jing , X. Zhu, “Movie Review Mining and Summarization”, In Proceedings of the 15th ACM international conference on Information and Knowledge Management, pp. 43-50, Arlington, USA, 2006, 112 [58] F. Li, C. Han, M. Huang, X. Zhu, Y. Xia, S. Zhang, and H. Yu. “Structure Aware Review Mining and Summarization”. In Proceedings of the 23rd International Conference on Computational Linguistics Association for Computational Linguistics, pp. 653-661, Beijing, China, 2010. [59] B. Sharifi, M. Hutton, and J. Kalita. “Summarizing Microblogs Automatically”. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Association for Computational Linguistics , pp. 685-688, LA, USA, 2010. [60] X. Song, Y. Chi, K. Hino, L. B.Tseng “Summarization System by Identifying Influential Blogs”, In Proceedings of the International Conference on Weblogs and Social Media, pp. 325-326, Boulder, U.S.A., 2007. [61] F. Lacatusu, A. Hickl, and S. Harabagiu,. “Impact of Question Decomposition on the Quality of Answer Summaries”, In Proceedings of the 5fth International Conference on Language Resources and Evaluation, pp. 233-236, Genoa, Italy, 2006. [62] Y. Tao, C. Huang and C. Yang. “Designing an automatic FAQ abstraction for internet forum”. Journal of Information Management, Volume 13, Issue 2, pp. 89-112, 2006. [63] Y. Tao, S. Liu and C. Lin, “Summary of FAQs from a Topical Forum Based on the Native Composition Structure”, Expert Systems with Applications Journal, Volume 38, Issue 1, pp. 527-535, 2011. [64] V. Barbier and A.-Laure Ligozat. “A Syntactic Strategy for Filtering Sentences in a Question Answering”, In Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing , pp.18-24, Borovets, Bulgaria, September 2005. [65] A. Berger , V. Mittal, “Query-Relevant Summarization Using FAQs”, In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics , pp.294-301, Hong Kong, 2000. [66] D. Radev and D. Tam, “Single-Document and Multi-Document Summary Evaluation Via Relative Utility,” In Proceedings of 12th International Conference on Information and Knowledge Management, pp.21-30, New Orleans , USA, 2003. 113 [67] D. Harman and P. Over. “The Effects of Human Variation in DUC Summarization Evaluation”. In Proceedings of the ACL-04 Workshop, pp. 10-17, Barcelona, Spain, 2004. [68] K. McKeown, V. Hatzivassiloglou, R. Barzilay, B. Schiffman, D. Evans, S. Teufel, “Columbia Multi-Document Summarization: Approach and Evaluation,” In Proceedings of the Document Understanding Conference, pp.217-226, New Orleans, USA, 2001. [69] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation,” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311-318, Philadelphia, USA, 2002. [70] C. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” In Proceedings of the Workshop on Text Summarization Branches, pp. 74-81, Barcelona, Spain, 2004. [71] A. Nenkova and R. Passonneau, “Evaluating Content Selection in Summarization: The Pyramid Method,” In Proceedings of the HLT/NAACL, pp 145-152, Boston, USA , 2004. [72] T. Pardo, L. Rino, M. Nunes: “NeuralSumm: A Connexionist Approach to Automatic Text Summarization”. In: Proceedings of the 4th Encontro Nacional de Inteligência Artificial, pp. 210-218, Campinas, Brazil, 2003. [73] J. Neto , A. Freitas , C. Kaestner, “Automatic Text Summarization Using a Machine Learning Approach”, In Proceedings of the 16th Brazilian Symposium on Artificial Intelligence: Advances in Artificial Intelligence, pp.205-215, Porto de Galinhas/Recife, Brazil, 2002. [74] T. Pardo, L. Rino, M. Nunes. “GistSumm: A Summarization Tool Based on a New Extractive Method”. In Proceedings of the 6th Workshop on Computational Processing of the Portuguese, pp. 210–218, Porto Alegre, Brazil,2003. [75] L. Neto, J. Santos, A. Kaestner, C. Freitas. “Document Clustering and Text Summarization”. In Proceedings of the 4th International Conference of Practical Applications of Knowledge Discovery and Data Mining, pp. 41–55, Manchester, UK, 2000. 114 [76] G. Salton, C. Buckley. “Term-Weighting Approaches In Automatic Text Retrieval. Information Processing and Management, pp. 513–523, Ithaca, USA, 1987. [77] Copernic Summarizer, “Technologies WhitePaper, 2003”. http://www.copernic.com/data/pdf/summarization-whitepapereng.pdf, {Retrieved on 10 04-2011} [78] E. Niggemeyer B. SimSum. “An Empirically Founded Simulation of Summarizing”. Information Processing & Management Journal, Volume 36, Issue 4, pp. 659-682, 2000. [79] D. Cai, S. Yu, J. Wen, and W. Ma “VIPS: A Vision-Based Page Segmentation Algorithm” http://research.microsoft.com/pubs/70027/tr-2003-79.pdf, {Retrieved on 1204-2011}. [80] M. Azmy, S. El-Beltagy, and A. Rafea. “Extracting the Latent Hierarchical Structure of Web Documents”, In Proceedings of the International Conference on Signal-Image Technology and Internet-Based Systems, pp.385-393, Hammamat, Tunisia, 2006. [81] N. Dao, T. Simpson. "Measuring Similarity between Sentences" (online), http://wordnetdotnet.googlecode.com/svn/trunk/Projects/Thanh/Paper/WordNetDotNet_S emantic_Similarity.pdf, {Retrieved on 10-04-2011} [82] M. Lesk. “Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone”, In Proceedings of the 5th annual International Conference on Systems Documentation, pp.24-26, Toronto, Canada, 1986. [83] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. "Automatic Extraction of Rules for Sentence Boundary Disambiguation". University of Patras. http://www.ling.gu.se/~lager/Mutbl/Papers/sent_bound.ps. {Retrieved 10-04-2011} [84] H. Chen and C. Lin and E. Hovy. “Automatic Summarization System coupled with a Question-Answering System (QAAS)” In Proceedings of the COLING ’02 Workshop on Multilingual Summarization and Question Answering, pp. 365-374, Taipei, Taiwan, 2002. 115 [85] R. Arnulfo, G. Hernández, Y. Ledeneva, G. Matías, M. Ángel, H. Dominguez, J. Chavez, A. Gelbukh, J. Luis, T. Fabela, “Comparing Commercial Tools and State-of-theArt Methods for Generating Text Summaries”, In Proceedings of the 8th Mexican International Conference on Artificial Intelligence IEEE Computer Society Washington, DC, USA, 2009. 116 APPENDIX A PILOT EXPERIMENT DETAILED RESULTS The table below shows summaries provided by running our system using each feature individually also it shows summaries provided by our proposed combined weighted scheme. The table also provides the scores as stated by the human evaluator involved. Summaries provided below are for a Web page that can be found at the following link: http://www.cartoonfactory.com/faq.html Score Question SIMILARITY Good LOCATION QUERY CAPITALIZED WORDS Combined Excellent Very Bad 1 Q. Do we draw all these picture? These are the original paintings produced by the studios to make cartoons , or reproductions from those studios based on their cartoons and produced using the same techniques . A . No - and that ' s the whole point . None. Very Bad Excellent None. A. No - and that’s the whole point. Good Good Very Bad Very Good Good 2 Q. How are these pictures made? A . That is a long and complicated process . A . That is a long and complicated process . None. We have a couple of pages that show the various steps in creating Limited Editions and Fine Art Sericels. A . That is a long and complicated process. Feature SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED Excellent Very Good Very Bad Very Bad 3 Q. Do you have a catalog? If so could you send me one? Instead , we are publishing our catalog online . A . Sorry , we no longer publish a printed catalog . None. None. 117 WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Excellent Excellent Very Good Very Bad Instead, we are publishing our catalog online. 4 Q. I live in England ( or Australia , or Japan ) Do you ship here ? No problem - we ship worldwide via FedEx . A . Yes . None. Bad Excellent For additional information, please see our Foreign Orders page. No problem - we ship worldwide via FedEx. Bad Excellent Bad 5 Q. Do you have a good recipe for Sea Bass? We love seafood . A . No , but if you do , please send it to us . We love seafood . Very Bad Bad None. We love seafood. Very Good Good Very Good Very Bad Combined Excellent SIMILARITY Very Good LOCATION Very Good 6 Q. With all these different types of art , I am confused ... what should I do ? The worst thing you can do is jump in and buy something and later regret it . A . Ask questions . Bug us . Make sure all your questions have been answered fully . None. A. Ask questions. Know each of the different types of art available and as much other information as you can gather. 7 Q. What type of art is the best investment? A . While animation art tends to be a good investment , that is not always the case . A . While animation art tends to be a good investment , that is not always the case . Even if you do see an increase in the value of your collection , it may not be as dramatic as some of the stories you may have heard . If you are seriously concerned ab 118 QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined Very Good Very Good Very Good A . While animation art tends to be a good investment , that is not always the case . If you are seriously concerned about a high return on investment , we would recommend you invest in the stock market . If you are just concerned about your investment Buy Something You Love. Know What You Are Buying. Buy Something You Really Love. A . While animation art tends to be a good investment, that is not always the case. Very Good Good Very Bad 8 Q. How do we know we can trust The Cartoon Factory? The bottom line is , at some point , you are going to have to trust us to do business with us . A . This is a tough question , and one we try hard to answer . None. Very Bad Good None. A . This is a tough question, and one we try hard to answer. Very Good 9 Q. What happens if I do have a problem with The Cartoon Factory? We work hard to satisfy our customers , but we work harder if they ever experience a problem . All of these policies are outlined on our About The Cartoon Factory page . If - not that this will happen , but if there is a problem , you can appeal to your c A . First , let us know . We work hard to satisfy our customers , but we work harder if they ever experience a problem . We are committed to our customers complete satisfaction . We , like any business , occasionally make mistakes . In a more long term sense , we guarantee the authenticity of all the art we sell for life . All of these policies are outlined on our About The Cartoon Factory page. If you would like a little more assurance , just make your purchase by credit card . You All of these policies are outlined on our About The Cartoon Factory page. We work hard to satisfy our customers, but we work harder if they ever experience a problem. ( I think we have made two ... ) If there has been a problem - in shipping, in framing, or if it is just not the image you thought it was - we will take care of Good 10 Q . Does The Cartoon Factory buy Animation Art ? To get the ball rolling , drop us an e - mail with as complete a description of your artwork as possible ! SIMILARITY Very Good LOCATION Good QUERY CAPITALIZED WORDS Good Combined SIMILARITY Good 119 LOCATION QUERY CAPITALIZED WORDS Good Bad A . Under some circumstances , yes . Which could be about anything , really . Very Bad Combined Excellent None. It has to be something either we are looking for , want terribly bad , something that just strikes our fancy , or something at such a good price we can ' t pass it up . SIMILARITY Excellent LOCATION QUERY Bad Very Bad CAPITALIZED WORDS Very Good Combined Good 11 Q . I have a cel of XXX ... what is it worth ? Do you want to know it ' s current market value , it ' s replacement value , what you should sell it for , or what we would buy it for ? If you did not purchase the artwork from The Cartoon Factory , for us to do anything , you must bring the artwork int A . Again , there is a long , involved answer to this question . One thing you must ask yourself is : What kind of appraisal do you want ? None. If you did not purchase the artwork from The Cartoon Factory, for us to do anything, you must bring the artwork into our gallery in person, or ship it to our gallery so that we can accurately evaluate the artwork. A . Again , there is a long , involved answer to this question . Do you want to know it ' s current market value , it ' s replacement value , what you should sell it for , or what we would buy it for ? Excellent Excellent Very Bad 12 Q . Do you know Mickey Mouse? A . Yes . A . Yes . None. Very Bad Excellent None. A. Yes. SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY Good Bad Very Bad CAPITALIZED WORDS Very Good 13 Q. How are cartoons made? You might also want to check out books on this subject . A . Again , there is no short answer to this question . None We are working on a page that will address this question, though... In the meantime, we do have some excellent online displays of how Limited Editions and Fine Art Sericels are made. 120 Combined Bad SIMILARITY Very Good LOCATION Good QUERY Very Good CAPITALIZED WORDS Good Combined Very Good SIMILARITY LOCATION Good Very Good A. Again, there is no short answer to this question. 14 Q. How many cels does a typical cartoon yield? First , the mechanics of film : film runs at 24 frames per second . ( Don ' t think it matters that video is 30 fps - they are still shot on film ... ) As a matter of course , though , what is typically done , at least by big budget films , is 12 cels pe First , the mechanics of film : film runs at 24 frames per second . That number never changes . Ever . So , the MOST any cartoon can be animated to is 24 fps . ( Don ' t think it matters that video is 30 fps - they are still shot on film ... ) As a matt Of these , you may find that while the main character may be animated at 12 fps , secondary characters in the scene may be animated at 4 fps , or used as part of a " loop " of animated cels . For example , lets imagine a generic scene of Homer talking to So, the MOST any cartoon can be animated to is 24 fps. For example, lets imagine a generic scene of Homer talking to Marge. Marge may only blink in this example scene, which would mean there are only 5 cels of Marge, even though there may be 300 of Homer First , the mechanics of film : film runs at 24 frames per second . So , the MOST any cartoon can be animated to is 24 fps . ( Don ' t think it matters that video is 30 fps - they are still shot on film ... ) As a matter of course, though, what is typic 15 Q. Can you help me with my school project ? What is the History of Animation ? ( Or how are cartoons made ... or ... ) A number of great books have been written on these subjects , and they use 100 ' s of pages ! A . We are sorry , but that is currently well beyond the scope of what we can do . A number of great books have been written on these subjects , and they use 100 ' s of pages ! QUERY CAPITALIZED WORDS Good Combined Good None. A number of great books have been written on these subjects, and they use 100 ' s of pages ! Very 16 Q. Does The Cartoon Factory stock video tapes , plush toys or other items other than cels ? The Cartoon Factory is just an Animation Art Gallery - what we sell SIMILARITY Very Bad 121 LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined Good Very Good Very Good Very Good Very Good Excellent Very Good Very Good Very Good Very Good are Animation Art Cels . A . No , we do not . The Cartoon Factory is just an Animation Art Gallery - what we sell are Animation Art Cels . The Cartoon Factory is just an Animation Art Gallery- what we sell are Animation Art Cels. The Cartoon Factory is just an Animation Art Gallery - what we sell are Animation Art Cels . 17 Q. Can you tell me if " XXXX " is on videotape , or when it will air again ? What may or may not be on video , or when it may be released is a matter that the studios plan and decide , not something we have any voice in , or any knowledge of . A . Again , The Cartoon Factory is just an Animation Art Gallery what we sell are Animation Art Cels . A . Again , The Cartoon Factory is just an Animation Art Gallery what we sell are Animation Art Cels . Again, The Cartoon Factory is just an Animation Art Gallery- what we sell are Animation Art Cels. A. Again, The Cartoon Factory is just an Animation Art Gallery what we sell are Animation Art Cels. SIMILARITY Excellent LOCATION QUERY CAPITALIZED WORDS Excellent Very Bad Combined Excellent 18 Q. Is "XXXXX" on DVD when it be released on DVD? A . Sorry , we are not a DVD store ; what is and is not on DVD , or when it may be released on DVD is not something we concern ourselves with . A . Sorry , we are not a DVD store ; what is and is not on DVD , or when it may be released on DVD is not something we concern ourselves with . None. We recommend Amazon.com, or their affiliate at the Big Cartoon DataBase Video Store. A. Sorry, we are not a DVD store; what is and is not on DVD, or when it may be released on DVD is not something we concern ourselves with. Very Good 19 Q. Can you send me a list of all the Disney Feature Cartoons? Why should we when The Big Cartoon DataBase already lists all the Disney cartoons online, with plenty of great facts? SIMILARITY Good 122 LOCATION QUERY CAPITALIZED WORDS Combined Very Good Very Good Very Good Very Good SIMILARITY Very Good LOCATION Excellent QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION Very Good Very Bad Excellent Very Good Very Good Very Good Very Good Very Good Excellent Very Good A. We could, but we won't. Why should we when The Big Cartoon DataBase already lists all the Disney cartoons online, with plenty of great facts? Why should we when The Big Cartoon DataBase already lists all the Disney cartoons online, with plenty of great facts? Why should we when The Big Cartoon DataBase already lists all the Disney cartoons online, with plenty of great facts? 20 Q. What was the cartoon that had the little red xxxx that floated when he ate biscuits? If you really need a question answered , and it ' s really bugging you , we suggest posting the question to a place that deals with cartoon trivia , such as this cartoon forum . A . Trivia questions - of any type - are not something we can take the time to answer . If you really need a question answered, and it's really bugging you, we suggest posting the question to a place that deals with cartoon trivia, such as this cartoon forum. None. A. Trivia questions - of any type - are not something we can take the time to answer. 21 Q. Can I receive any free products from Disney? This is a very important point : We are not Disney , Warner Bros . or any of the other studios . A . We don ' t know - you ' ll have to ask Disney . We don't know- you'll have to ask Disney. This is a very important point: We are not Disney, Warner Bros. or any of the other studios. This is a very important point: We are not Disney, Warner Bros. or any of the other studios. 22 Q. My son / daughter / wife / friend is a great artist! Who should they call at Disney to show their work? The artists and producers are in the production end , we deal with the consumer products divisions of the studios . A . We do not deal in that part of the business . 123 QUERY Good CAPITALIZED WORDS Good Combined Excellent SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined Very Good Very Good Very Good Very Bad Very Good In many case, if you want to know what Disney requires of young artists, you should probably ask them- we would be quite remiss to assume to speak for them. In many case, if you want to know what Disney requires of young artists, you should probably ask them- we would be quite remiss to assume to speak for them. The artists and producers are in the production end, we deal with the consumer products divisions of the studios. 23 Q. My son / daughter / wife / friend is a great artist! Where should they go to school? If your son or daughter is presently in school , they would do much better to involve their schools ' career counselors ( or those at a local college ) , people who deal with this sort of thing every day . And if they are out of school , they should rese A . Again , we do not deal creative end of the business . Nor are we very familiar with schools or schooling . If your son or daughter is presently in school , they would do much better to involve their schools ' career counselors ( or those at a local college ) , people who deal with this sort of thing every day . And if they are out of school , they should resea None. If your son or daughter is presently in school , they would do much better to involve their schools ' career counselors ( or those at a local college ) , people who deal with this sort of thing every day . And if they are out of school, they should resea 24 Q. How can I paint Cels, and what materials are used? A . As we do not do any production ourselves , this is not something we can really address directly . A . As we do not do any production ourselves , this is not something we can really address directly . A . As we do not do any production ourselves , this is not something we can really address directly . SIMILARITY Excellent LOCATION Excellent QUERY CAPITALIZED WORDS Excellent Combined Excellent None. A. As we do not do any production ourselves, this is not something we can really address directly. Very 25 Q. So how can I contact the studios directly? A . We have listed various studios addresses and phone numbers SIMILARITY Very Bad 124 LOCATION QUERY CAPITALIZED WORDS Combined SIMILARITY LOCATION QUERY CAPITALIZED WORDS Combined Good Very Good Very Good below . Good Very Good A . We have listed various studios addresses and phone numbers below . A . We have listed various studios addresses and phone numbers below . We suggest you try calling information in Los Angeles (area codes 213, 818 or 310) for other studios. A. We have listed various studios addresses and phone numbers below. Excellent Excellent Excellent 26 Q . Do you have e - mail addresses for the studios? A . No , we do not have any public e - mail addresses . A . No , we do not have any public e - mail addresses . A . No , we do not have any public e - mail addresses . Very Bad Excellent None. A. No, we do not have any public e - mail addresses. SIMILARITY Very Good LOCATION Very Good QUERY Good CAPITALIZED WORDS Very Good Combined Excellent 27 Q. Can I use the images on The Cartoon Factory site? We are NOT the copyright holder for any of these images , and you can not get that permission from us at The Cartoon Factory . Use of this site acknowledges agreement to these Terms of Use . To use any images from this site for any purpose is a violatio A . First of all , you need to understand that every cartoon character was created by someone ; therefore someone owns these characters . Not you , and certainly not us . To legally use any copyrighted cartoon character for any reason , you must have the All of the images on our site are copyrighted , and as such , are protected by US and international copyright law . Additionally , the use of this site and it ' s contents and systems are governed by our Terms of Use . To use any images from this site fo We are NOT the copyright holder for any of these images, and you can not get that permission from us at The Cartoon Factory. Or, link to the Clip Art section of NetLinks. We are NOT the copyright holder for any of these images, and you can not get that permission from us at The Cartoon Factory. All of the images on our site are copyrighted, and as such, are protected by US and international copyright law. Use of this sit Excellent 28 Q. Can you provide a link exchange with our site? A . Unfortunately , The Cartoon Factory site is not currently set up SIMILARITY 125 to provide this option LOCATION Excellent QUERY CAPITALIZED WORDS Very Good Very Good Combined Excellent SIMILARITY Very Good LOCATION Excellent QUERY Good CAPITALIZED WORDS Good Combined Very Good A . Unfortunately , The Cartoon Factory site is not currently set up to provide this option If you have a cartoon or comics related site , you might consider adding it to the Toon . Com directory , which is cartoon and comics - specific . The Cartoon Factory site is not currently set up to provide this option. A. Unfortunately, The Cartoon Factory site is not currently set up to provide this option. 29 Q. Can The Cartoon Factory License me or my company to use cartoon images? To legally use any copyrighted cartoon character for any reason , you must have the permission of the copyright owner . We are NOT the copyright holder for any of these images , and you can not get licensing permission or permissions for use from us at T A . No . Not at all . Ever. To legally use any copyrighted cartoon character for any reason , you must have the permission of the copyright owner . We are NOT the copyright holder for any of these images , and you can not get licensing permission or permissions for use from us at The Cartoon Factory . All of the images on our site are copyrighted , and as such , are protected by US and international We are NOT the copyright holder for any of these images, and you can not get licensing permission or permissions for use from us at The Cartoon Factory. Any infringement of property of The Cartoon Factory will be prosecuted To legally use any copyrighted cartoon character for any reason, you must have the permission of the copyright owner. We are NOT the copyright holder for any of these images, and you can not get licensing permission or permissions for use from us at The 126 APPENDIX B DATASET DESCRIPTION Page ID Domain Description 1 Software FAQs about learning CPP programming. 2 Software 3 Software FAQs about learning Java. FAQs about Java tutorials. 4 Software 5 Software 6 Customer Service 7 Customer Service Customer Service 8 FAQs about Google maps. FAQs about AI Software agents. FAQs about pets travelling in Delta airlines. FAQs about Delta airlines check-in. FAQs about DHL shipping. 9 Business FAQs about eCommerce. FAQs about project management. FAQs about cartoon making. 10 Business 11 Art 12 Art FAQs about the GRAMMY award. 13 Art FAQs about Shakespeare’s life. 127 URL # Questions http://www.parashift.co m/c++-faq-lite/how-tolearn-cpp.html http://java.sun.com/prod ucts/jdk/faq.html http://www.iam.ubc.ca/ guides/javatut99/inform ation/FAQ.html http://code.google.com/ apis/maps/faq.html http://www.davidreilly.c om/topics/software_age nts/ http://www.delta.com/h elp/faqs/pet_travel/inde x.jsp http://www.delta.com/h elp/faqs/checkin/ http://www.dhlusa.com/custserv/faq.as p?PageID=SH&nav=FA Q/Shipping http://www.nhbis.com/e commerce_faqs.html http://www.maxwidema n.com/ http://www.cartoonfacto ry.com/faq.html 8 http://www2.grammy.co m/GRAMMY_Awards/ Voting/FAQs/ http://absoluteshakespea re.com/trivia/faq/faq.ht 35 12 52 6 7 18 5 10 27 29 8 28 m 14 Health FAQs about Google health. 15 Health FAQs about Avian Flu. 16 Health FAQs about soy food. 17 Society FAQs about communism. 18 Society FAQs about environmental safety. 19 Society FAQs about human rights. 20 News FAQs about CNN News. 21 Academic FAQs about AUC exchange program. 22 Sports FAQs about Football. http://www.google.com/ 9 intl/enUS/health/faq.html http://www.who.int/csr/ 32 disease/avian_influenza/ avian_faqs/en/ http://www.soyfoods.or 22 g/health/faq http://www.angelfire.co 10 m/mn2/Communism/faq .html http://www.safety.roche 21 ster.edu/FAQ/answers.h tml http://web.worldbank.or 5 g/WBSITE/EXTERNA L/EXTSITETOOLS/0,, contentMDK:20749693 ~pagePK:98400~piPK: 98424~theSitePK:9547 4,00.html http://www.cnn.com/fee 18 dback/help/ http://www.aucegypt.ed 16 u/students/IPO/FAQ/Pa ges/AUC.aspx http://football.about.co 12 m/od/frequentlyaskedqu estions/Common_Quest ions.htm Total Number of 390 Questions Approximate Number 2000 of Sentences Table 1.0 Dataset Description. 128 APPENDIX C DETAILED EXPERIMENTAL RESULTS The table below shows the scores of each question in each page in the form of an average score of three human evaluators one for our system and the other for Copernic. Page Page 1 Q1-Q8 Page 2 Q9-Q43 Question FAQWEBSUMM Q1 0.267 Q2 0.667 Q3 0.600 Q4 0.533 Q5 0.667 Q6 0.733 Q7 0.600 Q8 0.733 Q9 0.800 Q 10 0.867 Q 11 0.600 Q 12 0.600 Q 13 0.800 Q 14 0.933 Q 15 0.733 Q 16 1.000 Q 17 0.800 Q 18 0.600 Q 19 0.333 Q 20 0.667 Q 21 0.600 Q 22 0.467 Q 23 0.733 Q 24 0.800 Q 25 0.867 Q 26 0.733 Q 27 0.733 Q 28 0.800 Q 29 0.867 129 Copernic 0.467 0.333 0.467 0.533 0.267 0.267 0.200 0.400 0.667 0.800 0.533 0.467 0.200 0.933 0.200 0.200 0.200 0.733 0.200 0.533 0.667 0.667 0.733 0.533 0.533 0.533 0.200 0.533 0.600 Improvement Ratio -42.9% 100.0% 28.6% 0.0% 150.0% 175.0% 200.0% 83.3% 20.0% 8.3% 12.5% 28.6% 300.0% 0.0% 266.7% 400.0% 300.0% -18.2% 66.7% 25.0% -10.0% -30.0% 0.0% 50.0% 62.5% 37.5% 266.7% 50.0% 44.4% Page 3 Q44-Q55 Page 4 Q56Q107 Q 30 Q 31 Q 32 Q 33 Q 34 Q 35 Q 36 Q 37 Q 38 Q 39 Q 40 Q 41 Q 42 0.933 0.733 0.800 0.867 0.800 0.600 0.667 0.600 0.600 0.667 0.533 0.400 0.300 0.200 0.400 0.733 0.533 0.467 0.600 0.467 0.600 0.467 0.933 0.533 0.600 0.300 366.7% 83.3% 9.1% 62.5% 71.4% 0.0% 42.9% 0.0% 28.6% -28.6% 0.0% -33.3% 0.0% Q 43 Q 44 Q 45 Q 46 Q 47 Q 48 Q 49 Q 50 Q 51 Q 52 Q 53 Q 54 Q 55 Q 56 Q 57 Q 58 Q 59 Q 60 Q 61 Q 62 Q 63 Q 64 Q 65 Q 66 Q 67 Q 68 Q 69 0.300 0.867 0.800 0.600 0.600 0.867 0.467 0.733 0.867 0.867 0.400 0.533 0.733 0.667 0.8 0.867 0.867 0.467 0.667 0.867 0.800 0.600 0.733 0.867 0.733 0.600 0.600 0.300 0.867 0.533 0.467 0.600 0.467 0.867 0.467 0.267 0.533 0.600 0.200 0.733 0.667 0.867 0.400 0.867 0.467 0.200 0.533 0.667 0.800 0.200 0.867 0.733 0.800 0.800 0.0% 0.0% 50.0% 28.6% 0.0% 85.7% -46.2% 57.1% 225.0% 62.5% -33.3% 166.7% 0.0% 0.0% -7.7% 116.8% 0.0% 0.0% 233.5% 62.7% 19.9% -25.0% 266.5% 0.0% 0.0% -25.0% -25.0% 130 Page 5 Q108Q113 Q 70 Q 71 Q 72 Q 73 Q 74 Q 75 Q 76 Q 77 Q 78 Q 79 Q 80 Q 81 Q 82 Q 83 Q 84 Q 85 Q 86 Q 87 Q 88 Q 89 Q 90 Q 91 Q 92 Q 93 Q 94 Q 95 Q 96 Q 97 Q 98 Q 99 Q 100 Q 101 Q 102 Q 103 Q 104 Q 105 Q 106 Q 107 Q 108 Q 109 Q 110 0.733 0.867 0.667 0.867 0.800 0.533 0.733 0.867 0.667 0.467 0.867 0.867 0.933 0.800 0.667 0.733 0.600 0.533 0.600 0.733 0.867 0.600 0.800 0.667 0.733 0.733 0.400 0.533 0.600 0.600 0.467 0.467 0.600 0.533 0.533 0.533 0.467 0.400 0.533 0.467 0.333 0.733 0.867 0.733 0.200 0.200 0.533 0.733 0.600 0.200 0.467 0.800 0.867 0.600 0.733 0.333 0.600 0.867 0.800 0.667 0.733 0.867 0.733 0.800 0.800 0.667 0.667 0.533 0.333 0.600 0.533 0.467 0.267 0.600 0.400 0.600 0.533 0.467 0.400 0.600 0.467 0.667 131 0.0% 0.0% -9.0% 333.5% 300.0% 0.0% 0.0% 44.5% 233.5% 0.0% 8.4% 0.0% 55.5% 9.1% 100.3% 22.2% -30.8% -33.4% -10.0% 0.0% 0.0% -18.1% 0.0% -16.6% 9.9% 9.9% -25.0% 60.1% 0.0% 12.6% 0.0% 74.9% 0.0% 33.3% -11.2% 0.0% 0.0% 0.0% -11.2% 0.0% -50.1% Page 6 Q114Q120 Page 7 Q121Q138 Page 8 Q139Q143 Page 9 Q144Q153 Q 111 Q 112 0.533 0.467 0.333 0.467 60.1% 0.0% Q 113 Q 114 Q 115 Q 116 Q 117 Q 118 Q 119 0.600 0.867 0.733 0.867 0.800 0.800 1.000 0.600 0.867 0.200 0.200 0.800 0.800 0.200 0.0% 0.0% 266.5% 333.5% 0.0% 0.0% 400.0% Q 120 Q 121 Q 122 Q 123 Q 124 Q 125 Q 126 Q 127 Q 128 Q 129 Q 130 Q 131 Q 132 Q 133 Q 134 Q 135 Q 136 Q 137 Q 138 Q 139 Q 140 Q 141 Q 142 0.867 0.733 0.733 0.933 0.667 0.800 0.733 0.933 0.867 0.467 0.867 0.933 0.400 0.867 0.800 0.933 0.600 0.667 0.667 0.600 0.733 0.800 0.533 0.867 0.800 0.400 0.933 0.667 0.800 0.400 0.933 0.467 0.600 0.867 0.933 0.800 0.867 0.467 0.933 0.867 0.867 0.200 0.600 0.733 0.800 0.733 0.0% -8.3% 83.3% 0.0% 0.0% 0.0% 83.3% 0.0% 85.7% -22.2% 0.0% 0.0% -50.0% 0.0% 71.4% 0.0% -30.8% -23.1% 233.4% 0.0% 0.0% 0.0% -27.3% Q 143 Q 144 Q 145 Q 146 Q 147 0.933 0.733 0.867 0.667 0.867 0.933 0.733 0.867 0.800 0.800 0.0% 0.0% 0.0% -16.7% 8.3% 132 Page 10 Q154Q180 Page 11 Q181Q209 Q 148 Q 149 Q 150 Q 151 Q 152 0.933 1.000 0.867 0.800 0.600 0.667 0.333 0.600 0.200 0.267 40.0% 200.0% 44.5% 300.0% 125.0% Q 153 Q 154 Q 155 Q 156 Q 157 Q 158 Q 159 Q 160 Q 161 Q 162 Q 163 Q 164 Q 165 Q 166 Q 167 Q 168 Q 169 Q 170 Q 171 Q 172 Q 173 Q 174 Q 175 Q 176 Q 177 Q 178 Q 179 Q 180 Q 181 Q 182 Q 183 Q 184 Q 185 0.667 0.400 0.467 0.600 0.667 0.533 0.400 0.667 0.533 0.600 0.333 0.667 0.600 0.667 0.600 0.733 0.667 0.467 0.400 0.533 0.667 0.600 0.400 0.467 0.600 0.600 0.533 0.600 0.733 0.467 0.800 0.867 0.467 0.600 0.400 0.600 0.267 0.667 0.600 0.400 0.667 0.400 0.600 0.800 0.667 0.667 0.467 0.467 0.800 0.600 0.400 0.400 0.400 0.467 0.667 0.333 0.600 0.600 0.600 0.533 0.400 0.400 0.533 0.800 0.200 0.200 11.1% 0.0% -22.2% 124.7% 0.0% -11.2% 0.0% 0.0% 33.3% 0.0% -58.4% 0.0% -10.0% 42.8% 28.5% -8.4% 11.2% 16.8% 0.0% 33.3% 42.8% -10.0% 20.1% -22.2% 0.0% 0.0% 0.0% 50.0% 83.3% -12.4% 0.0% 333.5% 133.5% 133 Page 12 Q210Q2117 Page 13 Q218Q245 Q 186 Q 187 Q 188 Q 189 Q 190 Q 191 Q 192 Q 193 Q 194 Q 195 Q 196 Q 197 Q 198 Q 199 Q 200 Q 201 Q 202 Q 203 Q 204 Q 205 Q 206 Q 207 Q 208 Q 209 Q 210 Q 211 Q 212 Q 213 Q 214 Q 215 Q 216 0.733 0.733 0.333 0.733 0.667 0.867 1.000 0.467 0.800 0.733 0.867 0.533 0.933 0.867 0.667 0.800 0.800 0.733 0.800 0.800 0.867 0.733 0.867 0.800 0.733 0.667 0.800 1.000 0.933 0.933 1.000 0.600 0.733 0.400 0.667 0.400 0.600 0.200 0.867 0.867 0.733 0.467 0.600 0.467 0.867 0.800 0.867 0.800 0.733 0.800 0.800 0.200 0.600 0.867 0.733 0.600 0.533 0.600 0.200 0.400 0.200 0.867 22.2% 0.0% -16.8% 9.9% 66.8% 44.5% 400.0% -46.1% -7.7% 0.0% 85.7% -11.2% 99.8% 0.0% -16.6% -7.7% 0.0% 0.0% 0.0% 0.0% 333.5% 22.2% 0.0% 9.1% 22.2% 25.1% 33.3% 400.0% 133.3% 366.5% 15.3% Q 217 Q 218 Q 219 Q 220 Q 221 Q 222 Q 223 Q 224 0.800 1.000 0.867 0.733 0.800 0.800 0.667 0.733 0.200 1.000 0.867 0.733 0.800 0.800 0.667 0.733 300.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 134 Page 14 Q246Q254 Page 15 Q255Q286 Q 225 Q 226 Q 227 Q 228 Q 229 Q 230 Q 231 Q 232 Q 233 Q 234 Q 235 Q 236 Q 237 Q 238 Q 239 Q 240 Q 241 Q 242 Q 243 Q 244 0.667 0.933 0.733 0.867 0.933 0.933 0.933 0.867 0.867 1.000 1.000 0.867 0.933 0.867 0.733 0.867 1.000 0.600 0.867 0.800 0.733 0.933 0.733 0.667 0.867 0.200 0.933 0.733 0.733 0.867 0.867 0.867 0.933 0.800 0.600 0.800 0.733 0.600 0.867 0.667 -9.0% 0.0% 0.0% 30.0% 7.6% 366.5% 0.0% 18.3% 18.3% 15.3% 15.3% 0.0% 0.0% 8.4% 22.2% 8.4% 36.4% 0.0% 0.0% 19.9% Q 245 Q 246 Q 247 Q 248 Q 249 Q 250 Q 251 Q 252 Q 253 0.933 0.800 0.933 0.733 1.000 1.000 0.800 1.000 0.867 0.933 0.733 0.800 0.733 0.667 0.867 0.333 0.400 0.400 0.0% 9.1% 16.6% 0.0% 49.9% 15.3% 140.2% 150.0% 116.8% Q 254 Q 255 Q 256 Q 257 Q 258 Q 259 Q 260 Q 261 Q 262 0.667 0.867 0.800 0.933 0.867 0.933 0.733 1.000 0.667 0.467 0.600 0.733 0.200 0.867 0.800 0.800 0.667 0.200 42.8% 44.5% 9.1% 366.5% 0.0% 16.6% -8.4% 49.9% 233.5% 135 Page 16 Q287Q308 Q 263 Q 264 Q 265 Q 266 Q 267 Q 268 Q 269 Q 270 Q 271 Q 272 Q 273 Q 274 Q 275 Q 276 Q 277 Q 278 Q 279 Q 280 Q 281 Q 282 Q 283 Q 284 Q 285 0.733 0.667 0.733 0.733 0.867 0.733 0.933 0.733 0.933 0.733 0.867 0.533 0.800 0.867 0.800 1.000 0.667 0.667 0.800 0.867 0.667 0.867 0.867 0.733 0.333 0.200 0.733 0.600 0.800 0.933 0.800 0.933 0.933 0.800 0.533 0.667 0.733 0.333 0.867 0.733 0.533 0.600 0.533 0.533 0.800 0.867 0.0% 100.3% 266.5% 0.0% 44.5% -8.4% 0.0% -8.4% 0.0% -21.4% 8.4% 0.0% 19.9% 18.3% 140.2% 15.3% -9.0% 25.1% 33.3% 62.7% 25.1% 8.4% 0.0% Q 286 Q 287 Q 288 Q 289 Q 290 Q 291 Q 292 Q 293 Q 294 Q 295 Q 296 Q 297 Q 298 Q 299 0.867 0.933 1.000 0.600 0.867 0.800 0.800 1.000 1.000 0.933 0.867 0.667 0.733 0.933 0.867 0.933 0.933 0.733 0.733 0.467 0.667 0.533 0.867 0.333 0.667 0.933 0.467 0.933 0.0% 0.0% 7.2% -18.1% 18.3% 71.3% 19.9% 87.6% 15.3% 180.2% 30.0% -28.5% 57.0% 0.0% 136 Page 17 Q309Q318 Page 18 Q319Q339 Q 300 Q 301 Q 302 Q 303 Q 304 Q 305 Q 306 Q 307 Q 308 Q 309 Q 310 Q 311 Q 312 Q 313 Q 314 Q 315 Q 316 Q 317 0.800 0.667 0.867 0.600 0.933 0.533 0.867 0.933 0.933 0.867 0.800 0.867 0.933 0.733 0.800 0.667 0.867 0.933 0.667 0.733 0.800 0.867 0.800 0.867 0.533 0.933 0.800 0.533 0.667 0.733 0.867 0.600 0.200 0.667 0.933 0.533 19.9% -9.0% 8.4% -30.8% 16.6% -38.5% 62.7% 0.0% 16.6% 62.7% 19.9% 18.3% 7.6% 22.2% 300.0% 0.0% -7.1% 75.0% Q 318 Q 319 Q 320 Q 321 Q 322 Q 323 Q 324 Q 325 Q 326 Q 327 Q 328 Q 329 Q 330 Q 331 Q 332 Q 333 Q 334 Q 335 Q 336 Q 337 Q 338 Q 339 0.667 0.667 0.733 0.933 0.733 0.733 0.667 0.667 0.867 0.667 0.800 0.933 0.867 0.800 0.800 0.733 0.800 0.800 0.800 0.667 0.933 0.667 0.733 0.600 0.733 0.600 0.200 0.467 0.333 0.667 0.733 0.667 0.400 0.467 0.200 0.467 0.733 0.667 0.533 0.667 0.333 0.333 0.733 0.800 -9.0% 11.2% 0.0% 55.5% 266.5% 57.0% 100.3% 0.0% 18.3% 0.0% 100.0% 99.8% 333.5% 71.3% 9.1% 9.9% 50.1% 19.9% 140.2% 100.3% 27.3% -16.6% 137 Page 19 Q340Q344 Page 20 Q345Q362 Page 21 Q363Q378 Q 340 Q 341 Q 342 Q 343 0.733 0.800 0.800 0.733 0.800 0.733 0.400 0.800 -8.4% 9.1% 100.0% -8.4% Q 344 Q 345 Q 346 Q 347 Q 348 Q 349 Q 350 Q 351 Q 352 Q 353 Q 354 Q 355 Q 356 Q 357 Q 358 Q 359 Q 360 Q 361 Q 362 Q 363 Q 364 Q 365 Q 366 Q 367 Q 368 Q 369 Q 370 Q 371 Q 372 Q 373 Q 374 Q 375 Q 376 Q 377 0.600 0.933 0.733 0.667 0.733 0.733 0.800 0.867 0.733 0.800 0.800 0.733 0.600 0.600 0.533 0.733 0.267 0.467 0.667 1.000 0.733 0.467 0.733 0.667 0.733 0.733 0.733 0.800 0.533 0.667 0.600 0.667 0.667 0.733 0.867 0.200 0.467 0.667 0.333 0.800 0.400 0.800 0.267 0.200 0.867 0.533 0.467 0.667 0.533 0.600 0.200 0.533 0.400 1.000 0.667 0.867 0.733 0.667 0.533 0.733 0.733 0.267 0.467 0.667 0.800 0.667 0.733 0.733 -30.8% 366.5% 57.0% 0.0% 120.1% -8.4% 100.0% 8.4% 174.5% 300.0% -7.7% 37.5% 28.5% -10.0% 0.0% 22.2% 33.5% -12.4% 66.8% 0.0% 9.9% -46.1% 0.0% 0.0% 37.5% 0.0% 0.0% 199.6% 14.1% 0.0% -25.0% 0.0% -9.0% 0.0% 138 Page 22 Q379Q390 Overall Average Q 378 Q 379 Q 380 Q 381 Q 382 Q 383 Q 384 Q 385 Q 386 Q 387 Q 388 Q 389 0.667 0.267 0.733 0.733 0.467 0.733 0.867 0.733 0.800 0.933 0.933 0.800 0.667 0.267 0.733 0.733 0.467 0.533 0.400 0.733 0.800 0.933 0.933 0.800 0.0% 0.0% 0.0% 0.0% 0.0% 37.5% 116.8% 0.0% 0.0% 0.0% 0.0% 0.0% Q 390 0.467 0.467 0.0% 0.732 0.610 20.1% Table 2.0 Questions average scores. The set of figures below shows a comparison of the score distributions for both summarizers with respect to each page discipline. Very Bad 3% Excellent 19% Excellent 10% Bad 5% Very Bad 18% Bad 8% Very Good 25% Good 37% Very Bad Bad Good Very Good Excellent Good 39% Very Good 36% FAQWebSum "Sof tw are" Copernic "Sof tw are" Figure 1 Software Discipline Score Distribution Comparison. 139 Bad 3% Very Bad 13% Good 13% Bad 7% Excellent 37% Excellent 40% Very Bad Good 13% Bad Good Very Good Excellent Very Good 44% Very Good 30% Copernic "Customer Service" FAQWebSum "Customer Service" Figure 2 Customer Service Discipline Score Distribution Comparison. Excellent 14% Excellent 3% Bad 14% Very Bad 8% Very Good 30% Bad 24% Very Bad Bad Good Very Good 30% Very Good Excellent Good 42% Good 35% FAQWebSum "Business" Copernic "Business" Figure 3 Business Discipline Score Distribution Comparison. 140 Bad 2% Good 8% Very Bad 12% Excellent 26% Bad 6% Very Bad Excellent 45% Good 18% Very Good 45% Bad Good Very Good Excellent Very Good 38% FAQWebSum "Art" Copernic "Art" Figure 4 Art Discipline Score Distribution Comparison. Good 6% Excellent 53% Very Bad 5% Bad 10% Excellent 25% Good 19% Very Good 41% Very Bad Bad Good Very Good Excellent Very Good 41% FAQWebSum "Health" Copernic "Health" Figure 5 Health Discipline Score Distribution Comparison. Good 3% Excellent 8% Very Bad 8% Excellent 28% Bad 14% Very Bad Bad Good Very Good 45% Good 25% Very Good 69% FAQWebSum "Society" Copernic "Society" Figure 6 Society Discipline Score Distribution Comparison. 141 Very Good Excellent Excellent 11% Very Bad 6% Excellent 6% Good 22% Very Bad 22% Very Good 22% Very Bad Bad Good Bad 17% Very Good 61% Good 33% FAQWebSum "New s" Very Good Excellent Copernic "New s" Figure 7 News Discipline Score Distribution Comparison. Excellent 6% Excellent 13% Good 19% Very Bad 6% Good 13% Very Bad Bad Good Very Good Very Good 75% Excellent Very Good 68% FAQWebSum "Academic" Copernic "Academic" Figure 8 Academic Discipline Score Distribution Comparison. 142 Excellent 25% Very Bad 8% Very Bad 8% Excellent 17% Bad 8% Good 17% Very Bad Bad Good 25% Good Very Good Excellent Very Good 42% Very Good 50% Copernic "Sports" FAQWebSum "Sports" Figure 9 Sports Discipline Score Distribution Comparison. The set of figures below shows a comparison of the score distributions for both summarizers with respect to each human evaluator. Very Bad Bad 3% 2% Very Bad 14% Excellent 37% Bad 5% Excellent 38% Very Bad Bad Good 46% Good Very Good Very Good 12% Very Good 7% FAQWebSum Score Distribution Good 36% Copernic Score Distribution Figure 10 Evaluator 1 Score Distribution Comparison. 143 Excellent Excellent Ver y Bad 6% 8% Very Good 14% Excellent 4% Very Bad 27% Bad 20% Ver y Good 28% Very Bad Bad Good Good 28% Very Good Excellent Bad 27% Good 38% FAQWebSum Sc or e Dist r ibut ion Copernic Score Distribution Figure 11 Evaluator 2 Score Distribution Comparison. Excellent 10% Very Bad 3% Excellent 7% Bad 16% Very Bad 24% Very Good 22% Very Bad Bad Very Good 37% Bad 16% Good 34% Good Very Good Excellent Good 31% Copernic Score Distribution FAQWebSum Score Distribution Figure 12 Evaluator 3 Score Distribution Comparison. Very Bad 2% Bad 4% Excellent 29% Good 26% Excellent 41% Very Bad 20% Bad 4% Very Bad Bad Good Very Good Good 21% Very Good 27% Very Good 26% Copernic Score Distribution FAQWebSum Score Distribution Figure 13 Evaluator 4 Score Distribution Comparison. 144 Excellent Very Bad 8% Excellent 19% Excellent 13% Very Good 11% Bad 29% Very Good 11% Very Bad 27% Very Bad Bad Good Very Good Excellent Good 21% Bad 28% Good 33% FAQWebSum Score Distribution Copernic Score Distribution Figure 14 Evaluator 5 Score Distribution Comparison. Very Bad Bad 1% 5% Very Bad 13% Excellent 23% Good 16% Bad 15% Excellent 41% Very Bad Bad Good Very Good Very Good 23% Excellent Very Good 37% Good 26% FAQWebSum Score Distribution Copernic Score Distribution Figure 15 Evaluator 6 Score Distribution Comparison. Very Bad 2% Bad 8% Excellent 18% Very Bad 13% Excellent 30% Bad 9% Very Bad Good 27% Bad Good Very Good Very Good 28% Excellent Good 32% Very Good 33% FAQWebSum Score Distribution Copernic Score Distribution Figure 16 Evaluator 7 Score Distribution Comparison. 145 Bad 1% Excellent 9% Good 15% Very Bad 15% Bad 2% Excellent 33% Very Bad Bad Good Good 25% Very Good 49% Very Good 51% FAQWebSum Score Distribution Very Good Excellent Copernic Score Distribution Figure 17 Evaluator 8 Score Distribution Comparison. The following set of tables shows the t-Test comparison between each pair of evaluators who scored the same set of pages. The tables show a record for each page scored by pair of evaluators and whether the values of both evaluators to each summarizer were significant or not. In other words, it shows if the scores of both evaluators are homogeneous to each other or not. If the scores are comparable with small tolerance then the t-Test value would be not significant and significant otherwise. Page FAQWEBSUMM Copernic 5 Significant Significant 8 Not Significant Not Significant 20 Significant Significant 21 Significant Significant 22 Not Significant Not Significant Table 2 Evaluators 1 and 2 t-Test Comparison. 146 Page FAQWEBSUMM Copernic 4 Significant Significant 6 Not Significant Not Significant 9 Significant Not Significant 20 Significant Significant 21 Significant Significant 22 Not Significant Not Significant Table 3 Evaluators 1 and 3 t-Test Comparison. Page FAQWEBSUMM Copernic 4 Significant Not Significant 6 Not Significant Not Significant 8 Not Significant Not Significant 9 Not Significant Not Significant Table 4 Evaluators 1 and 4 t-Test Comparison. Page FAQWEBSUMM Copernic 5 Not Significant Not Significant 10 Significant Significant Table 5 Evaluators 1 and 5 t-Test Comparison. 147 Page FAQWEBSUMM Copernic 1 Not Significant Not Significant 2 Significant Not Significant 7 Not Significant Not Significant 20 Not Significant Significant 21 Not Significant Not Significant 22 Not Significant Not Significant Table 6 Evaluators 2 and 3 t-Test Comparison. Page FAQWEBSUMM Copernic 1 Not Significant Not Significant 3 Not Significant Not Significant 7 Significant Significant 8 Not Significant Not Significant Table 7 Evaluators 2 and 4 t-Test Comparison. Page FAQWEBSUMM Copernic 2 Not Significant Not Significant 3 Not Significant Not Significant 5 Not Significant Not Significant 10 Significant Not Significant Table 8 Evaluators 2 and 5 t-Test Comparison. 148 Page FAQWEBSUMM Copernic 1 Not Significant Not Significant 4 Significant Significant 6 Not Significant Not Significant 7 Significant Significant 9 Significant Not Significant Table 9 Evaluators 3 and 4 t-Test Comparison. Page FAQWEBSUMM Copernic 2 Significant Not Significant Table 10 Evaluators 3 and 5 t-Test Comparison. Page FAQWEBSUMM Copernic 3 Not Significant Not Significant Table 11 Evaluators 4 and 5 t-Test Comparison. 149 Page FAQWEBSUMM Copernic 11 Not Significant Not Significant 12 Not Significant Not Significant 13 Not Significant Not Significant 14 Not Significant Not Significant 15 Significant Not Significant 16 Not Significant Not Significant 17 Not Significant Not Significant 18 Significant Not Significant 19 Not Significant Not Significant Table 12 Evaluators 6 and 7 t-Test Comparison. Page FAQWEBSUMM Copernic 11 Not Significant Not Significant 12 Not Significant Not Significant 13 Not Significant Not Significant 14 Not Significant Not Significant 15 Not Significant Not Significant 16 Not Significant Not Significant 17 Not Significant Not Significant 18 Not Significant Significant 19 Not Significant Not Significant Table 13 Evaluators 6 and 8 t-Test Comparison. 150 Page FAQWEBSUMM Copernic 11 Not Significant Not Significant 12 Significant Not Significant 13 Not Significant Significant 14 Significant Not Significant 15 Significant Not Significant 16 Not Significant Not Significant 17 Not Significant Not Significant 18 Significant Significant 19 Not Significant Not Significant Table 14 Evaluators 7 and 8 t-Test Comparison. 151 APPENDIX D SAMPLE OUPUT SUMMARIES The table below shows sample summaries provided by running our system against Copernic. The table also shows the average score given by all human evaluators in response to each question. Summaries provided below are for a Web page that can be found at the following link: http://www2.grammy.com/GRAMMY_Awards/Voting/FAQs/ Summarizer Score Very FAQWEBSUMM Good Copernic Good FAQWEBSUMM Good Copernic Bad FAQWEBSUMM Excellent Copernic Good Question 1-What ' s the difference between an entry and a nomination ? Entries are recordings submitted for GRAMMY consideration . Entries that meet all eligibility requirements are then voted on by The Academy ' s voting members and the results of that vote are the nominations. 2-What are the eligibility requirements ? For the 52nd Annual GRAMMY Awards , albums must be released between Oct . 1 , 2008 and August 31 , 2009 . sales by label to a branch or recognized independent distributor, via the Internet, or mail order / retail sales for a nationally marketed product. 3-How are recordings entered ? The Academy accepts entries online from its members and from registered labels. Entrants are provided information on how to submit their recordings electronically for consideration. 4-Who can vote ? FAQWEBSUMM Excellent Recording Academy voting members only . Copernic None Very Bad 152 FAQWEBSUMM Excellent Copernic Bad 5-Who qualifies as a Voting Member ? Recording Academy voting members are professionals with creative or technical credits on six commercially released tracks ( or their equivalent ) . These may include vocalists, conductors, songwriters, composers, engineers, producers, instrumentalists, arrangers, art directors, album notes writers, narrators, and music video artists and technicians. FAQWEBSUMM Excellent 6-How many GRAMMY categories are there ? There are currently 29 fields ( Pop , Gospel , Classical , etc . ) and 109 categories within those fields . Copernic None Very Bad FAQWEBSUMM Excellent Copernic Excellent 7-How are categories changed or added ? Proposals for changes to the categories are reviewed each year by The Academy ' s Awards & Nominations Committee , with final approval by The Academy ' s Trustees . Proposals for changes to the categories are reviewed each year by The Academy ' s Awards & Nominations Committee, with final approval by The Academy ' s Trustees. FAQWEBSUMM Excellent 8-What is the difference between Record Of The Year and Song Of The Year ? The Record Of The Year category recognizes the artist ’ s performance as well as the overall contributions of the producer ( s ) , recording engineer ( s ) and / or mixer ( s ) if other than the artist . Copernic None Very Bad Summaries provided below are for a Web page that can be found at the following link: http://www.google.com/intl/en-US/health/faq.html Summarizer Score FAQWEBSUMM Excellent Question 1. What kind of health information can I store in Google Health? You can store as much or as little information in Google Health as you want. 153 Copernic Very Good FAQWEBSUMM Excellent Copernic Excellent FAQWEBSUMM Very Good Copernic Very Good FAQWEBSUMM Excellent Copernic Good FAQWEBSUMM Excellent Copernic Excellent You can store as much or as little information in Google Health as you want. You can store wellness data, medical records, or both in Google Health including personalized wellness goals around weight or exercise as an example, or more traditional medical history such as your medications, allergies, procedures, immunizations, conditions, health insurance information and test results. 2. How can Google Health help me? Google Health offers a simple online dashboard you can use to track all your health - related information and goals. Google Health offers a simple online dashboard you can use to track all your health - related information and goals. 3. Is Google Health a new way to search for health information? Google Health helps you organize, track, monitor, and act on your health information. Google Health helps you organize, track, monitor, and act on your health information. com search results page but you are no longer inside the Google Health product. 4. How much does Google Health cost to use? Google Health is available from Google at no charge. There is no cost to sign up. And if you already have a Google account set up then you are set. Google Health is available from Google at no charge. There is no charge to doctors ' offices, hospitals, retail pharmacies, and health device makers and other application developers that partner with Google Health. Some third party companies that are integrated with Google Health and provide customized services may charge you directly but it ' s up to you to elect to work with them. 5. If Google doesn’t t charge for Google Health, how does Google make money off of it? Much like other Google products we offer, Google Health is made available at no charge to anyone who uses it. Much like other Google products we offer, Google Health is made available at no charge to anyone who uses it. 154 FAQWEBSUMM Very Good 6. How does Google Health protect the privacy of my health information? You should know two main things up front: We will never sell your personal health information or data we will not share your health data with individuals or third parties unless you explicitly tell us to do so or except in certain limited circumstances described in our Privacy Policy. Copernic None Very Bad 7. Will my employer or health insurance provider be able to see my Google Health profile? FAQWEBSUMM Excellent You are in control of who views your Google Health profile. Copernic None Very Bad FAQWEBSUMM Very Good Copernic Good FAQWEBSUMM Good Copernic Good 8. Does the data I store in Google Health get used for other Google products, like Search? Yes, we share information between Google products to enable cross - product functionality. Does the data I store in Google Health get used for other Google products, like Search? For example, Google Health can help you save your doctors ' contact information in your Google Contact List. 9. Is Google Health a PHR (personal health record)? A personal health record (PHR) is a patient - directed information tool that allows the patient to enter and gather information from a variety of healthcare information systems such as hospitals, physicians, health insurance plans, and retail pharmacies. Is Google Health a PHR (personal health record)? We believe it ' s not enough to offer a place where you can store, manage, and share your health information. 155