Digging Deep for Hidden Information in the Web Part 2: Automated hyperlink
by user
Comments
Transcript
Digging Deep for Hidden Information in the Web Part 2: Automated hyperlink
Digging Deep for Hidden Information in the Web Part 1: Automated blog analysis Part 2: Automated hyperlink analysis Part 1 Automated Blog Analysis Analysing Public Science Debates through Blogs and Online News Sources Part 1 Contents Background Blogs Online news sources RSS Tracking public science debates Detecting public science debates Background Blogs, public opinion, online news, RSS Background There are millions of bloggers Bloggers are almost normal human beings Automatically tracking bloggers’ postings may give insights into public opinion Blog tracking companies IBM WebFountain Intelliseek BlogPulse “Monitor, measure and leverage consumergenerated media” Others growing… RSS Format Rich Site Syndication/Really Simple Syndication XML technology Used for frequently updated information sources (blogs, news, academic journals) RSS Readers Users subscribe to the RSS feeds of favourite blogs/sites/journals/searches Notified when updates available User-controlled ‘push’ technology Tracking Public Science Debates Blog keyword searches Technorati “Searches weblogs by keyword and for links” Stem cell research Blogdigger stem cell research IceRocket Allows Advanced searches Allows genuine date range search (Google only allows “last updated” date range searches) Track evolution over time What is changing about interest in Stem cell research/GM food? Are experts good at identifying changes in public interest? How can experts be sure/can they be supported with quantitative information? Can blogs be used to generate time series reflecting changes in “public interest”? Free science debate graphs Solves the trend identification problem? Blogpulse Offers free automatic blog searches and keyword-generated clicksearch graphs Stem cell research GM food Mobile phone radiation Research graphs Time-consuming to collect data Give control over the data source Detecting Public Science Debates How to detect a new debate? Heuristic methods E.g. Read papers, scan relevant blogs Automatic methods E.g. look for sudden increase in usage of science-related words in blogs? Free hot topic searches Blog keyword search (sort by date) Technorati “Searches weblogs by keyword and for links” Stem cell research Blogdigger blog search Hot topic searches Blogdex – top contagious information Bloglines – today’s hot topics (most popular links) Searches find the really big science debates? Specialist research tools Commercial software Intelliseek/IBM Mozdeh RSS monitor Generates sub-collections Generates word time series Allows keyword searches Identifies hot topics Mozdeh Science Concern Corpus A collection of blog postings containing a fear word AND a science word Trend detection used to identify hot “science fear” topics Data cleaning to remove spam Need manual scanning of list of words experiencing biggest usage increase Classification of top 5 words Word Max. daily Classification increase (feeds) stem 19% Science fear (stem cell research) orlean 16% Information (about hurricane) hurricane 16% Duplicate of ‘orlean’ katrina 15% Duplicate of ‘orlean’ june 14% Temporal descriptor Classification of top 200 words The words come from multiple stories Random Temporal Descriptor Duplicate Other Threat Prediction Progress Information Fear of Science 0 20 40 60 Hot science fear words E.g. new medical cure 80 7.5% of top 200 Words Represent new public fears of Science stories Unexpected results? Social science research Sudden burst of discussion over fears of the economic theories of Karl Rove, an influential advisor to George Bush Computer security Concern over spyware features in a software vendor’s products Research showing that consumers’ pin numbers could be revealed by poor printing Conclusions Many free tools support exploration of Consumer Generated Media Also room for specialist research tools References http://www.blogpulse.com/ http://www.blogpulse.com/www2006workshop/ http://www.creen.org/ Thelwall, M., Prabowo, R. & Fairclough, R. (2006, to appear). Are raw RSS feeds suitable for broad issue scanning? A science concern case study. Journal of the American Society for Information Science and Technology. Acknowledgement The work was supported by a European Union grant for activity code NEST2003-Path-1. It is part of the CREEN project (Critical Events in Evolving Networks, contract 012684, http://www.creen.org/) Part 2: Automated hyperlink analysis Link analysis as a social science technique Link Analysis Manifesto Links are: A wonderful new source of information about relationships between people, organisations and information An easy to collect data source But: Results should be interpreted with care Part 2 Contents Academic link analysis –mainly from an information science perspective A general social science link analysis methodology Commercial applications Why Count Links? Individual hyperlinks may reflect connections between web page contents or creators Counts of large numbers of hyperlinks may reflect wider underlying social processes Links may reflect phenomena that have previously been difficult to study E.g. informal scholarly communication Why Count University Links? To map patterns of communication between researchers in a country Which universities collaborate a lot? Which universities collaborate with government or industry? Which universities are using the web effectively? Counting links Search engines will count them for you! Yahoo! advanced queries, e.g. Links from Wolves Uni. to Oxford Uni. Or back domain:ox.ac.uk AND linkdomain:wlv.ac.uk domain:wlv.ac.uk AND linkdomain:ox.ac.uk Google link queries Find links to specific URLs, e.g. links to the University home page link:www.wlv.ac.uk Counting links Can use a special purpose web crawler or robot Visits all the pages in a web site Counts the links in the site Can use “advanced” counting methods Some Inter-University Hyperlink Patterns Mainly for the UK and Europe Links to UK universities against their research productivity The reason for the strong correlation is the quantity of Web publication, not its quality This is different to citation analysis Most links are only loosely related to research 90% of links between UK university sites have some connection with scholarly activity, including teaching and research But less than 1% are equivalent to citations So link counts do not measure research dissemination but are more a natural byproduct of scholarly activity Cannot use link counts to assess research Can use link counts to track an aspect of communication UK universities tend to link to their neighbours Universities cluster geographically Language is a factor in international interlinking English the dominant language for Web sites in the Western EU In a typical country, 50% of pages are in the national language(s) and 50% in English Non-English speaking extensively interlink in English Others 328,644 Danish 86,107 Language Portugese 172,804 Finnish 444,974 Norwegian 458,961 Italian 488,172 French 885,432 Greek 941,420 Dutch 962,092 Swedish 1,008,353 Spanish 1,094,442 German 2,888,072 English 12,379,256 - 2,000,000 4,000,000 6,000,000 8,000,000 Total university Web pages 10,000,000 12,000,000 14,000,000 University Web page languages 100% 80% Others French Dutch Swedish German English 60% 40% 20% 0% fr it de es gr no nl pt ch Country be dk at se uk ie fi Patterns of international communication Counts of links between EU universities in Swedish are represented by arrow thickness. Counts of links between EU universities in French are represented by arrow thickness. Which language??? Which language??? Which language? Who is isolated? International link patterns The next slide is a (Kamada-Kawai) network of the interlinking of the “top” 5 universities in AEAN countries (Asia and Europe) with arrows representing at least 100 links and universities not connected removed. The rich get richer on the web Link creation obeys the ‘rich get richer’ law Sites which already have a lot of links attract the most new links Some sites have a huge number of links: most have one or none Rich get richer example: Links from Australian university pages The anomalies are also interesting Part 3: A General Social Science Link Analysis Methodology A general framework for using link counts in social sciences research For research into link creation or Together with other sources, for research into other online or offline phenomena Applicable when there are enough links relevant to the research question to count For collections of large web sites or For large collections of small web sites Nine stages for a research project 1. Formulate an appropriate research question, taking into account existing knowledge of web structure 2. Conduct a pilot study 3. Identify web pages or sites that are appropriate to address the research question Nine stages for a research project 4. Collect link data from a commercial search engine or a personal crawler, taking appropriate accuracy safeguards 5. Apply data cleansing techniques to the links, if possible, and select an appropriate counting method 6. Partially validate the link count results through correlation tests, if possible Nine stages for a research project 7. Partially validate the interpretation of the results through a link classification exercise 8. Report results with an interpretation consistent with link classification exercise, including either a detailed description of the classification or exemplars to illustrate the categories 9. Report the limitations of the study and parameters used in data collection and processing The theoretical perspective for link counting In order to be able to reliably interpret link counts, all links should be created individually and independently, by humans, through equivalent gravity judgments (e.g., about the quality of the information in the target page). Additionally, links to a site should target pages created by the site owner or somebody else closely associated with the site. Commercial applications Of link analysis Commercial applications Find out who links to your web site More links mean more visitors Check if your web site is being recognised Find out who isn’t linking to your site But is linking to a competitor’s web site! Gives ideas about where to get new customers or links from Takes an hour of advanced searches Simple but very valuable! Conclusion There is a lot of hidden information in the web: in blogs and hyperlinks Co-authors Ray Binns, Viv Cothey, Ruth Fairclough, Gareth Harries , Xuemei Li, Peter Musgrove, Teresa PageKennedy, Nigel Payne, Rudy Prabowo, Liz Price, David Stuart, David Wilkinson, Alesia Zuccala University of Wolverhampton. Rong Tang, Catholic University of America. Han-Woo Park, YeungNam University, South Korea. Paul Wouters, Andrea Scharnhorst. The Virtual Knowledge Studio for the Humanities and Social Sciences, Amsterdam, The Netherlands.