Making content searchable anywhere using IBM® WebSphere® Portal’s publishing Seedlist Framework
by user
Comments
Transcript
Making content searchable anywhere using IBM® WebSphere® Portal’s publishing Seedlist Framework
Making content searchable anywhere using IBM® WebSphere® Portal’s publishing Seedlist Framework Eitan Shapiro IBM Software Group, Information Management Haifa HA, Israel Constantin Radchenko Software Developer: WPLC.J2EE/Portal/WebSphere-based technology IBM Software Group, Information Management Haifa HA, Israel January 2009 © Copyright International Business Machines Corporation 2009 All rights reserved. Abstract: If you are developing an enterprise document-management application that serves as a platform to generate, manage, and publish content, you might wonder, “How can I make all the content available to end users in an effective and usable manner.” The answer is to enable users to search the content of the entire product. This white paper describes how you can make your published enterprise application content available for crawling by IBM® search engines such as IBM WebSphere® Portal Search Engine and IBM Omnifind Enterprise Edition 8.5. Learn how to achieve this functionality by using Content Provider Framework, also known as Seedlist Framework, and more specifically by implementing a simple set of APIs that returns the publishing content while handling critical aspects of security, rich metadata, and effective updates. Page 1 of 26 Contents 1 Introduction.................................................................................................................................. 2 2 Overview...................................................................................................................................... 3 3 Prerequisites................................................................................................................................. 3 4 Seedlist Framework architecture.................................................................................................. 3 5 Experimenting with the FileSystem Seedlist ............................................................................... 6 6 FileSystem Seedlist Servlet........................................................................................................ 10 7 Implementing FileSystem Retriever .......................................................................................... 12 7.1 FileSystem Retriever structure............................................................................................ 12 7.2 Retriever package implementation ..................................................................................... 12 7.2.1 RetrieverFactory interface implementation ................................................................. 13 7.2.2 RetrieverService interface implementation.................................................................. 14 7.2.3 Guidelines for Results interfaces implementation ....................................................... 18 7.2.4 Incremental publishing for incremental crawling ........................................................ 21 8 WebSphere Portal Seedlist Framework ..................................................................................... 23 9 Conclusion ................................................................................................................................. 24 10 Downloads ............................................................................................................................... 24 11 Resources ................................................................................................................................. 25 12 About the authors..................................................................................................................... 26 1 Introduction This white paper provides a full overview of IBM WebSphere Portal’s Content Provider Framework, also known as Seedlist Framework. This framework was developed to publish content from different IBM repositories in a single format that is rich enough to handle the requirements of these repositories. This format is called Seedlist format, and it can be viewed as an extension of the Sitemap format, the de facto standard for publishing content on the World Wide Web. The Seedlist format is based on the ATOM syndication format [RFC4287]. The need to develop a single format emerged from the following observations: • • • • Search engines cannot develop crawlers fast enough to keep pace with the proliferation of new internal content sources and new third-party content systems. Standard Web crawling is becoming more and more inefficient because Web content is created and changed more rapidly today than ever before. The crawler crawls an evergrowing set of documents, while it actually needs only the delta of modified or newly created documents. Web crawling can't reach all content; for example, most crawlers can't follow links that are manipulated by JavaScript™ code. Content metadata is growing rapidly. It needs to be indexed in a generic and consistent way among all types of content. Page 2 of 26 The Seedlist Framework and Seedlist format comprise an application-independent mechanism to integrate heterogeneous content sources with a search engine. Any new content can be crawled without the need to wait for the search engine to build a custom crawler. ATOM/REST-based integration makes it easy for content sources anywhere on any platform to be accessed by a search engine. Content owners decide the level of metadata, including security information, to be crawled, and they can ensure that only recently changed documents get crawled. Content owners also manage the relationship between the pieces of content they manage, and how and where those pieces get displayed in the hierarchical structure. Details about the Seedlist format can be found in the paper "A Richer Format for Providing Content Crawling Metadata" the "Downloads" section. This paper can help you better understand the concept of a Seedlist and how to implement the required APIs in Seedlist Framework for publishing your repository content. 2 Overview The following sections of this paper explain more about the Framework, what APIs must be implemented, and how to implement them. An example of publishing a file system using the Seedlist format is also provided. The full code of the example is included in the Downloads section and can be easily tried out on any J2EE server. After you have implemented the required APIs over your content system and installed the Seedlist Framework, you can test your Seedlist by using the built-in Seedlist Crawler in WebSphere Portal 6.1. For more information, refer to the WebSphere Portal 6.1 Information Center and search for the section "Crawling an external site using a seedlist." 3 Prerequisites To get the most from this white paper, you should have a good understanding of J2EE technology, REST-style architecture, and ATOM syndication format. To use the supplied code, you also need a basic understanding about how to handle servlets. This paper includes the source code for a full implementation to generate a seedlist for a file system, and we describe how the mandatory APIs should be implemented. To use Seedlist Framework and to generate a FileSystem Seedlist, you need a J2EE server such as IBM WebSphere Application Server. To crawl through the FileSystem Seedlist and to make the content searchable, you need a Seedlist Crawler such as the crawler in WebSphere Portal 6.1. After the generated FileSystem Seedlist has been crawled by the Seedlist Crawler, you can search the content from WebSphere Portal Search Center. After you implement the mandatory APIs in your own content system, you can repeat the experiment, crawl through your own system content, and search it from WebSphere Portal Search Center. 4 Seedlist Framework architecture Seedlist Framework is built on four main components. The consumer of the framework is a Crawler, which uses the Seedlist Framework to read the content and index it in a search engine. The flow between the components is a serial process. Every full cycle is triggered by the Seedlist Crawler and ends when the Crawler receives the Seedlist with the requested list of documents. Page 3 of 26 The Framework is designed to allow applications to enhance two aspects of the system, content retrieval and output formatting: • The Retriever package is the API for content retrieval. This package is intended to be overridden by every application that wants to generate a Seedlist using the Seedlist Framework. We will show how to implement a Retriever package, which can be used as reference code for any other implementation, over a file system. • The Formatter package is the API for output formatting. The Seedlist Formatter package is ready for use; you do not need to implement your own formatter. The Seedlist Framework can, however, work with your own formatter. Another part of Seedlist Framework that is product specific is the Servlet. Most of its work is delegated to the Seedlist Service that is part of the Seedlist Framework. In the Servlet code, you are expected to instantiate the correct Seedlist Retriever and Formatter and pass those objects to the Seedlist Service. In WebSphere Portal, the Servlet is based on the Eclipse extension point mechanism that is built into WebSphere Application Server 6.1. This architecture lets you install Retrievers and Formatters on a WebSphere Portal server, on which Seedlist Framework detects them automatically. The high-level architecture of Seedlist Framework in WebSphere Portal is shown in figure 1. NOTE: This paper does not cover the details of writing a Seedlist Retriever as an extension point. Instead, our focus is on a solution that can run on any J2EE server rather than only on a WebSphere Portal server. The WebSphere Portal solution is described to emphasize that the Servlet can be implemented in different ways and that any content system on a WebSphere Portal server that wants to use the Seedlist does not need to supply its own Servlet. The content system needs to supply only Seedlist Retriever as an extension point. Page 4 of 26 Figure 1. Seedlist Framework high-level architecture Application Space Content Provider provides Retriever implementation Search Engine Space Seedlist Formatter Implementation Raw Data Raw Data Seedlist Seedlist Retrieve r Seedlist Crawler Formatter Search Index Seedlist Service Portal Seedlist Servlet HTTP Request with specific Retriever and Formatter Get extension for Retriever and Formatter Extension Point Registry WebSphere Portal Seedlist Servlet is using extension points to find the relevant Retriever and Formatter implementations The Seedlist Framework components are described in Table 1. Table 1. Seedlist Framework components Name Description Comments Seedlist Servlet Seedlist Servlet is the backbone of the Framework. The Servlet • processes HTTP requests for actions getChildren, getDocuments • creates relevant Retriever and Formatter • delegates the request to the Seedlist Service • returns the output (Seedlist) to the Crawler Seedlist Service Seedlist Service is the backbone of the Framework. The Service • processes requests for actions: getChildren/getDocuments • delegates them to an appropriate Retriever • passes result data to Formatter • returns the output (Seedlist) to the Seedlist Servlet Retriever obtains content from specific content repository. It gets • children entries (sub-Seedlists and immediate documents) • all descendent documents • number of all descendent documents without their retrieval Formatter represents retrieved data into a specified format (for example, ATOM, Google sitemap) Product specific; e.g., in WebSphere Portal it includes code to support Virtual portals Part of the Seedlist Framework Retriever Formatter Extension Point Registry Mechanism to find implementations of the Retriever and Formatter APIs Page 5 of 26 Must be implemented by the content system Ready for you to use, as is, such as Seedlist Formatter Specific for WebSphere Portal implementation In addition to Seedlist Framework components, there are Seedlist Framework actors (see table 2). Table 2. Seedlist Framework actors Name Description Seedlist Crawler Crawls the content that is published in the Seedlist for indexing Search Index Is the search engine that indexes the content and lets you search it later To help you better understand the interactions between the Seedlist Framework components, the sequence diagram in figure 2 describes a simple scenario of the Crawler that fetches the first 100 documents from some content system. Figure 2. Seedlist Framework sequence diagram Seedlist Crawler Seedlist Servlet Seedlist Service Retriever Formatter Extension Point Registry Crawler asks for the first 100 documents Creates Retriever and Formatter based on the passed extension point IDs Passes request with parameters, Retriever and Formatter Asks for the first 100 documents Fetches 100 documents from the content system Sends the returned documents for formatting Formats 100 documents in Seedlist Returns formatted output Returns formatted output For details about the parameters that are passed by the calls between the Crawler and the Servlet, see the document "Seedlist Framework REST API" in the Downloads section. 5 Experimenting with the FileSystem Seedlist From reading about Seedlist Framework you probably already understand that you must implement your own Servlet over the Seedlist Service and also implement the Retriever package before you can use the Framework for generating a Seedlist that represents your system content. Page 6 of 26 To simplify the process of implementing a Retriever, a full implementation for a general file system is supplied here. Complete code to implement a Retriever for a general file system is supplied in the "Downloads" section. The code can be used as a starting point and reference code when you implement your own Retriever. In addition, to simplify the process of implementing a Servlet, we supply a simple Servlet implementation that supports only the FileSystem Retriever. This section describes the Servlet code and explains the Seedlist Service API, describes the file system implementation and explains the Retriever API, raises important points about implementation, and warns you about pitfalls. This section also contains recommendations and general conventions for Seedlist Retriever implementation. Before you read further, install the Seedlist Framework and try to generate a Seedlist for your own file system. Refer to the Downloads section for the file IlelSeedlist.zip that includes the file seedlist-installation-guide.doc. After installation, try to generate a Seedlist for your file system. Also, try to generate a Seedlist for the list of directories in one of your directories. The following URL returns all folders under the directory C:/ibm/wp_profile (the encoded form is C%3A%5Cibm%5Cwp_profile): http://<host>:<port>/seedlist/myserver?Format=html&SeedlistId=C%3A%5Cibm%5Cwp_pro file&Action=GetChildren Type this URL in your Web browser, using the correct host, port, and directory name for your site. Notice that the supplied action, GetChildren, returns all content items below the supplied folder. Also notice that we use HTML format and not ATOM format, which is the default format expected by the Seedlist Crawler. In Figure 3 you can see the output of the URL on one of our servers in HTML format. Page 7 of 26 Figure 3. FileSystem Seedlist example in HTML format to get folders You can also ask for documents from a specific folder in your file system by using the GetDocuments action, as seen in this URL: http://<host>:<port>/seedlist/myserver?Format=html&SeedlistId=C%3A%5Cibm%5Cwp_pro file&Action=GetDocuments Figure 4 shows the output of the URL on one of our servers in HTML format. Page 8 of 26 Figure 4. FileSystem Seedlist example in HTML format to get files For more information about URL formats and the different options, see "Seedlist Framework REST API" in the Downloads section. The next step is to crawl through the Seedlist and to index its content by using the WebSphere Portal Seedlist Crawler. For details on setting up such a collection and Crawler, refer again to the section "Crawling an external site using a Seedlist" in the WebSphere Portal 6.1 Information Center. The important thing to notice is that most of the files on your system are not indexed because the WebSphere Portal Crawler does not support file://<file path> URI format. Only text files are indexed with their content because the content is published inline. Figure 5 shows an example of search results of a file system collection. Page 9 of 26 Figure 5. Search results of a file system collection After you’ve generated the FileSystem Seedlist and the content has been crawled and indexed, the next step is to learn more about the Seedlist Servlet that works over the Seedlist Service and about the Retriever API and its implementation. 6 FileSystem Seedlist Servlet This paper addresses two implementations of the Seedlist Servlet: • The WebSphere Portal implementation that is described in the “Seedlist Framework architecture” section. • A simple Servlet implementation that is written specifically to support only the FileSystem Retriever. This Servlet can be used as a simple reference for how to write your own Servlet over the Seedlist Service API for your own system. The source code can be extracted from the ilel-seedlist.Servlet.ear package in the Downloads section. Seedlist Servlet processes Seedlist REST requests. Three actions are supported: • • • GetChildren GetDocuments GetNumberOfDocuments Page 10 of 26 Most of the functionality is included within the Seedlist Service. The main purposes of the Servlet are to instantiate the correct Retriever and Formatter and to delegate the action to the Seedlist Service. The Seedlist Service is available through the com.ibm.ilel.seedlist.service.SeedlistService interface. The Service can be obtained from the com.ibm.ilel.seedlist.service.SeedlistFactory interface during Servlet initialization. You can find the javadoc for SeedlistFactory and SeedlistService in the ilel-seedlist-javadoc.zip file in the Downloads section. Listing 1 shows how the HTTP Seedlist request is handled in the Servlet. Listing 1. Handling the HTTP Seedlist Servlet request // obtained while Servlet initialization process static final Map<String, FormatterFactory> ffactories; static final Map<String, Properties> fproperties; SeedlistFactory slfactory; Action action = slfactory.createAction(request, response); // no security for FileSystem Seedlist Retriever action.setUserCredentials(null, null); Properties validParams = action.validate(); // validated Servlet-specific parameters Properties validSevletParams = getServletParams(request); // obtain retriever factory and init properties RetrieverFactory rfactory = new com.ibm.ilel.seedlist.retriever.filesystem.RetrieverFactoryImp(); Properties rprops = null; // filesystem seedlist retriever uses no Properties // format parameter value is case-insensitive : atom/ATOM String ftype = validSevletParams.getProperty(FORMAT_PARAM).toLowerCase(); // obtain formatter factory and init properties FormatterFactory ffactory = ffactories.get(ftype); Properties fprops = fproperties.get(ftype); SeedlistService slservice = slfactory.createSeedlistService( rfactory, rprops, ffactory, fprops); URLResolver urlResolver = new UrlResolverImp(request, validParams, validSevletParams); slservice.handleRequest(action, urlResolver, validParams); While handling an HTTP Seedlist request, SeedlistFactory creates an appropriate Action object according to request parameters. The created Action object validates the Servlet request parameters and throws the exception SeedlistException, if the parameters are not valid. The Action object returns a list of valid parameters as a Properties object that is passed later to the SeedlistService for handling the request. The next step is to create the relevant Seedlist Retriever and Formatter factories that are used to create the Retriever and Formatter services inside SeedlistService. All these objects are passed to SeedlistFactory to create a SeedlistService instance that handles the request itself. The main function of the SeedlistService is handleRequest(), which executes the request. The function gets the action, a list of valid parameters, and URLResolver. The main function of the service is to obtain the documents from the Retriever and pass those documents to the specified Formatter. Page 11 of 26 To handle formatting properly, SeedlistService needs URLResolver instance, which provides all the Seedlist URLs to: • • • • current Seedlist page next and previous Seedlist feed pages feed page that represents Documents (GetDocuments action) feed page that represents Children (GetChildren action) Seedlist Framework implements SimpleUrlResolver that deals with URL generation. Notice that the generated URLs use either parameters that are passed on request or default parameters. For example, if parameter Range is not passed on a Servlet request, its default value is used by URLResolver when the URL is created. The Servlet developer can provide his own URLResolver. For example, the developer can replace the FileSystem Seedlist Servlet that uses com.ibm.ilel.seedlist.Servlet.url.UrlResolverImp, which is an extension of SimpleUrlResolver that overrides the addServletParams() function, with his own implementation. 7 Implementing FileSystem Retriever In the FileSystem Seedlist the hierarchy of content is built from folders and files. The folders are the Seedlists and Sub-Seedlists. The files are the leaves of the content tree, which we refer to as Documents. A Seedlist differs from a Document in that a Seedlist can contain other Seedlists or Documents. Before going into more detail, we recommend you obtain the source code of the FileSystem Retriever, which can be extracted from the ilel-seedlist.filesystem.jar package in the Downloads section. 7.1 FileSystem Retriever structure After extracting the FileSystem Retriever code, notice that the root package is defined as com.ibm.ilel.seedlist.retriever.filesystem and contains several sub-packages: • • • com.ibm.ilel.seedlist.retriever.filesystem.imp (contains implemented/extended classes that are specific for the FileSystem Retriever) com.ibm.ilel.seedlist.retriever.filesystem.test (contains JUNITs for testing the FileSystem Retriever) com.ibm.ilel.seedlist.retriever.filesystem.resources (contains the resource bundle properties file that is used for translation of error messages and other localized notes) 7.2 Retriever package implementation When you open the supplied Seedlist Framework API javadoc, ilel-seedlist-javadoc.zip, you can see that the API is organized in seven different packages, each of which you can review, reading more about each package and its purpose. The package that we focus on is the Retriever package. Listing 2 specifies the list of interfaces that are mandatory for a fully functional Retriever. The full list of required interfaces to implement the Retriever is much longer, but other interfaces have their default implementation that can be easily used, as is. The default implementation of the interfaces can be found in ilel- Page 12 of 26 seedlist.jar under the package com.ibm.ilel.seedlist.imp, and it was designed to be extended, if needed. Listing 2. Mandatory interfaces for a Retriever implementation com.ibm.ilel.seedlist.retriever.RetrieverFactory com.ibm.ilel.seedlist.retriever.RetrieverService com.ibm.ilel.seedlist.common.EntrySet com.ibm.ilel.seedlist.common.Document com.ibm.ilel.seedlist.common.Seedlist Here we describe the implementation of the five mandatory interfaces. In general, some of these interfaces also have abstract implementation classes to make the required development simpler and faster. The Retriever follows the factory and service design pattern. The main entry point to the package is the factory interface. 7.2.1 RetrieverFactory interface implementation RetrieverFactory lets you create objects, which are required for retrieving the content, like ApplicationInfo (credential information), RetrieverService, and RetrieverRequest. It must implement com.ibm.ilel.seedlist.retriever.RetrieverFactory interface and it can extend the com.ibm.ilel.seedlist.imp.AbstractRetrieverFactory class that provides basic implementation to create the ApplicationInfo instance. Listing 3 shows the implementation of RetrieverFactory for the FileSystem. Listing 3. Code sample from the RetrieverFactory implementation public class RetrieverFactoryImp extends AbstractRetrieverFactory { public RetrieverService getRetrieverService(Properties prop, HttpServletRequest ServletRequest, HttpServletResponse ServletResponse) throws SeedlistException { return RetrieverServiceImp(prop); } public RetrieverRequest createRequest(String seedlistId) throws SeedlistException { return RetrieverRequestImp(seedlistId); } public String getVersion() { return "FileSystem 1.0"; } NOTE: It is recommended to define the correct Retriever versioning string rather than to use the default string from the AbstractRetrieverFactory class. The versioning string can be any unique string that defines the type and version of the implementation. Notice that, despite the fact that the function getRetrieverService() gets a Servlet request and response as parameters, these parameters are not passed to the RetrieverServiceImp constructor. Although these additional parameters are not used in the FileSystem Retriever implementation, they might be used in other implementations. If needed, the Retriever can read any HTTP parameters that are passed on the request and that are not covered by the Seedlist specification. Passing the Servlet request and response means that a new service must be created for each new request to the Servlet. Page 13 of 26 7.2.2 RetrieverService interface implementation After obtaining a Retriever request object and a Retriever service instance from Retrieverfactory, we can further observe the service, which implements the com.ibm.ilel.seedlist.retriever.RetrieverService interface. Implementation of the RetrieverServiceImp constructor depends strictly on the Retriever itself, and usually the constructor initializes internal services for a particular content model. For example, FileSystem Retriever initializes rootSeedlistId, which is the root folder for a published file system. Listing 4 shows the constructor of the FileSystem Retriever Service. Listing 4. Code sample of RetrieverService constructor public class RetrieverServiceImp implements RetrieverService { ... // FileSystem root SeedlistId property private static final String ROOT_SEEDLIST_ID_PROP = "RootSeedlistId"; // default value for root SeedlistId property private static final String ROOT_SEEDLIST_ID_DEFAULT = "C:" + File.separator; // FileSystem root Seedlist (obtained from service properties) private static File rootSeedlist; public RetrieverServiceImp(Properties properties) { String rootSeedlistId = properties.getProperty( ROOT_SEEDLIST_ID_PROP, ROOT_SEEDLIST_ID_DEFAULT); rootSeedlist = new File(rootSeedlistId); } ... } This service class must implement the three functions that are listed in Listing 5. Notice that the two functions of getDocuments() and getChildren() have equivalent actions in the REST API level. Listing 5. Three mandatory functions that every Retriever service must implement 1. public int getNumberOfDocuments(ApplicationInfo appInfo, RetrieverRequest request) throws SeedlistExceptionpublic 2. public EntrySet getDocuments(ApplicationInfo appInfo, RetrieverRequest request) throws SeedlistExceptionpublic 3. public EntrySet getChildren(ApplicationInfo appInfo, RetrieverRequest request)throws SeedlistException The relevant function is called by the Seedlist Servlet based on the action passed in the REST call. The call to getDocuments() is relevant when the Crawler only wants to go through all the documents one after another. The call to getChildren() is relevant when the Crawler caller wants to traverse through the tree structure of the content repository. For clarity and simplicity we define a new class called FileSystemContent. This class has functionality that is similar to RetrieverService. The next sections present the implementation of RetrieverService, a wrapper over the FileSystemContent class. Detailed file system information is not given so that you can concentrate on the general flow. Listing 6 is a code sample that implements the getChildren() function. This function returns folders and documents that are directly under the specified Seedlist ID, a folder in the file system. Page 14 of 26 Listing 6. Code sample that implements the getChildren function File seedlist = getSeedlist(request); // Get the folder to start from and // use last update date for filtering of returned files Date lastUpdateDate = getLastUpdateDate(request); FileSystemContent fsContent = new FileSystemContent(request); fsContent.traverseChildren(seedlist, lastUpdateDate); List<Document> documents = fsContent.getDocuments(); List<Seedlist> seedlists = fsContent.getSeedlists(); EntrySetImp entrySet = new EntrySetImp(documents, seedlists, request); /* * Return Timestamp only for the first request in session, so don't miss * Documents, if they are added during pagination (in the same session) */ if (isFirstRequest(request)) { entrySet.setTimestamp(createNewTimestamp(fsContent.getTimestamp())); } return entrySet; The first step in the getChildren() function is to obtain the Seedlist ID from the passed request and create from it a com.ibm.ilel.seedlist.common.Seedlist object that is suitable for the Retriever. Note that this step is also the first step in the functions getNumberOfDocuments() and getDocuments(). For the FileSystem Retriever, the Seedlist ID represents some folder. If no Seedlist ID is specified, the root folder of the file system is returned. Listing 7 displays a code sample to get the requested Seedlist object in the FileSystem Retriever. Listing 7. Code sample to get the requested Seedlist object private File getSeedlist(RetrieverRequest request) { String id = request.getSeedlistId(); return (id == null || id.length() <= 0) ? rootSeedlist : new File(id); } The required Documents and Seedlists (file system folders) are obtained by use of the FileSystemContent class. Now you can create a result object that implements the EntrySet interface. We will discuss its implementation and relevant guidelines later in this paper. Listing 8 shows a code sample to implement the getDocuments() function. This function returns all the documents under the specified Seedlist ID that represents a folder in the file system. Notice that, when EntrySet is created, only documents are passed. Null is passed for the list of Seedlists because the action type must return only documents and not folders. Page 15 of 26 Listing 8. Code sample to implement the getDocuments function // initialize current request session parameters File seedlist = getSeedlist(request); // Get the folder to start from use last update date for filtering of returned files Date lastUpdateDate = getLastUpdateDate(request); FileSystemContent fsContent = new FileSystemContent(request); fsContent.traverseDocuments(seedlist, lastUpdateDate); List<Document> documents = fsContent.getDocuments(); EntrySetImp entrySet = new EntrySetImp(documents, null, request); /* * Return Timestamp only for the first request in session, so don't miss * Documents, if they are added during pagination (in the same session) */ if (isFirstRequest(request)) { entrySet.setTimestamp(createNewTimestamp(fsContent.getTimestamp())); } return entrySet; Listing 9 shows a code sample to implement the getNumberOfDocuments() function. This function returns the number of documents under the specified Seedlist ID, a folder in the file system. Its flow is similar to the getDocuments() function in the FileSystem Retriever. Listing 9. Code sample to implement the getNumberOfDocuments function // initialize current request session parameters File seedlist = getSeedlist(request); // Get the folder to start from use last update date for filtering of counted files Date lastUpdate = getLastUpdateDate(request); FileSystemContent fsContent = new FileSystemContent(request); return fsContent.countDocuments(seedlist, lastUpdate); Optimize Retriever using the State parameter After going through the three main functions of the RetrieverService, you can easily see that the code sample of the function getDocuments() was simplified for clarity because the optimization code was removed. Listing 10 shows the optimized code that uses the State parameter. The State parameter is an opaque object that is created by the Retriever itself on the first crawler request and can store any required information. This parameter helps the Retriever to jump directly to the correct start content item and to return the requested page without passing through all objects that were retrieved in previous requests. The requirement is that the Retriever must be stateless so that different crawlers can crawl through it at the same time. This means that the state information cannot be managed by the Retriever. The solution is to pass the data as an opaque state object between the Crawler and the Retriever throughout the crawling session. Page 16 of 26 Listing 10. Code sample to implement the getDocuments function // initialize current request session parameters File seedlist = getSeedlist(request); // Get the folder to start from use last update date for filtering of returned files Date lastUpdateDate = getLastUpdateDate(request); // obtain internal state of FileSystem seedlist retrieving model String fsState = getInternalState(request); FileSystemContent fsContent = new FileSystemContent(request); fsContent.traverseDocuments(seedlist, lastUpdateDate, fsState); List<Document> documents = fsContent.getDocuments(); EntrySetImp entrySet = new EntrySetImp(documents, null, request); // create State to use it in the next request for optimization reasons String newFsState = fsContent.createInternalState(); State newState = createNewState(newFsState, request, documents.size()); entrySet.setState(newState); /* * Return Timestamp only for the first request in session, so don't miss * Documents, if they are added during pagination (in the same session) */ if (isFirstRequest(request)) { entrySet.setTimestamp(createNewTimestamp(fsContent.getTimestamp())); } In the FileSystem example, the following occurs: 1. the State is encoded as a concatenation of start index and some internal state of the FileSystem content model, 2. the prefixed start index is used just for validation of the received State parameter, 3. the FileSystem content model stores the absolute path of the last retrieved file, 4. the State object is returned by the EntrySet interface and is appended to the "next page" URL that is supplied on the Seedlist format 5. When the crawler gets to this “next page” URL and accesses it, the State object is passed back to the Retriever as an HTTP parameter in a manner that is transparent to the crawler. Notice that the function getChildren() ignores the State parameter because it cannot optimize retrieval of immediate files or folders. On the other hand, in the function getDocuments(), the State object is obtained from the received request object. It gets the list of documents, and then a new State object is generated for the next request. The new State object is set on the EntrySet object that is returned from the Retriever and implements the interface com.ibm.ilel.seedlist.common.State. Listing 11 shows how to check the entire request State consistency and to return the FileSystem state from the Retriever request. Note that the State is considered inconsistent if the start index defined by the State does not match the start index defined by the Retriever request. This validity check ensures that the requested pages are sequential; if the requested pages are not sequential, we cannot optimize the request handling, and thus we must not use the information from the State object. The two start indexes are always identical while the crawler uses the published "next page" URLs on the Seedlist. Page 17 of 26 Listing 11. Obtaining the State from the Retriever request private String getContentModelState(RetrieverRequest request) throws SeedlistException { State state = request.getState(); if (isEmptyState(state)) { return null; } String stateInfo = state.asString(); // start index and internal FileSystem state are separated by '|' int stateSeparatorIndx = stateInfo.indexOf(STATE_SEPARATOR); if (stateSeparatorIndx == -1) { //...throw SeedlistException... } // check consistency of start index for state try { int startState = Integer.parseInt( stateInfo.substring(0, stateSeparatorIndx)); if (startState != request.getStartIndex()) { //...throw SeedlistException... } } catch (NumberFormatException e) { //...throw SeedlistException... } // obtain FileSystem content state return stateInfo.substring(stateSeparatorIndx + 1); } After successfully traversing the content, a new State for the next request is created. For the FileSystem, the State is encoded as a concatenation of the start index and absolute path of the last retrieved file, or as null, if no files or folders are returned. The State is defined as the following string: <start_index>|<internal_filesystem_state>. Listing 12 shows how the prefixed start index is calculated for a new State and how the whole State for the next request is created. In FileSystem Retriever, a new State is created only if documents are collected; if no documents are collected, the existing FileSystem State is used in the next request. Listing 12. Code sample to create a new opaque State object private static final char STATE_SEPARATOR = '|'; private State createNewState(String internalState, RetrieverRequest request, int numberOfEntries) { if (internalState == null) { return request.getState(); } // compose new state as <new_start_index>|<absolute_file_name> StringBuffer newStateStr = new StringBuffer(); newStateStr.append(request.getStartIndex() + numberOfEntries); newStateStr.append(STATE_SEPARATOR); newStateStr.append(internalState); return new StateImp(newStateStr.toString()); } 7.2.3 Guidelines for Results interfaces implementation At this point, we still haven’t covered the implementation of EntrySet, Document, and Seedlist interfaces. Let’s now go through the implementation of these interfaces and define some implementation guidelines. Page 18 of 26 A. Recommendations when implementing the com.ibm.ilel.seedlist.common.EntrySet interface: 1. Use the abstract com.ibm.ilel.seedlist.imp.AbstractEntrySet. It defines basic class members as protected and provides default implementations for getter functions. 2. Use Collections.EMPTY_LIST rather than null for Documents and Seedlists lists in EntrySetImp constructor (or setter functions). Note that getSeedlists() and getDocuments() functions return Iterator and cannot throw the exception NullPointerException if Documents(Seedlists) list is empty or null. 3. Most of the data of EntrySet is not set directly by EntrySet but is included in the com.ibm.ilel.seedlist.common.Metadata object. The Metadata interface includes categories (com.ibm.ilel.seedlist.common.Category), fields (com.ibm.ilel.seedlist.common.Field), and access control list (ACLs). Use the default Metadata implementation class com.ibm.ilel.seedlist.imp.MetadataImp that provides a full set of required functionalities and that can be extended, if required. 4. Each com.ibm.ilel.seedlist.common.Entry is expected to include relevant metadata such as fields, categories, and ACLs. Notice that EntrySet fields are relevant for all Documents and Seedlists Entries. You are expected to put fields in the EntrySet level only if it relevant to all Entries. 5. At the same time, com.ibm.ilel.seedlist.common.FieldInfos, which includes indexing instructions, is usually common for all Entries. For example, FileSystem Retriever assumes that every Entry includes fields like title, description, and last update date, as shown in listing 13. Listing 13. Creation of FieldInfo parameters for FileSystem Retriever EntrySet // Message code for Title field name private static final String MSG_FIELD_TITLE_NAME_INFO = "SEEDLISTRTVFILESYS0201I"; ... // Set field info for title, description and last update date private FieldInfo[] obtainFieldsInfo(Locale locale) throws SeedlistException { FieldInfo[] fieldsInfo = new FieldInfo[3]; fieldsInfo[0] = createFieldInfo( Field.FIELD_TITLE, MSG_FIELD_TITLE_NAME_INFO, FIELD_TITLE_DESC, FieldInfo.TYPE_STRING, locale); fieldsInfo[1] = createFieldInfo( Field.FIELD_DESCRIPTION, MSG_FIELD_DESCRIPTION_NAME_INFO, FIELD_DESCRIPTION_DESC, FieldInfo.TYPE_STRING, locale); fieldsInfo[2] = createFieldInfo( Field.FIELD_UPDATE_DATE, MSG_FIELD_UPDATE_DATE_NAME_INFO, FIELD_UPDATE_DATE_DESC, FieldInfo.TYPE_DATE, locale); return fieldsInfo; } // Create FieldInfo with localized message for name private FieldInfo createFieldInfo(String id, String msgCode, String desc, int type, Locale locale) throws SeedlistException { SeedlistMessage field = new SeedlistMessage(SeedlistMessage.SEVERITY_INFORMATIONAL, msgCode, RetrieverFactoryImp.FILESYSTEM_RETRIEVER_RES_BUNDLE_NAME); return new FieldInfoImp(id, field.getMessage(locale), desc, type); } Page 19 of 26 For this reason, FileSystem Retriever sets up FieldInfos on EntrySet rather than separately on every Entry. FieldInfo on EntrySet does not harm Entries that have no such fields. Therefore, it is recommended to define all FieldInfos on EntrySet and not on each Entry, to decrease the size of returned EntrySet and of transmitted data on network. Also in listing 13, note that names are translated based on the locale via the Seedlist Message class. 6. EntrySet also contains categories that are common for all Entries on returned set. Listing 14 shows the definition for content source category of FileSystem Retriever. Recall that Seedlists differ from Documents in that a Seedlist can contain other Seedlists or Documents. Seedlists represent nodes in the content hierarchy, whereas Documents represent leaves. Listing 14. Code sample of how to create field parameters for FileSystem Retriever EntrySet MetadataImp metadata = new MetadataImp(); metadata.setFields(obtainFields(dir)); ... setMetadata(metadata); ... private Field[] obtainFields(File dir) { ArrayList fields = new ArrayList(); fields.add(new FieldImp(dir.getName(), Field.FIELD_TITLE)); fields.add(new FieldImp(dir.getAbsolutePath(), Field.FIELD_DESCRIPTION)); fields.add(new FieldImp(new Date(dir.lastModified()), Field.FIELD_UPDATE_DATE)); return (Field[]) fields.toArray(new Field[fields.size()]); } B. Recommendations when implementing com.ibm.ilel.seedlist.common.Seedlist interface: 1. Extend the abstract class com.ibm.ilel.seedlist.imp.AbstractSeedlist. Similar to AbstractEntrySet, it defines basic class members as protected, and it provides default implementation for getter functions. 2. In the same manner as EntrySetImp, SeedlistImp sets most of its properties through the Metadata object. Use the default Metadata implementation class com.ibm.ilel.seedlist.imp.MetadataImp that provides a full set of required functionalities and that can be extended, if required. Listing 14 above displays the code sample to define several folder fields, such as the last modification time of the folder. C. Recommendations when implementing com.ibm.ilel.seedlist.common.Document interface: 1. Extend the abstract class com.ibm.ilel.seedlist.imp.AbstractDocument. 2. Each piece of content has two URLs: one for its display page and one for its content or crawl page. When crawling through content in the enterprise, the search engines can analyze the “crawl” URL while redirecting users to a proper “display” URL when a result is displayed and clicked. For a display link, the developer can use the basic class com.ibm.ilel.seedlist.imp.LinkImp that gets java.net.URI in a constructor. Page 20 of 26 For a crawl URL, if a link to the content can be provided, it is preferred, in which case the same LinkImp class can be used. When content can’t be accessed over a network, inline content representation can be used. and content text is displayed at the result Seedlist ATOM feed. In such a case, the developer can use the basic class com.ibm.ilel.seedlist.imp.DocumentContentImp. Listing 15 shows how FileSystem Retriever inline content is created for text files. Listing 15. Code sample of inline content creation for text files of FileSystem Retriever // Default encoding for text files is UTF-8 private static final String DEFAULT_ENCODING = "UTF-8"; // Represents text MIME type private static final String TEXT_MIME_TYPE = "text/plain"; private DocumentContent createFileContent(File file) throws FileNotFoundException { InputStream istream = new FileInputStream(file); DocumentContentImp docContent = new DocumentContentImp(istream); docContent.setEncoding(DEFAULT_ENCODING); docContent.setLocale(Locale.ENGLISH); docContent.setType(TEXT_MIME_TYPE); return docContent; } 3. Just like with EntrySet and SeedlistImp, DocumentImp sets most of its properties through the Metadata object. Use the default Metadata implementation class com.ibm.ilel.seedlist.imp.MetadataImp that provides a full set of required functionalities and that can be extended, if required. 7.2.4 Incremental publishing for incremental crawling One of the main capabilities that Seedlist Framework introduces is an effective mechanism for incremental publishing that results in incremental crawling. This mechanism is critical when crawling through large content systems, and the goal is to have a frequently updated search index. In addition, it reduces the load from the content system because the crawler needs only to return content items that changed, rather than to return all content items in the system. The Timestamp Servlet parameter is used to define logic time of the last crawling session (CS). The logic time object is an opaque object like the State object described earlier. It is created by Retriever, and only Retriever can "understand" it. This object can hold a real date or another data structure that represents the last CS time for your Retriever. When the Timestamp is passed to Seedlist Retriever, the Retriever must return documents that were updated from the specified Timestamp. As we don't return all the updates in one call, several sequential requests are sent to the Seedlist Servlet with increasing range values ("start" parameter). This process of sequential requests is called "pagination process", during which the same Timestamp parameter is passed. In general, when starting an incremental CS, the crawler should add the parameter Timestamp to the Seedlist URL, the first request in the current CS: &Timestamp=<timestamp from the last crawling session>. In all successive requests in the same CS, the Timestamp parameter is included automatically in the next URL by Seedlist URLResolver. The crawler is expected to read the Timestamp Page 21 of 26 element from the Seedlist output on the last page of the CS and use that Timestamp in the next CS. Listing 16 shows how the last update date—or any other Timestamp suitable for your Retriever—is obtained from the request to filter returned files. All Documents (files) that are updated after this date must be returned for the current request. Give priority to explicitly a specified Date parameter on request; if it is not defined, use the Timestamp parameter. Listing 16. Code sample to get the last update date (or timestamp) for filtering documents private Date getLastUpdateDate(Request request) throws SeedlistException { if (request.getDate() != null) { return request.getDate(); } State ts = request.getTimestamp(); if (isEmptyState(ts)) { return null; } try { // timestamp comprises last retrieving request time in millis return new Date(ByteBuffer.wrap(ts.getStateData()).getLong()); } catch (BufferUnderflowException e) { //...throw new SeedlistException... } } The advantage of using the Timestamp parameter over the Date parameter is the elimination of time-zone synchronization issues. In addition, the Timestamp parameter lets you save additional information to optimize the retrieval of the updated content, if required. Listing 17 describes the creation of a new Timestamp according to the time of the current request. Notice that, because Timestamp is an opaque object, we reuse the State interface for its representation. Listing 17. Code example to create new opaque Timestamp object private State createNewTimestamp(long timeInMillis) { // long can't be more than 8 bytes ByteBuffer timeBuf = ByteBuffer.allocate(8); // define timestamp as current time in milliseconds timeBuf.putLong(timeInMillis); return new StateImp(timeBuf.array()); } State Machine regarding State and Timestamp parameters In this section we describe the different states in which a system can be, in relationship to the State and Timestamp (TS) objects that are passed between the Crawler and Publishing backend system (see figure 6): 1. Crawler starts his first full crawling session with (0, 0), which means there is no State and no TS. It asks for the first page of the content list. 2. Crawler continues and asks for the next pages of the first crawling session with (1, 0), which means there is a State object on the request, but there still is no TS because we are still in the first full crawling session. Page 22 of 26 3. Crawler finishes the first full crawling session and starts another session after some delay to get updates. Crawler is at (0, 1), which means it does not have a State object. It asks for the first page; however, it already has a TS object that was returned from the previous session by the last request, and now it is used to get only the updates. 4. Crawler continues and asks for the next pages with (1, 1), which means there is a State object as well as a TS that is being used to get only the updates. The passed TS does not change between requests for consecutive pages on the same crawling session. 5. Crawler moves back to (0, 0) only when it wants to perform a full crawl again and wants to disregard the TS. Figure 6. State and Timestamp handling State Machine 0 – There is NO such parameter on the request 1 – There is such a parameter on the request (0, 0) (1, 0) (0, 1) (1, 1) (State, TS) 8 WebSphere Portal Seedlist Framework As described above in Section 4, Seedlist Framework architecture, the Servlet code is expected to be specific to the project. In this section we cover the specific implementation in WebSphere Portal and its specific characteristics. The implementation in WebSphere Portal is based on the Eclipse extension point mechanism, which actually lets WebSphere Portal customers develop their own Retriever, install the implementation on the server, and generate the Seedlist for their content system without any additional effort. A built-in Retriever, called the Discovery Retriever, discovers those Retrievers. This service implements the Seedlist Retriever API and returns a list of all available Seedlist Retriever extension point implementations that are available in the system with the getChildren() call. This service also provides a list of all available Documents in all Seedlist Retrievers with the getDocuments() call. Figure 7 shows the output of the URL: https://<host>:<port>/seedlist/myserver?Format=html Page 23 of 26 Figure 7. Discovery Retriever that displays available Retrievers on the system Notice that in WebSphere Portal Servlet, the parameter Source on the Seedlist REST API is used to pass the extension point ID. The URL does not include the Source parameter, and by default the Discovery Retriever is used. For the Action, the default is GetChildren. The Discovery Retriever finds the WebSphere Portal Retriever and FileSystem Retriever and displays them in an HTML view, the specified format. For each Retriever, we can see the extension point name as the description. 9 Conclusion Seedlist Framework is a simple and intuitive framework that simplifies the task of making your enterprise content searchable by publishing the content in Seedlist format. This format is supported by different search engines. The Framework handles all the critical and problematic aspects of enterprise content such as access control, metadata publishing with indexing instructions, and incremental crawling for effectiveness and scalability. We have described the Framework architecture and guided you through creation of a FileSystem Seedlist. We’ve also provided a step-by-step outline of the FileSystem Retriever implementation as reference code for your own Retriever development. 10 Downloads This section describes all the packages and source code that are needed to write a Seedlist Retriever and to generate a Seedlist using Seedlist Framework. In addition, it lists the additional papers on the Seedlist, API specifications, and Javadoc. Name IlelSeedlist.zip Size Download method Description The zip file includes: 1. ilel-seedlist-javadoc.zip 2. ilel-seedlist.jar 3. ilel-seedlist.servlet.ear 4. ilel-seedlist.filesystem.jar Page 24 of 26 5. seedlist-installation-guide.doc ilel-seedlist-javadoc.zip includes: 1. javadoc of Seedlist Framework API ilel-seedlist.jar includes: 1. Seedlist Framework API and its default implementations 2. ATOM/XML, HTML, XSLT Seedlist Formatters 3. Seedlist Service that connects the different framework parts. It validates request parameters. It calls appropriate Retriever function. It formats retrieved data according to required format. ilel-seedlist.servlet.ear includes: Servlet that instantiates the FileSystem Retriever, instantiates ATOM/XML Formatter, and processes requests by using Seedlist Service. The package includes the source code. ilel-seedlist.filesystem.jar includes: Seedlist Retriever implementation of File System. The package includes the source code. SeedlistSpecification.zip The zip files includes: 1. Seedlist Framework REST API. Specification that includes the HTTP parameters and the output format. 2. "A Richer Format for Providing Content Crawling Metadata" paper (written by David Konopnicki and Laurent Hasson). 11 Resources • • • developerWorks article, WebSphere Portal Search Toolbox for WebSphere Portal Version 6.0 developerWorks article, Integrating IBM Lotus Sametime with the IBM Lotus Quickr Search REST service developerWorks article, Introducing the Search and Indexing API in WebSphere Portal V6.0 Page 25 of 26 • • • • developerWorks article, IBM Search and Index APIs (SIAPI) for WebSphere Information Integrator OmniFind Edition developerWorks white paper, IBM Search and Index (SIAPI) V6.0 Javadoc WebSphere Portal product documentation WebSphere Portal zone 12 About the authors Eitan Shapiro holds a BSc degree in Information Systems Engineering from the Technion, Haifa, Israel. He joined IBM in 2005 and is the Team Lead of the Haifa Search Technologies Team, which develops search solution for WebSphere Portal and Lotus Quickr. Constantin Radchenko holds a BSc degree in Software Engineering from the Technion, Haifa, Israel. He joined IBM in 2006, where he is a software developer on the Haifa Enterprise Information Discovery Team, which develops search solutions for WebSphere Portal and Lotus Quickr. ********************************************************************** Trademarks • developerWorks, IBM, and WebSphere are trademarks or registered trademarks of IBM Corporation in the United States, other countries, or both. • Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. • Other company, product, and service names may be trademarks or service marks of others. ********************************************************************** Page 26 of 26