...

Making content searchable anywhere using IBM® WebSphere® Portal’s publishing Seedlist Framework

by user

on
Category: Documents
31

views

Report

Comments

Transcript

Making content searchable anywhere using IBM® WebSphere® Portal’s publishing Seedlist Framework
Making content searchable anywhere using
IBM® WebSphere® Portal’s publishing Seedlist
Framework
Eitan Shapiro
IBM Software Group, Information Management
Haifa HA, Israel
Constantin Radchenko
Software Developer: WPLC.J2EE/Portal/WebSphere-based technology
IBM Software Group, Information Management
Haifa HA, Israel
January 2009
© Copyright International Business Machines Corporation 2009 All rights reserved.
Abstract: If you are developing an enterprise document-management application that serves as a
platform to generate, manage, and publish content, you might wonder, “How can I make all the
content available to end users in an effective and usable manner.” The answer is to enable users
to search the content of the entire product.
This white paper describes how you can make your published enterprise application content
available for crawling by IBM® search engines such as IBM WebSphere® Portal Search Engine
and IBM Omnifind Enterprise Edition 8.5. Learn how to achieve this functionality by using
Content Provider Framework, also known as Seedlist Framework, and more specifically by
implementing a simple set of APIs that returns the publishing content while handling critical
aspects of security, rich metadata, and effective updates.
Page 1 of 26
Contents
1 Introduction.................................................................................................................................. 2
2 Overview...................................................................................................................................... 3
3 Prerequisites................................................................................................................................. 3
4 Seedlist Framework architecture.................................................................................................. 3
5 Experimenting with the FileSystem Seedlist ............................................................................... 6
6 FileSystem Seedlist Servlet........................................................................................................ 10
7 Implementing FileSystem Retriever .......................................................................................... 12
7.1 FileSystem Retriever structure............................................................................................ 12
7.2 Retriever package implementation ..................................................................................... 12
7.2.1 RetrieverFactory interface implementation ................................................................. 13
7.2.2 RetrieverService interface implementation.................................................................. 14
7.2.3 Guidelines for Results interfaces implementation ....................................................... 18
7.2.4 Incremental publishing for incremental crawling ........................................................ 21
8 WebSphere Portal Seedlist Framework ..................................................................................... 23
9 Conclusion ................................................................................................................................. 24
10 Downloads ............................................................................................................................... 24
11 Resources ................................................................................................................................. 25
12 About the authors..................................................................................................................... 26
1 Introduction
This white paper provides a full overview of IBM WebSphere Portal’s Content Provider
Framework, also known as Seedlist Framework. This framework was developed to publish
content from different IBM repositories in a single format that is rich enough to handle the
requirements of these repositories.
This format is called Seedlist format, and it can be viewed as an extension of the Sitemap
format, the de facto standard for publishing content on the World Wide Web. The Seedlist
format is based on the ATOM syndication format [RFC4287].
The need to develop a single format emerged from the following observations:
•
•
•
•
Search engines cannot develop crawlers fast enough to keep pace with the proliferation
of new internal content sources and new third-party content systems.
Standard Web crawling is becoming more and more inefficient because Web content is
created and changed more rapidly today than ever before. The crawler crawls an evergrowing set of documents, while it actually needs only the delta of modified or newly
created documents.
Web crawling can't reach all content; for example, most crawlers can't follow links that
are manipulated by JavaScript™ code.
Content metadata is growing rapidly. It needs to be indexed in a generic and consistent
way among all types of content.
Page 2 of 26
The Seedlist Framework and Seedlist format comprise an application-independent mechanism
to integrate heterogeneous content sources with a search engine. Any new content can be
crawled without the need to wait for the search engine to build a custom crawler.
ATOM/REST-based integration makes it easy for content sources anywhere on any platform to
be accessed by a search engine. Content owners decide the level of metadata, including
security information, to be crawled, and they can ensure that only recently changed documents
get crawled. Content owners also manage the relationship between the pieces of content they
manage, and how and where those pieces get displayed in the hierarchical structure.
Details about the Seedlist format can be found in the paper "A Richer Format for Providing
Content Crawling Metadata" the "Downloads" section. This paper can help you better
understand the concept of a Seedlist and how to implement the required APIs in Seedlist
Framework for publishing your repository content.
2 Overview
The following sections of this paper explain more about the Framework, what APIs must be
implemented, and how to implement them. An example of publishing a file system using the
Seedlist format is also provided. The full code of the example is included in the Downloads
section and can be easily tried out on any J2EE server.
After you have implemented the required APIs over your content system and installed the
Seedlist Framework, you can test your Seedlist by using the built-in Seedlist Crawler in
WebSphere Portal 6.1. For more information, refer to the WebSphere Portal 6.1 Information
Center and search for the section "Crawling an external site using a seedlist."
3 Prerequisites
To get the most from this white paper, you should have a good understanding of J2EE
technology, REST-style architecture, and ATOM syndication format. To use the supplied code,
you also need a basic understanding about how to handle servlets.
This paper includes the source code for a full implementation to generate a seedlist for a file
system, and we describe how the mandatory APIs should be implemented. To use Seedlist
Framework and to generate a FileSystem Seedlist, you need a J2EE server such as IBM
WebSphere Application Server. To crawl through the FileSystem Seedlist and to make the
content searchable, you need a Seedlist Crawler such as the crawler in WebSphere Portal 6.1.
After the generated FileSystem Seedlist has been crawled by the Seedlist Crawler, you can
search the content from WebSphere Portal Search Center. After you implement the mandatory
APIs in your own content system, you can repeat the experiment, crawl through your own
system content, and search it from WebSphere Portal Search Center.
4 Seedlist Framework architecture
Seedlist Framework is built on four main components. The consumer of the framework is a
Crawler, which uses the Seedlist Framework to read the content and index it in a search engine.
The flow between the components is a serial process. Every full cycle is triggered by the
Seedlist Crawler and ends when the Crawler receives the Seedlist with the requested list of
documents.
Page 3 of 26
The Framework is designed to allow applications to enhance two aspects of the system, content
retrieval and output formatting:
•
The Retriever package is the API for content retrieval. This package is intended to be
overridden by every application that wants to generate a Seedlist using the Seedlist
Framework. We will show how to implement a Retriever package, which can be used as
reference code for any other implementation, over a file system.
•
The Formatter package is the API for output formatting. The Seedlist Formatter package
is ready for use; you do not need to implement your own formatter. The Seedlist
Framework can, however, work with your own formatter.
Another part of Seedlist Framework that is product specific is the Servlet. Most of its work is
delegated to the Seedlist Service that is part of the Seedlist Framework. In the Servlet code, you
are expected to instantiate the correct Seedlist Retriever and Formatter and pass those objects
to the Seedlist Service.
In WebSphere Portal, the Servlet is based on the Eclipse extension point mechanism that is
built into WebSphere Application Server 6.1. This architecture lets you install Retrievers and
Formatters on a WebSphere Portal server, on which Seedlist Framework detects them
automatically. The high-level architecture of Seedlist Framework in WebSphere Portal is shown
in figure 1.
NOTE: This paper does not cover the details of writing a Seedlist Retriever as an extension
point. Instead, our focus is on a solution that can run on any J2EE server rather than only on a
WebSphere Portal server.
The WebSphere Portal solution is described to emphasize that the Servlet can be implemented
in different ways and that any content system on a WebSphere Portal server that wants to use
the Seedlist does not need to supply its own Servlet. The content system needs to supply only
Seedlist Retriever as an extension point.
Page 4 of 26
Figure 1. Seedlist Framework high-level architecture
Application Space
Content Provider
provides Retriever
implementation
Search Engine Space
Seedlist
Formatter
Implementation
Raw
Data
Raw
Data
Seedlist
Seedlist
Retrieve
r
Seedlist
Crawler
Formatter
Search
Index
Seedlist
Service
Portal
Seedlist
Servlet
HTTP Request with specific
Retriever and Formatter
Get extension for
Retriever and Formatter
Extension Point Registry
WebSphere Portal Seedlist
Servlet is using extension
points to find the relevant
Retriever and Formatter
implementations
The Seedlist Framework components are described in Table 1.
Table 1. Seedlist Framework components
Name
Description
Comments
Seedlist Servlet
Seedlist Servlet is the backbone of the Framework. The Servlet
• processes HTTP requests for actions getChildren, getDocuments
• creates relevant Retriever and Formatter
• delegates the request to the Seedlist Service
• returns the output (Seedlist) to the Crawler
Seedlist Service
Seedlist Service is the backbone of the Framework. The Service
• processes requests for actions: getChildren/getDocuments
• delegates them to an appropriate Retriever
• passes result data to Formatter
• returns the output (Seedlist) to the Seedlist Servlet
Retriever obtains content from specific content repository. It gets
• children entries (sub-Seedlists and immediate documents)
• all descendent documents
• number of all descendent documents without their retrieval
Formatter represents retrieved data into a specified format (for example,
ATOM, Google sitemap)
Product specific;
e.g., in
WebSphere Portal
it includes code to
support Virtual
portals
Part of the Seedlist
Framework
Retriever
Formatter
Extension Point
Registry
Mechanism to find implementations of the Retriever and Formatter APIs
Page 5 of 26
Must be
implemented by
the content system
Ready for you to
use, as is, such as
Seedlist Formatter
Specific for
WebSphere Portal
implementation
In addition to Seedlist Framework components, there are Seedlist Framework actors (see table
2).
Table 2. Seedlist Framework actors
Name
Description
Seedlist Crawler
Crawls the content that is published in the Seedlist for indexing
Search Index
Is the search engine that indexes the content and lets you search it later
To help you better understand the interactions between the Seedlist Framework components,
the sequence diagram in figure 2 describes a simple scenario of the Crawler that fetches the
first 100 documents from some content system.
Figure 2. Seedlist Framework sequence diagram
Seedlist
Crawler
Seedlist
Servlet
Seedlist
Service
Retriever
Formatter
Extension Point
Registry
Crawler asks for the first
100 documents
Creates Retriever and
Formatter based on the passed
extension point IDs
Passes request with
parameters, Retriever
and Formatter
Asks for the first
100 documents
Fetches 100 documents
from the content system
Sends the returned
documents for formatting
Formats 100 documents
in Seedlist
Returns formatted output
Returns formatted output
For details about the parameters that are passed by the calls between the Crawler and the
Servlet, see the document "Seedlist Framework REST API" in the Downloads section.
5 Experimenting with the FileSystem Seedlist
From reading about Seedlist Framework you probably already understand that you must
implement your own Servlet over the Seedlist Service and also implement the Retriever
package before you can use the Framework for generating a Seedlist that represents your
system content.
Page 6 of 26
To simplify the process of implementing a Retriever, a full implementation for a general file
system is supplied here. Complete code to implement a Retriever for a general file system is
supplied in the "Downloads" section. The code can be used as a starting point and reference
code when you implement your own Retriever.
In addition, to simplify the process of implementing a Servlet, we supply a simple Servlet
implementation that supports only the FileSystem Retriever.
This section describes the Servlet code and explains the Seedlist Service API, describes the file
system implementation and explains the Retriever API, raises important points about
implementation, and warns you about pitfalls. This section also contains recommendations and
general conventions for Seedlist Retriever implementation.
Before you read further, install the Seedlist Framework and try to generate a Seedlist for your
own file system. Refer to the Downloads section for the file IlelSeedlist.zip that includes the file
seedlist-installation-guide.doc.
After installation, try to generate a Seedlist for your file system. Also, try to generate a Seedlist
for the list of directories in one of your directories. The following URL returns all folders under
the directory C:/ibm/wp_profile (the encoded form is C%3A%5Cibm%5Cwp_profile):
http://<host>:<port>/seedlist/myserver?Format=html&SeedlistId=C%3A%5Cibm%5Cwp_pro
file&Action=GetChildren
Type this URL in your Web browser, using the correct host, port, and directory name for your
site. Notice that the supplied action, GetChildren, returns all content items below the supplied
folder. Also notice that we use HTML format and not ATOM format, which is the default format
expected by the Seedlist Crawler. In Figure 3 you can see the output of the URL on one of our
servers in HTML format.
Page 7 of 26
Figure 3. FileSystem Seedlist example in HTML format to get folders
You can also ask for documents from a specific folder in your file system by using the
GetDocuments action, as seen in this URL:
http://<host>:<port>/seedlist/myserver?Format=html&SeedlistId=C%3A%5Cibm%5Cwp_pro
file&Action=GetDocuments
Figure 4 shows the output of the URL on one of our servers in HTML format.
Page 8 of 26
Figure 4. FileSystem Seedlist example in HTML format to get files
For more information about URL formats and the different options, see "Seedlist Framework
REST API" in the Downloads section.
The next step is to crawl through the Seedlist and to index its content by using the WebSphere
Portal Seedlist Crawler. For details on setting up such a collection and Crawler, refer again to
the section "Crawling an external site using a Seedlist" in the WebSphere Portal 6.1 Information
Center.
The important thing to notice is that most of the files on your system are not indexed because
the WebSphere Portal Crawler does not support file://<file path> URI format. Only text
files are indexed with their content because the content is published inline. Figure 5 shows an
example of search results of a file system collection.
Page 9 of 26
Figure 5. Search results of a file system collection
After you’ve generated the FileSystem Seedlist and the content has been crawled and indexed,
the next step is to learn more about the Seedlist Servlet that works over the Seedlist Service
and about the Retriever API and its implementation.
6 FileSystem Seedlist Servlet
This paper addresses two implementations of the Seedlist Servlet:
•
The WebSphere Portal implementation that is described in the “Seedlist Framework
architecture” section.
•
A simple Servlet implementation that is written specifically to support only the
FileSystem Retriever. This Servlet can be used as a simple reference for how to write
your own Servlet over the Seedlist Service API for your own system. The source code
can be extracted from the ilel-seedlist.Servlet.ear package in the Downloads section.
Seedlist Servlet processes Seedlist REST requests. Three actions are supported:
•
•
•
GetChildren
GetDocuments
GetNumberOfDocuments
Page 10 of 26
Most of the functionality is included within the Seedlist Service. The main purposes of the
Servlet are to instantiate the correct Retriever and Formatter and to delegate the action to the
Seedlist Service.
The Seedlist Service is available through the com.ibm.ilel.seedlist.service.SeedlistService
interface. The Service can be obtained from the com.ibm.ilel.seedlist.service.SeedlistFactory
interface during Servlet initialization. You can find the javadoc for SeedlistFactory and
SeedlistService in the ilel-seedlist-javadoc.zip file in the Downloads section.
Listing 1 shows how the HTTP Seedlist request is handled in the Servlet.
Listing 1. Handling the HTTP Seedlist Servlet request
// obtained while Servlet initialization process
static final Map<String, FormatterFactory> ffactories;
static final Map<String, Properties> fproperties;
SeedlistFactory slfactory;
Action action = slfactory.createAction(request, response);
// no security for FileSystem Seedlist Retriever
action.setUserCredentials(null, null);
Properties validParams = action.validate();
// validated Servlet-specific parameters
Properties validSevletParams = getServletParams(request);
// obtain retriever factory and init properties
RetrieverFactory rfactory = new
com.ibm.ilel.seedlist.retriever.filesystem.RetrieverFactoryImp();
Properties rprops = null; // filesystem seedlist retriever uses no Properties
// format parameter value is case-insensitive : atom/ATOM
String ftype = validSevletParams.getProperty(FORMAT_PARAM).toLowerCase();
// obtain formatter factory and init properties
FormatterFactory ffactory = ffactories.get(ftype);
Properties fprops = fproperties.get(ftype);
SeedlistService slservice = slfactory.createSeedlistService(
rfactory, rprops, ffactory, fprops);
URLResolver urlResolver = new UrlResolverImp(request, validParams, validSevletParams);
slservice.handleRequest(action, urlResolver, validParams);
While handling an HTTP Seedlist request, SeedlistFactory creates an appropriate Action object
according to request parameters. The created Action object validates the Servlet request
parameters and throws the exception SeedlistException, if the parameters are not valid. The
Action object returns a list of valid parameters as a Properties object that is passed later to the
SeedlistService for handling the request.
The next step is to create the relevant Seedlist Retriever and Formatter factories that are used
to create the Retriever and Formatter services inside SeedlistService. All these objects are
passed to SeedlistFactory to create a SeedlistService instance that handles the request itself.
The main function of the SeedlistService is handleRequest(), which executes the request. The
function gets the action, a list of valid parameters, and URLResolver. The main function of the
service is to obtain the documents from the Retriever and pass those documents to the
specified Formatter.
Page 11 of 26
To handle formatting properly, SeedlistService needs URLResolver instance, which provides all
the Seedlist URLs to:
•
•
•
•
current Seedlist page
next and previous Seedlist feed pages
feed page that represents Documents (GetDocuments action)
feed page that represents Children (GetChildren action)
Seedlist Framework implements SimpleUrlResolver that deals with URL generation. Notice that
the generated URLs use either parameters that are passed on request or default parameters.
For example, if parameter Range is not passed on a Servlet request, its default value is used by
URLResolver when the URL is created.
The Servlet developer can provide his own URLResolver. For example, the developer can
replace the FileSystem Seedlist Servlet that uses
com.ibm.ilel.seedlist.Servlet.url.UrlResolverImp, which is an extension of SimpleUrlResolver
that overrides the addServletParams() function, with his own implementation.
7 Implementing FileSystem Retriever
In the FileSystem Seedlist the hierarchy of content is built from folders and files. The folders are
the Seedlists and Sub-Seedlists. The files are the leaves of the content tree, which we refer to
as Documents. A Seedlist differs from a Document in that a Seedlist can contain other Seedlists
or Documents.
Before going into more detail, we recommend you obtain the source code of the FileSystem
Retriever, which can be extracted from the ilel-seedlist.filesystem.jar package in the Downloads
section.
7.1 FileSystem Retriever structure
After extracting the FileSystem Retriever code, notice that the root package is defined as
com.ibm.ilel.seedlist.retriever.filesystem and contains several sub-packages:
•
•
•
com.ibm.ilel.seedlist.retriever.filesystem.imp (contains implemented/extended classes
that are specific for the FileSystem Retriever)
com.ibm.ilel.seedlist.retriever.filesystem.test (contains JUNITs for testing the
FileSystem Retriever)
com.ibm.ilel.seedlist.retriever.filesystem.resources (contains the resource bundle
properties file that is used for translation of error messages and other localized notes)
7.2 Retriever package implementation
When you open the supplied Seedlist Framework API javadoc, ilel-seedlist-javadoc.zip, you can
see that the API is organized in seven different packages, each of which you can review,
reading more about each package and its purpose.
The package that we focus on is the Retriever package. Listing 2 specifies the list of interfaces
that are mandatory for a fully functional Retriever. The full list of required interfaces to
implement the Retriever is much longer, but other interfaces have their default implementation
that can be easily used, as is. The default implementation of the interfaces can be found in ilel-
Page 12 of 26
seedlist.jar under the package com.ibm.ilel.seedlist.imp, and it was designed to be extended, if
needed.
Listing 2. Mandatory interfaces for a Retriever implementation
com.ibm.ilel.seedlist.retriever.RetrieverFactory
com.ibm.ilel.seedlist.retriever.RetrieverService
com.ibm.ilel.seedlist.common.EntrySet
com.ibm.ilel.seedlist.common.Document
com.ibm.ilel.seedlist.common.Seedlist
Here we describe the implementation of the five mandatory interfaces. In general, some of
these interfaces also have abstract implementation classes to make the required development
simpler and faster. The Retriever follows the factory and service design pattern. The main entry
point to the package is the factory interface.
7.2.1 RetrieverFactory interface implementation
RetrieverFactory lets you create objects, which are required for retrieving the content, like
ApplicationInfo (credential information), RetrieverService, and RetrieverRequest. It must
implement com.ibm.ilel.seedlist.retriever.RetrieverFactory interface and it can extend the
com.ibm.ilel.seedlist.imp.AbstractRetrieverFactory class that provides basic implementation to
create the ApplicationInfo instance. Listing 3 shows the implementation of RetrieverFactory for
the FileSystem.
Listing 3. Code sample from the RetrieverFactory implementation
public class RetrieverFactoryImp extends AbstractRetrieverFactory {
public RetrieverService getRetrieverService(Properties prop,
HttpServletRequest ServletRequest,
HttpServletResponse ServletResponse) throws
SeedlistException {
return RetrieverServiceImp(prop);
}
public RetrieverRequest createRequest(String seedlistId) throws SeedlistException {
return RetrieverRequestImp(seedlistId);
}
public String getVersion() {
return "FileSystem 1.0";
}
NOTE: It is recommended to define the correct Retriever versioning string rather than to use the
default string from the AbstractRetrieverFactory class. The versioning string can be any unique
string that defines the type and version of the implementation.
Notice that, despite the fact that the function getRetrieverService() gets a Servlet request and
response as parameters, these parameters are not passed to the RetrieverServiceImp
constructor. Although these additional parameters are not used in the FileSystem Retriever
implementation, they might be used in other implementations.
If needed, the Retriever can read any HTTP parameters that are passed on the request and that
are not covered by the Seedlist specification. Passing the Servlet request and response means
that a new service must be created for each new request to the Servlet.
Page 13 of 26
7.2.2 RetrieverService interface implementation
After obtaining a Retriever request object and a Retriever service instance from
Retrieverfactory, we can further observe the service, which implements the
com.ibm.ilel.seedlist.retriever.RetrieverService interface.
Implementation of the RetrieverServiceImp constructor depends strictly on the Retriever itself,
and usually the constructor initializes internal services for a particular content model. For
example, FileSystem Retriever initializes rootSeedlistId, which is the root folder for a published
file system. Listing 4 shows the constructor of the FileSystem Retriever Service.
Listing 4. Code sample of RetrieverService constructor
public class RetrieverServiceImp implements RetrieverService {
...
// FileSystem root SeedlistId property
private static final String ROOT_SEEDLIST_ID_PROP = "RootSeedlistId";
// default value for root SeedlistId property
private static final String ROOT_SEEDLIST_ID_DEFAULT = "C:" + File.separator;
// FileSystem root Seedlist (obtained from service properties)
private static File rootSeedlist;
public RetrieverServiceImp(Properties properties) {
String rootSeedlistId = properties.getProperty(
ROOT_SEEDLIST_ID_PROP, ROOT_SEEDLIST_ID_DEFAULT);
rootSeedlist = new File(rootSeedlistId);
}
...
}
This service class must implement the three functions that are listed in Listing 5. Notice that the
two functions of getDocuments() and getChildren() have equivalent actions in the REST API
level.
Listing 5. Three mandatory functions that every Retriever service must implement
1. public int getNumberOfDocuments(ApplicationInfo appInfo, RetrieverRequest
request) throws SeedlistExceptionpublic
2. public EntrySet getDocuments(ApplicationInfo appInfo, RetrieverRequest request)
throws SeedlistExceptionpublic
3. public EntrySet getChildren(ApplicationInfo appInfo, RetrieverRequest
request)throws SeedlistException
The relevant function is called by the Seedlist Servlet based on the action passed in the REST
call. The call to getDocuments() is relevant when the Crawler only wants to go through all the
documents one after another. The call to getChildren() is relevant when the Crawler caller wants
to traverse through the tree structure of the content repository.
For clarity and simplicity we define a new class called FileSystemContent. This class has
functionality that is similar to RetrieverService. The next sections present the implementation of
RetrieverService, a wrapper over the FileSystemContent class. Detailed file system information
is not given so that you can concentrate on the general flow.
Listing 6 is a code sample that implements the getChildren() function. This function returns
folders and documents that are directly under the specified Seedlist ID, a folder in the file
system.
Page 14 of 26
Listing 6. Code sample that implements the getChildren function
File seedlist = getSeedlist(request);
// Get the folder to start from and
// use last update date for filtering of returned files
Date lastUpdateDate = getLastUpdateDate(request);
FileSystemContent fsContent = new FileSystemContent(request);
fsContent.traverseChildren(seedlist, lastUpdateDate);
List<Document> documents = fsContent.getDocuments();
List<Seedlist> seedlists = fsContent.getSeedlists();
EntrySetImp entrySet = new EntrySetImp(documents, seedlists, request);
/*
* Return Timestamp only for the first request in session, so don't miss
* Documents, if they are added during pagination (in the same session)
*/
if (isFirstRequest(request)) {
entrySet.setTimestamp(createNewTimestamp(fsContent.getTimestamp()));
}
return entrySet;
The first step in the getChildren() function is to obtain the Seedlist ID from the passed request
and create from it a com.ibm.ilel.seedlist.common.Seedlist object that is suitable for the
Retriever. Note that this step is also the first step in the functions getNumberOfDocuments() and
getDocuments().
For the FileSystem Retriever, the Seedlist ID represents some folder. If no Seedlist ID is
specified, the root folder of the file system is returned. Listing 7 displays a code sample to get
the requested Seedlist object in the FileSystem Retriever.
Listing 7. Code sample to get the requested Seedlist object
private File getSeedlist(RetrieverRequest request) {
String id = request.getSeedlistId();
return (id == null || id.length() <= 0) ? rootSeedlist : new File(id);
}
The required Documents and Seedlists (file system folders) are obtained by use of the
FileSystemContent class. Now you can create a result object that implements the EntrySet
interface. We will discuss its implementation and relevant guidelines later in this paper.
Listing 8 shows a code sample to implement the getDocuments() function. This function returns
all the documents under the specified Seedlist ID that represents a folder in the file system.
Notice that, when EntrySet is created, only documents are passed. Null is passed for the list of
Seedlists because the action type must return only documents and not folders.
Page 15 of 26
Listing 8. Code sample to implement the getDocuments function
// initialize current request session parameters
File seedlist = getSeedlist(request);
// Get the folder to start from use last update date for filtering of returned files
Date lastUpdateDate = getLastUpdateDate(request);
FileSystemContent fsContent = new FileSystemContent(request);
fsContent.traverseDocuments(seedlist, lastUpdateDate);
List<Document> documents = fsContent.getDocuments();
EntrySetImp entrySet = new EntrySetImp(documents, null, request);
/*
* Return Timestamp only for the first request in session, so don't miss
* Documents, if they are added during pagination (in the same session)
*/
if (isFirstRequest(request)) {
entrySet.setTimestamp(createNewTimestamp(fsContent.getTimestamp()));
}
return entrySet;
Listing 9 shows a code sample to implement the getNumberOfDocuments() function. This
function returns the number of documents under the specified Seedlist ID, a folder in the file
system. Its flow is similar to the getDocuments() function in the FileSystem Retriever.
Listing 9. Code sample to implement the getNumberOfDocuments function
// initialize current request session parameters
File seedlist = getSeedlist(request);
// Get the folder to start from use last update date for filtering of counted files
Date lastUpdate = getLastUpdateDate(request);
FileSystemContent fsContent = new FileSystemContent(request);
return fsContent.countDocuments(seedlist, lastUpdate);
Optimize Retriever using the State parameter
After going through the three main functions of the RetrieverService, you can easily see that the
code sample of the function getDocuments() was simplified for clarity because the optimization
code was removed.
Listing 10 shows the optimized code that uses the State parameter. The State parameter is an
opaque object that is created by the Retriever itself on the first crawler request and can store
any required information.
This parameter helps the Retriever to jump directly to the correct start content item and to return
the requested page without passing through all objects that were retrieved in previous requests.
The requirement is that the Retriever must be stateless so that different crawlers can crawl
through it at the same time. This means that the state information cannot be managed by the
Retriever. The solution is to pass the data as an opaque state object between the Crawler and
the Retriever throughout the crawling session.
Page 16 of 26
Listing 10. Code sample to implement the getDocuments function
// initialize current request session parameters
File seedlist = getSeedlist(request);
// Get the folder to start from use last update date for filtering of returned files
Date lastUpdateDate = getLastUpdateDate(request);
// obtain internal state of FileSystem seedlist retrieving model
String fsState = getInternalState(request);
FileSystemContent fsContent = new FileSystemContent(request);
fsContent.traverseDocuments(seedlist, lastUpdateDate, fsState);
List<Document> documents = fsContent.getDocuments();
EntrySetImp entrySet = new EntrySetImp(documents, null, request);
// create State to use it in the next request for optimization reasons
String newFsState = fsContent.createInternalState();
State newState = createNewState(newFsState, request, documents.size());
entrySet.setState(newState);
/*
* Return Timestamp only for the first request in session, so don't miss
* Documents, if they are added during pagination (in the same session)
*/
if (isFirstRequest(request)) {
entrySet.setTimestamp(createNewTimestamp(fsContent.getTimestamp()));
}
In the FileSystem example, the following occurs:
1. the State is encoded as a concatenation of start index and some internal state of the
FileSystem content model,
2. the prefixed start index is used just for validation of the received State parameter,
3. the FileSystem content model stores the absolute path of the last retrieved file,
4. the State object is returned by the EntrySet interface and is appended to the "next page"
URL that is supplied on the Seedlist format
5. When the crawler gets to this “next page” URL and accesses it, the State object is
passed back to the Retriever as an HTTP parameter in a manner that is transparent to
the crawler.
Notice that the function getChildren() ignores the State parameter because it cannot optimize
retrieval of immediate files or folders. On the other hand, in the function getDocuments(), the
State object is obtained from the received request object. It gets the list of documents, and then
a new State object is generated for the next request. The new State object is set on the
EntrySet object that is returned from the Retriever and implements the interface
com.ibm.ilel.seedlist.common.State.
Listing 11 shows how to check the entire request State consistency and to return the FileSystem
state from the Retriever request. Note that the State is considered inconsistent if the start index
defined by the State does not match the start index defined by the Retriever request.
This validity check ensures that the requested pages are sequential; if the requested pages are
not sequential, we cannot optimize the request handling, and thus we must not use the
information from the State object. The two start indexes are always identical while the crawler
uses the published "next page" URLs on the Seedlist.
Page 17 of 26
Listing 11. Obtaining the State from the Retriever request
private String getContentModelState(RetrieverRequest request) throws SeedlistException {
State state = request.getState();
if (isEmptyState(state)) { return null; }
String stateInfo = state.asString();
// start index and internal FileSystem state are separated by '|'
int stateSeparatorIndx = stateInfo.indexOf(STATE_SEPARATOR);
if (stateSeparatorIndx == -1) {
//...throw SeedlistException...
}
// check consistency of start index for state
try {
int startState = Integer.parseInt(
stateInfo.substring(0, stateSeparatorIndx));
if (startState != request.getStartIndex()) {
//...throw SeedlistException...
}
} catch (NumberFormatException e) {
//...throw SeedlistException...
}
// obtain FileSystem content state
return stateInfo.substring(stateSeparatorIndx + 1);
}
After successfully traversing the content, a new State for the next request is created. For the
FileSystem, the State is encoded as a concatenation of the start index and absolute path of the
last retrieved file, or as null, if no files or folders are returned. The State is defined as the
following string: <start_index>|<internal_filesystem_state>.
Listing 12 shows how the prefixed start index is calculated for a new State and how the whole
State for the next request is created. In FileSystem Retriever, a new State is created only if
documents are collected; if no documents are collected, the existing FileSystem State is used in
the next request.
Listing 12. Code sample to create a new opaque State object
private static final char STATE_SEPARATOR = '|';
private State createNewState(String internalState, RetrieverRequest request, int
numberOfEntries) {
if (internalState == null) {
return request.getState();
}
// compose new state as <new_start_index>|<absolute_file_name>
StringBuffer newStateStr = new StringBuffer();
newStateStr.append(request.getStartIndex() + numberOfEntries);
newStateStr.append(STATE_SEPARATOR);
newStateStr.append(internalState);
return new StateImp(newStateStr.toString());
}
7.2.3 Guidelines for Results interfaces implementation
At this point, we still haven’t covered the implementation of EntrySet, Document, and Seedlist
interfaces. Let’s now go through the implementation of these interfaces and define some
implementation guidelines.
Page 18 of 26
A. Recommendations when implementing the com.ibm.ilel.seedlist.common.EntrySet interface:
1. Use the abstract com.ibm.ilel.seedlist.imp.AbstractEntrySet. It defines basic class
members as protected and provides default implementations for getter functions.
2. Use Collections.EMPTY_LIST rather than null for Documents and Seedlists lists in
EntrySetImp constructor (or setter functions). Note that getSeedlists() and
getDocuments() functions return Iterator and cannot throw the exception
NullPointerException if Documents(Seedlists) list is empty or null.
3. Most of the data of EntrySet is not set directly by EntrySet but is included in the
com.ibm.ilel.seedlist.common.Metadata object. The Metadata interface includes
categories (com.ibm.ilel.seedlist.common.Category), fields
(com.ibm.ilel.seedlist.common.Field), and access control list (ACLs). Use the default
Metadata implementation class com.ibm.ilel.seedlist.imp.MetadataImp that provides a
full set of required functionalities and that can be extended, if required.
4. Each com.ibm.ilel.seedlist.common.Entry is expected to include relevant metadata such
as fields, categories, and ACLs. Notice that EntrySet fields are relevant for all
Documents and Seedlists Entries. You are expected to put fields in the EntrySet level
only if it relevant to all Entries.
5. At the same time, com.ibm.ilel.seedlist.common.FieldInfos, which includes indexing
instructions, is usually common for all Entries. For example, FileSystem Retriever
assumes that every Entry includes fields like title, description, and last update date, as
shown in listing 13.
Listing 13. Creation of FieldInfo parameters for FileSystem Retriever EntrySet
// Message code for Title field name
private static final String MSG_FIELD_TITLE_NAME_INFO = "SEEDLISTRTVFILESYS0201I";
...
// Set field info for title, description and last update date
private FieldInfo[] obtainFieldsInfo(Locale locale)
throws SeedlistException {
FieldInfo[] fieldsInfo = new FieldInfo[3];
fieldsInfo[0] = createFieldInfo(
Field.FIELD_TITLE, MSG_FIELD_TITLE_NAME_INFO,
FIELD_TITLE_DESC, FieldInfo.TYPE_STRING, locale);
fieldsInfo[1] = createFieldInfo(
Field.FIELD_DESCRIPTION, MSG_FIELD_DESCRIPTION_NAME_INFO,
FIELD_DESCRIPTION_DESC, FieldInfo.TYPE_STRING, locale);
fieldsInfo[2] = createFieldInfo(
Field.FIELD_UPDATE_DATE, MSG_FIELD_UPDATE_DATE_NAME_INFO,
FIELD_UPDATE_DATE_DESC, FieldInfo.TYPE_DATE, locale);
return fieldsInfo;
}
// Create FieldInfo with localized message for name
private FieldInfo createFieldInfo(String id, String msgCode, String desc,
int type, Locale locale) throws SeedlistException {
SeedlistMessage field = new
SeedlistMessage(SeedlistMessage.SEVERITY_INFORMATIONAL, msgCode,
RetrieverFactoryImp.FILESYSTEM_RETRIEVER_RES_BUNDLE_NAME);
return new FieldInfoImp(id, field.getMessage(locale), desc, type);
}
Page 19 of 26
For this reason, FileSystem Retriever sets up FieldInfos on EntrySet rather than
separately on every Entry. FieldInfo on EntrySet does not harm Entries that have no
such fields. Therefore, it is recommended to define all FieldInfos on EntrySet and not on
each Entry, to decrease the size of returned EntrySet and of transmitted data on
network.
Also in listing 13, note that names are translated based on the locale via the Seedlist
Message class.
6. EntrySet also contains categories that are common for all Entries on returned set. Listing
14 shows the definition for content source category of FileSystem Retriever.
Recall that Seedlists differ from Documents in that a Seedlist can contain other Seedlists or
Documents. Seedlists represent nodes in the content hierarchy, whereas Documents represent
leaves.
Listing 14. Code sample of how to create field parameters for FileSystem Retriever EntrySet
MetadataImp metadata = new MetadataImp();
metadata.setFields(obtainFields(dir));
...
setMetadata(metadata);
...
private Field[] obtainFields(File dir) {
ArrayList fields = new ArrayList();
fields.add(new FieldImp(dir.getName(), Field.FIELD_TITLE));
fields.add(new FieldImp(dir.getAbsolutePath(), Field.FIELD_DESCRIPTION));
fields.add(new FieldImp(new Date(dir.lastModified()), Field.FIELD_UPDATE_DATE));
return (Field[]) fields.toArray(new Field[fields.size()]);
}
B. Recommendations when implementing com.ibm.ilel.seedlist.common.Seedlist interface:
1. Extend the abstract class com.ibm.ilel.seedlist.imp.AbstractSeedlist. Similar to
AbstractEntrySet, it defines basic class members as protected, and it provides default
implementation for getter functions.
2. In the same manner as EntrySetImp, SeedlistImp sets most of its properties through the
Metadata object. Use the default Metadata implementation class
com.ibm.ilel.seedlist.imp.MetadataImp that provides a full set of required functionalities
and that can be extended, if required. Listing 14 above displays the code sample to
define several folder fields, such as the last modification time of the folder.
C. Recommendations when implementing com.ibm.ilel.seedlist.common.Document interface:
1. Extend the abstract class com.ibm.ilel.seedlist.imp.AbstractDocument.
2. Each piece of content has two URLs: one for its display page and one for its content or
crawl page. When crawling through content in the enterprise, the search engines can
analyze the “crawl” URL while redirecting users to a proper “display” URL when a result
is displayed and clicked.
For a display link, the developer can use the basic class
com.ibm.ilel.seedlist.imp.LinkImp that gets java.net.URI in a constructor.
Page 20 of 26
For a crawl URL, if a link to the content can be provided, it is preferred, in which case the
same LinkImp class can be used. When content can’t be accessed over a network, inline
content representation can be used. and content text is displayed at the result Seedlist
ATOM feed. In such a case, the developer can use the basic class
com.ibm.ilel.seedlist.imp.DocumentContentImp.
Listing 15 shows how FileSystem Retriever inline content is created for text files.
Listing 15. Code sample of inline content creation for text files of FileSystem Retriever
// Default encoding for text files is UTF-8
private static final String DEFAULT_ENCODING = "UTF-8";
// Represents text MIME type
private static final String TEXT_MIME_TYPE = "text/plain";
private DocumentContent createFileContent(File file) throws FileNotFoundException {
InputStream istream = new FileInputStream(file);
DocumentContentImp docContent = new DocumentContentImp(istream);
docContent.setEncoding(DEFAULT_ENCODING);
docContent.setLocale(Locale.ENGLISH);
docContent.setType(TEXT_MIME_TYPE);
return docContent;
}
3. Just like with EntrySet and SeedlistImp, DocumentImp sets most of its properties
through the Metadata object. Use the default Metadata implementation class
com.ibm.ilel.seedlist.imp.MetadataImp that provides a full set of required functionalities
and that can be extended, if required.
7.2.4 Incremental publishing for incremental crawling
One of the main capabilities that Seedlist Framework introduces is an effective mechanism for
incremental publishing that results in incremental crawling. This mechanism is critical when
crawling through large content systems, and the goal is to have a frequently updated search
index. In addition, it reduces the load from the content system because the crawler needs only
to return content items that changed, rather than to return all content items in the system.
The Timestamp Servlet parameter is used to define logic time of the last crawling session (CS).
The logic time object is an opaque object like the State object described earlier. It is created by
Retriever, and only Retriever can "understand" it.
This object can hold a real date or another data structure that represents the last CS time for
your Retriever. When the Timestamp is passed to Seedlist Retriever, the Retriever must return
documents that were updated from the specified Timestamp. As we don't return all the updates
in one call, several sequential requests are sent to the Seedlist Servlet with increasing range
values ("start" parameter). This process of sequential requests is called "pagination process",
during which the same Timestamp parameter is passed.
In general, when starting an incremental CS, the crawler should add the parameter Timestamp
to the Seedlist URL, the first request in the current CS:
&Timestamp=<timestamp from the last crawling session>.
In all successive requests in the same CS, the Timestamp parameter is included automatically
in the next URL by Seedlist URLResolver. The crawler is expected to read the Timestamp
Page 21 of 26
element from the Seedlist output on the last page of the CS and use that Timestamp in the next
CS.
Listing 16 shows how the last update date—or any other Timestamp suitable for your
Retriever—is obtained from the request to filter returned files. All Documents (files) that are
updated after this date must be returned for the current request. Give priority to explicitly a
specified Date parameter on request; if it is not defined, use the Timestamp parameter.
Listing 16. Code sample to get the last update date (or timestamp) for filtering documents
private Date getLastUpdateDate(Request request) throws SeedlistException {
if (request.getDate() != null) {
return request.getDate();
}
State ts = request.getTimestamp();
if (isEmptyState(ts)) {
return null;
}
try {
// timestamp comprises last retrieving request time in millis
return new Date(ByteBuffer.wrap(ts.getStateData()).getLong());
} catch (BufferUnderflowException e) {
//...throw new SeedlistException...
}
}
The advantage of using the Timestamp parameter over the Date parameter is the elimination of
time-zone synchronization issues. In addition, the Timestamp parameter lets you save
additional information to optimize the retrieval of the updated content, if required.
Listing 17 describes the creation of a new Timestamp according to the time of the current
request. Notice that, because Timestamp is an opaque object, we reuse the State interface for
its representation.
Listing 17. Code example to create new opaque Timestamp object
private State createNewTimestamp(long timeInMillis) {
// long can't be more than 8 bytes
ByteBuffer timeBuf = ByteBuffer.allocate(8);
// define timestamp as current time in milliseconds
timeBuf.putLong(timeInMillis);
return new StateImp(timeBuf.array());
}
State Machine regarding State and Timestamp parameters
In this section we describe the different states in which a system can be, in relationship to the
State and Timestamp (TS) objects that are passed between the Crawler and Publishing
backend system (see figure 6):
1. Crawler starts his first full crawling session with (0, 0), which means there is no State
and no TS. It asks for the first page of the content list.
2. Crawler continues and asks for the next pages of the first crawling session with (1, 0),
which means there is a State object on the request, but there still is no TS because we
are still in the first full crawling session.
Page 22 of 26
3. Crawler finishes the first full crawling session and starts another session after some
delay to get updates. Crawler is at (0, 1), which means it does not have a State object. It
asks for the first page; however, it already has a TS object that was returned from the
previous session by the last request, and now it is used to get only the updates.
4. Crawler continues and asks for the next pages with (1, 1), which means there is a State
object as well as a TS that is being used to get only the updates. The passed TS does
not change between requests for consecutive pages on the same crawling session.
5. Crawler moves back to (0, 0) only when it wants to perform a full crawl again and wants
to disregard the TS.
Figure 6. State and Timestamp handling State Machine
0 – There is NO such
parameter on the request
1 – There is such a
parameter on the request
(0, 0)
(1, 0)
(0, 1)
(1, 1)
(State, TS)
8 WebSphere Portal Seedlist Framework
As described above in Section 4, Seedlist Framework architecture, the Servlet code is expected
to be specific to the project. In this section we cover the specific implementation in WebSphere
Portal and its specific characteristics.
The implementation in WebSphere Portal is based on the Eclipse extension point mechanism,
which actually lets WebSphere Portal customers develop their own Retriever, install the
implementation on the server, and generate the Seedlist for their content system without any
additional effort.
A built-in Retriever, called the Discovery Retriever, discovers those Retrievers. This service
implements the Seedlist Retriever API and returns a list of all available Seedlist Retriever
extension point implementations that are available in the system with the getChildren() call. This
service also provides a list of all available Documents in all Seedlist Retrievers with the
getDocuments() call.
Figure 7 shows the output of the URL:
https://<host>:<port>/seedlist/myserver?Format=html
Page 23 of 26
Figure 7. Discovery Retriever that displays available Retrievers on the system
Notice that in WebSphere Portal Servlet, the parameter Source on the Seedlist REST API is
used to pass the extension point ID. The URL does not include the Source parameter, and by
default the Discovery Retriever is used.
For the Action, the default is GetChildren. The Discovery Retriever finds the WebSphere Portal
Retriever and FileSystem Retriever and displays them in an HTML view, the specified format.
For each Retriever, we can see the extension point name as the description.
9 Conclusion
Seedlist Framework is a simple and intuitive framework that simplifies the task of making your
enterprise content searchable by publishing the content in Seedlist format. This format is
supported by different search engines. The Framework handles all the critical and problematic
aspects of enterprise content such as access control, metadata publishing with indexing
instructions, and incremental crawling for effectiveness and scalability.
We have described the Framework architecture and guided you through creation of a
FileSystem Seedlist. We’ve also provided a step-by-step outline of the FileSystem Retriever
implementation as reference code for your own Retriever development.
10 Downloads
This section describes all the packages and source code that are needed to write a Seedlist
Retriever and to generate a Seedlist using Seedlist Framework. In addition, it lists the additional
papers on the Seedlist, API specifications, and Javadoc.
Name
IlelSeedlist.zip
Size
Download
method
Description
The zip file includes:
1. ilel-seedlist-javadoc.zip
2. ilel-seedlist.jar
3. ilel-seedlist.servlet.ear
4. ilel-seedlist.filesystem.jar
Page 24 of 26
5. seedlist-installation-guide.doc
ilel-seedlist-javadoc.zip includes:
1. javadoc of Seedlist Framework API
ilel-seedlist.jar includes:
1. Seedlist Framework API and its default
implementations
2. ATOM/XML, HTML, XSLT Seedlist
Formatters
3. Seedlist Service that connects the
different framework parts. It validates
request parameters. It calls appropriate
Retriever function. It formats retrieved
data according to required format.
ilel-seedlist.servlet.ear includes:
Servlet that instantiates the FileSystem
Retriever,
instantiates
ATOM/XML
Formatter, and processes requests by using
Seedlist Service. The package includes the
source code.
ilel-seedlist.filesystem.jar includes:
Seedlist Retriever implementation of File
System. The package includes the source
code.
SeedlistSpecification.zip
The zip files includes:
1. Seedlist
Framework
REST
API.
Specification that includes the HTTP
parameters and the output format.
2. "A Richer Format for Providing Content
Crawling Metadata" paper (written by
David Konopnicki and Laurent Hasson).
11 Resources
•
•
•
developerWorks article, WebSphere Portal Search Toolbox for WebSphere Portal
Version 6.0
developerWorks article, Integrating IBM Lotus Sametime with the IBM Lotus Quickr
Search REST service
developerWorks article, Introducing the Search and Indexing API in WebSphere Portal
V6.0
Page 25 of 26
•
•
•
•
developerWorks article, IBM Search and Index APIs (SIAPI) for WebSphere Information
Integrator OmniFind Edition
developerWorks white paper, IBM Search and Index (SIAPI) V6.0 Javadoc
WebSphere Portal product documentation
WebSphere Portal zone
12 About the authors
Eitan Shapiro holds a BSc degree in Information Systems Engineering from the
Technion, Haifa, Israel. He joined IBM in 2005 and is the Team Lead of the
Haifa Search Technologies Team, which develops search solution for
WebSphere Portal and Lotus Quickr.
Constantin Radchenko holds a BSc degree in Software Engineering from the
Technion, Haifa, Israel. He joined IBM in 2006, where he is a software
developer on the Haifa Enterprise Information Discovery Team, which develops
search solutions for WebSphere Portal and Lotus Quickr.
**********************************************************************
Trademarks
• developerWorks, IBM, and WebSphere are trademarks or registered trademarks of IBM
Corporation in the United States, other countries, or both.
• Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun
Microsystems, Inc. in the United States, other countries, or both.
• Other company, product, and service names may be trademarks or service marks of others.
**********************************************************************
Page 26 of 26
Fly UP