Overview design of Search mechanism.
+ The searching is a fully client-side implementation of querying texts for + content searching, and no server is involved. That means when a user enters a query, + it is processed by JavaScript inside the browser, and displays the matching results by + comparing the query with a generated 'index', which too reside in the client-side web browser. + + Mainly the search mechanism has two parts. +
Indexing: First we need to traverse the content in the docs/content folder and index + the words in it. This is done by
nw-cms.jar
. You can invoke it by +ant index
command from the root of webhelp of directory. You can recompile it + again and build the jar file byant build-indexer
. Indexer has some extensive + support for such as stemming of words. Indexer has extensive support for English, German, + French languages. By extensive support, what I meant is that those texts are stemmed + first, to get the root word and then indexes them. For CJK (Chinese, Japanese, Korean) + languages, it uses bi-gram tokenizing to break up the words. (CJK languages does not have + spaces between words.) ++ When we run
ant index
, it generates five output files: +htmlFileList.js
- This contains an array namedfl
which stores details + all the files indexed by the indexer. +htmlFileInfoList.js
- This includes some meta data about the indexed files in an array + namedfil
. It includes details about file name, file (html) title, a summary + of the content.Format would look like, +fil["4"]= "ch03.html@@@Developer Docs@@@This chapter provides an overview of how webhelp is implemented.";
+index-*.js
(Three index files) - These three files actually stores the index of the content. + Index is added to an array namedw
.
+ +
+ Querying: Query processing happens totally in client side. Following JavaScript files handles them. +
nwSearchFnt.js
- This handles the user query and returns the search results. It does query + word tokenizing, drop unnecessary punctuations and common words, do stemming if docbook language + supports it, etc.{$indexer-language-code}_stemmer.js
- This includes the stemming library. +nwSearchFnt.js
file callsstemmer
method in this file for stemming. + ex:var stem = stemmer(foobar);
+
+
+
Adding new Stemmers is very simple.
Currently, only English, French, and German stemmers are integrated in to WebHelp. But the code is + extensible such that you can add new stemmers easily by few steps.
What you need: +
You'll need two versions of the stemmer; One written in JavaScript, and another in Java. But fortunately, + Snowball contains Java stemmers for number of popular languages, and are already included with the package. + You can see the full list in Adding support for other (non-CJKV) languages. + If your language is listed there, + Then you have to find javascript version of the stemmer. Generally, new stemmers are getting added in to + Snowball Stemmers in other languages location. + If javascript stemmer for your language is available, then download it. Else, you can write a new stemmer in + JavaScript using SnowBall algorithm fairly easily. Algorithms are at + Snowball. +
Then, name the JS stemmer exactly like this:
{$language-code}_stemmer.js
. For example, + for Italian(it), name it as,it_stemmer.js
. Then, copy it to the +docbook-webhelp/template/content/search/stemmers/
folder. (I assumed +docbook-webhelp
is the root folder for webhelp.) +Note
Make sure you changed the
webhelp.indexer.language
property inbuild.properties
+ to your language. ++ +
Now two easy changes needed for the indexer.
Open
docbook-webhelp/indexer/src/com/nexwave/nquindexer/IndexerTask.java
in + a text editor and add your language code to thesupportedLanguages
String Array.Example 3.1. Add new language to supportedLanguages array
+ change the Array from, +
+private String[] supportedLanguages= {"en", "de", "fr", "cn", "ja", "ko"}; + //currently extended support available for + // English, German, French and CJK (Chinese, Japanese, Korean) languages only. +
+ To,
+private String[] supportedLanguages= {"en", "de", "fr", "cn", "ja", "ko", "it"}; + //currently extended support available for + // English, German, French, CJK (Chinese, Japanese, Korean), and Italian languages only. +
+ Now, open
docbook-webhelp/indexer/src/com/nexwave/nquindexer/SaxHTMLIndex.java
and + add the following line to the code where it initializes the Stemmer (Search for +SnowballStemmer stemmer;
). Then add code to initialize the stemmer Object in your language. + It's self understandable. See the example. The class names are at: +docbook-webhelp/indexer/src/com/nexwave/stemmer/snowball/ext/
. +Example 3.2. initialize correct stemmer based on the
webhelp.indexer.language
specified+ SnowballStemmer stemmer; + if(indexerLanguage.equalsIgnoreCase("en")){ + stemmer = new EnglishStemmer(); + } else if (indexerLanguage.equalsIgnoreCase("de")){ + stemmer= new GermanStemmer(); + } else if (indexerLanguage.equalsIgnoreCase("fr")){ + stemmer= new FrenchStemmer(); + } +else if (indexerLanguage.equalsIgnoreCase("it")){ //If language code is "it" (Italian) + stemmer= new italianStemmer(); //Initialize the stemmer to
italianStemmer
object. + } + else { + stemmer = null; + } +
+
That's all. Now run ant build-indexer
to compile and build the java code.
+ Then, run ant webhelp
to generate the output from your docbook file.
+ For any questions, contact us or email to the docbook mailing list
+ <docbook-apps@lists.oasis-open.org>
.
+