5 Replies Latest reply on May 4, 2003 6:43 PM by h2o_polo

    Lucene as a Search module

    h2o_polo

      Ok, after seven roundtrips on the train and two weekends the end of the begining of the search module is here.

      There is a Readme file in search/src/etc

      Aelx.

        • 1. Re: Lucene as a Search module
          h2o_polo

          This stuff will not compile until the patch for the Page class is applied. I sent it in the previous post.

          Alex.

          • 2. Re: Lucene as a Search module

            I looked at it, it seems cool. There are some glitches but not important.

            * Be pragmatic please *

            1.Correct me if I am wrong :

            in indexContext, context parameter contains the words that will be analized ?

            for html module, that would contains the content of a page ?

            2.There is a thing I want to change. Reverse the control of the indexing process :

            Instead of having a module push its content within the search module, I want the search module to pull the content from all the modules (on a regular time intervall basis for instance). Of course the module should implements some kind of interface to be indexable.

            julien

            • 3. Re: Lucene as a Search module
              h2o_polo

              > in indexContext, context parameter contains the
              > words that will be analized ?

              Yes.

              > for html module, that would contains the
              > content of a page ?

              It can be anything that can be uniquely identified by the url (another parameter in the same method) and can be represented as String (currently since there is only a bunch of Analyzers in lucene).

              There is a little problem with the HTML situation. Since not all of the HTML is a context, as there is some context formating tags the StandardAnalyze or the Simple one will index the tags too. To avoid this we will need another analyzer. There is a demo that comes with lucene but I have not explored it yet.

              > 2.There is a thing I want to change.
              > Reverse the control of the indexing process :
              >
              > Instead of having a module push its content within
              > the search module, I want the search module to pull
              > the content from all the modules (on a regular time
              > intervall basis for instance). Of course the module
              > should implements some kind of interface to be
              > indexable.

              Yes, I remember the visitor you were talking about.

              So the module that needs to be indexed will have to implement something like:

              interface Searchable {
              indexMe (SearchModule);
              Collection getURLs ();
              String getContent (url);
              }

              abstract AbstractSearchable implements Searchable {
              indexMe (SearchModule sm) {
              sm.indexContext (Searchable);
              }

              abstract Collection getURLs ();
              abstract String getContent (url);
              }

              Suggestions will be nice to hear.

              I am not sure of expirations yet.

              Thanks,
              Alex.

              • 4. Re: Lucene as a Search module

                1.In fact pull vs push, I am not sure about it.

                Pull, it's simpler for the module but index may be not up-to-date.

                Push, each time the content change, the module can modify the index on the seach module. In that case I would like to use JMX notifications with special Notifications subclasses.

                2.Do we need a custom analyzer for HTML or filtering tags would be enough ?

                3.I understand the concept of search unit and I think it's ok. But I am a little bit sceptical about the need of an EJB for that. That's very static content. We could have an XML format that describe it and store it as a plain mbean attribute, parsed each time the set is called and create the object structure.

                I think also that a component that want to index data should describe its own units and not follow a schema set by the search module. Then in index module we can manipulate these units and arrange them.

                eg :

                html module creates units : html/page

                suppose we have another module that delivers html content and index it as : another/page

                in search module we see both and decide to unify them in the same unit :

                (html/page & another/page) -> html

                .that means each time the html module index some content it goes in html, because of mapping.
                .the same for another module.

                I want most of search configuration in search module, not scattered in every module.

                julien

                • 5. Re: Lucene as a Search module
                  h2o_polo

                  > 1.In fact pull vs push, I am not sure about it.
                  >
                  > Pull, it's simpler for the module but index may be not
                  > up-to-date.
                  >
                  > Push, each time the content change, the module
                  > can modify the index on the seach module. In that
                  > case I would like to use JMX notifications with
                  > special Notifications subclasses.

                  I think eventually I would do both but to start with the push would be done first.

                  > 2.Do we need a custom analyzer for HTML or
                  > filtering tags would be enough ?
                  If you filter though then you would filter out the content that looks like HTML. If you make a custom filter that is aware of how not mess up the content that is also HTML then it becomes as complicated as an Anlyzer :-) I am just relying without digging too deep though. I will look at it closer later on. It is not the first priority now, just a thought.

                  >3.I understand the concept of search unit and I
                  > think it's ok. But I am a little bit sceptical about the
                  > need of an EJB for that. That's very static content.

                  This point is arguable in my view. Especially if you allow modules to create their own search units. So there are two scenarios here either the module that creates its own search unit knows how to attach it to the group that exists in the hierarchy already or an admin would eventually attach it to a group. The case #1 is no good. I think we should limit a module from exposing internals of the SearchModule. In case #2 the ability to dynamically manipulate the structure and persist at the same time is essential. And what can be better than EJB :-)

                  > html module creates units : html/page
                  >
                  > suppose we have another module that delivers html > content and index it as : another/page
                  >
                  > in search module we see both and decide to unify
                  > them in the same unit :
                  >
                  > (html/page & another/page) -> html
                  >
                  > .that means each time the html module index some
                  > content it goes in html, because of mapping.
                  > .the same for another module.

                  Yep that is what is happening now, except that html/page and another/page both would need to be aware of the fact that their searchunit is html. I think your idea of modules declaring their units and admin manipulating them into groups is a good one. I will try to implement it soon.

                  Alex.