4 Replies Latest reply on Oct 7, 2004 9:08 PM by jae77

    New Module ? : Search engines bots (spider) detection

    cnovara

      I want to know what bot crawls my nukes website. Tons of PHP-like tools exist, but I made a nukes module to achieve this. Based on a bot signature lists (IP subnet or http agent matching), it logs bot requests at nuke (application) level. This way I can query the resulting log simply connecting my nukes site.

      Are you interested in these bot issues ? let me know.

      Note for cooper and theute : more work is necessary to package it really clean. I've made a "plug" in UserModule (handler) to invoke it only on a new session. We can discuss on it if you're interested in.

        • 1. Re: New Module ? : Search engines bots (spider) detection

          sounds good, can yuo package it better :-) ?

          • 2. Re: New Module ? : Search engines bots (spider) detection
            cnovara

            I've implemented full CMP/CMR implementation, module implementation ok again, html template ok but I nedd 2 DB tables : signatures et log and I'm not fully satisfied by CMP creation (type control is not strict enough I think, a lot to say about it, maybe on another dev thread ...). Is it possible to use a good old ddl ? I'm trying to get how to generate it by build process. Another issue : is it necessary to test it under hsqldb ? I don't even try it. We should also discuss about UserModule plug. For now here it is

            private Handler handler = new Handler()
             {
             public void process(Signature signature, NukesRequest req, NukesResponse resp, Handler.Next next)
             {
            ...
             synchronized (sessionIdToUserStatMap)
             {
             if (sessionIdToUserStatMap.containsKey(sessionId))
             {
             stat = (UserStat)sessionIdToUserStatMap.get(sessionId);
             }
             else
             {
             sessionIdToUserStatMap.put(sessionId, stat = new UserStat(sessionId));
            // CN Added, call BotcrawlModule
             try {
             server.invoke(ObjectNameFactory.create("nukes.modules:name=botcrawl"), "signIn",
             new Object[] { req }, new String[] { NukesRequest.class.getName() });
             } catch (Exception e) { // TODO Module not there, cry silently
             log.error("",e);
             }
             }
             }
            
             ...
             next.process(signature, req, resp);
             }
             };


            • 3. Re: New Module ? : Search engines bots (spider) detection
              theute

              For the DB there is an installer but it's in 1.1 not in HEAD yet.

              If it is comes packaged with Nukes, it would be nice to make it work with hsqldb, even though it is not useful for production website it is very neat to try it without having anything to setup/install.

              A good way to show your boss Nukes running on your machine in a couple of seconds (all depends on your download time :) )

              • 4. Re: New Module ? : Search engines bots (spider) detection
                jae77

                 

                "cnovara" wrote:
                I've implemented full CMP/CMR implementation, module implementation ok again, html template ok but I nedd 2 DB tables : signatures et log and I'm not fully satisfied by CMP creation (type control is not strict enough I think, a lot to say about it, maybe on another dev thread ...). Is it possible to use a good old ddl ? I'm trying to get how to generate it by build process. Another issue : is it necessary to test it under hsqldb ? I don't even try it. We should also discuss about UserModule plug.


                if you control what is being insert through the java code, then the type creation provided by jboss should be more then sufficent.

                i still think that a pure cmp solution is the way to go b/c it makes it that much easier to install. however, certain database differences and the need to load "primer" data, still doesn't make it the best solution in the world. perhaps as ldbc (http://ldbc.sf.net as julien mentioned before) becomes more mature, it may help solve the problems.