14 Replies Latest reply on Jul 7, 2009 10:11 PM by asookazian

Stopping Google from hitting ?cid=999

ericjava.eric.chiralsoftware.net May 29, 2008 8:30 AM

Here's the annoying problem: my seam pages end up having cid=... as a request param on them, because many pages try to start conversations. In fact, I want users to always be in a conversation.

This is causing a big problem for Google. It keeps on crawling index.seam?cid=55, and then cid=56, and on and on.

Is there something I can do to prevent this? I have urlrewrite enabled. Can I use that?

This is causing a big problem because starting all these conversations / sessions or whatever is causing Java to go up to 100% CPU. Or something is, and that's my first guess at the problem.

Any suggestions?

Thanks

1. Re: Stopping Google from hitting ?cid=999

nickarls May 29, 2008 8:46 AM (in response to ericjava.eric.chiralsoftware.net)

So you want the google spiders to visit but not crawl the cids or do you want to block them alltogether?
Actions
2. Re: Stopping Google from hitting ?cid=999

christian.bauer May 29, 2008 11:22 AM (in response to ericjava.eric.chiralsoftware.net)

http://www.seamframework.org/Documentation/HowCanIDisableConversationIdentifiersForRobotsCrawlingMySite
Actions
3. Re: Stopping Google from hitting ?cid=999

ericjava.eric.chiralsoftware.net May 29, 2008 6:26 PM (in response to ericjava.eric.chiralsoftware.net)

Thanks Christian. I'll get rid of the s:links, which are (currently) all over my site.

I might go wild and decide to write my own sub-class of s:link that's able to detect robots.

For the performance problem, I'm upgrading Seam to SP1 right now; perhaps whatever bug was in 2.02GA is what is killing my site.
Actions
4. Re: Stopping Google from hitting ?cid=999

ericjava.eric.chiralsoftware.net May 30, 2008 1:51 AM (in response to ericjava.eric.chiralsoftware.net)

I put on SP1, but we're still having massive stability problems on this site. It's not even that high traffic but it only can stay up for a couple of hours at a time. This is getting a couple thousand hits an hour, which should be nothing.

This is really bad... I'll try getting rid of all the s:links to see if that helps.
Actions
5. Re: Stopping Google from hitting ?cid=999

christian.bauer May 30, 2008 2:15 AM (in response to ericjava.eric.chiralsoftware.net)

Check that you are not running out of JVM heap space and how many HTTP sessions are created/in use. Running out of memory is not handled gracefully on Java platforms and looks like stuff is crashing randomly (on JBoss I've seen lots of transaction and database connection errors).

Definitely set a maximum number of sessions. Calculating that number is a bit difficult though.
Actions
6. Re: Stopping Google from hitting ?cid=999

ericjava.eric.chiralsoftware.net May 30, 2008 4:51 AM (in response to ericjava.eric.chiralsoftware.net)

Thanks for the suggestions Christian. I will check those.

I did change a few things just now: I put propagation=none on all the s:links except for the few where it is needed. I then put a block in robots.txt to keep robots from crawling onto pages like the contact us form, which start long-running conversations. After those changes, it is now staying up reliably. I really got burned by not thinking carefully about s:link and long-running conversations.

What I'm going to do is leave it the way it is, but I will write my own version of s:link that automatically enables propagation=none for bots. Maybe I'll add a property like propagation=nobots. That will let me keep conversations going for actual users, but not for bots. I do want users conversations to keep going, so, for example, if a user does a search, his search terms and maybe the top search result will show up. But that's not needed for bots.

I'll post the tag and code when I get it done. If people like it, maybe it could be rolled back into the official Seam distribution. I already got one tag in there (I wrote s:swing) and I hope to contribute more.
Actions
7. Re: Stopping Google from hitting ?cid=999

nickarls May 30, 2008 7:55 AM (in response to ericjava.eric.chiralsoftware.net)

How about a BotFilter or something if it turns out that there are multiple places where the problem might occur?

Might be useful for others, too. Something that can be configured through components.xml or web.xml.
Actions
8. Re: Stopping Google from hitting ?cid=999

ericjava.eric.chiralsoftware.net May 30, 2008 8:27 AM (in response to ericjava.eric.chiralsoftware.net)

I'm DEFINITELY putting a botfilter on this whole thing, mainly to catch bots that ignore robots.txt, of which there are quite a few. I want to a) exclude them from getting real content and b) provide them with an infinite tree of HTML random pages they can download. If you look in your logs you'll see there are some really obvious bots that don't even check for robots.txt and use user-agents which pretending to be regular browsers. I don't want these on my site because they couldn't be doing anything good.
Actions
9. Re: Stopping Google from hitting ?cid=999

nickarls May 30, 2008 8:37 AM (in response to ericjava.eric.chiralsoftware.net)

Of course the nasty bots aren't identifying themselves properly through headers, either :-/

Perhaps your filter should count access rates and if you get more that 10 requests in 10 seconds, you are redirected to a page that says Sorry for the inconvenience but we think you might be a bot, please solve this CAPTCHA ;-)
Actions
10. Re: Stopping Google from hitting ?cid=999

ericjava.eric.chiralsoftware.net May 30, 2008 8:48 AM (in response to ericjava.eric.chiralsoftware.net)

I like the idea of using a CAPTCHA to get out of the dog house. But getting into the dog house is easier than just looking at request rates. Some of the misbehaving bots do space their requests, but they do obviously bad bot things, like:

1. Put a URL in robots.txt, but don't link to it from anywhere else. Anything that fetches that URL is a bad bot.

2. Create an area in robots.txt like /junk/*. Then, on some obscure page, create a link to something in /junk/. Make that link a one-pixel transparent image. No one could click on that, and no bot should follow it, so any bot which touches it is a bad bot and could be automatically put on the filter.

3. It's definitely a good idea to create a simple infinite tree of random inter-linked HTML. Give them something fun to index. Make sure that that infinite tree is in robots.txt or else Google will black hole your site as being a spider trap.
Actions

11. Re: Stopping Google from hitting ?cid=999

tom_goring May 30, 2008 3:26 PM (in response to ericjava.eric.chiralsoftware.net)

I guess you know this but there are various options for:

<META NAME="ROBOTS" CONTENT="NONE"/>

like:

INDEX: search engine robots should include this page.
FOLLOW: robots should follow links from this page to other pages.
NOINDEX: links can be explored, although the page is not indexed.
NOFOLLOW: the page can be indexed, but no links are explored.
NONE: robots can ignore the page.
NOARCHIVE: Google uses this to prevent archiving of the page. See http://www.google.com/bot.html

12. Re: Stopping Google from hitting ?cid=999

asookazian Jul 7, 2009 8:54 PM (in response to ericjava.eric.chiralsoftware.net)

Is cid=999 the max for a Seam app? If not, what is the max?
Actions
13. Re: Stopping Google from hitting ?cid=999

sherkan777 Jul 7, 2009 9:58 PM (in response to ericjava.eric.chiralsoftware.net)

Couple days ago was same post here...
cid is simply long, so u must check what is max val of long variable.
Actions
14. Re: Stopping Google from hitting ?cid=999

asookazian Jul 7, 2009 10:11 PM (in response to ericjava.eric.chiralsoftware.net)

GoogleBot was requesting the same page, with a different conversation ID (CID) parameter.

source: http://chiralsoftware.com/launching-a-jboss-seam-site/jboss-seam-problems.seam

how/why do these bots do this? i.e. how does the bot know to increment/change the cid value and issue a HTTP GET request?
Actions

Go to original post