So you want the google spiders to visit but not crawl the cids or do you want to block them alltogether?
Thanks Christian. I'll get rid of the s:links, which are (currently) all over my site.
I might go wild and decide to write my own sub-class of s:link that's able to detect robots.
For the performance problem, I'm upgrading Seam to SP1 right now; perhaps whatever bug was in 2.02GA is what is killing my site.
I put on SP1, but we're still having massive stability problems on this site. It's not even that high traffic but it only can stay up for a couple of hours at a time. This is getting a couple thousand hits an hour, which should be nothing.
This is really bad... I'll try getting rid of all the s:links to see if that helps.
Check that you are not running out of JVM heap space and how many HTTP sessions are created/in use. Running out of memory is not handled gracefully on Java platforms and looks like stuff is crashing randomly (on JBoss I've seen lots of transaction and database connection errors).
Definitely set a maximum number of sessions. Calculating that number is a bit difficult though.
Thanks for the suggestions Christian. I will check those.
I did change a few things just now: I put propagation=none on all the s:links except for the few where it is needed. I then put a block in robots.txt to keep robots from crawling onto pages like the
contact usform, which start long-running conversations. After those changes, it is now staying up reliably. I really got burned by not thinking carefully about s:link and long-running conversations.
What I'm going to do is leave it the way it is, but I will write my own version of s:link that automatically enables propagation=none for bots. Maybe I'll add a property like propagation=nobots. That will let me keep conversations going for actual users, but not for bots. I do want users conversations to keep going, so, for example, if a user does a search, his search terms and maybe the top search result will show up. But that's not needed for bots.
I'll post the tag and code when I get it done. If people like it, maybe it could be rolled back into the official Seam distribution. I already got one tag in there (I wrote s:swing) and I hope to contribute more.
How about a BotFilter or something if it turns out that there are multiple places where the problem might occur?
Might be useful for others, too. Something that can be configured through components.xml or web.xml.
I'm DEFINITELY putting a botfilter on this whole thing, mainly to catch bots that ignore robots.txt, of which there are quite a few. I want to a) exclude them from getting real content and b) provide them with an infinite tree of HTML random pages they can download. If you look in your logs you'll see there are some really obvious bots that don't even check for robots.txt and use user-agents which pretending to be regular browsers. I don't want these on my site because they couldn't be doing anything good.
Of course the nasty bots aren't identifying themselves properly through headers, either :-/
Perhaps your filter should count access rates and if you get more that 10 requests in 10 seconds, you are redirected to a page that says
Sorry for the inconvenience but we think you might be a bot, please solve this CAPTCHA;-)
I like the idea of using a CAPTCHA to get
out of the dog house. But getting into the dog house is easier than just looking at request rates. Some of the misbehaving bots do space their requests, but they do obviously bad bot things, like:
1. Put a URL in robots.txt, but don't link to it from anywhere else. Anything that fetches that URL is a bad bot.
2. Create an area in robots.txt like /junk/*. Then, on some obscure page, create a link to something in /junk/. Make that link a one-pixel transparent image. No one could click on that, and no bot should follow it, so any bot which touches it is a bad bot and could be automatically put on the filter.
3. It's definitely a good idea to create a simple
infinite tree of random inter-linked HTML. Give them something fun to index. Make sure that that infinite tree is in robots.txt or else Google will black hole your site as being a
I guess you know this but there are various options for:
<META NAME="ROBOTS" CONTENT="NONE"/>
INDEX: search engine robots should include this page. FOLLOW: robots should follow links from this page to other pages. NOINDEX: links can be explored, although the page is not indexed. NOFOLLOW: the page can be indexed, but no links are explored. NONE: robots can ignore the page. NOARCHIVE: Google uses this to prevent archiving of the page. See http://www.google.com/bot.html
Is cid=999 the max for a Seam app? If not, what is the max?
Couple days ago was same post here...
cid is simply long, so u must check what is max val of long variable.
GoogleBot was requesting the same page, with a different conversation ID (CID) parameter.
how/why do these bots do this? i.e. how does the bot know to increment/change the cid value and issue a HTTP GET request?