[opencms-dev] search engines and static vs. dynamic contents

Doychi doychi-lists at doychi-dina.ath.cx
Tue Nov 15 23:34:02 CET 2005


On 1:35:11 2005-11-16 "Code Create, Bernd Wolfsegger" <bw at code-create.com>
wrote:
<snip>
> Well, any experts here? :)

I wouldn't call myself an expert, but I've worked with Verity's K2 for a
number of years and what follows are some suggestions that generally get
thrown around to simplify the job of indexing/crawling documents.

> I don't think that a robot can do more than make http requests etc..
> (Anything else would be a security case) And that is really something
> different from accessing the servers file system.

Got it in one.

> You have a problem with such dynamically generated sites, where you
> have a "controller" JSP (always the same Url) an thousands of get
> Parameter to determine which content to show.
> But thats not the case with OpenCms. The Urls look exactly like
> static content Urls. No difference.
<snip>

Assumption:  The site has already been spidered/index once by the search
engine.

Get parameters are a problem, but not the only one.  It depends a little on
how the search engine spiders the site.  Some will only check that the
"last modified" time of the root page is newer than when the spider last
found the page and if it isn't new then it won't process the page, and
won't check any pages further down the tree.  Others will check every page
they already know about to see if it is newer and then spider from the
newer pages.  The second method is safer and I suspect most search engines
are using this method now.

If you want to make your site highly accessable to spiders I would recommend:

1) Don't use JavaScript in links that you want the spider to follow.  Some
engines will be able to follow SOME JavaScript links, but due to the
flexability of JavaScript to mangle links I wouldn't trust it.

2) Don't use GET/POST parameters to change the information on a page. 
Again some search engines have options to allow the GET parameters to be
used in identifying pages for indexing, but again I wouldn't recommend it.

3) Make sure the "Last Modified" is set correctly.  This does two things it
prevents the spider from having to process pages it doesn't have to,
reducing the load on your servers, and also ensures that the content is
index if it is new.

Any way I hope this helps.

--
Doychi
spdoychiam at doychi-dina.ath.cx



More information about the opencms-dev mailing list