[opencms-dev] search engines and static vs. dynamic contents

Code Create, Bernd Wolfsegger bw at code-create.com
Wed Nov 16 19:09:53 CET 2005


Well,

as for: 3) Make sure the "Last Modified" is set correctly:

I found that OpenCms generated sites do have an updated "Modified" date every 
time they are requested ...
Is it somehow possible to change this behaviour?
Does anybody know if it has an effect to set the Last Modified Header in the 
response manually in the JSP to the OpenCms last edited date?

Any opinions ? :-)

Kind regrds, Bernd

On Wednesday, 16. November 2005 01:34, Doychi wrote:
> On 1:35:11 2005-11-16 "Code Create, Bernd Wolfsegger" <bw at code-create.com>
> wrote:
> <snip>
>
> > Well, any experts here? :)
>
> I wouldn't call myself an expert, but I've worked with Verity's K2 for a
> number of years and what follows are some suggestions that generally get
> thrown around to simplify the job of indexing/crawling documents.
>
> > I don't think that a robot can do more than make http requests etc..
> > (Anything else would be a security case) And that is really something
> > different from accessing the servers file system.
>
> Got it in one.
>
> > You have a problem with such dynamically generated sites, where you
> > have a "controller" JSP (always the same Url) an thousands of get
> > Parameter to determine which content to show.
> > But thats not the case with OpenCms. The Urls look exactly like
> > static content Urls. No difference.
>
> <snip>
>
> Assumption:  The site has already been spidered/index once by the search
> engine.
>
> Get parameters are a problem, but not the only one.  It depends a little on
> how the search engine spiders the site.  Some will only check that the
> "last modified" time of the root page is newer than when the spider last
> found the page and if it isn't new then it won't process the page, and
> won't check any pages further down the tree.  Others will check every page
> they already know about to see if it is newer and then spider from the
> newer pages.  The second method is safer and I suspect most search engines
> are using this method now.
>
> If you want to make your site highly accessable to spiders I would
> recommend:
>
> 1) Don't use JavaScript in links that you want the spider to follow.  Some
> engines will be able to follow SOME JavaScript links, but due to the
> flexability of JavaScript to mangle links I wouldn't trust it.
>
> 2) Don't use GET/POST parameters to change the information on a page.
> Again some search engines have options to allow the GET parameters to be
> used in identifying pages for indexing, but again I wouldn't recommend it.
>
> 3) Make sure the "Last Modified" is set correctly.  This does two things it
> prevents the spider from having to process pages it doesn't have to,
> reducing the load on your servers, and also ensures that the content is
> index if it is new.
>
> Any way I hope this helps.
>
> --
> Doychi
> spdoychiam at doychi-dina.ath.cx
>
>
> _______________________________________________
> This mail is send to you from the opencms-dev mailing list
> To change your list options, or to unsubscribe from the list, please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev

-- 

[  Code Create
[  Web Content Management and Presentation


[  Bernd Wolfsegger
[  Sun Certified Programmer for Java(TM) 2 Platform


[  Lohmeyerstrasse 13
[  10587 Berlin
[  Germany
[  Fon +49 (0)30 26555788
[  Fax +49 (0)30 2651835
[  Mobile +49 (0)163 6505622

[  bw at code-create.com
[  http://www.code-create.com/




More information about the opencms-dev mailing list