Training, Open Source computer languages

PerlPythonMySQLTclRubyC & C++LuaJavaTomcatPHPhttpdLinux

Search our site for:
Home Accessibility Courses Diary The Mouth Forum Resources Site Map About Us Contact
Mirroring a dynamic site

We're delighted to welcome crawlers such as Yahoo, Inktomi, Google and MSN spiders through our web site to index content - it's in our mutual interest. Those crawlers are all well written to analyse the pages that they collect, note patterns, and tailor their activities to make the best use of a dynamic site with 16 alternative versions (4 x font sets, 4 x colour sets) of displaying each page.

Robots which simply collect the whole of a web site for local offline browsing, such as HTTrack and (in some guises) Wget can be more problematic; they're simply not well formed to mirror dynamic sites and will try to gather every possible page, skewing web statistics and in bad cases leading to restriction of our resources for others. And it's doubtful whether any realistic use will be made of the data gathered.

I was watching HTTrack struggling to copy our dynamic website to a static mirror yesterday morning - every 4 seconds, another hit; it took 5 minutes just to get the help pages for the adhoc query demo by the time it had them in green on black in a tiny font, blue on yellow in a huge font, and all the intermediate settings. It's a waste of our resources and, frankly, I doubt whether the person making the mirror will find it of any use.

So as a web site owner, should I discourage such mirroring, and if so, how?

My first thought is to modify my robots.txt file to disallow all downloads by wget and httrack - except that I would need to check that they actually respect the standard before I go to the trouble, and that in any case we WELCOME user who sensibly use the utilities to download a few pages for offline viewing.

A second thought is to use my denial of service mechanism to trigger a delay where file access from a single remote host get delayed once they reach a certain threshold in a certain time - except that this would be just as likely to trap the legitimate / welcomed "bots" unless I put in some user agent specific logic that would need to be high maintenance - updated as new agents come along. And I certainly don't want to go down the "ban xxx IP address" road either.

A look at the HTTrack FAQ rather confirms my worries that neither of the above solutions is ideal; although it respects robots.txt, that can be turned off. Advise to users suggests that they use time limits and for large mirrors, ask the webmaster first and try not to download during working hours and do not download too large websites - use filters and I see all four of these pieces of advise NOT followed ... it also advises Are the pages copyrighted and, yes, they are.

But, in reality, it's no great issue to us if one or two users pull huge amounts of files they'll never use off our system.
(written 2006-04-12 04:56:08)

 
Associated topics are indexed under
G902 - Well House Consultants - Web site techniques, utility and visibility

Back to
Letter Boxes
Previous and next
or
Horse's mouth home
Forward to
Iran has enriched uranium ...
Some other Articles
A couple of days away
Staying in the country
Supporting users on Linux and Unix
Iran has enriched uranium ...
Mirroring a dynamic site
Letter Boxes
More or less on the edge of the page
Why are maps rarely to scale?
Sympathetic development
Melksham, Wiltshire
1893 posts, page by page
Link to page ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 at 50 posts per page


This is a page archived from The Horse's Mouth at http://www.wellho.net/horse/ - the diary and writings of Graham Ellis. Every attempt was made to provide current information at the time the page was written, but things do move forward in our business - new software releases, price changes, new techniques. Please check back via our main site for current courses, prices, versions, etc - any mention of a price in "The Horse's Mouth" cannot be taken as an offer to supply at that price.

Link to Ezine home page (for reading).
Link to Blogging home page (to add comments).

© WELL HOUSE CONSULTANTS LTD., 2008: Well House Manor • 48 Spa Road • Melksham, Wiltshire • United Kingdom • SN12 7NY
PH: 0800 043 8225 or 01225 708225 • FAX: 0845 8382 405 or 01225 707126 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho