- Why do you crawl the .dk domain.?
- In order to obtain statistics on what webservers are used in Denmark.
- Why does that interest you?
- Well. It just does.
- Why don't you just look at Netcrafts survey?
- Their data isn't specific to Denmark (but I do study their data with interest).
- Why don't you look a E-soft's survey then?
- Their sample of the danish domains is quite small, less than 10%. My data does seem to agree with theirs, though.
- How often do you crawl .dk?
- Once a month, at the start of the month.
- Why not more often?
- The crawl does take up a fair amount of bandwidth and memory, and the machine it runs on has other duties as well.
- How long does it take, then?
- Approximately 11 hours, with 20 crawlers running in parallel.
- How many times does the crawler visit a server?
- Well, that depends... Each crawler only visits a given ip-adress once, unless the server is running MicroSofts IIS.
- Why do you visit servers running IIS more often?
- Because IIS doesn't tell the crawler what modules it uses, in it's server header. So, to check for PHP, you have to check every domain.
- What pages does the crawler fetch from a server?
- Only the front page, and the crawler only asks the webserver how big the the page is, and when it was last modified.
- How much bandwidth does the crawl consume?
- Something like 6-700 MB, I think.
- What software do you use?
- A tool I've written myself (.dk-bot), written in perl. The sourcecode is available here.
Last updated: 2002-08-11 02:57