I got some weird messages telling that robots.txt do not allow several files to be captured. Whats going on?
These rules, stored in a file called robots.txt, are given by the website, to specify which links or folders should not be caught by robots and spiders – for example, /cgi-bin or large images files. They are followed by default by HTTrack, as it is advised. Therefore, you may miss some files that would have been downloaded without these rules – check in your logs if it is the case: Info: Note: due to www.foobar.com remote robots.txt rules, links begining with these path will be forbidden: /cgi-bin/,/images/ (see in the options to disable this) If you want to disable them, just change the corresponding option in the option list! (but only disable this option with great care, some restricted parts of the website might be huge or not downloadable) Q: I have duplicate files! What’s going on? A: This is generally the case for top indexes (index.html and index-2.html), isn’t it? This is a common issue, but that can not be easily avoided! For example, http://www.foobar.com/ and http://www.foo