Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

I got some weird messages telling that robots.txt do not allow several files to be captured. Whats going on?

April 26, 2017captured files going GOT Messages telling weird

0

10 Posted

I got some weird messages telling that robots.txt do not allow several files to be captured. Whats going on?

1 Answer

0

10 Posted

These rules, stored in a file called robots.txt, are given by the website, to specify which links or folders should not be caught by robots and spiders – for example, /cgi-bin or large images files. They are followed by default by HTTrack, as it is advised. Therefore, you may miss some files that would have been downloaded without these rules – check in your logs if it is the case: Info: Note: due to www.foobar.com remote robots.txt rules, links begining with these path will be forbidden: /cgi-bin/,/images/ (see in the options to disable this) If you want to disable them, just change the corresponding option in the option list! (but only disable this option with great care, some restricted parts of the website might be huge or not downloadable) Q: I have duplicate files! What’s going on? A: This is generally the case for top indexes (index.html and index-2.html), isn’t it? This is a common issue, but that can not be easily avoided! For example, http://www.foobar.com/ and http://www.foo