Does HarvestMan obey the Robots Exclusion Protocol ?
Yes. HarvestMan respects the rules laid down by website managers in the robots.txtrules in the web server. These rules specify certain limitations to crawling certain areas of the web site depending upon the user agent of the browser client. (Some site owners block entire sections to all clients). HarvestMan obeys the robot exclusion protocol by default. There is way to bypass this protocol by disabling this feature. However, it is a good idea to always enable it to follow Internet etiquette and also to prevent yourself getting fined or sued by website owners for not following the robots.txt rules. Support for robots.txt rules is available in Python. HarvestMan uses a customised form of this module.