Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Can I restart a project to download only links that failed in a previous run (web-page caching)?

0
Posted

Can I restart a project to download only links that failed in a previous run (web-page caching)?

0

Yes. You can, since HarvestMan has an inbuilt caching mechanism for documents downloaded from the network. From version 1.2, the caching mechanism is available and enabled by default. HarvestMan uses an MD5 checksumhm-cache in the HarvestMan project directory. When you re-start a project, HarvestMan loads the cache information for the project, if it exists. When it encounters a url, it compares the signature of the url data with the signature of the cache url and verifies if it is the same. If it is the same, the document has not changed, so HarvestMan skips this url. Otherwise it downloads it. The cache is regenerated at the end of every project. HarvestMan catches any keyboard interrupts by the user and makes sure that the cache is generated if the user decides to end the program by sending a keyboard interrupt, thereby making sure that precious network bandwidth is not wasted. You can disable web-page caching by disabling a configuration variable in HarvestMan configuration file. 5

Related Questions

What is your question?

*Sadly, we had to bring back ads too. Hopefully more targeted.

Experts123