My web crawler needs to use a web proxy, user authentication, cookies, a special user-agent, etc. What do I do?
WebSPHINX uses the built-in Java classes URL and URLConnection to fetch web pages. If you’re running the Crawler Workbench inside a browser, that means your crawler uses the proxy, authentication, cookies, and user-agent of the browser, so if you can visit the site manually, then you can crawl it. If you’re running your crawler from the command line, however, you’ll have to configure Java to set up your proxy, authentication, user-agents, and so forth.
Related Questions
- Are there special considerations regarding information transmitted through the internet (e.g., "cookies" and web beacons)?
- My web crawler needs to use a web proxy, user authentication, cookies, a special user-agent, etc. What do I do?
- Can transparent proxy be enabled centrally for automatic user proxy authentication?