What type of pages are automatically banned during a SocSciBot 4 crawl?
SocSciBot 4 bans pages if the site itself requests that they are banned, through the use of the robots.txt protocol. It also bans URLs containing any of the following – all of which are commonly found in mirror sites or large collections of dynamic pages: /cgi-bin/ .cgi .dll archive /calendar/ /ftp/ ftp. /handbook/ hypermail javadoc java/doc /JDK1. /JDK/ /JDK2. /manual/ /manuals/ mirror /parser.pl/ pipermail /record= /roombooking/ sashtml /search/ sessionid timetable twiki unixhelp wwwstats webstats and if the ban bulletin boards option is selected then it also bans bbs.
Related Questions
- I have a diabetese type 2 and high blood pressure issues and i am taking medication for it. Does that automatically disqualify me for the Foreign Service?
- The blank pages aren being dropped by the scanning software automatically. I thought they were supposed to be deleted automatically?
- How can I stop Internet Explorer from automatically completing addresses that I type into the Address field?