Bill has been working on a project that has piqued my interest for quite some time. It’s called CrawlWall and it’s a tool that can help keep scrapers at bay.
Here’s what Bill had to say about CrawlWall when I interviewed him recently.
You’re an engineer and software developer by trade, so why do I always see you at search related events?
That’s an easy one: I attend search conferences because I build websites!
Seriously, anyone that builds websites, even the programmer, needs to understand the current state of search technologies to make sure those sites have all the required technology integrated to allow them to rank well.
BTW, if you didn’t know already, I helped get Googlebot to verify itself in order to stop 302 proxy hijacking, back in ’06 at SES in San Jose.
You seem to have a more than healthy disdain for spam and rogue bots. Is that what inspired you to create CrawlWall?
Inspiration came quickly when the scrapers were impacting my ability to be online and make money.
Scrapers started hitting my large dynamic sites as hard as a DDoS attack on a somewhat regular basis and made my servers unresponsive. If I wasn’t nearby to stop the attacks manually, sometimes the servers didn’t recover for well over an hour until the high-speed scraping subsided.
Additionally, the scrapers were harming my search engine rankings by using my content against my own site. Something had to be done. It evolved into CrawlWall.
Who do you think are the ideal users for CrawlWall?
CrawlWall is basically a firewall for websites, so anyone trying to protect their content from being stolen and server resources from being attacked are candidates. It’s easier to stop content from being stolen by blocking the scrapers in the first place than it is wasting time tracking stolen content after the fact. Chasing copyright infringers and sending DMCA letters is a massive waste of time and not conducive to being productive or profitable.
You’ll never stop all of the scrapers with CrawlWall, or any similar product for that matter. However, you can make a big dent in the problem to where it’s just a minor annoyance and nearly non-existent.
What’s your favorite feature of CrawlWall?
The fact that it’s all automatic. Once I trusted the technology was working well, I was finally able to relax and stop thinking about scrapers.
However, if I must pick one feature, it’s the CrawlWall Robots.txt Defender which actually puts teeth in the toothless robots.txt standard. Any bot that asks for robots.txt and is disallowed, is automatically physically blocked from any further visits to the site by CrawlWall. Likewise, spiders that are allowed into the site are physically blocked from visiting any pages disallowed in the robots.txt rules. No cheating on robots.txt is allowed when CrawlWall is watching.
Are you working on any other projects?
Ironically, the maker of CrawlWall, which blocks crawlers from accessing websites, also authors a crawler called LinkScrubber.
The need for LinkScrubber came from trying to manage a large directory with many thousands of links. Manually verifying all of the outbound links was impossible, and none of the other link checkers we used could catch sites that transitioned away from the original owner. If other link checkers see a site return a “200 OK” status they think the site is fine, regardless of the content it currently serves. LinkScrubber actually analyzes the content being served by the external links and can identify many thousands of page profiles that indicate the site is no longer OK at all and needs to be removed from your site.
Which is how we came up with our slogan that says it all: “When 200 is NOT OK!”