The first thing you do when pen-testing a …
… network: ping sweep
… host: port scan
… website: crawl
Each process begins by gathering intelligence and mapping out the attack surface as best as possible. Without a well-understood attack surface generating results from a scan/pen-test/assessment or whatever is extremely difficult. I bring this up because there’s been a lot of interesting discussion lately about evaluating web application vulnerability scanners though measuring code coverage and crawling capabilities. At WhiteHat Security we’ve spent years R&D’ing our particular crawling technology and have plenty of insights to share about the challenges in this particular area.
Crawl and code coverage capabilities are important metrics because automating as much of a website vulnerability assessment process as possible is a really good thing. Automation saves time and money and both are in short supply. The better a website crawl (even if manually assisted), the more thorough the vulnerability assessment. If you’ve ever experienced a scanner spinning out of control and not completing for hours or days (perpetually sitting at 5%), or the opposite when a report comes up clean 2 minutes later after clicking scan, there’s a good chance it’s the result of a bad crawl. The crawler of any good web application scanner has to be able to reach every nook and cranny of a website to map the attack surface, isolate the injection points, or run the risk of false negatives.
Then there is the matter of scale as some websites contain millions of links, some even growing and decaying faster than anything can keep up with. Without a database to hold the crawl data, scanners tend to slow or die at about 100,000 links due to memory exhaustion. However to use Amazon as an example, you don’t need to visit every book in the index in order to map all the functionality. But, a scanner has to be intelligent enough to know when it’s getting no more value from a particular branch and move on. Otherwise scans will go on forever when they don’t need to. Ever wonder why there are so many configuration settings for the commercial scanner products around crawling and forced browsing? Now you know. Like I said in an earlier post, each scanning feature should be an indication of the inability to overcome a technology obstacle and it needs human assistance.