Following Larry Suto’s analysis of NTOSpider, IBM’s AppScan, and HP’s WebInspect, where he compared code coverage to links crawled and vulnerabilities found, some questioned the accuracy of his results. Personally I didn’t get the criticism because with only two published reviews this year, we should be grateful to see data of any kind related to web application vulnerability scanners. I chalked the comments up as the standard scanner vendor or product reseller defense mechanism. Besides, Larry Suto is a professional source code reviewer and if he can’t figure it out, what chance does anyone else have? Well, expect for the vendor themselves, and this is where it gets interesting.
HP/SPI felt the issue important enough to research and respond to directly. Jeff Forristal (HP Security Labs) set up an environment to redo Larry’s work and measure WebInspect v7.7 and v7.5 for two of the three websites. While everyone is encouraged to read the report and judge for themselves, a couple of things really stood out to me in the data charts (see below) - specifically false-positives and “vulnerability duplicates”. I’ve only talked about the problem of vulnerability duplicates briefly in the past when describing how customers eventually mature to a Quality phase where less equals more. Obviously perople prefer not to see identical vulnerabilities reported dozens/hundreds of times.
If you look at the chart columns “Total # Findings”, “Raw False Positive”, and “Accurate # of Instances” - these compare what the scanner reported, to what was false, to what vulnerabilities were valid and unique. The two scanners reported nearly identical validated issues, 5 on the Roller website and 113/110 on OpenCMS. In the false-positive dept, WebInspect v7.7 did fairly well only having between 0% and 16% on the two websites, while v7.5 performed a little worse at 2% and 36%. But what you have to look closely at is the ratio of Total # of Findings to Accurate # of Instances (minus the falses) as this will measure the level of vulnerability duplicates.
In the Roller website v7.7 reported 40 unvalidated issues, with v7.5 displaying 55, all of which boiled down to 5 unique vulnerabilities. That means only 12% of v7.7 results are meaningful! v7.5 was 9%! On OpenCMS, of 1,258 unvalidated issues reported by v7.7 (3,756 for v7.5), came down to 113 unique vulnerabilities. Once again only 9% and 3% of the results were necessary. Shall we call this a Vulnerability Duplicate Rate? That’s a lot of data to distill down and must take a lot of time. For those that use these scanners, is this typical in your experience and expected behavior?
I know Ory is reading this… :), so can you give us an indication of what the accuracy rating might have been for AppScan?