Wednesday, May 02, 2007

Battle of the Colored Boxes (part 2 of 2)

Coverage and comprehensiveness is key to effective vulnerability assessment. The more vulnerabilities identified and weeded out the harder it is for the bad guys to break in. In web application security, black box testing is a fairly standard measure of the difficulty and commonly used as a method to improve it. That’s why when Fortify recently published a new white paper entitled “Misplaced Confidence in Application Penetration Testing” (registration required), it immediately peaked my interest. Plus a title like that is bound to generate some controversy (score 1 for marketing). I highly recommend reading their paper first before moving on and having your opinions colored by mine.

Done reading? Good, let’s move on.

Fortify, a company specializing in white box analysis tools, performed a study measuring the percentage of code coverage achieved by black box scanners. They set up a few web application test beds, wrapped the security-relevant APIs with Fortify Tracer as a way to measure, launched some commercials scanners (with and without manual configuration), and logged the results. A novel approach I hadn’t seen before in webappsec. They also surveyed what people “believed” they are getting in terms of coverage from black box scanners, but that wasn’t so much of interest to me as the “actual” measurements.

Here are the highlights from the paper that interested me:

- Our experiment found that penetration testing identifies key vulnerabilities during application runtime, but only reaches on average, between 20-30% of a given application’s potential execution paths

- Manual customization can improve tests, although this improvement is not significant: our experiments showed an increase in coverage of only 19%.

- Two sets of issues caused the majority of the “misses”. The first set involved exercising sources of input that are inaccessible through the Web, such as data read from the file system, Web Services, and certain database entries. The second, and more alarming area of missed coverage, came from areas that are accessible from the Web interface of the application, but are difficult to reach due to the structure and program logic of the application.


Upon reading this doesn’t look good for black box application penetration testing, or scanners specifically. Though there are several factors that went unexplained that could have significantly impacted the results, especially with the first two bullet points. Taylor McKinley (a Fortify Product Manager) was kind enough to indulge my curiosity.

1) Paper: “Many of the respondents used commercial and freeware automated tools, commonly referred to as web application scanners, as their primary mechanism to conduct their application penetration tests.”

Question: Does this suggest that the respondents didn’t complete the vulnerability assessment and just pressed “go” on the scanner?

Fortify: We’re just stating that many of the survey respondents used automated tools. We’re not specifying anything about how they use these tools. To address your question, which I believe is about whether we studied manual testing or automated testing, we studied automated testing with tools. We ran them out of the box first and then customized each one for the application we were attacking. In general, some pen testers most likely supplement their use of pen testing tools with manual efforts, however, if they don’t know what the tools are doing, its that much more difficult to supplement them in any reasonable way.


This means the results were based scans and “configured” scans, not full penetration tests or vulnerability assessments with experts behind them like those that I am known to recommend. It would be interesting to see how a combo scanner / expert assessment would stack up.

2) Paper: “For our evaluation, we had an internal security expert conduct penetration tests against a test bed of five Web applications.“

Question: What were these web applications exactly? Where they demo or training web applications like WebGoat or SiteGenerator, or something like a message board, shopping cart, or real-world website, or what? This is an important aspect for context as well.

Fortify: One was our own internal test application, which is a 5MB order fulfillment application. WebGoat was another one HacmeBooks was another. The last two are not well known but are representative of standard web applications in terms of size and functionality.

This is probably a fair enough test bed for this experiment.

3) Paper: “To address this shortcoming, we developed a tool that instruments an application like a code coverage tool, but specifically targets points where input enters the program and security-relevant APIs are used.“

Question: Can you elaborate more on how this is done?

Fortify: Fortify Tracer inserts its own Java bytecode into the Java bytecode of an application. It takes the .class files and, using aspect oriented technology, scans through the bytecode looking for vulnerable APIs. It also contains a set of rules defining what APIs are vulnerable, and what parameter to watch out for. Using aspect oriented technology Fortify Tracer has the ability to add code around, before, or after an API. When the aspect technology hits a particular API that is vulnerable it will insert Fortify’s code. This allows Fortify to analyze the data coming into or going out of an API.

Admittedly I don’t know enough about Java or this type of technology to say one way or the other this is solid enough for code execution measurement. A control case would have been nice…

4) Paper: “The second, and more alarming area of missed coverage, came from areas that are accessible from the Web interface of the application, but are difficult to reach due to the structure and program logic of the application.”

Question: Was a “control” method done as part of the experiment? Meaning, what if a user or QA process interacted with the website in a normal and complete usage fashion. What would have been the execution percentage? This seems to me like a vital piece of data to use as a reference point.

Fortify: Very good point. These test applications were relatively small and we know them well so we felt very comfortable that a QA tool could hit the vast majority of the application. We don’t have the data on hand but I agree in retrospect that would have been a good thing to have addressed in the report. I can say that we would expect a full QA test to exercise more than 80% of the application.
5) Paper: “The tester conducted both fully-automated and manually-assisted penetration tests using two commercial tools and used Fortify Tracer® to measure the effectiveness of these tests against the test bed of applications.”

Question: There are a number of crappy commercial black box scanners on the market, which did you use? It would be unfair of Fortify to have selected bottom of the barrel scanners as a representative comparison to represent the entirety of the black box testing market. This is another one of those missing vital pieces of data.

Fortify: I would like to divulge the names of these tools, but this is unfortunately not something we can do. However, short of telling you their names, is there anything I can do to convince you that you would not be disappointed? Maybe I can put it this way, we used two of the top three market leading tools. Does that suffice?

I think that’s fair enough to make some educated guesses about who’s product(s) they used.

OK we got some measurement concerns out of the way that should be considered if someone else decided they’d like to repeat the experiment. And I really hope someone does, this is good stuff. What’s also interesting is if you take the combined total of the first two bullet point measurements (30% + 19%), this is about the coverage I said that scanners are capable of testing for (~50%). Now, if you were to perform a full vulnerability assessment with an expert, would we have improved coverage to over 80% as mentioned in Q4? I don’t see why not. From that point of view the scanner / expert coverage doesn’t look so bad, at least not by an order of magnitude.

I think what Fortify is suggesting in the paper is not so much that black box scanners are bad or incomplete, but that their coverage will vary widely from one web application to the next. There is a lot of truth to this. Unless the end user is able to measure the depth of coverage, they’re unable to know the value their getting or not. I think that’s fair. Until the technology matures to the point where the coverage doesn’t vary so greatly and we begin to trust it, we’ll have to measure. Lets move onto the third bullet point.

6) Paper: “Two sets of issues caused the majority of the “misses”. The first set involved exercising sources of input that are inaccessible through the Web, such as data read from the file system, Web Services, and certain database entries.”

Question: How would you characterize the exploitability risk of these “misses” by external attackers?

Fortify: While the risk from internal hackers is more severe for these types of attacks, the threat from an external hacker is very real and needs to be addressed.

Ok, lets ask about that then…

7) Paper: “The first are those vulnerabilities that penetration testing is simply incapable of finding, such as a privacy violation, where confidential information is written to a log file, which potentially violates PCI compliance. Log files are particularly vulnerable to attack by hackers who recognize that they are often an easy way to extract information from a system.”

Question: Can you describe a plausible scenario of how an attacker might access this data though a web application hack?

Fortify: The basic premise here is that a log file isn’t a secure location, so you’re putting private data in a place that is not thoroughly protected. A hacker might be able to exploit some vulnerability that gives them access to various parts of your network, including the log files. This is also a major threat from an insider, who has a greater likelihood of being able to gain access to the log files for a particular application. I just spoke to someone who used to work for a major bank and he said this was a huge issue for them b/c employees, if they did the right thing, could gain access to all types of private data. In addition, having this type of vulnerability may be grounds for a failed audit, which could means fines or, at the extreme, having your ability to process credit cards shut down. Lastly, creating problems with log files is a great way to stymie forensics efforts.

I take this to mean that they would agree that vulnerability “misses” due to functionality inaccessible through the Web are more of an insider threat. This is fine, but its important to understand that when comparing and contrasting software testing methodologies. So where does all of this leave us? Fortify is illustrating the limitations of black box testing and where they can add value, nothing wrong with that. I think its safe to say they’d admit that certain vulnerabilities are beyond their coverage zone as well. They are a reasonable bunch.

For myself I can’t help but think we’re going about this measurement stuff the wrong way. We’re all busy fighting and comparing against each others solutions down in the weeds with “who found more” and “I can find that while you can’t” nonsense. We’re missing the big picture and that is… How do we keep a website from getting hacked? Isn’t that the point of everything we’re trying to do? That says to me that we must find a way to measure ourselves against the capabilities of the bad guys. How much harder does vulnerability assessment of black, white or gray box testing make it for them? In my opinion, this is where the focus on this type of research should be and would provide the most value to website owners.

No comments: