Web application security is a hot topic these days. And a comparably limited amount lot of research has been performed in a field which remains wide open for creative minds. All you need to do is pick a topic and you are sure to uncover new stuff.
The work we do at WhiteHat Security enables us to push the limits of automated web application vulnerability scanning. Its vital to understand precisely what can and can't be scanned for so an expert can compete the rest. Many still believe scanning web applications is anywhere close to the capabilities of our network scanning cousins. The difference in the comprehensiveness and methodology is night and day. While the validity of my estimates have been questioned by other webappsec scanner vendors, but my educated guess (based on thousands of assessments) remains that only about half of the required tests for a security assessment can be performed on a purely automated basis. The other half require human involvement, typically for identifying vulnerabilities in business logic.
This is the area where we focus on improving the most. We know its impossible to scan for everything (undecidable problems), so why not instead focus on the areas that reduce the human time necessary to complete a thorough assessment? To us this makes the most sense from a solution standpoint. We measure, on check-by-check basis, what tests are working or not. What's taking us the most time and what can we do to speed things up? This strategy affords us the unique and agile ability to improve our assessment process faster than anyone else. We bridge the gap between our operations team (the guys that do the work) and the development team (who make the technology). The results are more complete, repeatable, and inexpensive security assessments.
For the technologists, the bits of information about the challenges of automated web application scanning is what's interesting? I'll describe about a few.
False-Positives and Vulnerability Duplicates
Anyone who has ever run a vulnerability scanner of any type understands the problem of false-positives. They are huge waste of time. In the webappsec world, we have that plus the problem of vulnerability duplicates. Its often very difficult to tell if a 1,000 XSS scanner reported vulnerabilities are in fact the same issue. Vulnerable parameters may be shared across different CGI's. URL's may contain dynamic content. Results can be hard to collapse down effectively and if you can't, your lost in a pile script tags and single quotes.
Its common scanner practice to make guesses at files that might be left on the web server, just not linked in. Like login.cgi.bak, register.asp.old, or /admin/. You would think its the easiest thing in the world to tell if a file is there or not right!? Web servers are supposed to respond with code 404 (RFC) aren't they!? Sure they do, er, sometimes anyway. Sometimes there are web server handlers that respond with 200 OK no matter what, making you think something is there when it isn't. They might even give you content it thinks you want, but not what you asked for. How do you tell? Sometimes the web servers and servlets inside have different 404 handlers. Some 404, while others 200, making it difficult to identify what exactly the web server is configured to do when. Then dealing with dynamic not found page content and multiple stages of re-directs. The list of strangeness is endless. Unaccounted for strangeness causes false-positives.
Login and Authentication Detection
Infinite web sites
We like to refer to this issue as the "calendar problem", as this was the spot we initially ran into it the most. By the way, we found the year 10,000 problem first, we think. :) When crawling web sites, sometimes there are just too many web pages (millions of items), grows and decays too rapidly, or unique links are generated on-the-fly. A good percentage of the time, while we can technically reach heights of million-plus link scans, its simply impossible or impractical to crawl the entire website. A scanner could trap itself by going down an infinite branch of a website. A scanner needs to be smart enough to realize its in this trap and dig itself out. Otherwise you get an infinite scan.
Finding all pages and functionality
So there you have it. An overview of a handful of the challenges we push the limits on everyday. I wish I could go into the innovative solutions we've develop and improve behind the scenes. The bottom line of what you need to know is scanning web application is an imperfect art. And if you take a list of "we support this" features from any scanner, you'll find the actual "support" will vary from one website to the next. That's why scanning in addition to an assessment process is the only way to go and greater than the sum of its parts.