Jeremiah Grossman: Why crawling matters

Friday, October 26, 2007

Why crawling matters

The first thing you do when pen-testing a …

… network: ping sweep
… host: port scan
… website: crawl

Each process begins by gathering intelligence and mapping out the attack surface as best as possible. Without a well-understood attack surface generating results from a scan/pen-test/assessment or whatever is extremely difficult. I bring this up because there’s been a lot of interesting discussion lately about evaluating web application vulnerability scanners though measuring code coverage and crawling capabilities. At WhiteHat Security we’ve spent years R&D’ing our particular crawling technology and have plenty of insights to share about the challenges in this particular area.

Crawl and code coverage capabilities are important metrics because automating as much of a website vulnerability assessment process as possible is a really good thing. Automation saves time and money and both are in short supply. The better a website crawl (even if manually assisted), the more thorough the vulnerability assessment. If you’ve ever experienced a scanner spinning out of control and not completing for hours or days (perpetually sitting at 5%), or the opposite when a report comes up clean 2 minutes later after clicking scan, there’s a good chance it’s the result of a bad crawl. The crawler of any good web application scanner has to be able to reach every nook and cranny of a website to map the attack surface, isolate the injection points, or run the risk of false negatives.

Whatever you do don’t assume crawling websites is simple and scanners are just mimicking old school search engine (Google, Yahoo, and MSN) technology. While much is similar, consider everything those guys do is pre-login and that’s just the start of where things become difficult. Many issues exist that routinely trip up crawling like login, maintaining login state, JavaScript links, malformed HTML, invalid HTTP messages, 200 OK’ing everything, CAPTCHA’s, Ajax, RIAs, Forms, non-standard URLs, and the list goes on. Search engines don’t really have to worry about this stuff and if they happen to miss portions of content they probably don’t care a whole lot anyway. For example they wouldn’t be interested in indexing a bank website beyond the login form.

While WhiteHat’s technology has become really good at JavaScript (Ajax) support and overcoming most of the challenges described, no one in the industry is able to guarantee that all links have been found or application logic flows exercised. Password resets or account activation are often not linked to from the website itself, instead they’re only visible in email. This means human assistance is often required to complete attack surface mapping. What we do with WhiteHat Sentinel is make available the crawl data to our customers for sanity checking. So if for some reason we’ve missed a link they can notify us and we’ll add it directly. Then we investigate why the link was missed and if necessary make a code update. Most often its because the accounts we were provided did not give us access to a certain section of the website and we had no idea it existed.

Then there is the matter of scale as some websites contain millions of links, some even growing and decaying faster than anything can keep up with. Without a database to hold the crawl data, scanners tend to slow or die at about 100,000 links due to memory exhaustion. However to use Amazon as an example, you don’t need to visit every book in the index in order to map all the functionality. But, a scanner has to be intelligent enough to know when it’s getting no more value from a particular branch and move on. Otherwise scans will go on forever when they don’t need to. Ever wonder why there are so many configuration settings for the commercial scanner products around crawling and forced browsing? Now you know. Like I said in an earlier post, each scanning feature should be an indication of the inability to overcome a technology obstacle and it needs human assistance.

Moving forward the one thing we’re going to have a keep an eye on is scanner performance over time. The effectiveness of scanners could actually diminish rather than improve as a result of developer adoption of Web 2.0ish technologies (Flash, JavaScript (Ajax), Java, Silverlight, etc.). Websites built with this technology have way more in common with a software application than an easy parse document. While its possible to get some degree of crawl support with Ajax and the rest its not going to come anywhere close to supporting a well known data format like HTML. As a result, attack surface mapping capabilities could in fact decline and by extension the results. Lots of innovation will be required in many areas to overcome this problem or simply keep up with the status quo.

13 comments:

Anonymous said...: I agree, crawling is the first step before starting a test. My personal way is actually drawing up a mindmap of the application.

First I use the app, I get to know how they have built it, what does what and also understand more about what the developers were trying to do. Without a full understanding of the application, you cannot even begin to try and test it for security issues.

I use Freemind (http://freemind.sourceforge.net/) which is a nice opensource mind mapping tool and I start to build up a visual diagram of how the site is layed out, with different child nodes indicating various aspects of the site and it's functionality.

When I come across something which I feel needs later attention, I highlight it in the mind map.

Only when I feel I fully understand the app and what it does, do I go back to the mindmap and then start methodical testing.

Works for me, maybe others may benefit.; October 27, 2007 at 8:11 AM
dre said...: ok i'll give in that crawling is important - especially the points that it is the first step in testing and also that there is a lot more to a security crawl as opposed to antiquated search engine technology.

so what makes a security scanner good at crawling? for an automated code scanner - they have to be able to search for bugs faster than stepping through the code with debugger breakpoints. stepping with a fat app debugger can also be made faster through techniques like detouring or user-mode single stepping. this doesn't help or work exactly the same for web applications, but they provide some good analogies.

so, in similar ways, these security scanners must be able to crawl in 3 major methods: automated spidering, partially-manual interactive capture/replay (proxy or no proxy), and the completely manual step by step mode.

however, to throw further wrenches into the problem - you have situations like GMail/GDocs where you're editing HTML or other content inside the browser. i'd be interested to hear how any web application security scanner solves this sort of problem (and i'd also be interested to hear related ones).

there is an insane amount of customization you'd want to do depending on how the applications work under the websites you are testing. let me try and list some more examples based on what you said in this post:

scanners tend to slow or die at about 100,000 links due to memory exhaustion

you could limit the crawl depth or give the scanner a maximum amount of pages to download. this is an interesting problem to solve.

If you’ve ever experienced a scanner spinning out of control and not completing for hours or days (perpetually sitting at 5%)

would time limits and max retries help here? you could have separate limits on javascript or flash links as well.

when a report comes up clean 2 minutes later after clicking scan, there’s a good chance it’s the result of a bad crawl

here's where it would be nice to change from a depth first to a breath first search (get a broader search, but possibly less accurate).

Many issues exist that routinely trip up crawling like login, maintaining login state, JavaScript links, malformed HTML, invalid HTTP messages, 200 OK’ing everything, CAPTCHA’s, Ajax, RIAs, Forms, non-standard URLs

so do you make your scanner capability to spider temporarily pause when encountering these scenarios in order to train or teach the spider test? sometimes you'll want to switch to a different testing mode instead of applying one-off trainers - but it's hard to tell when and how fast you're failing.

the biggest crawl problem to solve for these web application security scanners is not to automate everything but to automate enough to know when manual intervention is necessary - and to make that intervention as quick and clean as possible. solving this for as many problems as we're facing today in web applications seems a bit optimistic.

in the real world - you'll need an expert who knows his/her web application security scanner of choice well. in my mind - the best way to increase the expertise is through increasing instructional capital. in a closed solution like Sentinel - this may prove difficult.

for the other commercial scanners - the ability to change the tests using javascript is optimal - because javascript is the universal language of web applications on the client-side. so teach your scanner experts javascript after making your tool use it in the data-driven tests.

the second issue is the interactive capture/replay, proxies, and trainers. if you use a universal spreadsheet, keyword-driven format for your capture/replay, proxy, and trainers - this is the first step. then we should create a standard that can be used across multiple web application security scanners in this way. i'm not sure there is an equivalent in quality testing but i'll take a look into solutions here. it could be some sort of new file format or even use an existing schema, although it would be nice if it was in XML.

thirdly - when you run into issues of session management changing inside of the application the scanner expert will usually write more one-off tests to avoid looping or repeating tests too often. this could be done using a standardized domain-specific language. the cookie-revolver project has some well defined terminology here which could be a start to writing some sort of DSL to handle complex analysis application logic such as session management.

which almost brings me to a final problem to solve - which is not really related to crawling (more of a scraping or parsing problem), but also allows me to improve my earlier points about web application security scanners and what they really do. web application security scanners are meant to attack, find vulnerabilities, and locate exploits. sure - they can't do so without some sort of surface coverage, but i'm not so certain just yet that crawling is the biggest problem facing these scanners.

identifying the results of the tests is most important if you want truly automated vulnerability reports. with fat applications - most of the time you know there is a problem when the program crashes. well this may certainly be true in testing web applications - more likely you'll get an error page or a "user not recognized" sort of result data back. it could even be a simple looking page but with another user's account information displayed!

getting back to how this relates to crawls is that each attack path (e.g. WASC TC) or software weakness (OWASP T10, MITRE CWE) can be found best by certain types of crawling techniques on certain types of web application elements.

matching correct crawl technique to the correct parsing technique is basically the problem that scraping attempts to solve. if you want to build a vulnerability finding scraper - this is yet another problem that is best handled by experts familiar with their tools. i know that parsing languages can be improved by leaps and bounds but one resource that can be utilized as instructional capital (well, the first one that comes to mind) is xpath. dealing with malformed html is less of a problem when converted to xml - and a standard expression languages is used to parse that xml. xpath does that very nicely, although people are working on better methods, parsers, and scraping technology at a pace that outclasses moore's law... these are all new and interesting fields of research.

web application security scanners are going to be difficult to measure using any standard benchmark. do we even have comparisons or benchmarks for network security scanners? is ncircle 360, nessus, foundstone, qualys, or rapid7 the "best" network security scanner?

if you want to test improvements to these tools - try measuring your own tool first and also let the experts learn more about their internals and how they work. let the expert users of these tools learn how to explain and re-use their vulnerability finding techniques. first build better testers, then concentrate on building better tools.; October 27, 2007 at 6:03 PM
Anonymous said...: Hi Jeremiah,
Just out of curiosity, if I'm stuck using open source tools only (we don't have enough money in our budget this year for commercial scanners), which open source tool do you believe provide the best crawling capabilities? This could be independent of the open source web application testing tool itself (WebScarab, etc.) since I can always just set the testing tool as the proxy for the web spider. Any ideas? Thanks.; October 28, 2007 at 1:40 PM
dre said...: @Anonymous: Comments like yours make me want to completely retract my first paragraph.

i.e. dre said... ok i'll take back that crawling is important - especially because it isn't a part of security testing and also that it's a complete waste of time if you have filesystem or CMS access.

What is the problem you are trying to solve? The best way to crawl is not to crawl - to instead have full-knowledge of the application.

If you don't have a CMS, then maybe you have SCM. If you don't have any SCM then you don't need security testing - you need SCM. If you have SCM, then you also probably have a developer tester or quality tester (or maybe at least one of each). These people should already have a full, accurate representation of the web application. Marketing or another department in any given organization has probably already asked for the same data.

Even better than SCM or CMS data (and therefore full, accurate crawl data) would be something called a "walkthrough". That's where the developer(s) who wrote the web application explain to you what it is and how it works.

Let's assume all the developers, testers, and marketroids are dead (or hate you), or that for some reason your organization has decided to do a zero-knowledge vulnerability assessment. This sounds like a political or stupidity issue. You also don't need security testing. You need to fix the glitch.

If you don't fall into the above categories, then what category do you fit into? Let me re-phrase the question. Do you live in a non-third-world-country and do you pay taxes?; October 28, 2007 at 5:16 PM
Anonymous said...: @Anonymous:

Regular Firefox Power User - SpiderZilla or check out FireCAT

Regular Unix Power User - wget

Scientifically-inclined Power User - Heritrix

Regular Developer - PHP-CURL

Scientifically-inclined Developer - Some snobbish answer that involves his/her favorite programming language, HTTP driver, parsing library, javascript engine, flash decoder, bytecode weaver, etc

Regular Application Security Professional - Some snobbish answer that involves his/her favorite tool from the following list: Elza, Wapiti, Grabber, Paros, Burp Suite, Pantera, ProxMon, SPIKE proxy, HTTPBee, JBroFuzz, Sprajax, DirBuster, DFF Scanner, Nikto, w3af, cruiser, netcat, etc

Scientifically-inclined Application Security Professional - custom javascript spider that uses the Yahoo Search Site Explorer or dapper.net - along with xss proxy tunneling to scan intranets; October 28, 2007 at 7:03 PM
Anonymous said...: @ daniel
Hey thanks for mentioning FreeMind I just checked it out and I can totally see some great uses for it.

@ Everyone
I agree with Daniel I like to get the client to show me the app before I even start doing any crawling. That way I see how they intend for it to be used or how an actual user does use it. This gives me a good chance to take notes on things I think I should dig into further. It's one thing to come up with whacked out hacks but it's a completely different thing to come up with everyday user (human factor) weird things. ie: Where's the any button etc...; October 29, 2007 at 6:32 AM
Jeremiah Grossman said...: @anonymous, as ntp said, all those tools can be checkout to see if they are suitable to you. Personally if I couldn't use Sentinel, I'd go for wget. Its simple, effective, command, and works on many things I've pointed it at. Then I can run more own text searches on the data and get a make shift site directory.; October 29, 2007 at 8:07 AM
Jeremiah Grossman said...: @dre, to your earlier comment:

You put lots of good insights in there and much of it I agree with. Clearly you've spent a great deal of time pen-testing a working with a wide variety of tools. So let's spend a couple of minutes going over the areas where we seem to be missing a little bit.

"you have situations like GMail/GDocs where you're editing HTML or other content inside the browser. i'd be interested to hear how any web application security scanner solves this sort of problem"

I'd be the first one to tell you that a certain percentage of websites, these examples in particular, just can't be scanned with today's technology. In fact you'd spend more time fighting with your scanner than doing the whole thing by hand. For the sake of our service and business model, we do have to pass on some of these from time to time because there is no way to provide good continuous assessments on them.

"you could limit the crawl depth or give the scanner a maximum amount of pages to download"

"would time limits and max retries help here?"

You could play around with all those types of settings, but you might be sacrificing coverage when you do. That's the tricky balance. Our scanner attempts to make these types of decisions on its own, and in the event that it makes a mistake, it becomes apparent very quick we can make adjustments manually. And if we do and the same problem comes up over and over again, we can make a technology decision making fix. At the end of the day though, its very difficult to be certain a 100% crawl has been achieved.

"so do you make your scanner capability to spider temporarily pause when encountering these scenarios in order to train or teach the spider test"

To the extent that it needs it. Sometimes it does, sometimes it doesn't. Forms are a good example of where the scanner asks for human assistance. Login loops are another, or when it hits on 10 forced browsing things is a row. We take a pragmatic approach to the process and we monitor these situations in real time on hundreds of websites.

"the biggest crawl problem to solve for these web application security scanners is not to automate everything but to automate enough to know when manual intervention is necessary - and to make that intervention as quick and clean as possible"

Basically the design premise of our technology.

"web application security scanners are meant to attack, find vulnerabilities, and locate exploits"

More of less. But I prefer to see them as time savers for vulnerability assessments.

"but i'm not so certain just yet that crawling is the biggest problem facing these scanners."

Actually if I had to pick two of the hardest problems in scanning technology, it sounds strange... but crawling and maintaining login state are extremely difficult.

"do we even have comparisons or benchmarks for network security scanners?"

Not that I'm aware on. Mostly its vuln to vuln comparison.

"try measuring your own tool first and also let the experts learn more about their internals and how they work. "

Unfortunately the only people who consume or get a good look at our tech internals is us. All our customers really care about is results and no so much on how we get it done.; October 29, 2007 at 8:24 AM
Anonymous said...: Hi,
Thanks a lot for all your feedback. But my question was more along the lines of which web spider is most mature in not just basic crawling, but also extracting links from javascript and other advanced features. For example, I've used wget and other web spiders, but I've never actually evaluated these web spiders on how well they crawl non-crawler-friendly web pages. I'm sure that wget won't get them. But perhaps some other web spider might. Thanks for your help.; October 29, 2007 at 1:02 PM
Jeremiah Grossman said...: I don't know of any open source spider capable of supporting JavaScript. I've heard of some crawling capabilities based out of the browser DOM, but I don't recall the names or who was working on it. Probably a good question to ask on the web security mailing list. Someone there is sure to know.; October 29, 2007 at 2:03 PM
dre said...: @anonymous: But my question was more along the lines of which web spider is most mature in not just basic crawling, but also extracting links from javascript and other advanced features

All of these tools work out of the box:

cruiser, Grabber, Blueinfy's scanjax (scanweb2.0), TestMaker (free, but not OSS), Sprajax

Shreeraj Shah (Blueinfy) has also written some papers on the subject including one called, "Crawling Ajax-driven Web 2.0 Applications", where he demonstrates how to write a `crawl-ajax' script using both RBNarcissus and Watir. if you can download a Ruby interpreter and use Notepad than this article is at your level.

With additional programming skill, you could write a javascript crawler using HtmlUnit, JSWebUnit, Selenium IDE, Selenium RC, WebDriver, WindMill, Sahi, Watij, Watin, firewatir, scRUBYt, and probably many other popular methods out there.

some tools will "grep" for this sort of content. w3af does this. urlgrep from Blueinfy also does this. i think you're better off with a "browser driver" instead of an "application driver". what you want is something better than just a "protocol driver with grep". these are all standard words that the industry should use to describe this technology.

a browser driver is actually something that lets you do the crawling inside of the browser. this is hard to come by for javascript, although i did mention Selenium IDE (also: TestGen4Web). everyone i know uses Firebug, DOM Inspector (and add-on InspectThis) for javascript debugging, and before Firebug there was the Javascript debugger and Venkman. Shreeraj Shah did another article for securityfocus where he demonstrated the power of Firebug and ChickenFoot for these purposes. i'm not sure if JS Commander (jscmd) is considered an application driver, or a browser driver - but it is damn smooth (especially if you are working with embedded systems like the iPhone).

if you have some money, but not enough for a web application security scanner - you might want to look at Squish or QTP... and i'm also trying to find other options in this space, so please anyone let me know if you have heard of any other XHR/Ajax/JS crawler or scraper!

as a final note, since i figure someone will eventually ask: yes, the only open-source tool that currently supports javascript, flash decompiling, and EXIF data is cruiser. i would be into seeing others!

@Jeremiah: we should definitely talk more offline about some of our unanswered questions. i have a lot of insight into the process of testing crawlers for quality, performance, and security - as well as improving SaaS security assurance tools; October 29, 2007 at 10:09 PM
Anonymous said...: Thanks a lot for your valuable input. I'm off to start testing the tools mentioned. Thanks again.; October 30, 2007 at 11:20 AM
Anonymous said...: @ntp and dre:
Where can I find cruiser? I've found things such as the presentation about cruiser - http://2005.recon.cx/recon2005/papers/Robert_E_Lee-Jack_Louis/syllogistic_web_application_testing-recon05.pdf as well mention that it would be included in OSACE but all that I could find there was the Unicornscan. Could you please point me to where I can find it? Thanks a lot.; October 31, 2007 at 10:46 AM

BIO

Jeremiah Grossman brings 20+ years of experience in Computer Security and has become one of the most recognizable and world-renowned cybersecurity experts in the industry, coining several of the original hacking terms commonly used around the world today. Early in his career, Jeremiah was known as “The Hacker Yahoo” which led to his role as the company’s Information Security Officer. Jeremiah founded WhiteHat Security (now Synopsis), and served as Chief of Security Strategy for SentinelOne which was the highest-valued cybersecurity IPO in history. Most recently, Jeremiah was the founder & CEO of Bit Discovery, which was acquired by Tenable in 2022. He also serves as a company advisor and board member to several tech startups. In his spare time, Jeremiah does Brazilian Jiu-Jitsu and is passionate about classic cars. He recently opened Toybox, a luxury car club in Boise, Idaho.