Jeremiah Grossman: Input validation or output filtering, which is better?

This question is asked regularly with respect to solutions for Cross-Site Scripting (XSS). The answer is input validation and output filtering are two different approaches that solve two different sets of problems, including XSS. Both methods should be used whenever possible. However, this answer deserves further explanation.

Input Validation
(aka: sanity checking, input filtering, white listing, etc.)
Input validation is one of those things ranted about incessantly in web application security, and for good reason. If input validation was done properly and religiously throughout all web application code we’d wipe out a huge percentage of vulnerabilities, XSS and SQL Injection included. I’m also a believer that developers shouldn’t have to be experts in all the crazy attacks potentially thrown at a websites. There’s simply too much to learn and their primary job should be writing new code, not to become web application hackers. Developer should only have to concern themselves with the solutions required to mitigate any attack no matter what it might be. This is where input validation comes in play.

Input validation should be performed on any incoming data that is not heavily controlled and trusted. This includes user-supplied data (query data, post data, cookies, referers, etc.), data in YOUR database, from a third-party (web service), or elsewhere. Here are the steps that should be performed before any incoming data is used:

Normalize
URL/UTF-7/Unicode/US-ASCII/etc decode the incoming data.

Character-set checking
Ensure the data only contains characters you expect to receive. The more restrictive the rules are the better.

Length restrictions (min/max)
Ensure the data falls within a restricted minimum and maximum number of bytes. Limit the window of opportunity for an attacks as exploits tend to require lengthy input strings.

Data format
Ensure the structure of the data is consistent with what is expected. Phone should look like phone numbers, email addresses should look like email address, etc.

Regular expression examples with iteratively more restrictive security:
(These are just samples, not recommended for production use)

Phone number:

/* 555-555-5555 */
String phone = req.getParameter(”phone”);

/* character-set OK */
String regex1 = “^([0-9\-]+)$”;

/* character-set with length restrictions */
String regex2 = “^([0-9\-]{12})$”;

/* with data format restrictions */
String regex3 = “^([0-9]{3})(\-)([0-9]{3})(\-)([0-9]{4})$”;
if (phone.matches(regex3)) {

/* data is ok, do stuff... */

}

Email Address:

/* user@somehostname.com */
String email = req.getParameter(”email”);

/* character-set */
String regex1 = “^([0-9a-ZA-Z@\.\-]+)$”;

/* character-set with length restrictions */
String regex2 = “^([0-9a-ZA-Z@\.\-]{1,128})$”;

/* with data format restrictions */
String regex3 = “^([0-9a-ZA-Z\.\-]{1,64})(@)([0-9a-ZA-Z\.\-]{1,64})
(\.)([a-zA-Z]{2,3})$”;

if (email.matches(regex3)) {

/* data is ok, do stuff... */

}

Implementation
For a variety of reasons input validation has proved time consuming, prone to mistakes, and easy to forget about. The best approach is defining all the expected application data-types (account ID’s, email addresses, usernames, etc.), abstract them into reusable objects, and made easily available from inside the development framework. Input validation is all handled behind the scenes, no need to parse URLs, or remember to apply all the relevant business logic rules. The benefit to this approach is security becomes consistent and predictable. Plus developers are assisted is creating software at faster rate. Security and business goals are in alignment, which is exactly the place you want to be.

For example, let’s say you’re in an objected oriented environment working with a product purchase process:

URL:
http://website/purchase.cgi

Post Data:
product=100&quanitiy=4&cc=4444333322221111&exp=01/08


// Check if the user is properly logged-in and their account is active
if (user.isActive) {

     // make sure the product is available in the requested quantity
     if (req.product.isAvailable) {

          // calculate the total purchase price
          var total = req.product.price * req.qty;

          // make sure the credit card is valid for the purchase total
          if (req.creditcard.isValid(total)) {

               // initiate the transaction
               processOrder(user, req.product, req.qty, total, req.creditcard);
            
        } else {
        
            // inform user that their credit card was not accepted with a consistent message and also log the error to central database.
            requestFailed(req.creditcard.error);
        
        }

    } else {
    
        // inform user that items is not available with a consistent message and also log the error to central database.
        requestFailed(req.product.error);

    }

} else {

    // inform user that they are not properly logged-in with a consistent message and also log the error to central database.
    requestFailed(user.error);
    
}

Notice in the example code there is no input validation, direct database calls, or implicit strings. Everything is handled behind the scenes by the objects and methods. This makes mistakes less likely to occur and extremely helpful in preventing a wide variety of attacks including XSS, SQL Injection, and more.

Output Filtering
When you get right down to it, XSS happens on output when the unfiltered data hits the user (victim) web browser. Plus untrusted data may originate from a variety of locations, including your own database. As a developer you’re never really certain if someone else is doing their job and placing potentially malicious data in the DB. Better to play it safe when printing to screen.

Control the output encoding
Don’t let the web browser guess at a web pages content encoding. They’re known for making mistakes that could lead to strange XSS variants. There are two ways to set encoding, response header and meta tags. Its best to use both methods to make certain the browser gets it right.

Response Header:
Content-Type: text/html; charset=utf-8
or
Content-Type: text/html; charset=iso-8859-1

Meta Tags:
<* meta http-equiv="Content-Type" content="text/html; charset= utf-8">
or
<* meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Removing HTML/JavaScript
Many of the languages and frameworks have their own methods to convert special characters in their equivalent HTML Entities, it’s probably best to use one of those. If not, here is Perl regex snippet that can be used or ported. I welcome anyone to comment on libraries they like, I’m not familiar and up to date with all of them. As with input validation its best to abstract this layer and make it second nature for developers.

$data =~ s/(<|>|\"|\'|$|$|:)/'&#'.ord($1).';'/sge;
or
$data =~ s/([^\w])/'&#'.ord($1).';'/sge;

9 comments:

Anonymous said...: Usually strict checking against predefined pattern is a nightmare for users - everyone writes dates and phone numbers differently.
In such cases I prefer _extraction_ of data.
For example instead of checking for proper arrangement of spaces, hypehs, etc. in phone number, just remove all non-digit characters and you'll have safe and bulletproof input.

Oh, and please don't forget that + is legal character in e-mail username!
MTAs (gmail) can use it for tagging/filtering (username+tag@example.com); January 30, 2007 at 4:47 PM
Kyran said...: Input? Output? I'll have a little of each on my webapp plate please.; January 30, 2007 at 8:00 PM
Jeremiah Grossman said...: Hi kl,

With complex data types, for the most part I agree, but these were just example. The point I was trying to make was be as restrictive as you can. Then balance with usability accordingly.; January 31, 2007 at 1:02 PM
Anonymous said...: haha, in your online shop example you didnt check the quantity ;-)

found some quite prominent homepages having these kind of issues.

-- beNi mybeNi.tk; February 3, 2007 at 3:27 AM
Jeremiah Grossman said...: beNi: Heheh, I would have put that part in the isAvailable method. :); February 4, 2007 at 10:20 PM
ron777 said...: J,
if possible, can you explain your regex:

$data =~ s/(<|>|\"|\'|$|$|:)/'&#'.ord($1).';'/sge;
or
$data =~ s/([^\w])/'&#'.ord($1).';'/sge;; October 29, 2007 at 12:56 PM
Jeremiah Grossman said...: Both regex's take special characters and convert them into HTML entities. Basically so the dat can't execute as HTML.; October 29, 2007 at 1:50 PM
Fr0st said...: I am pleased to visit your blog. The type of content is awasome. Hope you carry out the task in future for bignner like me.; November 21, 2008 at 5:42 AM
Gary said...: I like this article because, however short, it provides some basics that I haven't found in too many places - which is kind of a surprise. Many write and talk about the importance but details are hard to find.

Establishing class objects for data types may not necessarily be the ideal way to go. One can be limited by their web software architecture (class objects may not be so easy to implement) and relying on built-in object verification can easily ignore data that doesn't fit into a predefined object class. A programmer not used to consciously using input validation functions could be more likely to skip validation.

I also dislike using regular expressions for quick data validation in most cases (what a waste of a computer) - I like the comment that data should be extracted (and validated) which is what I do.; December 1, 2009 at 3:10 PM

Tuesday, January 30, 2007

Input validation or output filtering, which is better?

9 comments: