This question is asked regularly with respect to solutions for Cross-Site Scripting (XSS). The answer is input validation and output filtering are two different approaches that solve two different sets of problems, including XSS. Both methods should be used whenever possible. However, this answer deserves further explanation.
Input Validation(aka: sanity checking, input filtering, white listing, etc.)Input validation is one of those things ranted about incessantly in web application security, and for good reason. If input validation was done properly and religiously throughout all web application code we’d wipe out a huge percentage of vulnerabilities, XSS and SQL Injection included. I’m also a believer that developers shouldn’t have to be experts in all the crazy attacks potentially thrown at a websites. There’s simply too much to learn and their primary job should be writing new code, not to become web application hackers. Developer should only have to concern themselves with the solutions required to mitigate any attack no matter what it might be. This is where input validation comes in play.
Input validation should be performed on any incoming data that is not heavily controlled and trusted. This includes user-supplied data (query data, post data, cookies, referers, etc.), data in YOUR database, from a third-party (web service), or elsewhere. Here are the steps that should be performed before any incoming data is used:
NormalizeURL/UTF-7/Unicode/US-ASCII/etc decode the incoming data.
Character-set checkingEnsure the data only contains characters you expect to receive. The more restrictive the rules are the better.
Length restrictions (min/max)Ensure the data falls within a restricted minimum and maximum number of bytes. Limit the window of opportunity for an attacks as exploits tend to require lengthy input strings.
Data formatEnsure the structure of the data is consistent with what is expected. Phone should look like phone numbers, email addresses should look like email address, etc.
Regular expression examples with iteratively more restrictive security:(These are just samples, not recommended for production use)Phone number:
/* 555-555-5555 */
String phone = req.getParameter(”phone”);
/* character-set OK */
String regex1 = “^([0-9\-]+)$”;
/* character-set with length restrictions */
String regex2 = “^([0-9\-]{12})$”;
/* with data format restrictions */
String regex3 = “^([0-9]{3})(\-)([0-9]{3})(\-)([0-9]{4})$”;
if (phone.matches(regex3)) {
/* data is ok, do stuff... */
}
Email Address:
/* user@somehostname.com */
String email = req.getParameter(”email”); /* character-set */
String regex1 = “^([0-9a-ZA-Z@\.\-]+)$”;
/* character-set with length restrictions */
String regex2 = “^([0-9a-ZA-Z@\.\-]{1,128})$”;
/* with data format restrictions */
String regex3 = “^([0-9a-ZA-Z\.\-]{1,64})(@)([0-9a-ZA-Z\.\-]{1,64}) (\.)([a-zA-Z]{2,3})$”;
if (email.matches(regex3)) {
/* data is ok, do stuff... */
} ImplementationFor a variety of reasons input validation has proved time consuming, prone to mistakes, and easy to forget about. The best approach is defining all the expected application data-types (account ID’s, email addresses, usernames, etc.), abstract them into reusable objects, and made easily available from inside the development framework. Input validation is all handled behind the scenes, no need to parse URLs, or remember to apply all the relevant business logic rules. The benefit to this approach is security becomes consistent and predictable. Plus developers are assisted is creating software at faster rate. Security and business goals are in alignment, which is exactly the place you want to be.
For example, let’s say you’re in an objected oriented environment working with a product purchase process:
URL:
http://website/purchase.cgi
Post Data:
product=100&quanitiy=4&cc=4444333322221111&exp=01/08
// Check if the user is properly logged-in and their account is active
if (user.isActive) {
// make sure the product is available in the requested quantity
if (req.product.isAvailable) {
// calculate the total purchase price
var total = req.product.price * req.qty;
// make sure the credit card is valid for the purchase total
if (req.creditcard.isValid(total)) {
// initiate the transaction
processOrder(user, req.product, req.qty, total, req.creditcard);
} else {
// inform user that their credit card was not accepted with a consistent message and also log the error to central database.
requestFailed(req.creditcard.error);
}
} else {
// inform user that items is not available with a consistent message and also log the error to central database.
requestFailed(req.product.error);
}
} else {
// inform user that they are not properly logged-in with a consistent message and also log the error to central database.
requestFailed(user.error);
}
Notice in the example code there is no input validation, direct database calls, or implicit strings. Everything is handled behind the scenes by the objects and methods. This makes mistakes less likely to occur and extremely helpful in preventing a wide variety of attacks including XSS, SQL Injection, and more.
Output FilteringWhen you get right down to it, XSS happens on output when the unfiltered data hits the user (victim) web browser. Plus untrusted data may originate from a variety of locations, including your own database. As a developer you’re never really certain if someone else is doing their job and placing potentially malicious data in the DB. Better to play it safe when printing to screen.
Control the output encodingDon’t let the web browser guess at a web pages content encoding. They’re known for making mistakes that could lead to strange XSS variants. There are two ways to set encoding, response header and meta tags. Its best to use both methods to make certain the browser gets it right.
Response Header:
Content-Type: text/html; charset=utf-8or
Content-Type: text/html; charset=iso-8859-1Meta Tags:
<* meta http-equiv="Content-Type" content="text/html; charset= utf-8">or
<* meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">Removing HTML/JavaScriptMany of the languages and frameworks have their own methods to convert special characters in their equivalent HTML Entities, it’s probably best to use one of those. If not, here is Perl regex snippet that can be used or ported. I welcome anyone to comment on libraries they like, I’m not familiar and up to date with all of them. As with input validation its best to abstract this layer and make it second nature for developers.
$data =~ s/(<|>|\"|\'|\(|\)|:)/''.ord($1).';'/sge; or
$data =~ s/([^\w])/''.ord($1).';'/sge;