I have been dealing with XSS at my so-called “real job” recently, and it has come to my attention that a lot of people in this world are under the mistaken impression that it’s better to do “input filtering” than “output filtering”. As I pretty much came up with these terms myself (they may or may not exist elsewhere; I’m just too lazy to find out), I’ll define them for you:
Input Filtering: Scrubbing XSS-dangerous data out of your input before it gets saved anywhere.
Output Filtering: Scrubbing XSS-dangerous data only upon display.
Now, the most important concept here is that XSS is most dangerous when a user can see immediate results without alerting you, the web designer. So if you have a page that repeats their parameters back at them (say a search page where you put “Your search for $parameter could not be found”), that’s A) independent of input vs. output scrubbing, and B) extremely by far the most dangerous kind of XSS vulnerability. Why? Because it allows a user to post a link to your site that can execute malicious javascript. Bad, bad, bad.
After echoing user parameters is fixed, you have to look at how you display stored data. This is where the type of scrubbing comes into play – do you scrub the data before storing to your database / file system? Or do you only scrub when you’re about to display the data?
I will soon prove that input scrubbing is for pansies who are paranoid and tend to make up pathetic lies about their imaginary 20-year-old girlfriends.
Why input filtering is inefficient
- It’s bad to store data in a display-specific way (have to unencode when displaying PDF, email/text reports, etc).
- You have to modify other areas of code than just DB storage, such as
searching (search for “<blah>” won’t yield “<blah>”), which may not
be immediately obvious.
- You could just auto-filter all incoming data, but there may be cases where you really can’t or don’t want to. I personally dislike blind filtering like this unless there is no better option.
- If you have existing data, you have to check it for pre-existing problems. With large data, this can be very slow.
- If you’re truly paranoid (as I am), you still won’t trust the DB data and will need to find a way to have input filtering work nicely with output filtering. This is a whole lot more work than just doing one or the other.
- If you use a good MVC system like Rails, you can actually escape all text
fields as they’re read from the database if you want. With a carefully
written ActiveRecord plugin to Rails, I’d bet you could have all accessors
automatically escape their data if it’s textual. And even provide a method
for getting at the unsafe data.
- I still don’t like such blind scrubbing logic, but better to blindly display scrubbed data than to blindly alter data before it hits your database.
Why input filtering can be dangerous
- If you can’t trust your programmers to do proper output filtering, why would
you trust them to do proper input filtering?
- Yes, input filtering is liable to be in fewer locations, particularly if you filter all incoming parameters, but it’s still not a silver bullet, and has a lot of long-term risks when mistakes do happen (read on for details).
- Compare to output filtering in terms of the bug factor:
- Bugs will happen. If you truly believe you don’t ever write code with bugs, then by all means ignore this section. I’ll get a good laugh when you tell me about your first big project that went from a two-week estimate to a six-month half-finished-and-then-rewritten-from-scratch project from hell.
- If you mess up an output filter:
- You probably have an issue that’s confined to a single area on your site (the area you messed up).
- You do a quick hotfix, and the site is once again safe.
- If you mess up an input filter:
- Every area of the site that contains the data you missed is at risk.
- You do a quick hotfix to stop anything new from coming in, but existing data is still currently at risk.
- You find and quickly fix the very obvious offending data in the database.
- You wait until the site is slow (or you can take it down) and run through all data entered since you suspect the exploit came into existence, fixing it record by record.
- If future XSS issues arise, you have to retroactively fix your old data
again instead of merely fixing your filter.
- New xss vulnerabilities won’t arise, you say? Maybe so, but how many times have we computer folk shot ourselves in the foot with presumptions about the future? (We’ll never need more than 640k memory, nobody will still be using this old software when y2k finally hits, etc)
- Note that XSS attackers have discovered that in some cases, the backtick character (`) will work to do specific JS-oriented attacks. This is not a character that is scrubbed by at least two different html_escape types of functions that I know of. Enjoy retroactive data-fixing? Me too!
Why input filtering can be better (and my incessant arguments to prove that it really can’t)
The most logical argument I was given is that in a large enterprise, control of data output gets pretty tricky. So as far as I’m concerned, large companies are the only place the below issues even have a tiny bit of merit. And even then….
- In a large enterprise, you know that nobody will inadvertantly display
unsafe data, because all data is safe.
- Unless of course somebody writes a program that makes changes to the DB. Less likely than a rogue program that merely displays data, I agree, but still a possibility. In an organization that’s big enough to be at risk of multiple apps reading data that wasn’t built by the “proper” people, I’d say there is a definite risk that apps will be writing to said data as well.
- At my job, there have been several cases where somebody who wasn’t even a part of IT (a manager and a content designer) modified data directly in SQL, bypassing any hope of safeguards.
- In a large enterprise, I think it’s even more important than ever that all access to the DB goes through knowledgeable IT staff. Yes, I know this is a pipe dream, but I still think proper procedures can allow output filtering to be the clearly correct option.
- You can detect problems with input filters more easily, because you have the
data that could be dangerous right at your fingertips. If need be, write a
program that periodically audits your data to check for unsafe characters.
If you messed up an input filter, this program can save you.
- Good testing does this same thing for output filtering. It’s far harder to write perfect tests for your app’s HTML output than to write a program to audit the DB for unsafe data, but it’s still the right way to do it.
- Resource usage is wasteful in my opinion, when the resources are being used to prevent data from simply being stored in its original state.
- If you have a large amount of data that is changing all the time, this solution may simply not be doable. In what situation would you have that much data changing that regularly? Oh… I don’t know… maybe in a big corporate enterpise?
I think this is still an important topic, so I finally cleaned it up to display properly with Markdown formatting.