Be careful of Rubyforge gem!

I just discovered a weird issue. The comments under my name will explain it better than I can here, but sufficed to say, if you use Net::HTTP in ruby, do not install the RubyForge gem!

Read all about “the issue”:http://rubyforge.org/tracker/index.php?func=detail&aid=8907&group_id=1024&atid=4025 on the rubyforge bug tracking page.

The wonderful world of Cross-site scripting (XSS) – OR – why input filtering is bad

I have been dealing with XSS at my so-called “real job” recently, and it has come to my attention that a lot of people in this world are under the mistaken impression that it’s better to do “input filtering” than “output filtering”. As I pretty much came up with these terms myself (they may or may not exist elsewhere; I’m just too lazy to find out), I’ll define them for you:

Input Filtering: Scrubbing XSS-dangerous data out of your input before it gets saved anywhere.

Output Filtering: Scrubbing XSS-dangerous data only upon display.

Now, the most important concept here is that XSS is most dangerous when a user can see immediate results without alerting you, the web designer. So if you have a page that repeats their parameters back at them (say a search page where you put “Your search for $parameter could not be found”), that’s A) independent of input vs. output scrubbing, and B) extremely by far the most dangerous kind of XSS vulnerability. Why? Because it allows a user to post a link to your site that can execute malicious javascript. Bad, bad, bad.

After echoing user parameters is fixed, you have to look at how you display stored data. This is where the type of scrubbing comes into play – do you scrub the data before storing to your database / file system? Or do you only scrub when you’re about to display the data?

I will soon prove that input scrubbing is for pansies who are paranoid and tend to make up pathetic lies about their imaginary 20-year-old girlfriends.

Why input filtering is inefficient

  • It’s bad to store data in a display-specific way (have to unencode when displaying PDF, email/text reports, etc).
  • You have to modify other areas of code than just DB storage, such as searching (search for “<blah>” won’t yield “&lt;blah&gt;”), which may not be immediately obvious.
    • You could just auto-filter all incoming data, but there may be cases where you really can’t or don’t want to. I personally dislike blind filtering like this unless there is no better option.
  • If you have existing data, you have to check it for pre-existing problems. With large data, this can be very slow.
  • If you’re truly paranoid (as I am), you still won’t trust the DB data and will need to find a way to have input filtering work nicely with output filtering. This is a whole lot more work than just doing one or the other.
  • If you use a good MVC system like Rails, you can actually escape all text fields as they’re read from the database if you want. With a carefully written ActiveRecord plugin to Rails, I’d bet you could have all accessors automatically escape their data if it’s textual. And even provide a method for getting at the unsafe data.
    • I still don’t like such blind scrubbing logic, but better to blindly display scrubbed data than to blindly alter data before it hits your database.

Why input filtering can be dangerous

  • If you can’t trust your programmers to do proper output filtering, why would you trust them to do proper input filtering?
    • Yes, input filtering is liable to be in fewer locations, particularly if you filter all incoming parameters, but it’s still not a silver bullet, and has a lot of long-term risks when mistakes do happen (read on for details).
  • Compare to output filtering in terms of the bug factor:
    • Bugs will happen. If you truly believe you don’t ever write code with bugs, then by all means ignore this section. I’ll get a good laugh when you tell me about your first big project that went from a two-week estimate to a six-month half-finished-and-then-rewritten-from-scratch project from hell.
    • If you mess up an output filter:
    • You probably have an issue that’s confined to a single area on your site (the area you messed up).
    • You do a quick hotfix, and the site is once again safe.
    • If you mess up an input filter:
    • Every area of the site that contains the data you missed is at risk.
    • You do a quick hotfix to stop anything new from coming in, but existing data is still currently at risk.
    • You find and quickly fix the very obvious offending data in the database.
    • You wait until the site is slow (or you can take it down) and run through all data entered since you suspect the exploit came into existence, fixing it record by record.
  • If future XSS issues arise, you have to retroactively fix your old data again instead of merely fixing your filter.
    • New xss vulnerabilities won’t arise, you say? Maybe so, but how many times have we computer folk shot ourselves in the foot with presumptions about the future? (We’ll never need more than 640k memory, nobody will still be using this old software when y2k finally hits, etc)
    • Note that XSS attackers have discovered that in some cases, the backtick character (`) will work to do specific JS-oriented attacks. This is not a character that is scrubbed by at least two different html_escape types of functions that I know of. Enjoy retroactive data-fixing? Me too!

Why input filtering can be better (and my incessant arguments to prove that it really can’t)

The most logical argument I was given is that in a large enterprise, control of data output gets pretty tricky. So as far as I’m concerned, large companies are the only place the below issues even have a tiny bit of merit. And even then….

  • In a large enterprise, you know that nobody will inadvertantly display unsafe data, because all data is safe.
    • Unless of course somebody writes a program that makes changes to the DB. Less likely than a rogue program that merely displays data, I agree, but still a possibility. In an organization that’s big enough to be at risk of multiple apps reading data that wasn’t built by the “proper” people, I’d say there is a definite risk that apps will be writing to said data as well.
    • At my job, there have been several cases where somebody who wasn’t even a part of IT (a manager and a content designer) modified data directly in SQL, bypassing any hope of safeguards.
    • In a large enterprise, I think it’s even more important than ever that all access to the DB goes through knowledgeable IT staff. Yes, I know this is a pipe dream, but I still think proper procedures can allow output filtering to be the clearly correct option.
  • You can detect problems with input filters more easily, because you have the data that could be dangerous right at your fingertips. If need be, write a program that periodically audits your data to check for unsafe characters. If you messed up an input filter, this program can save you.
    • Good testing does this same thing for output filtering. It’s far harder to write perfect tests for your app’s HTML output than to write a program to audit the DB for unsafe data, but it’s still the right way to do it.
    • Resource usage is wasteful in my opinion, when the resources are being used to prevent data from simply being stored in its original state.
    • If you have a large amount of data that is changing all the time, this solution may simply not be doable. In what situation would you have that much data changing that regularly? Oh… I don’t know… maybe in a big corporate enterpise?

Rants about Ruby on Rails

h2. Disclaimers

  1. Ruby on Rails is as good as the hype, it really is. Building a maintainable web application just doesn’t get much better (yet). These are just some caveats I’ve found while using ROR(Ruby on Rails), and some will probably make me sound a lot more pissy than I intend, so keep in mind I would still rather use ROR(Ruby on Rails) than PHP, Perl, java, C, C#, etc. And also keep in mind that some of the problems I’m mentioning here are going to be found in other frameworks. ORM systems try to keep us from having to write SQL, and that is guaranteed to have some issues when performance is our #1 concern.

  2. I’m also the kind of person who seems to get pissed off about almost anything. And the thing is, I’m not actually pissed off – I’m annoyed. I just happen to be vocal about my annoyances…

  3. One final note… I’m not claiming to be a Rails expert! These issues are ones that I find problematic because of people like me who learn as we go. A true Rails guru probably will have better solutions to a lot of these issues. But I’m trying to help the Rails people who haven’t had the time to learn every little nuance in Rails, because I think there are a lot more of those than there are gurus…

h2. Background

Here’s the deal – at my job, I was assigned to a big project (monstrously big when you consider I was the only developer for 90% of the project), and I chose to use ROR(Ruby on Rails) for it. I don’t know if I’m at liberty to discuss the exact details, so let’s just say I’m working on a community-oriented site, revamping their product reviews. Currently, the site has information about something like 70,000 products, and each product has anywhere from 1 to 1000 user-submitted reviews. My revamp is a temporary fix until we get a java portlet solution working (That one was definitely not my decision).

h2. Minor annoyances

My main gripes relate to performance, so I’ll cover them in detail below. First, a list of my minor gripes… keep in mind, it’s often the “little” things that add up.

  1. Rails doesn’t currently support portlets in any way (I can’t find anything about ’em on google, at least). If it did, maybe we could convince the genius management types that Rails is a suitable solution. Hell, I don’t even think portlets are half as interesting to site visitors as they are to site builders. But since they’re a requirement in the long term, they’re what we’ve got.

  2. Rails documentation is pretty sparse for some of the internals. Look at the api site (http://api.rubyonrails.com) and try to find information about RJS or specific information about using routing / named routes. It may be there, but I sure can’t find it! And much of the documentation simply doesn’t go deep enough. Look at ActiveRecord::Migration and ActiveRecord::Schema – they tend to give the simplest cases and still have undocumented functions that you have to look into the source to figure out… no big surprise, Rails is still new, but still an annoyance.

  3. The testing is weird. I haven’t tried integration testing yet, and I think it might fix some of my issues with testing, but unit tests and functional tests are a bit odd at times. I have one test that passes when run by itself, 100 out of 100 times. But it fails when run with rake. I still haven’t figured that out… and functional tests… ugh. First off, the name seems to be wrong. They’re just more view/content-oriented unit tests. Second, they act strangely if you call multiple actions in one test. Even with the recent inclusion of integration testing, it’s just more convenient to write a “quick” functional test that hits the same action in various ways. But if you do that, there are cases where the test won’t work. I’ve had to split up some tests in order to get them to work in a situation where, for some reason, the testing framework is breaking them. Like the data refuses to change after calling one action… and this only happens sometimes, not always. And with such crappy documentation about how testing works (most docs for testing are third-party docs, and generally very shallow), it’s a pain in the ass!

  4. Some things are just documented incorrectly, and seem to be done on purpose for “simplicity”. Associations, for instance. If I ask for product.reviews, the docs say I get an array back. Well, technically it is an Array. But it’s been mucked with (the specific instance itself, not the Array class) in order to allow for the association’s special functions. Let’s say I want placeholders – for products with less than 10 reviews, I want the array padded to 10 items, where nil items tell my view to show the text, “Your review could be here!” (Stupid example, yes. There are other ways to do this, etc, etc. But the point isn’t this specific example so shut up.) The moment I call @reviews.push(nil) on that “Array” I got back, I’ll hit an error that nil isn’t a review, so I can’t add it to my array. The Array object that’s sent back is modified to allow the cool association features, but this is not clearly documented! The solution? @reviews = (complexassociationlogic).dup. By adding dup, you get a new Array that hasn’t been mucked with. Horrible for performance on large Arrays, I’d bet, but if you know the Array is going to be small, you can deal with it and get a true Array back.

  5. Inconsistent API gets really old. Parameters are a great example. When parsing parameters, hashes exist for parameters that are sent in a certain way (‘object[attribute]’ as an input field’s name is the most common reason to see this). And in the testing framework, you can send parameters to an action as hash elements, such as:

    
    get :action, :param1 => {:nested1 => 1, :nested2 => 2}
    
    You cannot do this with url_for style parameters! It will not internally convert from a hash to the expected “url-friendly” values. I don’t see the usefulness in allowing the hashed params in two places, and then having them break in another! I’m hoping it’s just an oversight, but all the same, it’s difficult.

h2. Major annoyances

Quick sidenote about how badly this site was written: when a user submits a review for a product, the review is stored as a flat text file instead of in the database. When the admin looks at reviews for approval (or rejection), he’s looking at the first line on the flat text file. When he clicks “Next”, the ENTIRE FILE is rewritten to move the next line to the top of the file. And this is one of the more logical operations on the site. Other examples include the fact that there’s currently a page that shows all reviews for a given category, and the biggest category, housing over 15,000 products, ends up with a 2.3Mb page that’s got so much stuff on it, it’s totally worthless to anybody. Don’t forget the minor detail that all pages are static html, generated every night by a review generation script that takes up to two hours to run on some extrememly fast systems (yes, there really is a lot of data). Dynamic content is so much sweeter after seeing a system like this.

Moving right along… so I’ve got this crappy legacy system to work with. I’m only trying to make the system a little easier to use, and more maintainable in the short term. This means I’m not getting any content folks to rewrite pages, I’m not totally redoing layout, etc. It’s all going to look very similar to the end user, but behave much differently on the back end. The current system is all perl scripts, and is easily the worst code I’ve ever seen. And given that I’ve seen my own code from years ago, that’s saying quite a lot.

So while I’m rewriting a lot of the backend here, I have to try my hardest to keep the look and feel moderately similar to what’s there right now. Tricky, tricky. And when moving from gobs of static files to dynamically-generated output, things get really painful….

Now to unveil my #1 gripe: Rails is very stupid when it builds SQL. Watch the dev log fly by when you do things that are oh-so-simple in code. (Lame gripe? Sure… this is actually a gripe that would exist in most ORM frameworks, even. But with this project it caused me some major headaches, so instead of having a more legitimate major gripe, I have this one. Deal with it.)

First example is a Rails newbie (me around the beginning of 2006). I have a list of manufacturers, and want to show each one, its products, and the product reviews. Simple:


  <% @manufacturers.each do |mfg| %>
    

<%= @manufacturer.name %> <% mfg.products.each do |prod| %>
  • <%= product.name %> <% prod.reviews.each do |review| %>
      <% review.ratings.each do |rating| %>
    • <%= rating.name %>: <%= rating.text %>
    • <% end %>
    <% end %>
  • <% end %> <% end %>

    When there are 1000 manufacturers for a given category, each with an average of 15 products, each with an average of 10 reviews, each with ~5 rating items, you can see very quickly that the number of hits to the database is insane. And this is a relatively simple example of what I’ve had to do.

    I learned quickly about eager loading. When I load my manufactureres, I pull in all products at the same time. This means a longer hit to the DB, but a lot less time parsing. But this still means a lot of hits for reviews and ratings! What do we do?

    Well, some of the smarter folks out there might suggest using a :through association to eagerly load the reviews. Then we’re only grabing each review’s ratings. Great idea, right? Surprisingly, no. While it may be a better situation than the prior one, it’s still far too slow for a page hit. The SQL for eagerly loading products will run on our dev machines in about 1/10th of a second. Throw in reviews and it’s up to TWELVE SECONDS! (Yes, I’ve already checked indexes, thanks for the tip, smartass). What’s more, the query gives me over 200,000 rows, while the manufacturer + product query only gives me about 30,000. That’s a whole lot more data to deal with. In fact, it’s exactly the number of rows that are in the review table to begin with. (This is obvious to somebody familiar with SQL….) Either way, the page load just for that data being processed (read from database and then turned into ActiveRecord objects) can end up being in excess of a minute.

    Now the naysayers will pipe up with “use caching!”

    So on to my next major gripe: caching is not very useful for a site with huge amounts of data to present. Oh sure, only one person has to be hit with that huge penalty, then everybody else is good. No big deal. Except when you need your data refreshed every hour, and 10 places on the site are dangerously slow (I consider >30 seconds to be in this realm). All of a sudden you have 10 pissed off customers an hour, 240 a day, and eventually everybody knows something’s not right.

    Background caching is something I would love to see! I wish I were a better programmer, I’d submit a patch… but boy, wouldn’t it be great to be able to say, “start rebuilding this cache, but until it’s done, show the old cached page”? Then nobody is ever hit with that initial page load! That would make insanely large sites able to present a perfect interface to the user, while keeping the data up to date as often as needed!

    Okay, back to my situation. Not only do I need to display all the data previously mentioned, but I also need to present some advertising-related data. Each product can have many advertiser_links. Each of those links points to an advertiser. To get at that data, I’ve just added two new joins. Pulling all the data at once is simply not doable. Just adding the ratings to the above query (remember we were just doing manufacturer, product, and review) makes it too slow (47 seconds!! over 1,370,000 rows!! YAY!) for me to ever consider for a real website. Adding the advertising information isn’t even worth timing. I tried, just for kicks, and couldn’t sit around long enough to see how long it took.

    So how do I get past this? Why, I turn to my friend the paginator! That’ll fix all my woes! Right?

    Once again… no. It’ll help, but certainly isn’t the final solution. The problem is that this site just has too much data. I paginate the hell out of it, and still most pages take too long to wait for loading. Especially since I can’t eagerly load as much as I need to in order to stop hitting the DB 10-50 times per product. Pagination is part of the answer, but by itself it isn’t bringing page load down low enough to matter. Early tests revealed 5-15 seconds per page because even though I’m pulling less data, I’m still hitting the DB so many times. And when I do lots of eager loading, the DB spends so much time that I don’t gain much on getting a small result set.

    SO HOW THE HELL DO I GET PAST THIS? At this point I had hit a wall; I just didn’t know what else to do.

    NOTE: there may be more options available that I don’t know. I know enough about Rails to know that there’s still a lot more that I have to learn.

    I ended up doing what I should have done long ago: building custom SQL. Okay, I actually didn’t build SQL outside of the Rails system except once, where I needed hundreds of thousands of rows of data without the overhead of building an object for each one. But I did have to make heavy use of the find method’s :joins and :select options (once you hit those two with a :conditions option, you’re writing about 70% of the sql for the query, minus the weird table/row aliasing rails does), and I did a lot of preloading data. For instance, I’d have my list of manufacturers (with products eagerly loaded) and the load all advertiser information, stored in a hash by product id. This allowed me to hit the database only a few times, and with relatively quick transactions.

    Another little cheat was to do a lot of @product.mfgid = xxx@ or @if (product.mfgid == xxx)@, instead of @product.manufacturer = xxx@ and @if (product.manufacturer == xxx)@. I may know an id for something, but not have it loaded on that specific product instance. Punching in the extra few characters (and therefore not loading more data) can yield tremendous performance bonuses.

    Another nice solution I came up with is one I will put up as a plugin at some point – cachedfind. If findevery (all finds eventually hit this one) is called with the same parameters twice, within a given amount of time, the plugin will not hit the database again. This saved me a ton of DB hits on tables that have a lot of data that never changes. Some will say, “A good DB server will cache that for you!”, and they are right. But that still requires a hit to the DB server (rarely on the same server your code is running, at least in a large app), and Rails has to parse that data. Returning pre-parsed data that was previously being loaded 50+ times per page makes a big difference. This can end up involving a ton of data being stored in memory, so this trick is really only good for small sets of data that are rarely changed, and loaded in a lot of different places for different reasons. In my case, things like product categories (loaded for products, manufacturerers, and reviews) and advertiser info (only a dozen or so advertisers exist in the data, but the extra join on two or three queries was still painful).

    The moral of the story is simple. Rails is proven for small sites, and even medium-to-large sites written with Rails from the beginning. Redoing a huge site to work in Rails requires a lot more than what the tutorials will ever tell you. It requires a lot of banging your head into walls, trying to rattle out one more life-saving plan. It requires a tenacity that not a lot of developers will have, especially those buying into the “Rails projects go 10 times faster!” hype. It requires a lot of digging into the internals of Rails. It requires at least a small amount of brain damage.

    Is it worth it to do this when I could be working on the Java solution? I certainly think so. I find that Java is painful enough to use (compared to Ruby) that it is still easier to play with new technology that may have little quirks than to deal with “reliable” and “proven” frameworks based in Java.