Nerdmaster does something useful…

No, children, this isn’t the unbelievable (and fictitious) story of a bad little nerd who finally finds redemption via some amazingly-selfless act. This is, instead, the true story of me, the very selfish nerd who offers something useful to the world in order to make himself (myself) feel superior.

I’ve been into statistics lately, and built a really slick hypergeometric distribution probability calculator. I offer the C++ and Ruby source for free, so if you’re interested in statistics and programming, check it out.

Why do I do this? Simple – for starters, I can always use the publicity. I linked to my page from wikipedia’s article. I never would have bothered to put my code up until I noticed that the only other link (here comes the superiority complex) was to a calculator that’s dog slow and gets odd values in some cases – if your desirables (white marbles) are set to 10 and your sample size (marbles drawn) are set to 20, it will actually show drawing 11 desirables as being possible! Plus it doesn’t show data in a very friendly way, doesn’t really explain the algorithm, and doesn’t offer source code. So I figured I could get some math nerds onto my site by throwing that link there, and more visitors is always good, even if they’re icky math nerds.

The other reason is to impress some C++ loving weirdos at Arena Net. I guess they think a client/server game with strict performance requirements should be written in C++ or something stupid like that. I tried to explain that VB is the only language for programming a serious game, but they just wouldn’t listen.

Anyway, enjoy the code. I licensed it as a “use however the frack you want” kind of license, so if it’s useful to you, just let me know you like it.

Sloccount’s sloppy slant – or – how to manipulate programming projects the Wheeler Way

Sloccount is my newest Awesome Software Discovery. It’s a great idea, but is far too simple to do what it claims: estimate effort and expense of a product based on lines of code. And really, I wouldn’t expect it to be that great. The model used to estimate effort is certainly not the author’s fault, as it isn’t his model. But that idiot (David Wheeler) doesn’t just say it’s a neat idea – he actually uses this horrible parody of good software to “prove” that linux is worth a billion dollars. For the record, I prefer linux for doing any kind of development. I hate Windows for development that isn’t highly visual in nature (Flash, for instance kind or requires Win or Mac), and Macs are out of my price range for a computer that doesn’t do many games. So Linux and I are fairly good friends. I just happen to be sane about my liking of the OS. (Oh, and BSD is pretty fracking sweet, too, but Wheeler didn’t evaluate it, so neither will I)

The variables

To show the absurdity of sloccount, here’s a customized command line that is assuming pretty much the cheapest possible outcome for a realistic project. The project will be extremely easy for all factors that make sense in a small business environment. We assume an Organic model as it is low-effort and most likely situation for developing low-cost software.

Basically I’m assuming a very simple project with very capable developers. I’m not assuming the highest capabilities when it comes to the dev team because some of that stuff is just nuts – the whole team on a small project just isn’t likely to be having 12+ years experience, and at the top 10% of all developers. But the assumptions here are still extremely high – team is in the top 75% in all areas, and 6-12 years of experience, but pay is very low all the same. This should show a pretty much best-case scenario.

Also, I’m setting overhead to 1 to indicate that in our environment we have no additional costs – developers work from home on their own equipment, we market via a super-cheap internet site or something (or don’t market at all and let clients do our marketing for us), etc.

Other factors (from sloccount’s documentation ):

  • RELY: Very Low, 0.75
    • We are a small shop, we can correct bugs quickly, our customers are very forgiving. Reliability is just not a priority.
  • DATA: Low, 0.94
    • Little or no database to deal with. Not sure why 0.94 is the lowest value here, but it is so I’m using it.
  • CPLX: Very Low, 0.70
    • Very simple code to write for the project in question. We’re a small shop, man, and we just write whatever works, not whatever is most efficient or “cool”.
  • TIME: Nominal, 1.00
    • We don’t worry about execution time, so this isn’t a factor for us. Assume we’re writing a GUI app where most of the time, the app is idle.
  • STOR: Nominal, 1.00
    • Same as time – we don’t worry about storage space or RAM. We let our users deal with it. Small shop, niche market software, if users can’t handle our pretty minimal requirements that’s their problem.
  • VIRT: Low, 0.87
    • We don’t do much changing of our hardware or OS.
  • TURN: Low, 0.87
    • I don’t know what this means, so I’m assuming the best value on the grid.
  • ACAP: High, 0.86
    • Our analysts are good, so we save time here.
  • AEXP: High, 0.91
    • Our app experience is 6-12 years. Our team just kicks a lot of ass for being so underpaid.
  • PCAP: High, 0.86
    • Again, our team kicks ass. Programmers are very capable.
  • VEXP: High, 0.90
    • Everybody kicks ass, so virtual machine experience is again at max, saving us lots of time and money.
  • LEXP: High, 0.95
    • Again, great marks here – programmers have been using the language for 3+ years.
  • MODP: Very High, 0.82
    • What can I say? Our team is very well-versed in programming practices, and make routine use of the best practices for maintainable code.
  • TOOL: Very High, 0.83
    • I think this is kind of a BS category, as the “best” system includes requirements gathering and documentation tools. In a truly agile, organic environment, a lot of this can be skipped simply because the small team (like 2-3 people) is so close to the codebase that they don’t have any need for complexities like “proper” requirements gathering. Those things on a small team can really slow things down a lot. So I’m still giving a Very High rating here to reflect speedy development, not to reflect the grid’s specific toolset. For stupid people (who shouldn’t even be reading this article), this biases the results against my claim, not for it.
  • SCED: Nominal, 1.00
    • Not sure why nominal is best here, but it’s the lowest-effort value so it’s what I’m choosing. Dev schedules in small shops are often very flexible, so it makes sense to choose the cheapest option here.

So our total effort will be:

0.75 * 0.94 * 0.70 * 1.00 * 1.00 *                # RELY - STOR
0.87 * 0.87 * 0.86 * 0.91 * 0.86 *                # VIRT - PCAP
0.90 * 0.95 * 0.82 * 0.83 * 1.00 *                # VEXP - SCED
2.3                                               # Base organic effort

= 0.33647 effort

We’re also going to assume a cheap shop that pays only $40k a year to programmers, because it’s a small company starting out. Or the idiot boss only pays his kids fair salaries. Or something.

Command line:

sloccount --overhead 1 --personcost 40000 --effort 0.33647 1.05

Bloodsport Colosseum

For something simple like Bloodsport Colosseum, this is an overly-high, but acceptable estimate. With HTML counted, the estimate is 5.72 man-months. Without, it’s 4.18 man-months. We’ll go with the average since my HTML counter doesn’t worry about comments, and even with rhtml having embedded ruby, the HTML was usually easier than the other parts of the game. So this comes to 4.95 months. That’s just about 21 weeks (4.95 months @ 30 days a month, divided by 7 days a week = just over 21). At 40 hours a week that would work out to 840 hours. I spent around 750 hours from start (design) to finish. I was very unskilled with Ruby and Rails, so this estimate being above my actual time is certainly off (remember I estimated for people who were highly skilled), and a lot of the time I spent on the project was replacing code, not just writing new code. But overall it’s definitely an okay ballpark figure.

When you start adding more realistic data, though, things get worse.

If you simply assume the team’s capabilities are average instead of high (which is about right for BC), things get significantly worse, even though the rest of the factors stay the same:

0.75 * 0.94 * 0.70 * 1.00 * 1.00 *                # RELY - STOR
0.87 * 0.87 * 1.00 * 1.00 * 1.00 *                # VIRT - PCAP
0.90 * 0.95 * 0.82 * 0.83 * 1.00 *                # VEXP - SCED
2.3                                               # Base organic effort

= 0.4999 effort

This changes our average from 4.95 man-months to 7.3 months, or about 31 weeks. That’s 1240 hours of work, well more than I actually spent. From design to final release, including the 1000-2000 of lines of code that were removed and replaced (ie, big effort for no increase in LoC), I spent about 40% less time than the estimate here.

…And for the skeptics, no, I’m not counting the rails-generated code, such as scripts/*. I only included app/, db/ (migration code), and test/.

However, this still is “close enough” for me to be willing to accept that it’s an okay estimate. No program can truly guess the effort involved in any given project just based on lines of code, so being even remotely close is probably good enough. The problem is when you look at less maintainable code.

Just for fun, you can look at the dev cost, which is $21k to $28k, depending on whether you count the HTML. I wish I could have been paid that kind of money for this code….

Murder Manor

This app took me far less time than BC (no more than 150-200 hours). I was more adept at writing PHP when I started this than I was at writing Ruby or using Rails when I started BC. But the overall code is still far worse because of my lack of proper OO and such. So I tweak the numbers again, to reflect a slightly skilled user of the language, but worse practices, software tools, and slightly more complex product (code was more complex even though BC as a project had more complex rules. Ever wonder why I switched from PHP for anything over a few hundred lines of code?):

0.75 * 0.94 * 0.85 * 1.00 * 1.00 *                # RELY - STOR
0.87 * 0.87 * 1.00 * 1.00 * 1.00 *                # VIRT - PCAP
0.90 * 0.95 * 1.00 * 1.00 * 1.00 *                # VEXP - SCED
2.3                                               # Base organic effort

WHOA. Effort jumps to 0.8919! New command line:

sloccount --overhead 1 --personcost 40000 --effort 0.8919 1.05

This puppy ends up being 3.4 months of work. That’s 14.5 weeks, or 580 hours of work — around triple my actual time spent!

Looking at salary info is something I tend to avoid because as projects get big, the numbers just get absurd. In this case, even with a mere 3500-line project, the estimate says that in the environment of cheap labor and no overhead multiplier, you’d need to pay somebody over $10k to rewrite that game. Good luck to whatever business actually takes these numbers at face value!

But these really aren’t the bad cases. Really large codebases are where sloccount gets absurd.

Big bad code

Slash ’em is a great test case. It isn’t OO, is highly complex, and has enough areas of poor code that I feel comfortable using values for average- competency programmers. So here are my parameters, in depth:

  • RELY: Very Low, 0.75
    • Free game, so not really any need to be highly-reliable.
  • DATA: Nominal, 1.00
    • The amount of data, in the form of text-based maps, data files, oracle files, etc. is pretty big, so this is definitely 1.00 or higher.
  • CPLX: Very High, 1.30
    • Complex as hell – the codebase supports dozens of operating systems, and has to keep track of a hell of a lot of data in a non-OO way. It’s very painful to read through and track things down.
  • TIME: High, 1.11
    • Originally Nethack was built to be very speedy to run on extremely slow systems. There are tons of hacks in the code to allow for speeding up of execution even today, possibly to accomodate pocket PCs or something.
  • STOR: Nominal, 1.00
    • I really can’t say for sure if Slash ‘Em is worried about storage space. It certainly isn’t worried about disk, as a lot of data files are stored in a text format. But I don’t know how optimized it is for RAM use – so I choose the lowest value here.
  • VIRT: Nominal, 1.00
    • Since the app supports so many platforms, this is higher than before. I only chose Nominal because once a platform is supported it doesn’t appear its drivers change regularly if at all.
  • TURN: Low, 0.87
    • Again, I don’t know what this means, so I’m assuming the best value on the grid.
  • ACAP: Nominal, 1.00
    • Mediocre analysts
  • AEXP: Nominal, 1.00
    • Mediocre experience
  • PCAP: Nominal, 1.00
    • Mediocre programmers
  • VEXP: Nominal, 1.00
    • Okay experience with the virtual machine support
  • LEXP: Nominal, 1.00
    • Mediocre language experience
  • MODP: Nominal, 1.00
    • The code isn’t OO, which for a game like this is unfortunate, but overall the code is using functions and structures well enough that I can’t really complain about a lot other than lack of OO.
  • TOOL: Nominal, 1.00
    • Again, nominal here – the devs may have used tools for developing things, I really can’t be sure. I know there isn’t any testing going on, so I can be certain that 1.00 is the best they get.
  • SCED: Nominal, 1.00
    • The nethack and slash ’em projects are unfunded, and have never (as far as I can tell) worried about a release schedule. Gotta choose the cheapest value here.

Total:

0.75 * 1.00 * 1.30 * 1.11 * 1.00 *                # RELY - STOR
0.87 *                                            # TURN (the rest are 1.00)
2.3                                               # Base organic effort

Total is now 2.166 effort. New command line, still assuming cheap labor and no overhead:

sloccount --overhead 1 --personcost 40000 --effort 2.166 1.05

Slash ‘Em is a big project, no doubt about it. But the results here are laughable at best. The project has 250k lines of code, mostly ansi c. The estimate is that this code would take nearly 61 man-years of effort. The cost at $40k a year would be almost $2.5 million! With an average of just under 24 developers, the project could be done in two and a half years.

I worked for a company a while ago that built niche-market software for the daycare industry. They had an application that took 2-3 people around 5 years to build. It was Visual C code, very complex, needed a lot more reliability than Slash ‘Em, was similar in size (probably closer to 200k lines of code), and had a horrible design process in which the boss would change his mind about which features he wanted fairly regularly, sometimes scrapping large sections of code. That project took at most 15 man-years to produce. To me, the claim that Slash ‘Em was that much bigger is a great reason to make the argument that linux isn’t worth a tenth what Wheeler claims it is. Good OS? Sure. But worth a billion dollars??

Linux and the gigabuck

I’m just not sure how anybody could buy Wheeler’s absurd claim that Linux would cost over a billion dollars to produce. Sloccount is interesting for sure, particularly for getting an idea of one project’s complexity compared to another project. But using the time and dollar estimates is a joke.

Wheeler’s own BS writeup proves how absurd his claims are: Linux 6.2 would have taken 4500 man-years to build, while 7.1, released a year later, would have taken 8000 man-years. I’m aware that there was a lot of new open source in the project, and clearly a small team wasn’t building all the code. But to claim that the extra 13 million lines of code are worth 3500 years of effort, or 400 million dollars…. I dunno, to me that’s just a joke.

And here’s the other thing that one has to keep in mind: most projects are not written 100% in-house. So this perceived value of Linux due to the use of open source isn’t exclusive to Linux or open source. At every job I’ve had, we have used third-party code, both commercial and open source, to help us get a project done faster. At my previous job, about 75% of our code was third-party. And in one specific instance, we paid about a thousand dollars to get nearly 100,000 lines of C and Delphi code. The thing with licensing code like this is that the company doing the licensing isn’t charging every user the value of their code – they’re spreading out the cost to hundreds or even thousands of users so that even if their 100k lines are worth $50k, they can license the code to a hundred users at $1000 a pop. Each client pays 2% of the total costs – and the developmers make more money than the code is supposedly worth. And clearly this saves a ton of time for the developer paying for the code in question.

If you ignore the fact that big companies can use open source (or commercially-licensed code), you can conjure up some amazing numbers indeed.

I can claim that Bloodsport Colosseum is an additional 45 months of effort simply by counting just the ruby gems I used (action mailer, action pack, active record, active support, rails, rake, RedCloth, and sqlite3-ruby). Suddenly BC is worth over $175k (remember, labor is still $40k a year and I am still assuming a low-effort project) due to all the open source I used to build it.

Where exactly do we draw the line, then? Maybe I include all of Ruby’s source code since I used it and its modules to help me build BC. Can I now claim that BC is worth more than a million dollars?

Vista is twice as good as Linux!

As a final proof of absurdity, MS has a pretty bad track record for projects taking time, and the whole corporate design/development flow slowing things down. Vista is supposed to be in the realm of 50 million lines of code. Using the same methods Wheeler used to compute linux’s cost and effort, we get Vista being worth a whole hell of a lot more:

Total physical source lines of code:                    50,000,000
Estimated Development Effort in Man-Years:              17,177
Estimated cost (same salaries as linux estimate,        $2.3 billion
  $56,286/year, overhead=2.4)

To me these numbers look just as crazy as the ones in the Linux estimate, but MS being the behemoth it is, I’m not going to try and make a case either way. Just keep in mind that MS would have had to dedicate almost 3,000 employees to working on Vista full-time in order to get 17,177 years of development done in 6.

The important thing here is that by Wheeler’s logic, Vista is actually worth more than linux. By a lot.

Linux fanatics are raving idiots

So all you Linux zealots, I salute you for being so fiercely loyal to your favorite OS, but coming up with data like this (or simply believing in and quoting it) just makes linux users appear a ravenous pack of fools. Make your arguments, push your OS, show the masses how awesome Linux can be. But make sound arguments next time.

digg this!

The move to typo 4.0

Typo is my blogging software. Written in Ruby on Rails, it seemed like an ideal choice for me since I’m a big fan of the RoR movement. But like so many other open source applications, Typo has got some major problems. I’m not going to say another open source blog would have been better (though I suspect this is true from other pages I’ve found on the net), but Typo has been a major pain in the ass to upgrade.

For anybody who has to deal with this issue, I figure I’ll give a nice account here.

First, the upgrade tool is broken. If you have an old version of typo that has migrations numbered 1 through 9 instead of 001 through 009, you get conflicts during the attempt at migrating to the newest DB format. You must first delete the old migrations, then run the installer:

rm /home/user/blog_directory/db/migrations/*
typo install /home/user/blog_directory

Now you will (hopefully) get to the tests. These will probably fail if, like me, your config/database.yml file is old and doesn’t use sqlite. Or hell, if it does use sqlite but your host doesn’t support that. Anyway, so far as I’m concerned the tests should be unnecessary by the time the Typo team releases a version of Typo to the public.

Next, if you have a version of typo that uses the components directory (back before plugins were available in Rails, I’m guessing), the upgrade tool does not remove it. This is a big deal, because some of the components that are auto-loaded conflict with the plugins, causing all sorts of stupid errors. That directory has to be nuked:

rm -rf /home/username/blog_directory/components

This solves a lot of issues. I mean, a lot. If you’re getting errors about the “controller” method not being found for the CategorySidebar object, this is likely due to the components directory.

Another little quirk is that when Typo installs, it replaces the old vendor/rails directory with the newest Rails code. But it does not remove the old code! This is potentially problematic, as I ended up with a few dozen files in my vendor/rails tree that weren’t necessary, and may have caused some of my conflicts (I never was able to fully test this and now that I have things working, I’m just not interested). Very lame indeed. To fix this, kill your rails dir and re-checkout version 1.2.3:

rm -rf /home/username/blog_directory/vendor/rails
rake rails:freeze:edge TAG=rel_1-2-3

My final gripe was the lack of even a mention that older themes may not work. I had built a custom typo theme which used some custom views. But of course I didn’t know it was the theme until I spent a little time digging through the logs to figure out why things were still broken. Turned out my theme, based on the old Azure theme and some of the old view logic for displaying articles, was trying to hit code that no longer existed. Yes, my theme was using an old view and the old layout, both of which were hitting no-longer-existing code. But better API coding for backward compatibility would have made sense, since they did give you the option to use a theme to override views and layouts. Or at the very least, a warning would have been real nice. “Danger, danger, you aren’t using a built-in theme! Take the following precautions, blah blah blah, Jesus loves you.”

How do you fix the theme issue, though, if you can’t even log in to the blog to change it? Well, like all good programmers who are obsessively in love with databases, the typo team decided to store the config data in the database. And like all bad open-source programmers, they stored that data in an amazingly stupid way. I like yaml, don’t get me wrong – it’s amazingly superior to that XML crap everybody wants to believe is useful. But in a database, storing data in yaml format seems just silly.

<rant>

PEOPLE, LISTEN UP, if you’re going to store config that’s totally and utterly NOT relational, do not use a relational database. It’s simple. Store the config file as a yaml file. If you are worried about the blog being able to write to this file, fine, store your data in the DB, but at least store it in a relational sort of way. Use a field for each config directive if they’re not likely to change a lot, or use a table that acts like a hash (one field for blogid, one for settingname, one for setting_value). But do something that is easy to deal with via SQL. Show me the SQL to set my theme from ‘nerdbucket’ to ‘azure’ please. When you can’t use your database in a simple, straightforward way, you’ve fracking messed up. Yes, there are exceptions to every rule, but this blog software is not one of them. It would not have been hard to store the data in a neutral format that would make editing specific settings easy.
</rant>

Sorry. Anyway, how to fix this – the database has a table called “blogs” that has a single record for my blog. This record stores the base url and a bunch of yaml for the site config. You edit the field “settings” in the blogs table, and change just the part that says “theme: blah” to “theme: azure”. If you don’t have access to a simple tool like phpmyadmin, then you’ll likely have to retrieve the value from your mysql prompt, edit it in the text editor of your choice, and then reset the whole thing, making sure to use newlines properly so as not to screw up the yaml format…. Then you are up and running and can worry about fixing the theme at your leisure.

Now, to be fair, I think I could have logged in to the admin area without fixing my theme, and then fixed it there. But with all the problems I was having, I thought it best to set the theme in the DB to see if that helped get the whole app up and running. Obviously it wasn’t the theme that was killing my admin abilities (and I can’t even remember anymore what it was). But once I hit that horrible config storage, my stupidity felt ever so much smarter compared to the person who designed typo’s DB.

Typo is pretty sweet when you don’t have to delve under the hood. But “little” things like that can make or break software, and I hope to <deity of your choice> that the next upgrade is a whole lot smoother.


UPDATE UPDATE HOORAY

One more awesome annoyance. It seems all my old blog articles are tagged as being formatted with “Markdown”. When I created them, I formatted them with “Textile”. If you’re not up on these two awesome formatting tools, take a look at a simple example (the first is how Textile appears when run through the Markdown formatter):

  • This “website”:http://www.nerdbucket.com is really sweet, dude!
  • This website is really sweet, dude!

I’ve been using Markdown lately as I kind of prefer it now. But my old articles are in Textile format. I don’t know why upgrading my fracking blog loses the chosen format, but boy is it fun going through old articles and fixing them!!

Digg this!

How not to benchmark software

I’ve just stumbled upon an amazingly misinformed benchmark about Flex, from an actual Adobe employee, Matt Potter.

This guy benchmarks JSON, AMFPHP, and XML as ways to transmit data between PHP and a Flex app. His findings show that XML is generally faster than either JSON or AMFPHP. This “discovery” could revolutionize the way we send and receive data on the net! Who’d have thought that XML is so efficient? Truly amazing results!

But if we choose to drop back down to Earth from the blissful land of Ignoramia, we may find that even Adobe devs can make horrible mistakes.

So why is this year-old article worth dissecting? Simple – it comes up FIRST when you search google for “json flex”, which makes it a great tool of misinformation for people looking for ways to incorporate the awesomeness of JSON into flex! Note that if it were a random article that was at least moderately hard to find, I probably wouldn’t care too much.

So Matt Potter compares XML, AMFPHP, and JSON. His first and most amazing mistake is that he’s using raw XML, but converting data structures in PHP into JSON and AMFPHP. XML is expensive to create as well as to read, so skipping that step completely invalidates his article in my opinion. But worse still, he tests against a local server. The network overhead of XML is going to be significantly worse than JSON in most cases (no idea about AMFPHP as I’ve never used it), so ignoring the 2-3x bigger data really doesn’t do much for providing a valid test.

One of the comments also mentions that there’s a PHP extension for JSON that’s better than what Matt used, and Matt’s response: “I used the Zend Framework instead of the json php extension because I really think that the Zend Framework is the easiest to setup and use, and I have other examples of using the ZF that I’m going to publish”. So instead of looking for the best tool for the job, he went with the easiest tool. But for XML testing he went with the hardest but most efficient “tool”: manual creation of XML with no conversion from objects, no use of XML creation tools, nothing.

I dislike ignorance when one tries to present facts, but this article actually makes me suspicious that his intention was to “prove” XML was the best technology of the three, and was willing to manipulate data in any way necessary to provide evidence. It’s pretty despicable to have a position of influence (adobe employee) and abuse it to prove a totally BS point.

It should be noted that some of the comments, particularly Blaine McDonnell’s, ripped Matt apart better than I can. But when ignorance and/or deception rear their ugly head, it never hurts to point them out one last time.

Yet another Awesome Software Discovery!

This time it’s a piece of javascript to compute ideal body weight in a variety of ways, the most interesting of which claims to tell you what other people like you consider their ideal body weight. Very “slick” little system, if you care about such things.

I’m always on the lookout for crazy new technology, so when I found this “ideal weight calculator”:http://www.halls.md/ideal-weight/body.htm, I was overjoyed by how many different algorithms seemed to be present. When I looked at the source and found that the author was using javascript, I was again very excited. This meant I could look at (and possibly learn) his algorithms!

And so here they are: “javascript source for calculating ideal weight”:http://www.halls.md/ideal-weight/body2.js.

But read the copyright message with me and bask in the author’s sheer genius! Clearly he (or maybe she? No idea, don’t care) considers the algorithms to be proprietary and will MESS YOU UP if you steal them! So I guess I won’t bother to learn them. Hell, merely looking at them is probably illegal.

So aside from the author’s painful arrogance and stupidity, what can we learn about him from this script? Simple – he thinks he’s some sort of omniscient deity (don’t mess with me lest I strike ye down, mortal! And I will know if you try: “you won’t get away with copying this code”), and yet he doesn’t have even the tiniest iota of smarts when it comes to securing what he claims is “truly my unique creation and algorithms”.

O, Great and Wonderful Physican (yes, that’s right kids, he points out to all us lesser mortals that he’s a god damn physician!), I beseech thee! A bit of simple and kind advice for you: if you want an algorithm to be protected, don’t publish it on the web. In un-obfuscated javascript no less. Obfuscation isn’t bulletproof, not by a long shot, but it’s better than nothing.

And really, go for a server-side approach if you’re as paranoid as you seem to be. Once you use javascript, everybody who visits your site has copied it. This is not because they’re all thieves, but because of a little thing called the browser cache. Not only that, but anybody can view your proprietary algorithm and rewrite it. Copyright it all you want, a rewritten version of the algorithm is going to be COMPLETELY LEGAL! Copyrights only protect exact (or very nearly exact) duplication. You need a patent to protect an algorithm. For a basic description of the algorithm, read below. I was gonna rewrite it in javascript, but it’s really quite worthless, so explaining it should piss off our good doctor well enough.

<By the way, the message should be “It’s copyrighted”, or since you’re talking about scripts (plural), maybe “They’re copyrighted”. Note the apostrophes. Apostrophes can be your friend.>

The good news is that his script is so mundane and, dare I say it, not unique – most of the script is other people’s work on pretty standard formulas. Why, you ask, is this good news? Because he doesn’t actually need to worry about people stealing it!

His “secret formula” is well worth discussion, however:

You go to the site. You put in your height and weight. His script uses a very standard formula to calculate BMI. His “people’s choice” code then cuts BMI down by 40% or 50% (gender determines this) and then adds a gender-specific value (11.5 for men, 11 + age x 0.03 for women). Then reversing the very standard BMI calculation gives you what other people supposedly consider to be an ideal weight!

That’s right, a simple algorithm that tells you what other people just like you consider ideal! But because of the simplicity of the script, it gets worse – say you’re a 440 pound, 5’6″ adult male. According to this brilliant physician, the average person that height, weight, age, and gender think that 291 pounds is their ideal weight! That’s right, little ones. If you’re extremely obese, your beliefs of what is and isn’t an ideal weight become so skewed that you think being slightly less obese (but still very obese) is “ideal”. Funnier still, of course, is that as your weight changes, so will your ideal. So once our example 440lb guy gets down to 400, he thinks his ideal is 271lbs. Doesn’t matter if it takes him a month or fifteen years to drop 40lbs, his new ideal is still going to be 271.

BUT WAIT, THERE’S MORE! When you’re not an adult, the script tells you that your peers consider your desired weight to be something that is based entirely on height and weight! So the average 440 pound, 5’6″, 18-year-old male longs to weigh 131 pounds. The moment he’s older than 18.5 (no idea where the doc pulled these numbers from), he longs to weigh 291. Yup. One day he goes to sleep hoping to be in a healthy weight range, then he wakes up thinking he was wrong, and should in fact weigh more than twice his original goal.

Arrogance, stupidity, bad programming, and then weird assumptions followed by even more stupidity. This is possibly my best Awesome Software Discovery yet!

Be careful of Rubyforge gem!

I just discovered a weird issue. The comments under my name will explain it better than I can here, but sufficed to say, if you use Net::HTTP in ruby, do not install the RubyForge gem!

Read all about “the issue”:http://rubyforge.org/tracker/index.php?func=detail&aid=8907&group_id=1024&atid=4025 on the rubyforge bug tracking page.

The wonderful world of Cross-site scripting (XSS) – OR – why input filtering is bad

I have been dealing with XSS at my so-called “real job” recently, and it has come to my attention that a lot of people in this world are under the mistaken impression that it’s better to do “input filtering” than “output filtering”. As I pretty much came up with these terms myself (they may or may not exist elsewhere; I’m just too lazy to find out), I’ll define them for you:

Input Filtering: Scrubbing XSS-dangerous data out of your input before it gets saved anywhere.

Output Filtering: Scrubbing XSS-dangerous data only upon display.

Now, the most important concept here is that XSS is most dangerous when a user can see immediate results without alerting you, the web designer. So if you have a page that repeats their parameters back at them (say a search page where you put “Your search for $parameter could not be found”), that’s A) independent of input vs. output scrubbing, and B) extremely by far the most dangerous kind of XSS vulnerability. Why? Because it allows a user to post a link to your site that can execute malicious javascript. Bad, bad, bad.

After echoing user parameters is fixed, you have to look at how you display stored data. This is where the type of scrubbing comes into play – do you scrub the data before storing to your database / file system? Or do you only scrub when you’re about to display the data?

I will soon prove that input scrubbing is for pansies who are paranoid and tend to make up pathetic lies about their imaginary 20-year-old girlfriends.

Why input filtering is inefficient

  • It’s bad to store data in a display-specific way (have to unencode when displaying PDF, email/text reports, etc).
  • You have to modify other areas of code than just DB storage, such as searching (search for “<blah>” won’t yield “&lt;blah&gt;”), which may not be immediately obvious.
    • You could just auto-filter all incoming data, but there may be cases where you really can’t or don’t want to. I personally dislike blind filtering like this unless there is no better option.
  • If you have existing data, you have to check it for pre-existing problems. With large data, this can be very slow.
  • If you’re truly paranoid (as I am), you still won’t trust the DB data and will need to find a way to have input filtering work nicely with output filtering. This is a whole lot more work than just doing one or the other.
  • If you use a good MVC system like Rails, you can actually escape all text fields as they’re read from the database if you want. With a carefully written ActiveRecord plugin to Rails, I’d bet you could have all accessors automatically escape their data if it’s textual. And even provide a method for getting at the unsafe data.
    • I still don’t like such blind scrubbing logic, but better to blindly display scrubbed data than to blindly alter data before it hits your database.

Why input filtering can be dangerous

  • If you can’t trust your programmers to do proper output filtering, why would you trust them to do proper input filtering?
    • Yes, input filtering is liable to be in fewer locations, particularly if you filter all incoming parameters, but it’s still not a silver bullet, and has a lot of long-term risks when mistakes do happen (read on for details).
  • Compare to output filtering in terms of the bug factor:
    • Bugs will happen. If you truly believe you don’t ever write code with bugs, then by all means ignore this section. I’ll get a good laugh when you tell me about your first big project that went from a two-week estimate to a six-month half-finished-and-then-rewritten-from-scratch project from hell.
    • If you mess up an output filter:
    • You probably have an issue that’s confined to a single area on your site (the area you messed up).
    • You do a quick hotfix, and the site is once again safe.
    • If you mess up an input filter:
    • Every area of the site that contains the data you missed is at risk.
    • You do a quick hotfix to stop anything new from coming in, but existing data is still currently at risk.
    • You find and quickly fix the very obvious offending data in the database.
    • You wait until the site is slow (or you can take it down) and run through all data entered since you suspect the exploit came into existence, fixing it record by record.
  • If future XSS issues arise, you have to retroactively fix your old data again instead of merely fixing your filter.
    • New xss vulnerabilities won’t arise, you say? Maybe so, but how many times have we computer folk shot ourselves in the foot with presumptions about the future? (We’ll never need more than 640k memory, nobody will still be using this old software when y2k finally hits, etc)
    • Note that XSS attackers have discovered that in some cases, the backtick character (`) will work to do specific JS-oriented attacks. This is not a character that is scrubbed by at least two different html_escape types of functions that I know of. Enjoy retroactive data-fixing? Me too!

Why input filtering can be better (and my incessant arguments to prove that it really can’t)

The most logical argument I was given is that in a large enterprise, control of data output gets pretty tricky. So as far as I’m concerned, large companies are the only place the below issues even have a tiny bit of merit. And even then….

  • In a large enterprise, you know that nobody will inadvertantly display unsafe data, because all data is safe.
    • Unless of course somebody writes a program that makes changes to the DB. Less likely than a rogue program that merely displays data, I agree, but still a possibility. In an organization that’s big enough to be at risk of multiple apps reading data that wasn’t built by the “proper” people, I’d say there is a definite risk that apps will be writing to said data as well.
    • At my job, there have been several cases where somebody who wasn’t even a part of IT (a manager and a content designer) modified data directly in SQL, bypassing any hope of safeguards.
    • In a large enterprise, I think it’s even more important than ever that all access to the DB goes through knowledgeable IT staff. Yes, I know this is a pipe dream, but I still think proper procedures can allow output filtering to be the clearly correct option.
  • You can detect problems with input filters more easily, because you have the data that could be dangerous right at your fingertips. If need be, write a program that periodically audits your data to check for unsafe characters. If you messed up an input filter, this program can save you.
    • Good testing does this same thing for output filtering. It’s far harder to write perfect tests for your app’s HTML output than to write a program to audit the DB for unsafe data, but it’s still the right way to do it.
    • Resource usage is wasteful in my opinion, when the resources are being used to prevent data from simply being stored in its original state.
    • If you have a large amount of data that is changing all the time, this solution may simply not be doable. In what situation would you have that much data changing that regularly? Oh… I don’t know… maybe in a big corporate enterpise?

Two more reasons Ruby beats Perl: Chuck Norris and Chuck Norris

I was thinking of all the various Norrisisms on the web (“Chuck Noriss’s tears cure cancer. If only he would ever cry…” and such), and realized that Chuck Norris would prefer ruby to perl. Two of my own very clever*** quotes are below, and after reading them I think even the most die-hard perl programmer will have no choice but to convert.

* “Clever” by the Webster 1913 definition of “well-shaped; handsome”. They are sexy, even if not funny.


The ruby versions:

Chuck Norris doesn’t strip strings – when they see him, they get so excited they just strip themselves.
If Chuck Norris raises an exception, it takes two programmers, four paramedics, and at least one Chinese Healer to rescue it.

The perl versions:

Chuck Norris doesn’t trim strings – when they see him, they trim themselves out of fear.
When Chuck Norris found out perl couldn’t deal with exceptions nicely, he roundhouse kicked it. Twice. That’s why perl is so ugly.

Note how much less scared and ugly I made ruby sound.

Web security and Mobster World: a tale of woe

I belong to a forum for web game developers and I recently posted about how to keep one’s game from being a target of the most common security problems. The information seems, to me, to be so obvious, but apparently there’s a lot of ignorance about how to secure an application as well as why it matters. So let me relate a tale of exactly why website security is so damned important.

I relate the details of this hackery not only to brag (I am proud to have hacked this game so thoroughly even if it wasn’t much of a challenge), but also to point out how “minor” security issues can destroy a game (or other web application) completely. This is not a “Security on the Web 101” as much as proof that bad security can destroy a good concept.

A long time ago, in a land far far away, there lived a game designer. We’ll call him “Alphonso”. Because that’s his name. Makes things simplest that way, really….

Alphonso had a grand idea for a mobster-oriented PBBG (Persistent Browser-Based Game). His idea was pretty decent overall, and he opened up the short-lived site Mobster World. Don’t bother looking for the site, it died a long time ago. And this story will tell you why.

In this game, Alphonso had built a few key areas that I’m going to cover: * Logging in * Jobs * Buying Items * Shooting a player * Reading “private” messages * Sending messages

BUT FIRST…

The basic information in here is this: do not trust user-supplied information! You can build an HTML page with all kinds of hidden form fields and use cookies and all that stuff, but at the end of the day, if you assume that the user will supply you with a valid URL, valid cookies, and valid form fields, you will get a hacker eventually.

Logging in

This was the most absurd area. You’d put in your name, password, and the CAPTCHA image to prove you weren’t a bot. The security image was a collection of three digits. The images were shown to you on the form and you’d enter the digits in the appropriate field. Fine and dandy up to this point. Problem was, the images were shown separately (CAPTCHAs usually show a single image that contains all the numbers/letters) and this allows an attacker to analyze the filenames of each image to figure out which corresponds to a given number. But worse, the filenames were #.jpg. That is, the image representing “1” was “1.jpg”. So I could look at the form and see the <image> tags to know exactly what I needed to type – very easy for a bot to do, by the way.

When I thought the login couldn’t get any worse, I noticed a “hidden” field. In HTML, a hidden field doesn’t mean the user cannot see it! It merely means the field isn’t immediately visible! This particular hidden field contained the exact security string Alphonso was expecting. So my bot was very quickly able to grab the expected CAPTCHA string and supply it. The CAPTCHA succeeded in stopping only the most inexperienced of hackers, and those ones weren’t likely to know how to script a bot properly anyway.

Also please note that having a CAPTCHA may indeed stop bots (though rumor has it good anti-CAPTCHA technology is more accurate than most users), but it may also annoy regular users, especially those with minor-or-worse visual problems. If you insist on a CAPTCHA, at least make it accessible to all users.

Jobs

There were two kinds of jobs, where a player could perform a job to gain stats and/or money. The “big jobs” were dangerous (rob a bank, steal a car, etc), and could land you in jail if you failed. The “small jobs” weren’t dangerous – things like petty theft, bar fights, etc. They didn’t have the same rewards, and therefore I didn’t bother to try hacking them.

Each job would give you two or three options for how to perform the job, usually a situation where you could choose to be stealthy or direct or whatever (Robbing a bank via the front door or back door, and other totally unimportant crap). But when the page was created, the actions were pre-determined. The html would have hidden form fields saying whether a given button was going to be successful. This meant when I chose to rob a bank, the “front door” option would already be set up via hidden fields to succeed or fail. So one could very easily submit the form with any button they wanted so long as they set the value of that hidden field to “1” instead of “0”. Since big jobs were so risky, success yielded pretty good cash. 100% success meant tons of money and no time wasted waiting for your jailtime to end.

Moral: Don’t set up future actions in hidden fields! It’s stupid and very easy to hack! All Alphonso needed to do was do the random check after the form was submitted and this issue would not have existed.

Buying Items

But why bother getting a bunch of cash? What a waste of time! Because the game was so poorly scripted so far, I decided to look at buying items, and sure enough I was confronted with awesome hidden fields. The hidden fields would tell the game that a certain button would buy item X at price Y. Hack the form via a bot, and you could buy any item for $0. This meant the most powerful gun for $0. All the ammo you wanted for $0. Bodyguards for $0 a piece. Bulletproof vests? $0. Medical kits: special limited time offer, two for $0!

So you buy great items for free and you realize you don’t need money.

This is a clear case of relying too heavily on the form to determine what’s going to happen. Instead of having the form store the cost of things, it should be stored somewhere on the server – database, bdb file, whatever. User buys an item, sends that item’s ID to the server, and the server pulls the price from the only source it can trust: itself.

Shooting a player

Mobster World was written to stress uneasy alliances. People start shooting each other and the game degrades into total chaos if some of the mob families (essentially in-game alliances) don’t force order by disciplining their members. Because of this, shooting a player was usually not a good idea without a good-sized family behind you. Unless, of course, you could cheat.

The “shoot a player” area was also plagued with hidden fields. By setting the %-to-hit field to 100, all shots would hit. The best gun only hit 50% of the time, meaning you could fire off a shot and do no damage, but still have all sorts of consequences. And if your target had bodyguards or armor (both were essentially just ways to increase bullet-taking ability), your shot could be totally wasted. So again, shooting was usually limited to a family trying to take down another family. But with a 100% chance to hit, free healing (bodyguards, body armor, medical kits), and free ammo, a cheater could do tremendous damage relatively safely.

The game allowed a shot every 10 minutes, so even a cheater had his limits, but with a single bot I was able to knock an unsuspecting don (leader of an entire mob family) down to 6 bodyguards (from 12) in a matter of about two hours. A smarter cheater could have run multiple bots and destroyed an entire in-game alliance in an hour or less.

This is exactly the same as above – there was no need for the form to ever know the chance of a successful shot. Calculate that on the server and only on the server. Yeah, you might want to display it to the user, but don’t let the user be the one who tells you anything other than the weapon they’re using (and of course validate that they own the given weapon and have ammo for it) and the player they’re trying to shoot.

Reading “private” messages

This is where we move from forms to URLs. Reading a message would require a hit to a page like “/messages.php?id=xxx”, where xxx is the id of the email. Well, because Alphonso didn’t think users could modify the URL themselves, you could put in any id you wanted, and then read anybody’s email. Using this passive cheat, you could see what your enemies planned. Following up with a similar method on the message deletion URL, you could see your enemies’ plans but keep them from letting each other know! I was able to discover that my “enemies” thought I was an ex-player they had pissed off a while before I started playing. I catered to this fear and made up all kinds of interesting stories about revenge and such.

Simple fix here – if a user requests access to anything private, make sure they are authorized to see/edit that item!

Sending messages

Once I got bored of looking for “boring” exploits, I decided to check out XSS possibilities. I’m not a security expert, so I only knew how to do something similar to what the wikipedia article calls a “type-2 attack”. And I wasn’t interested in stealing these people’s accounts or anything, I just wanted to mess with their game.

When sending a message, I found that I could embed any HTML I wanted. So with very little effort, I made the private message receipt form appear to have a button on it that looked like the usual “Delete” button, and made the rest of the real page end up hidden so that the only button on the form ended up being mine. When my delete button was clicked, it actually took the user to the “Shoot a player” page, with one of my enemies as the target.

After some testing with a friend, I discovered that I could make a user run literally any action in the game, from failing a big job (giving them jailtime), shooting their own don, going into hiding (forcing them to log out for 8 hours of real time, unable to perform any actions), etc. Had I been evil enough I could have logged out all the players who disliked me except for one, and systematically killed them one at a time.

With a little more tweaking, I found that I could use AJAX to actually make the person perform these actions without even clicking a button. The incoming message could be as simple as “You suck!”, and by merely viewing it, the player committed to the action(s) of my choosing.

It is important to note that many designers think they can get around this issue by stripping out <script> tags – this is not the case. I can embed malicious code in something as simple as a <b> tag just by handling a JavaScript event, such as onMouseOver. Simple solution: do not allow HTML in user-supplied data. For my “big” game (Bloodsport Colosseum), I allow formatting via Textile markup. There are many similar solutions for all kinds of scenarios, and they are, in general, far safer than trying to allow HTML in any form, even if you think you’re being careful.

Email to Alphonso

I wrote Alphonso an in-game email asking if he was aware of cheating issues. I figured he’d deny it like so many web app designers who don’t know security. He surprised me:

yes I am aware of it and thank you very much for assisting me in this game: I have other areas that I am repairing first and I will be getting there soon. Please continue to inform me of areas that you find.

At this point I felt pretty bad and told him the truth – I’d been exploiting the game from day 1, and I pointed out all the areas I thought he needed to look into.

Read my response here if you’re curious how much of a dick I can be when I’ve hacked you black and blue.

Final Word

Do not assume users won’t edit forms and submit bogus data. Do not let a user alter or view anything he doesn’t own (if he says he wants to view message id 10, make sure he is authorized to do so!). Cookies, URLs, and form fields are extremely easy to edit!

There is also the unmentioned SQL Injection attack. I can’t help much with these as I know very little about the attack, but this wikipedia article will give you a great deal of help. The most important thing here is that most database libraries have built-in features for keeping things at least moderately safe (bind variables, for instance, such as “SELECT * FROM FOO WHERE ID = ?”, where the library will make sure the variable that’s substituted for the ‘?’ is safe). USE THEM!

Web security is much more important than most programmers seem to realize. If you want a game or other app to get popular and last a long time, do not skimp on security. Or you, too, could end up with a good idea that does as well as Mobster World.

Arch Reality gives spammers the edge

On the heels of my amazing discovery of the “PC Mesh Hide Files and Folders“:http://blog.nerdbucket.com/articles/2007/01/15/revolutionary-new-software software, I make yet another Awesome Software Discovery: “jcap”:http://www.archreality.com/jcap/!

CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart) technology is always trying to keep ahead of spam / bot technology. This is just another techno-arms race that will probably never end. But this company, “Arch Reality”:http://www.archreality.com, has devised a “clever” image-based CAPTCHA that is 100% javascript.

This Awesome Software Discovery is “special” because it may be the only CAPTCHA system that is run in the client’s browser exclusively. In most cases, you have to have server scripting (PHP, Ruby, Perl, ASP, etc) to process CAPTCHA information, which is a bit of a pain. You have to maintain state information to know that user X was shown picture Y and such. But with this system, All you need is a client running javascript! How awesome is that? Super easy to set up, even for a web novice.

Spammers, beware! As long as we have people like Arch Reality working on our side, your days are numbered!

…or are they?

Well, this is one of those theoretically sound ideas. Much like Communism and pyramid schemes. Any web programmer will notice very quickly that this is total BS. How do spammers operate? Do they single-handedly man a thousand computers simultaneously, working feverishly to send out their spam? NO. They automate everything they can. And let me tell you, when you automate something like a web-based form submission, the last thing you want to bother with is figuring out some javascript! So what do the spammers do? They fracking ignore it! Which leads us to a CAPTCHA that actually verifies that people who have a javascript-enabled browser are, in fact, real people. WOW.

This one blows my mind. PC Mesh has a pretty crappy concept, but these folks really take the cake! Arch Reality’s only saving grace is the disclaimer that came over a year after jcap’s release:

***NOTICE (01.10.2006): The developer assumes no liability with this resource and it is provided as is. This script is referred to as a “security development” because it can provide some minimal level of security. While it does seem to be an effective elementary form of security the developer does not claim that it is an impenetrable solution and thus the developer does not recommend implementing it for the protection of highly sensitive data.

And to me, even that disclaimer is full of crap. Their product will provide literally no security. If you want proof, hit their “demo page”:http://www.archreality.com/jcap/captcha.html, then disable javascript, then type ANYTHING YOU WANT, and click Submit.

Just like a JavaScript-ignoring bot, you too can break through this so-called security development with ease! I’d like to know where they got the idea that this garbage would be “effective” at anything other than pissing off clients! Almost makes me think Arch Reality is working for the spammers….

I’ve done a small amount of digging, and sadly there are people out there who use this product, and think it provides some measure of security. This kind of ignorance is so easily avoided if the people who write software would spend the half hour to research the actual problem they’re trying to solve.

If I can reach just one person, and that person keeps from hiring these horribly untalented hacks, I’ll feel this blog post was more than worthwhile.