Arch Reality gives spammers the edge

On the heels of my amazing discovery of the “PC Mesh Hide Files and Folders“:http://blog.nerdbucket.com/articles/2007/01/15/revolutionary-new-software software, I make yet another Awesome Software Discovery: “jcap”:http://www.archreality.com/jcap/!

CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart) technology is always trying to keep ahead of spam / bot technology. This is just another techno-arms race that will probably never end. But this company, “Arch Reality”:http://www.archreality.com, has devised a “clever” image-based CAPTCHA that is 100% javascript.

This Awesome Software Discovery is “special” because it may be the only CAPTCHA system that is run in the client’s browser exclusively. In most cases, you have to have server scripting (PHP, Ruby, Perl, ASP, etc) to process CAPTCHA information, which is a bit of a pain. You have to maintain state information to know that user X was shown picture Y and such. But with this system, All you need is a client running javascript! How awesome is that? Super easy to set up, even for a web novice.

Spammers, beware! As long as we have people like Arch Reality working on our side, your days are numbered!

…or are they?

Well, this is one of those theoretically sound ideas. Much like Communism and pyramid schemes. Any web programmer will notice very quickly that this is total BS. How do spammers operate? Do they single-handedly man a thousand computers simultaneously, working feverishly to send out their spam? NO. They automate everything they can. And let me tell you, when you automate something like a web-based form submission, the last thing you want to bother with is figuring out some javascript! So what do the spammers do? They fracking ignore it! Which leads us to a CAPTCHA that actually verifies that people who have a javascript-enabled browser are, in fact, real people. WOW.

This one blows my mind. PC Mesh has a pretty crappy concept, but these folks really take the cake! Arch Reality’s only saving grace is the disclaimer that came over a year after jcap’s release:

***NOTICE (01.10.2006): The developer assumes no liability with this resource and it is provided as is. This script is referred to as a “security development” because it can provide some minimal level of security. While it does seem to be an effective elementary form of security the developer does not claim that it is an impenetrable solution and thus the developer does not recommend implementing it for the protection of highly sensitive data.

And to me, even that disclaimer is full of crap. Their product will provide literally no security. If you want proof, hit their “demo page”:http://www.archreality.com/jcap/captcha.html, then disable javascript, then type ANYTHING YOU WANT, and click Submit.

Just like a JavaScript-ignoring bot, you too can break through this so-called security development with ease! I’d like to know where they got the idea that this garbage would be “effective” at anything other than pissing off clients! Almost makes me think Arch Reality is working for the spammers….

I’ve done a small amount of digging, and sadly there are people out there who use this product, and think it provides some measure of security. This kind of ignorance is so easily avoided if the people who write software would spend the half hour to research the actual problem they’re trying to solve.

If I can reach just one person, and that person keeps from hiring these horribly untalented hacks, I’ll feel this blog post was more than worthwhile.

Revolutionary new software!

There is a company out on the fringes of technology. Making software that most of us only dream of being able to write. Scoffing at the current obsolete methodologies and practices, these brave new developers have recently pioneered an awesome new era in software development.

This software company is clearly just another one of your typical geniuses not recognized in their time, as the very unscrupulous “CNet / Downloads.com”:http://www.download.com/PC-Mesh/3260-20_4-6263078.html reviews have been far too harsh on this enterprising company.

“PC Mesh”:http://www.pcmesh.com is, of course, the company to which I refer. It is with the most sincere amazement I discovered this little gem of a company today. Or more specifically, the discovery was their “Hide Files and Folders”:http://www.pcmesh.com/hide-files-folders.htm software.

How can I make these claims about this company? Well, for starters, their web site tells us all we need to know: PCMesh Hide Files and Folders is a revolutionary new software product…. But I’m not an idiot – I know to do my homework and not take everything at face value, even a statement so indisputable as that one. So how do we know these guys are the real deal? I’ll go through their feature list, item by item, and explain just how brilliant they are. Some of what you’re about to read may be difficult to accept, but keep in mind that true brilliance will often challenge that which we have been taught to believe, and that challenge can sometimes be difficult to accept. Now, on with the -propaganda bashing- product highlights: * Invisible from the operating system, invisible from virus attack and invisible from spying eyes that won’t even know the cloaked files or directories are present. ** Wow. Just… wow. Okay, invisible files are protected from virus attack and spies. Humans won’t know to look for cloaking software, of course, because this concept is totally new and unique, and even now that it’s out, unauthorized people would never dream of doing research to learn of this exciting new software. As for viruses, yeah, they can’t infect what they can’t find. Too bad most people want to hide data files, not fracking executables. And too bad that when you make those files visible, a virus will then see the OS reading the files and infect them. And too bad a smart virus could easily be written to infect this POS program in such a way as to destroy the data as you try and make it visible. But other than that, yeah, this software is amazingly effective. * Encrypted files are still visible on the hard drive. This makes them vulnerable to attack from anyone who is interested enough in the content of the files to spend time trying to decipher them. And with more and more hackers intent on defeating modern encryption algorithms, a need exists for a better type of protection. ** In fact, this may be the only statement that’s partially true. Granted, most encryption today is nearly unbreakable, especially for home computers that don’t house highly-sensitive data, but otherwise this isn’t a “bad” thing to say. Better encryption standards are always a good thing. Questioning the strength of today’s encryption is certainly a worthy goal. Course, I’m not sure where they got the idea that “more and more hackers” are are intent on defeating modern encryption algorithms. Haven’t droves of hackers (and security specialists and general security enthusiasts) always been interested in defeating encryption algorithms? Without those people, we’d all still be using Caeser Shift Ciphers! * In addition to rapidly becoming obsolete, current encryption programs are slow. ** Rapidly becoming obsolete? Gosh, even the encryption algorithms that are considered to be broken are still pretty strong for the average computer user’s needs. And anybody with data so sensitive that it needs unbreakable encryption can probably deal with the fact that they need to update encryption methods every few years. * It takes as long as 10 minutes per 200 MB to encrypt or decrypt a file, while PCMesh Hide Files and Folders executes instantly regardless of the file size or number of files/folders being protected. Just one click is all it takes to render any file or directory invisible. ** Okay, I don’t know much about encryption speeds. I have to be honest, this could be completely true for a really awesome encryption algorithm. So let’s say they have two semi-true statements. Let’s note here that this software “executes instantly”, which means to me it flags files in some way (prefixing them with $sys$, perhaps?), and doesn’t do any kind of encryption. * Data that’s protected by PCMesh Hide Files and Folders is not visible, so it can’t be attacked. In fact, the software itself does not even run continually, so it does not announce its presence to snoopers and hackers. The only time the software is active is when it’s being used to hide or reveal protected files or directories. ** This statement (or series of statements) is so ridiculous I am amazed. “Security through obscurity”:http://en.wikipedia.org/wiki/Securitythroughobscurity is just plain stupid. If an attacker simply finds out about this garbage software, they’ll know to attack the “invisible” files. And since the files are still on the system, there is no way to truly make them hidden – if this software can get to them, so can an attacker. Worse, the authors actually believed that an attacker will need something continually running in order to realize what’s going on. I imagine PCMesh is populated by people who’ve never even read a single article on security, encryption, or hacking. If hackers had no access to the internet and didn’t know how to research new “protection” schemes, they really wouldn’t ever be a threat. * Better Than Encryption ** Though this is higher on the page than the last few items (its their header in fact), I thought I’d mention it here just after the security bit, to point out how absurd the claim is. Obscurity is /never/ better than encryption for sensitive files. It’s only better than encryption when it comes to files you don’t need strong protection on, and situations where you just want to keep the honest people honest. Nobody can currently break AES, but just spending some time hacking through this product’s disassembly (after unpacking their undoubtedly “proprietary” packed executable) would probably reveal how to find the “hidden” files. Though I suspect it’ll turn out to be similar to the “Sony rootkit”:http://en.wikipedia.org/wiki/2005SonyBMGCDcopyprotectionscandal BS from 2005. * Hide files or folders of any size instantly. There is no processing time. ** Most of the bullet point “highlights” are just repetitive crap from the “Better Than Encryption” section of this website. But this one struck me as funny. It’s really no big deal; it’s just that with computers (and pretty much anything), there’s no such thing as “no processing time”. I dunno, I’m picky, shut up.

So clearly this product kicks ASS. Go out and buy it today.

Cheap security for web-based games!

h3. This is new to me, but I imagine true security fiends have already thought about this issue plenty, so I apologize if I’m repeating “news” that’s already been mentioned.

I came across an interesting security mechanism in my quest to automate some “Kingdom of Loathing”:http://www.kingdomofloathing.com stuff in ruby the other day. Their login system isn’t hosted on a “secure” server, which means that (under normal circumstances) anybody can snoop the network traffic, get your password, and end up stealing your account.

For a web-based game, this isn’t (usually) a big enough reward for the time spent sniffing through network traffic and hacking the account, so most such games haven’t got any security on their login forms (including my web-based games, though I may change this when I’m hugely successful). For online banking, obviously the rewards are much higher, so those sites need to be secure.

But what is it to be a secure web site? At the time of this writing, I’m of the belief that it costs a good deal of money to have essentially public-domain technology applied to a web site in order to get the stupid little “this site is secure” icon that makes people willing to put credit card, social security, and other private data into a web form.

I’m not saying that “VeriSign”:http://www.verisign.com is just in the business to rip people off – they provide a lot of services other than just encryption. My problem is merely that the technology of encrypting sensitive data, and assuring a user that a site is safe, shouldn’t cost an arm and a leg, especially for low-traffic sites such as a niche web game, where annual profit may be as low as one or two thousand dollars.

The technology used for website security is pretty basic, really. I mean, it’s powerful stuff and considered unbreakable, but the same security is available in libraries for dozens of languages for free – it’s just strange to me that these algorithms cost so much money ($500+ per year) to implement on a web server. Which brings us back to “Kingdom of Loathing”:http://www.kingdomofloathing.com….

I was writing a little script to automate login, trading, and some other minor things in this game. When I reached the login I found a very interesting twist – when you submit the form, your password is not sent to their servers. Instead, your password is processed and altered quite a bit: * The server sends your browser a session-specific “challenge” code, which is essentially a one-time code for encryption (yes, I tried using the same code multiple times with no luck, which makes me think it’s probably stored on the server with your session data). * Your browser uses javascript to first get an MD5 of your password, then an MD5 of that value concatenated with the challenge code. * This final MD5 is what’s sent to the servers.

This system doesn’t provide what VeriSign provides – it is only encrypting data going from the browser to the web page, not an entire area of the site. Incoming data is totally unprotected, which can be problematic in some cases (although I’d bet clever javascript could fix that to some extent by using true two-way encryption). And let’s face it, simply using a couple MD5 hashes and one-time keys isn’t enough to guarantee security. I’d imagine if the reward was high enough, somebody could figure a way to capture the data for a while, analyze it, and eventually work out either a password or else a security hole of some kind.

But it isn’t so much about having 100% security in this case – it’s about having “good enough” security for the situation. If the rewards aren’t worth the time, even a simple “Caesar cipher”:http://en.wikipedia.org/wiki/Caesarcipher is good enough security. And if we could implement a decent key exchange for a more powerful encryption system (“Diffie-Hellman”:http://en.wikipedia.org/wiki/Diffiehellman plus “AES / Rijndael”:http://en.wikipedia.org/wiki/AdvancedEncryptionStandard, for instance), we could ensure that small pieces of a site were 100% secure “for free”. From my minimal research, it does appear that many encryption algorithms do already exist in javascript.

So for all the web game developers out there, or other small-profile sites – if the only security you truly need is a few form passwords here and there, this MD5 solution used by KoL is probably not a bad fit. And if you need more security, a fully javascript solution shouldn’t be discounted, especially if you need high (but not necessarily perfect) security, but don’t have the extra money to pay for it.

Perl vs. Ruby

Just a quick random thought about getting rid of whitespace from a string.

In perl, you trim.

In ruby, you strip.

Now which language is really sexier? You be the judge.

Rants about Ruby on Rails

h2. Disclaimers

  1. Ruby on Rails is as good as the hype, it really is. Building a maintainable web application just doesn’t get much better (yet). These are just some caveats I’ve found while using ROR(Ruby on Rails), and some will probably make me sound a lot more pissy than I intend, so keep in mind I would still rather use ROR(Ruby on Rails) than PHP, Perl, java, C, C#, etc. And also keep in mind that some of the problems I’m mentioning here are going to be found in other frameworks. ORM systems try to keep us from having to write SQL, and that is guaranteed to have some issues when performance is our #1 concern.

  2. I’m also the kind of person who seems to get pissed off about almost anything. And the thing is, I’m not actually pissed off – I’m annoyed. I just happen to be vocal about my annoyances…

  3. One final note… I’m not claiming to be a Rails expert! These issues are ones that I find problematic because of people like me who learn as we go. A true Rails guru probably will have better solutions to a lot of these issues. But I’m trying to help the Rails people who haven’t had the time to learn every little nuance in Rails, because I think there are a lot more of those than there are gurus…

h2. Background

Here’s the deal – at my job, I was assigned to a big project (monstrously big when you consider I was the only developer for 90% of the project), and I chose to use ROR(Ruby on Rails) for it. I don’t know if I’m at liberty to discuss the exact details, so let’s just say I’m working on a community-oriented site, revamping their product reviews. Currently, the site has information about something like 70,000 products, and each product has anywhere from 1 to 1000 user-submitted reviews. My revamp is a temporary fix until we get a java portlet solution working (That one was definitely not my decision).

h2. Minor annoyances

My main gripes relate to performance, so I’ll cover them in detail below. First, a list of my minor gripes… keep in mind, it’s often the “little” things that add up.

  1. Rails doesn’t currently support portlets in any way (I can’t find anything about ’em on google, at least). If it did, maybe we could convince the genius management types that Rails is a suitable solution. Hell, I don’t even think portlets are half as interesting to site visitors as they are to site builders. But since they’re a requirement in the long term, they’re what we’ve got.

  2. Rails documentation is pretty sparse for some of the internals. Look at the api site (http://api.rubyonrails.com) and try to find information about RJS or specific information about using routing / named routes. It may be there, but I sure can’t find it! And much of the documentation simply doesn’t go deep enough. Look at ActiveRecord::Migration and ActiveRecord::Schema – they tend to give the simplest cases and still have undocumented functions that you have to look into the source to figure out… no big surprise, Rails is still new, but still an annoyance.

  3. The testing is weird. I haven’t tried integration testing yet, and I think it might fix some of my issues with testing, but unit tests and functional tests are a bit odd at times. I have one test that passes when run by itself, 100 out of 100 times. But it fails when run with rake. I still haven’t figured that out… and functional tests… ugh. First off, the name seems to be wrong. They’re just more view/content-oriented unit tests. Second, they act strangely if you call multiple actions in one test. Even with the recent inclusion of integration testing, it’s just more convenient to write a “quick” functional test that hits the same action in various ways. But if you do that, there are cases where the test won’t work. I’ve had to split up some tests in order to get them to work in a situation where, for some reason, the testing framework is breaking them. Like the data refuses to change after calling one action… and this only happens sometimes, not always. And with such crappy documentation about how testing works (most docs for testing are third-party docs, and generally very shallow), it’s a pain in the ass!

  4. Some things are just documented incorrectly, and seem to be done on purpose for “simplicity”. Associations, for instance. If I ask for product.reviews, the docs say I get an array back. Well, technically it is an Array. But it’s been mucked with (the specific instance itself, not the Array class) in order to allow for the association’s special functions. Let’s say I want placeholders – for products with less than 10 reviews, I want the array padded to 10 items, where nil items tell my view to show the text, “Your review could be here!” (Stupid example, yes. There are other ways to do this, etc, etc. But the point isn’t this specific example so shut up.) The moment I call @reviews.push(nil) on that “Array” I got back, I’ll hit an error that nil isn’t a review, so I can’t add it to my array. The Array object that’s sent back is modified to allow the cool association features, but this is not clearly documented! The solution? @reviews = (complexassociationlogic).dup. By adding dup, you get a new Array that hasn’t been mucked with. Horrible for performance on large Arrays, I’d bet, but if you know the Array is going to be small, you can deal with it and get a true Array back.

  5. Inconsistent API gets really old. Parameters are a great example. When parsing parameters, hashes exist for parameters that are sent in a certain way (‘object[attribute]’ as an input field’s name is the most common reason to see this). And in the testing framework, you can send parameters to an action as hash elements, such as:

    
    get :action, :param1 => {:nested1 => 1, :nested2 => 2}
    
    You cannot do this with url_for style parameters! It will not internally convert from a hash to the expected “url-friendly” values. I don’t see the usefulness in allowing the hashed params in two places, and then having them break in another! I’m hoping it’s just an oversight, but all the same, it’s difficult.

h2. Major annoyances

Quick sidenote about how badly this site was written: when a user submits a review for a product, the review is stored as a flat text file instead of in the database. When the admin looks at reviews for approval (or rejection), he’s looking at the first line on the flat text file. When he clicks “Next”, the ENTIRE FILE is rewritten to move the next line to the top of the file. And this is one of the more logical operations on the site. Other examples include the fact that there’s currently a page that shows all reviews for a given category, and the biggest category, housing over 15,000 products, ends up with a 2.3Mb page that’s got so much stuff on it, it’s totally worthless to anybody. Don’t forget the minor detail that all pages are static html, generated every night by a review generation script that takes up to two hours to run on some extrememly fast systems (yes, there really is a lot of data). Dynamic content is so much sweeter after seeing a system like this.

Moving right along… so I’ve got this crappy legacy system to work with. I’m only trying to make the system a little easier to use, and more maintainable in the short term. This means I’m not getting any content folks to rewrite pages, I’m not totally redoing layout, etc. It’s all going to look very similar to the end user, but behave much differently on the back end. The current system is all perl scripts, and is easily the worst code I’ve ever seen. And given that I’ve seen my own code from years ago, that’s saying quite a lot.

So while I’m rewriting a lot of the backend here, I have to try my hardest to keep the look and feel moderately similar to what’s there right now. Tricky, tricky. And when moving from gobs of static files to dynamically-generated output, things get really painful….

Now to unveil my #1 gripe: Rails is very stupid when it builds SQL. Watch the dev log fly by when you do things that are oh-so-simple in code. (Lame gripe? Sure… this is actually a gripe that would exist in most ORM frameworks, even. But with this project it caused me some major headaches, so instead of having a more legitimate major gripe, I have this one. Deal with it.)

First example is a Rails newbie (me around the beginning of 2006). I have a list of manufacturers, and want to show each one, its products, and the product reviews. Simple:


  <% @manufacturers.each do |mfg| %>
    

<%= @manufacturer.name %> <% mfg.products.each do |prod| %>
  • <%= product.name %> <% prod.reviews.each do |review| %>
      <% review.ratings.each do |rating| %>
    • <%= rating.name %>: <%= rating.text %>
    • <% end %>
    <% end %>
  • <% end %> <% end %>

    When there are 1000 manufacturers for a given category, each with an average of 15 products, each with an average of 10 reviews, each with ~5 rating items, you can see very quickly that the number of hits to the database is insane. And this is a relatively simple example of what I’ve had to do.

    I learned quickly about eager loading. When I load my manufactureres, I pull in all products at the same time. This means a longer hit to the DB, but a lot less time parsing. But this still means a lot of hits for reviews and ratings! What do we do?

    Well, some of the smarter folks out there might suggest using a :through association to eagerly load the reviews. Then we’re only grabing each review’s ratings. Great idea, right? Surprisingly, no. While it may be a better situation than the prior one, it’s still far too slow for a page hit. The SQL for eagerly loading products will run on our dev machines in about 1/10th of a second. Throw in reviews and it’s up to TWELVE SECONDS! (Yes, I’ve already checked indexes, thanks for the tip, smartass). What’s more, the query gives me over 200,000 rows, while the manufacturer + product query only gives me about 30,000. That’s a whole lot more data to deal with. In fact, it’s exactly the number of rows that are in the review table to begin with. (This is obvious to somebody familiar with SQL….) Either way, the page load just for that data being processed (read from database and then turned into ActiveRecord objects) can end up being in excess of a minute.

    Now the naysayers will pipe up with “use caching!”

    So on to my next major gripe: caching is not very useful for a site with huge amounts of data to present. Oh sure, only one person has to be hit with that huge penalty, then everybody else is good. No big deal. Except when you need your data refreshed every hour, and 10 places on the site are dangerously slow (I consider >30 seconds to be in this realm). All of a sudden you have 10 pissed off customers an hour, 240 a day, and eventually everybody knows something’s not right.

    Background caching is something I would love to see! I wish I were a better programmer, I’d submit a patch… but boy, wouldn’t it be great to be able to say, “start rebuilding this cache, but until it’s done, show the old cached page”? Then nobody is ever hit with that initial page load! That would make insanely large sites able to present a perfect interface to the user, while keeping the data up to date as often as needed!

    Okay, back to my situation. Not only do I need to display all the data previously mentioned, but I also need to present some advertising-related data. Each product can have many advertiser_links. Each of those links points to an advertiser. To get at that data, I’ve just added two new joins. Pulling all the data at once is simply not doable. Just adding the ratings to the above query (remember we were just doing manufacturer, product, and review) makes it too slow (47 seconds!! over 1,370,000 rows!! YAY!) for me to ever consider for a real website. Adding the advertising information isn’t even worth timing. I tried, just for kicks, and couldn’t sit around long enough to see how long it took.

    So how do I get past this? Why, I turn to my friend the paginator! That’ll fix all my woes! Right?

    Once again… no. It’ll help, but certainly isn’t the final solution. The problem is that this site just has too much data. I paginate the hell out of it, and still most pages take too long to wait for loading. Especially since I can’t eagerly load as much as I need to in order to stop hitting the DB 10-50 times per product. Pagination is part of the answer, but by itself it isn’t bringing page load down low enough to matter. Early tests revealed 5-15 seconds per page because even though I’m pulling less data, I’m still hitting the DB so many times. And when I do lots of eager loading, the DB spends so much time that I don’t gain much on getting a small result set.

    SO HOW THE HELL DO I GET PAST THIS? At this point I had hit a wall; I just didn’t know what else to do.

    NOTE: there may be more options available that I don’t know. I know enough about Rails to know that there’s still a lot more that I have to learn.

    I ended up doing what I should have done long ago: building custom SQL. Okay, I actually didn’t build SQL outside of the Rails system except once, where I needed hundreds of thousands of rows of data without the overhead of building an object for each one. But I did have to make heavy use of the find method’s :joins and :select options (once you hit those two with a :conditions option, you’re writing about 70% of the sql for the query, minus the weird table/row aliasing rails does), and I did a lot of preloading data. For instance, I’d have my list of manufacturers (with products eagerly loaded) and the load all advertiser information, stored in a hash by product id. This allowed me to hit the database only a few times, and with relatively quick transactions.

    Another little cheat was to do a lot of @product.mfgid = xxx@ or @if (product.mfgid == xxx)@, instead of @product.manufacturer = xxx@ and @if (product.manufacturer == xxx)@. I may know an id for something, but not have it loaded on that specific product instance. Punching in the extra few characters (and therefore not loading more data) can yield tremendous performance bonuses.

    Another nice solution I came up with is one I will put up as a plugin at some point – cachedfind. If findevery (all finds eventually hit this one) is called with the same parameters twice, within a given amount of time, the plugin will not hit the database again. This saved me a ton of DB hits on tables that have a lot of data that never changes. Some will say, “A good DB server will cache that for you!”, and they are right. But that still requires a hit to the DB server (rarely on the same server your code is running, at least in a large app), and Rails has to parse that data. Returning pre-parsed data that was previously being loaded 50+ times per page makes a big difference. This can end up involving a ton of data being stored in memory, so this trick is really only good for small sets of data that are rarely changed, and loaded in a lot of different places for different reasons. In my case, things like product categories (loaded for products, manufacturerers, and reviews) and advertiser info (only a dozen or so advertisers exist in the data, but the extra join on two or three queries was still painful).

    The moral of the story is simple. Rails is proven for small sites, and even medium-to-large sites written with Rails from the beginning. Redoing a huge site to work in Rails requires a lot more than what the tutorials will ever tell you. It requires a lot of banging your head into walls, trying to rattle out one more life-saving plan. It requires a tenacity that not a lot of developers will have, especially those buying into the “Rails projects go 10 times faster!” hype. It requires a lot of digging into the internals of Rails. It requires at least a small amount of brain damage.

    Is it worth it to do this when I could be working on the Java solution? I certainly think so. I find that Java is painful enough to use (compared to Ruby) that it is still easier to play with new technology that may have little quirks than to deal with “reliable” and “proven” frameworks based in Java.