I’m not dead! Sadly, neither is Rails…

My main website languishes, having no significant updates in over a decade. I repeatedly give up on my attempts at building even simple new games. And now even this crappy blog goes nearly two years without any updates. Why? Because I’m pretty much in an abyss of hellish real-life stressy balls of shit. A huge part of my lack of motivation (I think) is my day job.

The university where I work is still doing the Rails app I’ve bitched about many times (I call it “Cyclops” sometimes). We’ve been using this framework for nearly a decade, and our latest rewrite started three years ago. I say “latest” because we have already scrapped one rewrite attempt — the stack is just that great.

But somehow upgrading to the latest and greatest code in the Cyclops stack has made things far worse than previous versions. This self-imposed train-wreck is so bad that despite having a team of about six developers, and despite not having yet deployed to production, we’re spending more time trying to debug or figure out what it’ll take to make the framework do what we want than actually implementing features.

We’ve been “full steam ahead” for three years or so, and MVP is still a ways away. On average, it eats up roughly two FTE of devs (we have a rotating crew of five to six who are on it 40-60% of their time). I relish the “off-sprint” time when I can work on literally any other project. Go projects are divine. Our massive Django project is a beautiful thing in comparison. Even digging into Java code for the hideously awful DSpace project to fix their incredibly short-sighted and inaccessible HTML-building code is easier, more maintainable, and more enjoyable. And I hate Java with a passion.

COVID changed everything for me, and in a surprisingly good way. A coworker left, and we haven’t been able to replace him. Which meant his work fell to me. Drupal updates (I hate Drupal and I hate PHP), more focus on front-end accessibility work (you wouldn’t know this looking at my main site, but web accessibility is critically important to me), and random API development. All these things are better than Cyclops.

Here’s the thing that’s killing me. We knew this was a bad framework when we implemented our original incarnation back around 2014 or so. The project took two years (it was supposed to be half that), didn’t perform well, had many errors in production, and yet the ASSHOLES in charge of it sold it as a success to the library administration (I work in a university library for those not in the know).

I have been very opposed to this beautiful lie, because while the website itself is reasonably functional, the technology stack is very brittle, and requires a fair bit of maintenance just to keep it as-is. This kind of maintenance — upgrading Rails, removing / replacing / updating gems that have security problems, digging in to why some pages take 30+ seconds to render, etc. — has become not only tolerated, but accepted and even expected.

Because of how shitty the stack has been historically, nobody bats an eye when the new implementation, using the latest version of the stack, is in fact slower and less reliable. There are major institutions divesting themselves from the technology, choosing to pay for really awful commercial solutions because “at least they scale well”.

We put many hours into debugging why it takes hours to index a few thousand assets. We never found a single point of failure we could fix. At least we managed to disprove the moronic claim that Solr was the problem (Solr hits aren’t even 1% of the total time spent). We’re expecting it to take several weeks to index our full dataset (half a million digital assets — primarily images, but also PDFs, audio, video, etc.) Reread that, please. Several weeks to index half a million assets into Solr. I just said Solr isn’t the problem, so what the fuck? Well, this is where Cyclops FUCKING SUCKS ASS. Its code that prepares an asset’s data for Solr is slow. We can’t figure out why or how to fix it, but we know that’s the section of code that’s shitty. And we know that section as a whole is critical, because without an index in Solr, the website can’t function.

Our production server is looking likely to need at least 32 gigs of RAM just to serve up web pages (Background workers, responsible for things like creating web-friendly versions of huge images, will likely need another 16 GB or so, but we’ll ignore that for now). The web server’s job, 95% of the time, is just to look up data in Apache Solr (which again is incredibly fast if configured well — I’ve benchmarked it repeatedly in multiple scenarios to prove my idiot colleague wrong so he’ll stop trying to fucking make Solr his scapegoat, ignoring our real problems) and produce some pretty basic HTML.

For reference, the Django project which I don’t hate, Open ONI, runs our Historic Oregon Newspapers website. The data is much less diverse, being 100% newspaper data, so it’s apples to oranges in terms of the data, but the general job of the systems is the same: ingest images, present web viewers for those images, use Apache Solr to do full-text search.

Oregon News runs on a 6GB RAM server with two vCPUs. It gets more traffic than Cyclops. It runs on Django, which isn’t that much faster than Rails generally speaking. This single server runs the following:

  • Solr, for full-text search across all our newspapers
  • Apache+WSGI to serve up the Django app
  • RAIS which decodes JP2 images (which are notoriously CPU-intense to decode) on the fly to produce tiles for the pan-and-zoom viewer, thumbnails, and “clips”
  • MySQL for the canonical source of data (you don’t generally use Solr are your only data source).

(We also have a separate server which does processing for new newspapers that need to be ingested – converting TIFF images into JP2s, for instance. It’s also a very low-end server, but it’s not a fair comparison since the real Cyclops project handles movies, audio files, etc. I’m just trying to give a sense of how the two perform when we strictly discuss images, so the “job processor” servers aren’t something I’m considering)

For all these tasks, ONI runs very well on 6 gigs of RAM and 2vCPUs. It gets more traffic than any site we have:

  • 30k page hits a week
  • 500k to a million on-the-fly image tile requests a week
  • Over 1/3rd of page hits are full-text searches, and most hits have to do some form of searching, even if it’s just “find all issues for a given newspaper title”. Even the “static” pages always present an image viewer, which hits RAIS to generate dynamic tiles from the JP2. So don’t go thinking it’s mostly serving static content.

Our Cyclops-powered shithole needs at least 32 gigs of RAM, and we’re looking at maybe 64. It’s still slower than the news site even with a small dataset (our staging server has only a few thousand assets right now). I have no traffic stats for the new system since it’s still in development, but the current system we’ll be replacing gets about a third the traffic of Oregon News.

ONI can ingest a million newspapers’ data in about ten hours, or 600 minutes. Just the “index” operation of 1000 images in Cyclops took us about an hour. 60 minutes. For 1000 fairly simple objects. That’s 1/10th the time for 1/1000th the work. i.e., Cyclops will take roughly 100x as long to simply index assets compared to doing a full ingest in ONI.

This is where I get really fucking pissy. I understand that Cyclops is trying to be a generic solution for any type of digital asset management, and assets have more varied metadata, but this is absolute bullshit garbage. There is no excuse for a 100x cost to index assets. ONI indexes the full text OCR, so the “more metadata” argument falls flat fast. And a generic solution can be a little slower, but not two orders of magnitude.

The project was already a horrible experience, and one I am always glad to have an excuse to avoid, but it makes my blood boil when I see our project stakeholders pushing the devs forward while they all — fucking devs included — bury their heads in the motherfucking sand pretending we aren’t speeding toward a brick wall. No matter how you look at it, NOTHING warrants the costs of reindexing for the Cyclops project. Our particular data “needs” are absurd and stupid, but they don’t justify a 100x slowdown.

If it were just reindexing that sucked, I’d probably just say “HOLY SHIT! Oh well, we’ll just have to scale horizontally when we have to reindex.” But this is indicative of the entire project. And worse, the library community as a whole. When did library software become this black hole of garbage-wrapped shit? When did we decide that we didn’t give a fuck about building systems that were maintainable? And hell, eco-friendly. This project will be burning more CPU than the vast majority of sites or apps I’ve ever been part of, while serving hardly anybody in the community or even the universities that are part of the project. Shit, at least historic newspapers have some value — half the images in our Cyclops project are locked down for use at the university and nowhere else.

So tell me, librarian-slash-programmers: how do you justify this? How is it okay to push projects, lie about their successes, and waste thousands of dev-hours on unsustainable trash that nobody even fucking needed to begin with?

I’ve actually been told “job security” by somebody who got away from Cyclops after investing years in the stack. What a community of shitty people.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.