ludic.mataroa.blog Open in urlscan Pro
95.217.30.133  Public Scan

URL: https://ludic.mataroa.blog/blog/i-accidentally-saved-half-a-million-dollars/
Submission: On October 31 via manual from FR — Scanned from FI

Form analysis 0 forms found in the DOM

Text Content

Ludicity


I ACCIDENTALLY SAVED HALF A MILLION DOLLARS

Published on October 29, 2023

I saved my company half a million dollars in about five minutes. This is more
money than I've made for my employers over the course of my entire career
because this industry is a sham. I clicked about five buttons.

Let's talk about why happened and why it's a disgrace that it was even possible.


I. BACKGROUND

Let's start with some background, because it is fucking wild that an
inefficiency that took me five minutes to solve in a GUI configuration panel was
allowed to persist. We cancelled someone's contract the week before I did this.
Someone lost their job because no one could get their act together long enough
to click the button I told them to click.

A few years ago, this company decided that it wanted to create an analytics
platform, following the decision to become more "data driven". They hired some
incredibly talented people to make this happen, and then like five times as many
idiots.

At the time this was happening, I had just graduated and joined the organization
as a data scientist. We, of course, did not do any data science, because the
organization did not require any data science to be done - what they actually
needed to do was fire most of the staff in every team, leaving behind the two
people who actually had good domain knowledge, then allow them to collaborate
with good engineering teams to build sensible processes and systems. Instead,
they hired a bunch of Big Firm Consultants. You can see where this is going
already.

Nonetheless, at the time I was young and took the organization at its word.
Executives would tell us constantly how excited they were for us to roll out new
A.I initiatives (then tell us there was no time, so could we please get that
report to them in a spreadsheet), and I'd ask for some sort of compute to
perform some machine learning, or even set up data pipelines.

It never worked. Instead, we were told that we just had to wait for the Advanced
Analytics Platform (AAP) to be deployed. You see, it's December, and it's
launching in January.

Then in January I was told to be patient, it was coming in March.

In June, I was told it had been put on hold due to Covid - this was a very
convenient excuse because they had absolutely fucked the whole project up
already, but it bought some valuable time. By the next December, I had left the
organization and the AAP was still nowhere to be seen.

We skip ahead three years. The AAP is finally ready to launch. It turns out none
of the features I needed were ever planned, so I guess they were just lying to
me before I left.

Four engineers leave the company in the same week, and I speak with the
directors because I know they need a real engineer in and they can't find them.
I'm a substantially less experienced engineer than many of the readers here, but
suffice it to say that I can read documentation without panicking, which is
considered S-tier in this country. My conditions - a big pile of money and they
had to put me on the AAP team because they're the only team that gets actual
toys to play with.


II. IT FUCKING SUCKS

It's an insane dumpster fire spiderweb of technical debt and it's only like one
week old. Here are some fun details.

I get a friend of mine hired (big fan of nepotism), and he finds, on day one, a
file in the project's repository that deletes prod using our CI/CD pipelines if
it is ever moved into the wrong folder. It comes complete with the key and
password required for an admin account. It was produced by the former lead
engineer, who has moved on to a new role before his sins catch up with him.

The entire thing is stitched together by spreadsheets that are parsed by Python,
dropped into S3, parsed by Lambdas into more S3, the S3 files are picked up by
MongoDB, then MongoDB records are passed by another Lambda into S3, the S3 files
are pulled into Snowflake via Snowpipe, the new Snowflake data is pivoted by a
Javascript stored procedure into a relational format... and that's how you edit
someone's database access. That whole process is to upload like a 2KB CSV to a
database that has people's database roles in it.

This is considered more auditable.

Everything is transformed into a CSV because the security team demanded
something that could undergo easy scanning for malicious content, then they
never deployed the scanning tool, so we have all the downsides of the CSVs and
none of the upsides.

Every Lambda function, the backbone of all the ETL pipelines, starts with
counter = 1 because one of the early iterations used to use a counter and people
have just been copying that line over and over. Senior data engineers have been
copying that line over and over.

The test suites in the CI/CD pipelines have been failing for months, because
someone during debugging chose to use the Linux tee command to log any errors to
both stdout and a file at the same time, but tee successfully executing was
overwriting the error code from the failing tests.

To get access to the password for any API we need to hit, you search for
something like service-password in an AWS service, which returns the value...
service-password (as in, literally all the values are the same as the keys),
then you use that to look up the actual password in a completely different
service. No one knows why we do this.

The script that generates configuration files for our pipelines starts with 600
lines of comments, because senior engineers have been commenting the lines out
in case they're needed later. The lines are just setting the same variables to
different values, and they're all on GitHub anyway.

This is at an organization that some percentage of readers will recognize on
sheer brand strength if they're in my country.

I'm not even getting started, but we have to stop for now because I am going to
catch fire. These details are important because now you understand the kind of
operational incompetence that allows you to waste so much money on processing
<1TB of data per day that it dwarfs your team's salary.


III. THE BUDGET

The next thing to realize is that this platform never really had a chance of
making any money for the organization. They do a little accounting trick (read:
lying) which I'll talk about in another post that makes it seem like they've had
huge wins, but really this is just many times more expensive than our previous
operational model.

The deal is that we pretend the whole team is doing something or other, and we
stay within budget because the organization can't afford to spend infinite money
on this social fiction. However, the budget for our database costs was being
drastically overrun. I'm not sure what the original estimate was, but I think it
was intended to cost something like 200K for a year of operations, but we were
now close to a million dollars.

Some quick facts:

 1. We use Snowflake as our database, which charges you based on the size of the
    computer you use to run your queries.
 2. You only pay for computers while they're on.
 3. We probably run a few thousand queries per week, mostly developers
    experimenting with little tweaks for PowerBI reports that no one reads, and
    on average they take about 2 seconds to run.
 4. The computers are set to idle for 10 minutes after every query.

I noticed this about a month into joining the team, and suggested we uh... don't
have the computers run for like two orders of magnitude longer than they need to
for every query. I literally can't remember what was said, there was some Agile
bullshit about doing a discovery piece, then it just never happened.


IV. JUST DOING THE FUCKING THING

Anyway, months later, they finally give me a card that says "Discovery: Optimise
Costs". Now I have to optimize costs so that I have something to say at the next
standup, and fortunately I know just the thing! I'll test my hypothesis that
this is all a sick joke, and I'm going to push the button that I secretly think
should obviously have been pushed.

We've got a new guy on another team who seems excellent, so I ask management if
I can give him admin credentials since we need competent people. They say no. I
flick him some lower-level database credentials that I technically wasn't told
not to do since they aren't admin credentials, and he sanity-checks that it
would save money. At 4PM on the last day of the week, I ping a chat full of good
engineers and no managers to make sure I'm not about to nuke everything, then
just do it.


V. CHAOS REIGNS

I return to work the following Monday. I suspected that this would save a bunch
of money, and guess what, our projected bill dropped from a million to half a
million dollars, and everyone is losing their fucking minds.

My team has spun this as a huge cost saving, when really we just applied a fire
extinguisher to the pile of money that we had set alight.

Other teams are attacking my team, insisting that it can't be a coincidence that
the one new guy joined exactly as we did this, and how was it possible we didn't
know how to generate that kind of saving without his help? They are saying this
because it makes them seem higher status and their teams only produce money in
the land where you lie all day, but it is a fair question.

While my managers are very happy, they quietly suggest it may be unwise to roll
out the changes to all the computers (I only did a few to be safe) because it
would oversaturate the department to hear about us all day. And invite unwelcome
questions. The subtext is that if we do this all slowly enough, it might seem
like it took a lot of effort instead of just clicking buttons that I said had to
be clicked almost a year ago.

I am asked to write some PowerPoints, which include phrases like "a careful
statistical analysis of user usage patterns indicated an opportunity to more
effectively allocate resources", implying that nothing was wrong, we just needed
to collect more data before deciding not to let the expensive machines idle all
day.

Every day, I dread someone asking me to explain what the change was, because I
will have to fucking yeet some managers I like under a bus, but they can't
resist talking about the change non-stop because it is the closest some of them
will ever get to impacting the bottom line. And many of them are actually decent
managers, it's just that this whole department, like many departments, is some
sort of weird political PsyOp to get executives promoted. It's cosplaying as a
real business and the board thinks the costume is convincing.


VI. THE AFTERMATHS AND TAKEAWAYS

By identifying a handful of good engineers and going totally rogue, we
outperformed the entire department pretty effortlessly. The competent people are
there, just made totally impotent by the organization, and I'm still convinced
that this place is probably better than the median organization.

I ask management for a 30K raise after saving 500K and my message is still
unread. I suspect I will eventually receive either nothing or 5K.

I have even more meetings now because everyone wants to talk about how we saved
the money. I had to make a PowerPoint. Kill me.

I would have been better off not doing anything. Let that be a lesson to you. Do
you hear me? I applied myself for five minutes against my own better judgement,
had the greatest success of my career, and have immediately been punished for
it. Learn from my mistakes, I beg of you.

PS: While this is doing immense traffic, any similar stories being sent to
ludicity.hackernews@gmail.com would be greatly appreciated so that I feel less
alone in this madhouse.

Subscribe via RSS / via Email.


Powered by mataroa.blog.