www.confluent.io Open in urlscan Pro
2a05:d014:275:cb00:60f:54cb:281a:9d22  Public Scan

Submitted URL: https://go2.confluent.io/NTgyLVFIWC0yNjIAAAGGAxYPT2WGgCVF5QcErKe6Dx0p81odEph-K-F8rcunGtWCKiKLW3-nO5M_9uGMG0uSzt661h4=
Effective URL: https://www.confluent.io/blog/guide-to-kafka-pricing-and-diy-open-source-costs/?utm_campaign=tm.campaigns_cd.2022q3-tco-d...
Submission: On August 03 via manual from IN — Scanned from DE

Form analysis 3 forms found in the DOM

<form role="search"><svg class="cfHeaderNav-style-module--searchBarIcon--ZSA0y" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="search" role="img" viewBox="0 0 512 512">
    <path fill="currentColor"
      d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z">
    </path>
  </svg><input type="search" placeholder="Search" autocomplete="off" autocorrect="off" autocapitalize="off" name="s" value="" class="style-module--input--fZLFL st-default-search-input" id="nav-search-input"><svg
    class="cfHeaderNav-style-module--inputBarResetIcon--JFpIe" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="times" role="img" viewBox="0 0 352 512">
    <path fill="currentColor"
      d="M242.72 256l100.07-100.07c12.28-12.28 12.28-32.19 0-44.48l-22.24-22.24c-12.28-12.28-32.19-12.28-44.48 0L176 189.28 75.93 89.21c-12.28-12.28-32.19-12.28-44.48 0L9.21 111.45c-12.28 12.28-12.28 32.19 0 44.48L109.28 256 9.21 356.07c-12.28 12.28-12.28 32.19 0 44.48l22.24 22.24c12.28 12.28 32.2 12.28 44.48 0L176 322.72l100.07 100.07c12.28 12.28 32.2 12.28 44.48 0l22.24-22.24c12.28-12.28 12.28-32.19 0-44.48L242.72 256z">
    </path>
  </svg></form>

<form role="search"><svg class="cfHeaderNav-style-module--searchBarIcon--ZSA0y" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="search" role="img" viewBox="0 0 512 512">
    <path fill="currentColor"
      d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z">
    </path>
  </svg><input type="search" placeholder="Search" autocomplete="off" autocorrect="off" autocapitalize="off" name="s" value="" class="style-module--input--fZLFL st-default-search-input" id="nav-search-input"><svg
    class="cfHeaderNav-style-module--inputBarResetIcon--JFpIe" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="times" role="img" viewBox="0 0 352 512">
    <path fill="currentColor"
      d="M242.72 256l100.07-100.07c12.28-12.28 12.28-32.19 0-44.48l-22.24-22.24c-12.28-12.28-32.19-12.28-44.48 0L176 189.28 75.93 89.21c-12.28-12.28-32.19-12.28-44.48 0L9.21 111.45c-12.28 12.28-12.28 32.19 0 44.48L109.28 256 9.21 356.07c-12.28 12.28-12.28 32.19 0 44.48l22.24 22.24c12.28 12.28 32.2 12.28 44.48 0L176 322.72l100.07 100.07c12.28 12.28 32.2 12.28 44.48 0l22.24-22.24c12.28-12.28 12.28-32.19 0-44.48L242.72 256z">
    </path>
  </svg></form>

<form role="search"><svg class="cfHeaderNav-style-module--searchBarIcon--ZSA0y" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="search" role="img" viewBox="0 0 512 512">
    <path fill="currentColor"
      d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z">
    </path>
  </svg><input type="search" placeholder="Search" autocomplete="off" autocorrect="off" autocapitalize="off" name="s" value="" class="style-module--input--fZLFL st-default-search-input" id="nav-search-input"><svg
    class="cfHeaderNav-style-module--inputBarResetIcon--JFpIe" aria-hidden="true" width="18px" height="18px" focusable="false" data-prefix="fas" data-icon="times" role="img" viewBox="0 0 352 512">
    <path fill="currentColor"
      d="M242.72 256l100.07-100.07c12.28-12.28 12.28-32.19 0-44.48l-22.24-22.24c-12.28-12.28-32.19-12.28-44.48 0L176 189.28 75.93 89.21c-12.28-12.28-32.19-12.28-44.48 0L9.21 111.45c-12.28 12.28-12.28 32.19 0 44.48L109.28 256 9.21 356.07c-12.28 12.28-12.28 32.19 0 44.48l22.24 22.24c12.28 12.28 32.2 12.28 44.48 0L176 322.72l100.07 100.07c12.28 12.28 32.2 12.28 44.48 0l22.24-22.24c12.28-12.28 12.28-32.19 0-44.48L242.72 256z">
    </path>
  </svg></form>

Text Content

Read Report | Get 257% return on your Kafka budget and free your teams to do
more with less

Contact Us

 * Products
   
   * Choose Your deployment
     
     
     Confluent Cloud
      * Pricing
      * Login
     
     
     Software: Confluent Platform
      * Subscription
   
   * 
     Connectors
     
     ksqlDB
     
     Stream Governance
     Confluent vs. Kafka: Why you need Confluent
 * Solutions
   * 
     By Industry
     
     By Use Case
     
     By Architecture
     
     By Customer
     
     All Solutions
   * 
     Hybrid and Multicloud Modernization
     
     Event-driven Microservices
     
     Streaming ETL
     
     Use Case Showcase
     Streaming Use Cases to transform your business
 * Learn
   * 
     Blog
     
     Resources
     
     Training
     
     Professional Services
   * 
     Careers
     
     Events
      * Meetups
      * Kafka Summit
      * Current: Data Streaming Event
      * Webinars
     
     Streaming ETL cloud demo
     Mastering Kafka Streams and ksqlDB
     Microservices with Confluent
 * Developers
   * 
     Confluent Developer
     
     Docs
   * Apache Kafka Quick Start
     Streaming Audio Podcast
     Ask the Community
 * Get Started Free
 * 
 * US English

Get Started Free

 * 
 * 
 * Products
   
    * Choose Your deployment
      
      
      Confluent Cloud
       * Pricing
       * Login
      
      
      Software: Confluent Platform
       * Subscription
   
    * 
      Connectors
      
      ksqlDB
      
      Stream Governance
      Confluent vs. Kafka: Why you need Confluent

 * Solutions
    * 
      By Industry
      
      By Use Case
      
      By Architecture
      
      By Customer
      
      All Solutions
    * 
      Hybrid and Multicloud Modernization
      
      Event-driven Microservices
      
      Streaming ETL
      
      Use Case Showcase
      Streaming Use Cases to transform your business

 * Learn
    * 
      Blog
      
      Resources
      
      Training
      
      Professional Services
    * 
      Careers
      
      Events
       * Meetups
       * Kafka Summit
       * Current: Data Streaming Event
       * Webinars
      
      Streaming ETL cloud demo
      Mastering Kafka Streams and ksqlDB
      Microservices with Confluent

 * Developers
    * 
      Confluent Developer
      
      Docs
    * Apache Kafka Quick Start
      Streaming Audio Podcast
      Ask the Community

 * Get Started Free

 * Products
   
   * Choose Your deployment
     
     
     Confluent Cloud
      * Pricing
      * Login
     
     
     Software: Confluent Platform
      * Subscription
   
   * 
     Connectors
     
     ksqlDB
     
     Stream Governance
     Confluent vs. Kafka: Why you need Confluent
 * Solutions
   * 
     By Industry
     
     By Use Case
     
     By Architecture
     
     By Customer
     
     All Solutions
   * 
     Hybrid and Multicloud Modernization
     
     Event-driven Microservices
     
     Streaming ETL
     
     Use Case Showcase
     Streaming Use Cases to transform your business
 * Resources
   
   * FOR DEVELOPERS
     
     
     Documentation
     
     Courses
     
     Apache Kafka Quickstart
     
     Stream Processing Cookbook
     
     Community
     
     Blog
     See All
   
   * GENERAL
     
     Resource Center
     Blog
     Training
     Professional Services
     
     EVENTS
     
     Current: Data Streaming Event
     Kafka Summit
     Webinars
 * Company
   * About
     Careers
     Investor Relations
     Trust & Security
     Press
     Partners
 * Get Started Free
 * 
 * US English

Get Started Free

 * 
 * 
 * Products
   
    * Choose Your deployment
      
      
      Confluent Cloud
       * Pricing
       * Login
      
      
      Software: Confluent Platform
       * Subscription
   
    * 
      Connectors
      
      ksqlDB
      
      Stream Governance
      Confluent vs. Kafka: Why you need Confluent

 * Solutions
    * 
      By Industry
      
      By Use Case
      
      By Architecture
      
      By Customer
      
      All Solutions
    * 
      Hybrid and Multicloud Modernization
      
      Event-driven Microservices
      
      Streaming ETL
      
      Use Case Showcase
      Streaming Use Cases to transform your business

 * Resources
   
    * FOR DEVELOPERS
      
      
      Documentation
      
      Courses
      
      Apache Kafka Quickstart
      
      Stream Processing Cookbook
      
      Community
      
      Blog
      See All
   
    * GENERAL
      
      Resource Center
      Blog
      Training
      Professional Services
      
      EVENTS
      
      Current: Data Streaming Event
      Kafka Summit
      Webinars

 * Company
    * About
      Careers
      Investor Relations
      Trust & Security
      Press
      Partners

 * Get Started Free

 * Products
   
   * Choose Your deployment
     
     
     Confluent Cloud
      * Pricing
      * Login
     
     
     Software: Confluent Platform
      * Subscription
   
   * 
     Connectors
     
     ksqlDB
     
     Stream Governance
     Confluent vs. Kafka: Why you need Confluent
 * Solutions
   * 
     By Industry
     
     By Use Case
     
     By Architecture
     
     By Customer
     
     All Solutions
   * 
     Hybrid and Multicloud Modernization
     
     Event-driven Microservices
     
     Streaming ETL
     
     Use Case Showcase
     Streaming Use Cases to transform your business
 * Developers
   
   * CONFLUENT DEVELOPER
     
     
     Documentation
     
     Courses
     
     Community
     
     Apache Kafka Quickstart
     
     
     Stream Processing Cookbook
     
     Blog
     See All
   
   * EVENTS
     
     Current: Data Streaming Event
     Kafka Summit
     Resources
 * Company
   
   * ABOUT US
     
     About
     Careers
     Investor Relations
     
     Trust & Security
     Press
     Partners
   
   * RESOURCES
     
     Resource Center
     Blog
     Training
     Professional Services
 * Get Started Free
 * 
 * US English

Get Started Free

 * 
 * 
 * Products
   
    * Choose Your deployment
      
      
      Confluent Cloud
       * Pricing
       * Login
      
      
      Software: Confluent Platform
       * Subscription
   
    * 
      Connectors
      
      ksqlDB
      
      Stream Governance
      Confluent vs. Kafka: Why you need Confluent

 * Solutions
    * 
      By Industry
      
      By Use Case
      
      By Architecture
      
      By Customer
      
      All Solutions
    * 
      Hybrid and Multicloud Modernization
      
      Event-driven Microservices
      
      Streaming ETL
      
      Use Case Showcase
      Streaming Use Cases to transform your business

 * Developers
   
    * CONFLUENT DEVELOPER
      
      
      Documentation
      
      Courses
      
      Community
      
      Apache Kafka Quickstart
      
      
      Stream Processing Cookbook
      
      Blog
      See All
   
    * EVENTS
      
      Current: Data Streaming Event
      Kafka Summit
      Resources

 * Company
   
    * ABOUT US
      
      About
      Careers
      Investor Relations
      
      Trust & Security
      Press
      Partners
   
    * RESOURCES
      
      Resource Center
      Blog
      Training
      Professional Services

 * Get Started Free

Apache Kafka


THE COST OF APACHE KAFKA: AN ENGINEER’S GUIDE TO PRICING OUT DIY OPERATIONS


Gwen Shapira
Joshua Buss

Jun 19, 2020

When I have a small software project that I want to share with the world, I
don’t write my own version control system with a web UI. I don’t even try to run
similar software on a computer on someone’s datacenter. I don’t write a document
analyzing the pros and cons of each decision.

Instead, I just create a repository on GitHub.

To be fair, GitHub is free for open source projects, so this is an easy call.
And what if GitHub cost me $5 a month? I’m paying this much for my to-do list
management software. Five dollars a month is quite easy for an engineer to
afford even out of her own pocket.

Cloud services can have a wide price range. Apache Kafka® is free, and Confluent
Cloud is very cheap for small use cases, about $1 a month to produce, store, and
consume a GB of data. As your usage scales and your requirements become more
sophisticated, your cost will scale too.

Get Started
If you’d like to get started with Confluent Cloud, sign up for a free trial and
use the code CL60BLOG for an additional $60 of free usage.



This is quite typical for managed databases as a service—they can be very low
cost for casual use and very expensive when you use them in anger. This is what
usage-based billing is all about, and it is one of the biggest cloud benefits.

With large-scale use cases, it is quite natural to seriously consider just
running your own Kafka or MinIO instead of Amazon S3, for that matter. And of
course, you should seriously consider a decision of this magnitude. You just
need to consider it rationally. And I noticed that many software engineers and
engineering managers do not always do this.

Even at the low end of the scale, where a managed service is ridiculously
inexpensive, I see engineers run their own Kafka and not even consider a managed
service. When I ask why, the responses usually include: the joy of running Kafka
themselves, the career opportunities it creates, and perhaps most commonly, a
sense of futility—“my manager would never approve this expense.”

When you talk to engineering managers, the responses vary. Sometimes they trust
their team’s ability to deliver quality service more than they trust a service
provider. But quite often, the managers themselves don’t know how to calculate
the trade-offs involved and how to justify the necessary budget. If you are an
engineering manager, then you have years of practice going to your manager and
saying, “This year, my team is also running Kafka, we are spending 20 hours a
week on maintenance. I need an additional headcount.”

What you likely have less experience with is going to your manager and saying,
“This year, my team is building a real-time inventory management system. This
requires an event streaming platform. We decided to use Confluent Cloud, and I
need a budget of $7,000 a month for our use case.” Engineering organizations are
built to hire engineers. Managers are incentivized and get promoted for building
large teams, and no one seems to know how to convert this budget into managed
services.

As an industry, we did not learn how to make great decisions about the use of
managed services. It is time we up our game.


SO, HOW MUCH DOES KAFKA REALLY COST?


START WITH SOME FAIRLY OBVIOUS AND EASY-TO-QUANTIFY EXPENSES

If you are going to run Kafka on AWS, you’ll need to pay for EC2 machines to run
your brokers. If you are using a Kubernetes service like EKS, you pay for nodes
and for the service itself (Kubernetes masters). Most relevant EC2 types are EBS
store only and Kubernetes only supports EBS as a first-class disk option, which
means you need to pay for EBS root volume in addition to the EBS data volume.
Don’t forget that until KIP-500 is merged, Kafka is not just brokers—we need to
run Apache ZooKeeper™ too, adding three or five nodes and their storage to the
calculation. The way we run Kafka is behind a load balancer (acting partially as
a NAT layer), and since each broker needs to be addressed individually, you’ll
need to pay for the “bootstrap” route and a route for each broker.

All these are fixed costs that you pay without sending a single byte to Kafka.

On top of this, there are network costs. Getting data into EC2 costs money, and
depending on your network setup (VPC, private link, or public internet), you may
need to pay both when sending and receiving data. If you replicate data between
zones or regions, make sure you account for those costs too. And if you are
routing traffic through ELBs, you will pay extra for this traffic. Don’t forget
to account for both ingress and egress, and keep in mind that with Kafka, you
typically read 3–5 times as much as you write.

Now we are running the software, ingesting data, storing it, and reading it.
We’re almost done. 🙂 You need to monitor Kafka, right? Make sure you account
for monitoring (Kafka has many important metrics)—either with a service or
self-hosted, and you’ll need a way to collect logs and search them as well.
These can end up being the most expensive parts of the system, especially if you
have many partitions, which increases the number of metrics significantly.


SOME ASPECTS OF RUNNING KAFKA ARE MORE CHALLENGING TO QUANTIFY

But it doesn’t mean you should avoid considering them, or you will end up paying
these costs later whether you want to or not.

It starts with capacity planning. Ideally, you start with some idea of what
workload you will run on the cluster—MB/s ingress and egress, number of
partitions, number of concurrent connections, connection rate, and request rate.
Realistically, no new project ever estimates these correctly. This means that
you will plan capacity based on some guesses and over-provision to provide a
buffer when the guesses inevitably turn out to be wrong. Capacity planning takes
time, which is money, and you have to pay for all the over-provisioned capacity,
too.

If you still get capacity wrong and under provisioned, you’ll pay the price in
availability—which also means you and your on-call rotation will get paged,
sometimes with rather mysterious issues, leading to significant time spent
trying to solve all those problems. Expanding an already loaded cluster is a
very challenging problem. An overloaded cluster will not have the spare
bandwidth, IO, and CPU that you need in order to move workloads around. Time
spent troubleshooting and the downtime involved also have cost implications.

Getting the capacity right involves more than just choosing the number of
brokers. At Confluent, we spend significant time choosing the right components
and making sure they are all aligned and optimized—the right machine types,
right disk types, right disk sizes, broker configuration, zone alignment, load
balancers, and a lot more.

Then, there is routine maintenance. Not tons, but definitely enough to keep
people busy. You’ll want to upgrade regularly, especially with bugfix releases
and security patches. Being on top of latest bug fixes is critical for avoiding
disastrous incidents; it is heartbreaking to see customers lose data due to a
bug that was fixed a year ago. New releases also open up new capabilities,
including better configuration and further monitoring. You’ll want to stay on
top of this and make sure you collect the latest metrics—they were added for a
reason. It will also take significant time to tune alerts and make sure you know
of impending disasters early while not drowning in noise.

Another important aspect of routine maintenance are cluster rebalances and
expansions. You’ll want to watch for early signs of workload imbalance and move
partitions around to get to a balanced state. This improves performance and
indirectly reduces your costs. More importantly, watching for early signs of
overload and proactively expanding the cluster keeps you from trying to expand
an already overloaded cluster.

There is a cost to elasticity or lack thereof. One of the more interesting
aspects around maintenance is when your cluster has periodic workloads. You may
know that you need twice the capacity for Black Friday, or on weekend events, or
daily between 5:00 p.m. and 12:00 a.m. Do you have the ability to shrink and
expand the cluster at will? Does it happen frequently enough that you need to
automate it? Are you running in the cloud where you have some degree of
elasticity or in an old-school datacenter where you need to order your entire
capacity three months in advance?

In these cases, either you run at maximum capacity 100% of the time, paying the
cost of capacity that you don’t always use, or you pay in time and effort for
manual expand/shrink operations or the effort of automating it.

“Why is Kafka slow?” tax. Regardless of how well you planned capacity and
tuning, someone is bound to ask, “Why is Kafka so slow?” Maybe they expected 20
ms of latency and are seeing 40 ms. Maybe it is usually 20ms, but a few times a
day it spikes to 2,000 ms. Finding the answer and fixing issues can be
incredibly time consuming even with a team of experts and near impossible if
your Kafka team is also the Apache Cassandra team and Elasticsearch team. Pay
the price of hiring and training experts, or pay the price of living with
slowdowns.

Hybrid cloud tax. Running on multiple clouds brings its own level of challenges.
Even with Kubernetes, each cloud requires a capacity planning exercise—not all
vcores are made equal; storage and networks also differ. You’ll also need to
learn all about their network routing idiosyncrasies and their different
defaults for Kubernetes hosts. Load testing is time consuming, and hybrid cloud
doubles or triples the cost. If you rely on a managed service that only exists
in one cloud, you get to enjoy the benefits of a managed service in one
environment but need to pay all the DIY tax in another.


SOME COSTS ARE NEARLY IMPOSSIBLE TO QUANTIFY

We started by talking about actual costs that you get billed for by your vendors
and service providers. Then we talked about costs of the “time is money”
variety—those are harder to quantify, but since we have some estimates for
engineering salaries, there’s a fairly straightforward formula for converting
time into money. But there are some costs that are nearly impossible to put a
specific price tag on. This doesn’t mean they are not important. In fact, the
opposite is true: they are hard to quantify because in the worst case, the price
is the entire company.

Time to market. Quite a few of the “engineering time” items we mentioned in the
previous section have to happen before your application is deployed to
production, such as capacity planning, monitoring, and tuning. Any delay, due to
lack of experience or just the fact that this is challenging work, delays your
product or application from being released to its users. In some cases, the
delay doesn’t matter at all, but in others, it gives a competitor a critical
advantage.

Engineering happiness. If you build it and deploy it, you also hold the pager
for it and are on the hook for all maintenance tasks. This type of maintenance
work isn’t what gets engineers excited to go to work in the morning. We can
tolerate a bit of paging and even routine maintenance once or twice (provided
that we have a clear plan on how to automate it away). But if there is no plan,
and if there is too much stuff that isn’t really “engineering” on our plate, we
become very unhappy very quickly. This may lead to retention problems and churn,
which are pretty easy to quantify. But in the worst scenario, you have engineers
who are disengaged and unmotivated in their current position.

Risk. If you have world-class experts running the service, the risk is not huge.
It is still there, but with a dedicated team of experts and some
over-provisioning, 99.95% uptime isn’t impossible to reach. If your engineers
are learning on the job, as is frequently the case, you risk downtime, data
loss, security breaches, and compliance issues. The impact can range from
apologizing to customers to facing a crippling loss of trust from customers and
hefty government fines. When calculating the cost of downtime, don’t forget to
account for the loss of business during the downtime, the length of downtime (it
takes longer for nonexperts to solve issues), the time engineers spend figuring
out, solving issues, deploying remediations and recovering (if someone worked
until 4:00 a.m. on an incident, don’t expect much productivity the next day),
and frequency of issues.


COMPARING COSTS

Now that you’ve accounted for everything that DIY costs you, it’s time to
compare different options. The alternatives to DIY are hosted or managed
offerings.

When comparing different hosted and managed offerings, it is super important to
check what is included in the price and what is an “extra.”

Do you pay for brokers? For traffic? For storage? What about traffic over the
public internet? Between zones and regions? Do you pay for ZooKeeper nodes too?
What about “special” network configurations—VPC peering, private link, and NAT
gateways? Are “batteries included” when it comes to capacity planning, upgrades,
and elasticity? Or do you still need to invest in engineering effort? If they
claim to be elastic, do they balance the load for you, or are you on the hook
for the most challenging part? Can you shrink or just expand?

Note that Kafka has relatively “thick” clients, so make sure the vendor of
choice has the capability to troubleshoot client issues and take on the dreaded
“Kafka is slow” question, rather than just responding with “the server is fine.”

After you’ve calculated exactly how much each option will cost, I recommend
provisioning what you think are equivalent-sized clusters in your top options
and to use the Kafka perf producer and consumer to load them. Run the test for a
few hours to make sure you see the sustained capacity and not “burst capacity,”
and make sure you also keep an eye on latency. If latency becomes unacceptably
high, you’ll want to reduce throughput to keep it within acceptable boundaries.

The last step is to use the total cost and throughput from the very basic
benchmark to calculate the price in dollars per MB/s. This is a standard way to
compare cost-effectiveness. It’s been used in TPC-C benchmarks to compare
databases for the last few decades.

Now you can compare your options like a pro.

One last thing to remember when comparing providers: Not all SLAs are equal,
even if all providers claim 99.95% uptime in their SLAs. There are a lot of
important details related to how they measure “availability” and how easy it is
to process SLA-breach claims.


OPTIMIZING FOR COSTS

It is worth noting that both you and your cloud provider can do quite a bit to
reduce the cost per MB/s, so as to run Kafka more efficiently and either get
more throughput from the same setup or get the same throughput from a more
cost-effective setup.

The easiest way you can improve your throughput without provisioning additional
capacity is by sending data to Kafka in a more efficient way. The difference
between efficient and inefficient use of Kafka can be more than three times the
throughput on the same hardware.

Luckily, the advice on being more cost efficient is exactly the same as
performance tuning advice, so it is fairly easy to find. Use the sticky
partitioner, tune linger.ms, use fewer clients, and maybe even fewer partitions
to send the same throughput using fewer requests.

All these performance improvements lead to greater efficiency regardless of who
runs your Kafka brokers.

However, it is nice when your vendor shares your mindset. Confluent is a bit
obsessive about Kafka performance (in a good way). We strive to introduce small
performance improvements every few weeks and those add up. Those improvements
ship as part of Apache Kafka releases, but we deploy from master to our cloud
clusters every two weeks, so you can benefit from those improvements weeks or
months before the Apache Kafka version is even released, not to mention the
months or years it takes many companies to upgrade.


SUMMARY

The key to making good choices regarding managed services boils down to setting
aside institutional traditions and dysfunctional incentives and focusing on
making an economic decision.

Because we are not used to making economic decisions, we often miss some key
costs. I remember the day I realized that 50% of our costs were network, and
that for each four-broker cluster, we also paid for three ZooKeeper nodes and
three Kubernetes master nodes. That cross-zone traffic would have been less
expensive if it were within the same VPC, used ELB, or was a full moon.

You also need to know how to convert engineering time to budget and vice versa.
We had a weird SSL issue that caused us to perform many more IOPS than expected.
We could solve it by replacing the disks with SSDs, so the storage would be
capable of delivering all the IOPS we needed, or we could re-implement part of
the Java SSL stack inside Kafka to work around the issue. Without knowing how to
compare costs of SSDs overtime to cost of engineering effort, it’s hard to make
the right choice. (Spoiler: Switching to SSDs was an additional $50,000 a year.
The bug took two weeks to fix. It turns out that “throw hardware at the problem”
isn’t always the right call).

The hardest part of all is the part when you, as an engineer, need to make a
compelling case to your managers, without sounding like you are whining or
trying to get out of doing your job.

And if you’re interested in a free TCO assessment to see how much you can save
with Confluent Cloud, let us know and we’d be happy to provide that for you.

Get a free TCO assessment

To learn more, check out the Cost-Effective page as part of Project
Metamorphosis.


FURTHER READING

 * Project Metamorphosis Month 2: Cost-Effective Apache Kafka for Use Cases Big
   and Small
 * Unifying Streams and State: The Seamless Path to Real Time
 * Reducing the Total Cost of Operations for Self-Managed Apache Kafka
 * Measuring TCO: Apache Kafka vs. Confluent Cloud’s Managed Service

Gwen Shapira is an engineering leader at Confluent. She has over 15 years of
experience working with code and customers to build scalable data architectures,
integrating relational and big data technologies. Gwen is the author of
“Kafka—The Definitive Guide” and “Hadoop Application Architectures,” and she is
a frequent presenter at industry conferences. Gwen is a PMC member on the Apache
Kafka project and a committer on Apache Sqoop. When Gwen isn’t building data
pipelines or thinking up new features, you can find her pedaling on her bike
exploring the roads and trails of California, and beyond.

Joshua Buss is a site reliability engineer who’s worked on Confluent Cloud since
its infancy, focusing first on security, later on observability and analytics,
and most recently cloud costs. He’s a cloud veteran and a hardware geek who went
through the DevOps movement, wrote articles for AnandTech, and has worked in the
travel, finance, and ad tech industries. When not optimizing Confluent Cloud,
Joshua enjoys spending time with his family, friends, and cat Pixel.


DID YOU LIKE THIS BLOG POST? SHARE IT NOW



Subscribe to the Confluent blog

Subscribe


MORE ARTICLES LIKE THIS


CONFLUENT’S DATA STREAMING PLATFORM CAN SAVE OVER $2.5M VS. SELF-MANAGING APACHE
KAFKA

If you’re reading this, it’s likely because you are leveraging (or considering)
Apache Kafka® in your organization—especially as it has become the de facto
standard for data streaming. Adopted by

Read


DESIGN CONSIDERATIONS FOR CLOUD-NATIVE DATA SYSTEMS

Twenty years ago, the data warehouses of choice were Oracle and Teradata. Since
then, growth and innovation has shifted to the cloud, and a new generation of
data systems have

Read


MODERNIZE YOUR BUSINESS WITH CONFLUENT’S CONNECTOR PORTFOLIO

To win in today’s digital-first world, businesses must deliver exceptional
customer experiences and data-driven, backend operations. This requires the
ability to react, respond, and adapt to a continuous, ever-changing flow

Read
 * Product
 * Confluent Platform
 * Connectors
 * ksqlDB
 * Stream Governance
 * Confluent Hub
 * Subscription
 * Professional Services
 * Training
 * Customers

 * Cloud
 * Confluent Cloud
 * Support
 * Sign Up
 * Log In
 * Cloud FAQ

 * Solutions
 * Financial Services
 * Insurance
 * Retail and eCommerce
 * Automotive
 * Government
 * Gaming
 * Communication Service Providers
 * Technology
 * Manufacturing
 * Fraud Detection
 * Customer 360
 * Messaging Modernization
 * Streaming ETL
 * Event-driven Microservices
 * Mainframe Offload
 * SIEM Optimization
 * Hybrid and Multicloud
 * Internet of Things
 * Data Warehouse

 * Developers
 * Confluent Developer
 * What is Kafka?
 * Resources
 * Events
 * Online Talks
 * Meetups
 * Current: Data Streaming Event
 * Tutorials
 * Docs
 * Blog

 * About
 * Investor Relations
 * Company
 * Careers
 * Partners
 * News
 * Contact
 * Trust and Security

 * 
 * 
 * 
 * 
 * 
 * 
 * 

 * 
 * 
 * 
 * 
 * 
 * 
 * 

Terms & Conditions | Privacy Policy | Do Not Sell My Information | Modern
Slavery Policy | Cookie Settings

Copyright © Confluent, Inc. 2014-2022. Apache, Apache Kafka, Kafka, and
associated open source project names are trademarks of the Apache Software
Foundation



By clicking “Accept All Cookies”, you agree to the storing of cookies on your
device to enhance site navigation, analyze site usage, and assist in our
marketing efforts. Cookie Notice

Cookies Settings Reject All Accept All Cookies