October 20th 2023

Posted on Oct 20 • Originally published at signoz.io This article was originally posted on SigNoz Blog and is written by Nočnica Mellifera.Datadog has a huge product footprint with a sophisticated user experience, but any discussion of its usefulness must include a consideration of its significant costs.It’s likely that your business isn’t at the scale that it will generate a $65 million bill, but it is possible to generate bills that rival your operations bills.Kelly Sommers is saying something we all know: unchecked, datadog bills can equal your operations billsThis piece explores two ways that Datadog’s pricing is often much larger than expected for small and mid-size engineering teams. The first is the per-host pricing that affects microservice architectures, and the second is Custom Metrics that can quickly get out of control and inflate your Datadog bill.Datadog's pricing is tied directly to the number of hosts monitored. In dynamic environments, especially those with microservices, the amount of data can fluctuate significantly. This makes it challenging to predict costs. Datadog's per-host pricing model can be particularly challenging for architectures that rely heavily on microservices. Here's why:In a microservices architecture, services are often distributed across multiple hosts for scalability and fault tolerance. Each host running a part of the service counts toward the per-host pricing, increasing the overall cost. Some users report that Datadog costs more to monitor your AWS t3.medium instance than the actual instance. This pricing structure tends to favor users who use larger, higher-resource instances which often don’t make sense for a proper microservice system. For users using straightforward, even medium sized AWS instances, you can again find observability costs outstripping infrastructure costs.It's generally not great to be changing your architecture just to make observability affordable.Microservices often use dynamic scaling to handle varying loads, which can lead to a fluctuating number of hosts. This makes it difficult to predict costs accurately. Further, in some containerized environments, you have a choice to lose the accurate hostname or accept that every auto-scaled host will have a direct pricing impact.Microservices are designed to do one thing well, meaning you might have many small services running on separate hosts. This granularity can inflate the number of billable hosts.The development and testing environments often mirror the production setup to some extent. This adds additional hosts that also count toward the pricing. When we’re monitoring development, test, and staging environments, often a self-hosted solution makes sense for these private, internal, low volume hosts. That option doesn’t exist with Datadog, so if you want to monitor these pre-production environments you’ll have to pay the full rate per host and send your data to the Datadog servers.A number of comments all over Reddit and Hacker News, mention that Datadog’s billing on custom metrics is surprisingly high. At first, the ten cents per hundred ingested custom metrics seems more than fair: after all, when I’m manually adding calls in my code to send custom metrics, it would take me hours and hours to even add a few dozen new metric names, so surely I’m just paying a few bucks for custom metrics, right?In reality this cost can get out-of-control easily through two very common patterns:Often, we’re encouraged to start with custom metrics by manually adding calls to create metrics, which seems like an inherently limited technique. But what happens if I start sending a metric like {user.path:status.code}? If the key is dynamic, I can create a huge metric space very quickly. Ideally, I’d limit this somewhat to make sure my data is useful, but there are plenty of reasons why we’d want to store a space of tens of thousands of possible keys. Worse, there’s no built-in system in Datadog to clamp metric explosion automatically.This is all a significant issue since in general it’s very bad if Operations teams are looking at a big bill and the only way to improve the situation is to commit code changes to the core application. Not only is this a difficult feedback process, it’s another time when we end up punishing our coders for trying to add observability data. It would be much better if there were some way to do ops-level config to clamp cardinality, as there is with the OpenTelemetry Collector.Even more surprising than our own custom metrics’ ability to swell observability bills, is integrations with other tools. Interactions with plug-ins’ and integrations code and your own production data can cause custom metrics to increase significantly. The list of integrations which by default send unpredictable numbers of custom metrics is quite a list:Will send custom metrics by default with no built in limits:Nagios /PDH Check /OpenMetrics /Windows performance counters /WMI /PrometheusWill send custom metrics if so configured, with no built in limit on the number of keys reported:MySQL /Oracle /Postgres /SQL ServerFinally, cloud integrations with AWS can also send custom metrics. It’s hard to imagine running a modern service without using some of the resources listed above, and wanting to monitor them with your Observability tool.In a theoretical example with custom metrics, we often start with a single call generated a large number of metrics due to having a large space for possible keys. In reality, the situation is much worse if you have a modern microservice architecture. We’ve discussed previously how host-based pricing penalizes smaller, lighter containers. Here with custom metrics, we get another nasty shock: custom metrics are counted by unique keys, values, and tags. One of the most basic tags is the host, so every single container that generates custom metrics is creating unique metrics and will quickly impact your budget.You may want to sample your metrics at the cost of sending more metrics. It would be tempting to try sampling your data and sending distribution metrics instead. But since Datadog charges not by the number of metric values but rather by their keys, sending distribution metrics increases the number of metrics used five-fold. As each metric now also includes count, sum, min, max, and avg.The Datadog documentation site is quick to point out that you will only be charged for custom metrics if you purchase the appropriate SKU, and without that SKU you can send 300 metrics for free (well, included in the cost of your other products but still, no additional cost). Let’s review a scenario to see how a single latency metric can use up all 300 of those metrics:You’re a smart, in-the-know SRE who wants to add latency metrics for an unsupported framework. You know that Datadog custom metrics are expensive so you keep things simple. Just a route with status code for the key, and a latency measurement as a value:You group up route and log in so the actual metric is {user.login-status.200-latency:20} a nice, compact metric. There are only ten different endpoints, so you know things will be well clamped. For safety’s sake and in case some reports are dropped, you report distribution and percentile metrics as well. You add the few lines of code to report your metrics, and head off for the weekend. When you come in on Monday there’s three voicemails from Datadog, saying you need to add a SKU because you reported over 300 metrics in the last day. What happened?All Datadog custom metrics are tagged by host. So you were actually reporting {hostA-user.login-status.200-latency:20} . You have five services that are instrumented with Datadog, so instead of reporting 10 metrics for each of your endpoints, you’re actually reporting 50 metrics, ten for each host. And since you responsibly reported distribution values to make your life easier, you’re also reporting p50, p75, p90, p95, and p99 AND count, sum, min, max, and avg for every combination of host and endpoint. Your single added metric is now 10 endpoints, times 5 hosts, times 10 summary metrics or 500 metrics. Without Datadog’s custom metrics SKU you just exhausted all 300 custom metrics at once.Do you have more than five hosts running your code? In this example, we always returned a status code of 200, just a few more status codes would multiply the problem further.There are a variety of possible reasons why the pricing for Datadog’s custom metrics are so high. It may be that their data backend struggles with high cardinality keys. However, a slightly more sinister explanation has to do with vendor lock in. While integrations and other plug ins can produce custom metrics, the primary way these metrics are created are by inserting custom calls into your application code to send metrics to the Datadog agent.These custom metrics calls within your code do not use any open standard or library, and as such they can only ever report data to the Datadog agent. As such, when you start reporting custom metrics you are signaling to Datadog that you’re ‘locked in’ and planning to stay with Datadog as your observability platform in the long term. This makes your organization less sensitive to price increases.OpenTelemetry offers a cost-effective alternative to Datadog. Being an open-source project, it eliminates the licensing costs associated with proprietary solutions like Datadog. Another advantage of OpenTelemetry is that it reduces vendor lock-in, offering greater flexibility to switch between different observability solutions without undergoing a massive overhaul of the existing setup. Companies can choose to store their observability data in cost-efficient backends, and they have the flexibility to select from a range of open-source storage and visualization tools like SigNoz, Grafana, or Loki.To go further, read a comprehensive price breakdown of how Datadog can cost 9x more than OpenTelemetry and SigNoz.Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well Confirm For further actions, you may consider blocking this person and/or reporting abuse Wojciech Lepczyński - Oct 1 Saul Fernandez - Oct 1 Matteo - Sep 1 Unni P - Aug 28 👇 🤓 Once suspended, signoz will not be able to comment or publish posts until their suspension is removed. Once unsuspended, signoz will be able to comment and publish posts again. Once unpublished, all posts by signoz will become hidden and only accessible to themselves. If signoz is not suspended, they can still re-publish their posts from their dashboard. Note: Once unpublished, this post will become invisible to the public and only accessible to Ankit Anand ✨. They can still re-publish the post if they are not suspended. Thanks for keeping DEV Community safe. Here is what you can do to flag signoz: signoz consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy. Unflagging signoz will restore default visibility to their posts. DEV Community — A constructive and inclusive social network for software developers. With you every step of your journey. Built on Forem — the open source software that powers DEV and other inclusive communities.Made with love and Ruby on Rails. DEV Community © 2016 - 2023. We're a place where coders share, stay up-to-date and grow their careers.

Beyond Boundaries: Crafting Your Own …
The Art of Customization in Residenti…
Microsoft reveals custom Final Fantas…
Another Appliance Company Now Files A…

This post first appeared on VedVyas Articles, please read the originial post: here