Observability Suite in 2025

what should your observability suite do for you ?

Dec 10, 2024

Dear Technology Leader,

I guess it's the right time to jot down my expectations from an Observability Suite in 2025. After all, a month is good enough for you to fix the gaps. In 23-24, observability tools grew leaps and bounds, and people finally started looking at observability as something more than just instrumentation. And OpenTelemetry already looks like a very strong alternative to private tools. We've all gained a lot from this change. 2025 should bring in more beneficial changes. Here's what I think your observability suite should look/feel/behave like.

Your observability suite should allow you to see all production signals - logs, traces and metrics, in one place. This is because working across multiple tools leads to longer resolution times (MTTD), correlating is easier if all signals are visible on one plane.
Your observability suite should monitor all sources of change to the production environment. If you have to find out about a production deployment by calling someone when an alert pops up, if you have to call a vendor to find out if they have an outage going on, that just adds to MTTD as well as human dependencies. At a minimum, your tools should track all infra changes, deployments and vendors' uptime statuses, along with application signals.
Your observability suite should store every signal as a wide event - basically record all relevant metadata along with the value of said signal. If you have to go outside of the tool to correlate why an issue on only one OS version or only one node is dropping your time series (orders/transactions/views/etc) you are not on the right platform. *Also called high cardinality*. Additionally you should be able to add an arbitrary number of attributes as metadata for different categories or instances of signals.
Your observability tools should be fast, with sub-100ms response times for most dashboards and searches. If dashboards take a minute to load, or alerts take longer to trigger, engineers are already wasting precious time.
You should be paying for your observability tool based on consumption. Pricing per headcount or number of hosts etc push you to adopt suboptimal strategies, like giving only half of your staff access, or installing agents in 1/3rd of your machines. You should be able to track data storage expenses in real time.
Your observability suite should be able to perform well regardless of how much data you push into it. You should be able to decide whether you want to store data in queryable form for 7 days or 7 years. And you should have a choice to store different streams of data for configured time periods.
Your observability suite should promote real-time collaborative debugging with simple context sharing tools like permalinks, time and signal based visual markers/bookmarks, knowledge base integration with dashboards.
Your observability suite should be able to trace events seamlessly across complex distributed systems (if you are putting one in production), through sync, async/evented boundaries.
Your observability suite should implement OpenTelemetry standards, thus allowing you to switch to a tool of choice anytime instead of being jailed with one tool.
Your observability suite should help you set up agents and add new services with an ever improving list of defaults and alerts, thus helping you implement learnings from each incident.

Wishlist

Once you are ready with the basics, you should be ready to invest in tools that help you add more to improve MTTR, like the following to your suite.

Security event monitoring - similar to point 1 above, going through multiple observability tools and systems wastes time. Security incidents often cause downtimes, and hence events should be part of the comprehensive observability suite.
Observability suite as the source of truth for various compliance audits, thereby reducing time and staff cost needed to perform said audits.
Predictive Intelligence - tools that help you identify anomalies, identify and communicate possible root causes.
Intelligent alerting - tools that help you reduce unnecessary alerting and alert fatigue, for example give you reports on which alerts weren’t acted upon for a long period of time and can be deleted.
Natural language querying and voice interfaces - need I say more than - "what's the 95% percentile response time for User Service ? ", or even "show me which services are unhealthy right now?"

What’s on your wish list ?

Essays and Reflections for Software Engineers and Managers

Discussion about this post