When creating applications (using product methodology in particular), each programmer, architect, or QA understands the importance of tech investments and the existence of a tech debt management process. Business, especially the one that considers itself modern or advanced, is also familiar with these concepts and even uses them from time to time.
In our company, resources are distributed in the following proportions: 80% goes to product tasks and 20% to tech tasks (to be honest, distribution is a little bit more complex and depends on the “maturity” of the product, but let’s omit these details).
However, in reality, this distribution is often disrupted by unforeseen challenges such as missed deadlines, underestimated workloads, and unexpected urgent tasks that were not accounted for during the planning sessions. Consequently, the team’s architect finds themselves disheartened as tech roadmap tasks get delayed and pushed further into the future.
You can understand the product office; they are responsible for business metrics and tech metrics, like throughput, code coverage, quality of code, etc., which aren’t comprehensible or interesting for them at all. In this case, how can we show their significance?
Unfortunately, businesses recognise and value only product and business metrics. Therefore, it falls upon the development team, whether tech leads or architects, to translate these complex tech metrics into tangible monetary value. Allow me to share our experience through two examples: one involving issues with application fault tolerance and another related to our infrastructure’s readiness to meet business expectations.
For context: our company provides SaaS for e-commerce solutions to other companies.
Example 1: The Weakest Link or Fragile Infrastructure
As our application grew, we were making compromises due to limited context and strict deadlines. One day we realised that we ended up in a situation, when our Redis transformed from a cache into a single point of failure (don’t ask why). Despite having an uptime of around 6 years (thanks to Murphy’s law), a virtual machine failure resulted in a 20-minute API downtime.
But what does a 20-minute downtime mean for business? The API couldn’t process requests for a little bit, a couple of users couldn’t make transactions, but after 20 minutes, everything returned to normal. Why worry? Let’s continue making some new features.
To understand the impact of this incident on business, we need to quantify it in monetary terms. We considered two components:
- The number of active users who encountered errors. By analyzing the conversion funnel, we could estimate the amount of lost revenue.
- Checking the Service Level Agreement (SLA) with partners. For instance, we had the following SLA (see image below) in agreement with one of our enterprise partners. 20 minutes of downtime in a month means that we already have 0.046% of unavailability, i.e., the error budget will be almost completely consumed in a single incident, after which our fee will be decreased by 0.3%. Knowing the historical revenue for this partner, it is easy to calculate the amount of money that we will lose.
Additionally, reputational risks can be acknowledged, although quantifying them in monetary terms can be challenging.
To convince the business, it is crucial to estimate the risk of future occurrences. In our case, the Redis instance had its first failure in six years. Despite the difficulty in quantifying this risk, emphasizing the potential catastrophic consequences of such an event was enough to persuade the business to allocate resources for a new Redis cluster in the next quarter.
Example 2: Anticipating a Bright Future
At one point, we observed that our API’s traffic was approaching its maximum capacity. Metrics revealed a twofold increase in traffic during the first two months of the year. This led to the question: “What should we do? Panic or remain calm?”
Gathering current numbers
We provide a backend for partner integrations, so in the simplest case, we can check metrics of usage of each of our API methods by each partner + group these metrics by integration type (e.g., partners who are selling physical merchandise generate much less traffic) and partner size (midtier, enterprise, etc.).
That’s why it is important to have a good monitoring system that not only allows for the collection of real-time metrics but is also capable of storing them for historical analysis.
In our company, we use New Relic for application metrics and Prometheus + Grafana for infrastructure metrics.
After analysis, we understood how much load each partner generates for our API. Looking at infrastructure metrics, we realised whether or not there would be bottlenecks in different parts of our system during horizontal scaling (databases, message brokers, etc.).
The next step is to actualize the current values of maximum throughput. Meaning perform load testing, which can be executed, for example, with the help of k6.
The most important thing is to record all incidents related to this problem. In our case, we hold postmortems for the cases when, during some marketing activity, sales, etc., traffic on our application was so high that we started to cut it by rate limits (or worse, our SLO, such as response time or transaction throughput, started to deteriorate). If the application went down, it’s possible to calculate losses during this downtime; in our situation, we could calculate losses by number of rejected requests. It’s a good idea to add concrete losses in money to the report instead of dry numbers of downtime.
These incidents will be an additional argument when you are persuading the business side that the problem is real.
After obtaining the current numbers, we aligned them with the business plans for the year. While it can be challenging to formalize these plans, it is crucial to gather and update the necessary information. We specifically focused on new partner integrations and the anticipated load generated by different integration types.
To present our case effectively, we built a calendar highlighting months where new integrations would not be feasible, excluding peak activities like sales. By considering the forecasted revenue for each integration, we could calculate the potential lost profit.
Lastly, we brainstormed and analysed possible solutions. There is no point in presenting what we have at this point to the business because it is not clear how many resources we need or what profit we promise in the end.
That is why we wrote down all available options, estimated them, highlighted the risks, and predicted the results.
There is a tempting sensation to trick the product office and not include solutions that are not beneficial for developers themselves (e.g., cheap but leading to an enormous number of crutches and smelling code). Yes, this tactic can work, however, in my opinion, it negatively affects trust between the product and development departments.
Trusting relationships are priceless because they lead to awareness of the importance of tech tasks by business side, so in the end, capacity will be more willingly allocated.
What’s the result?
By following these strategies, we successfully presented the problems and suggested solutions to the higher management levels, including business heads. Through constructive discussions and persuasive arguments, we secured the allocation of two senior developers dedicated to these tasks for the upcoming quarter, without compromising feature development or partner support.
Here is the brief summary:
- It is vital to have a good monitoring and analytics system for tech metrics (not only product metrics). It is impossible to convince someone without arguments. For example, it is much more clear if you present a comparison of development time in certain components (or onboarding into an application) with metrics of code quality.
- You need to convert profit from tech tasks into understandable business metrics and make a comprehensive presentation. For example, “update version of framework/software” for the business side is not really clear. Tell them, for instance, that it is needed because otherwise, in the next year, because of deprecated software, you won’t pass certification or that in this version, critical vulnerabilities were detected and there is a risk of data compromise.
- Log all incidents with related reasons and consequences, as well as highlight works or tasks that could have prevented them but hadn’t been done. It will be an additional argument.
- You should be ready to take the initiative and dig out the necessary information by yourself (interrogate product owners!): examples of SLA in agreements with partners, plans for new integrations for different products, etc.
Sometimes, when I follow these pieces of advice, I realise that the task we wanted to do turns out to be less important than we actually thought, and it is not clear why we tried to force its implementation :)
From ancient times, developers and managers were opposed to each other, acted as the main characters of various memes. I shared with you my experience with communication with the product office. And how are you doing on this matter? How do you achieve mutual understanding?
The author has used AI tools for writing this article. Even though initial idea and structure were their own, inspired by real life experience of working in IT, AI has been utilized to make grammatically correct text with varied vocabulary.