Effective IT environment. How to boost system performance without spending a fortune on new servers?

Errors, lags, slow response times, frequent and prolonged downtimes – IT systems “age”, too, especially those that have been saddled with new functionalities over the years. Performance will often drop in the absence of adequate optimisation. And yet, instead of looking for possible optimisation, we often invest in expensive solutions – servers, licences – which only mask the mounting issues. What is the alternative to this never-ending infrastructure expansion?

Effective IT environment: How to boost system performance with performance engineering and avoid spending a fortune on new servers?

Performance engineering experts know from experience that any existing IT environment is usually sufficient for the system to work well and without major obstacles. The problem is we don’t focus enough on the resources we have; we don’t know how to use them well.

We develop software with a focus on user experience and often forget that the number of users is gradually going to grow. We integrate solutions without analysing the environment. As a result, our actions generate bottlenecks that disrupt our systems and reduce their performance. If we don’t understand what the problem is, we might get the erroneous impression that the problem lies with servers and maintenance.

IT performance vs. complexity

System development begins with the delivery of the first basic functionalities, when the number of users is still quite low. At this stage, developers are usually still involved, and they know what was coded where and how everything works. Business pressure is also relatively low, because the system has not yet been deployed on a wider scale and does not generate important revenues. There is still room for experimentation, short downtimes or slower operation.

However, as we achieve business success, two things happen:

  • the system begins to operate in a certain environment and changes are needed, not just in connection with planned, internal work, but also due to external variables (e.g. vulnerabilities are detected, new versions of libraries/frameworks are released, new development teams get down to work, etc.). Changes begin to accumulate, not just in the app layer, but in practically the whole, increasingly complex IT environment;
  • the system begins to be important for business, so every problem/downtime can be costly.

A dilemma arises: should we push on and develop the system at the risk of errors, or should we follow the rule that says: if it works, don’t fix it? Especially since many developer teams tend to focus on functionalities alone (which is their job after all), ignoring the performance aspect (how many users the functionality is meant to serve). In the end, we get an excessively complex system that is at the same time critical for business and revenues. And this complexity triggers more threats to performance:

  • by resolving one bottleneck, we risk uncovering an even bigger problem somewhere else;
  • we begin to have a tendency to optimise locally, instead of globally;
  • we lack a comprehensive picture of the system (e2e), especially with enterprise solutions.

Performance engineering – mechanism and effects

Performance engineering focuses on identifying and eliminating bottlenecks that may have a negative impact on system performance, as well as optimising the use of existing resources. The main tasks include analysing loads, optimising code and architecture, identifying performance problems and using appropriate tools for monitoring and measuring system parameters.

As such, it is an extremely important element of IT infrastructure management, especially in a dynamic and demanding business environment.

The main actions taken as part of performance engineering include:

  • Audit/Tuning – a regular system check, which produces a list of recommendations and enables “treatment” (a backlog of issues and their management plan). An audit should be performed regularly so as to catch problems early, before they have a chance to grow and wreak havoc. We can use a checklist like this:
    • understanding Actors and traffic patterns (users/processes/API consumers);
    • analysing top transactions and their profiles and trends in different layers;
    • analysing layers (resource utilisation: cpu, io, net, jvm/gc, pools, etc.).
  • Troubleshooting – resuscitating the system to eliminate the problem (this is a reactive action still consistent with the planned process). Let’s suppose we need to react to a system slowdown report: you look at resource saturation, check user threads, move to another system, re-check resources and threads, until you finally reach, e.g. a database that lags. Don’t hypothesise what the problem might be, follow the rule: monitoring first, hypothesis later (or never ?).

By implementing performance engineering processes, we can get the following benefits:

  • IT infrastructure cost optimisation. Following a tuning/audit, we often find solutions that might allow us to cut our maintenance or licensing costs. This is particularly important if we’re thinking of migrating our systems to the cloud where we pay for the computing power we actually use;
  • Better user experience (faster, more efficient systems);
  • Greater revenues/margins (owing to the first two factors);
  • Green IT. Lower resource use and reduced carbon footprint in line with ESG initiatives. Optimised infrastructure: fewer active processors, lower energy use.

Better system performance – how to avoid spending millions

The advantages of performance engineering are best shown using an example. In our case, that would be an actual case study: a system we designed to support our client’s sales and product management through an internal Call Centre. At one point, the system API was made available to an external partner with a view to creating a new partner channel. The project turned out to be a great success and increased our client’s sales considerably.

On the other hand, however, the system was now practically impossible to work with during lunch hours. The Call Centre couldn’t create new sales or provide customer service, because the system crashed continually. Maintenance teams were soon alerted to the problem and the first idea they had was to invest in more infrastructure, i.e. processors.

But, of course, an investment in one layer entails costs in others: if you invest in processors, you will need to invest in licences, etc. In addition, time was of the essence: it turned out that there was no space left on the virtual machine, which meant new servers had to be bought, and the whole project ballooned to a whopping three months’ worth of work.

The initial success and higher sales now became a problem. This is where we stepped in: our role was to look for a less obvious alternative that would be cheaper and faster to deploy.

Performance engineering – case study 

As part of this performance engineering project, we began by analysing traffic (internal and external, generated by partners) and the components with the greatest loads. We found that the most saturated layer was the CPU on the database.

Problems in one layer caused issues in others, so we got down to eliminating the bottlenecks one by one (we performed synchronisations, optimised thread pools, connections, analysed SQL queries). We then introduced mechanisms to separate internal and external traffic and did some tuning to reduce resource use and restore the system to normal operation under expected loads.

Iterative approach: Actors -> quotas -> tops -> pools -> code synchronisations -> repeat.

Together with our client’s maintenance team, our two performance engineering experts took just three weeks to restore the app to normal operation during lunch hours (reducing CPU loads from 100% to 60%). All the necessary changes were made in the evening/at night so as not to disrupt normal business activities. This allowed the client to avoid unnecessary investment in infrastructure, which would have amounted to several million zlotys (per year).

In addition, after analysing all the app layers (system, operation, virtualisation, databases), we found several components that significantly reduced performance (e.g. default app server settings). We introduced changes that generated enough space to support approximately two years of linear traffic growth. This meant huge savings and allowed high app performance to be restored.

Generate savings in your business

Performance engineering is the key to effective IT environment management, allowing system performance to be boosted without unnecessary costs. If you have noticed issues related to system speed, instead of rushing to buy another server, book a consultation with our team.

We are an authorised Dynatrace partner with a sizable portfolio of performance engineering projects that have already generated important savings for our clients.