Implementing Application Performance Management (APM) and DevOps requires change - and change can be painful and is often resisted. Here are some pointers on how to make your APM implementation project go smoothly.
You’ll probably find that the stakeholders in IT and business are delighted to hear that they will have reports available on system performance and you already have a number of evangelists, however, the key to perceived success is often in expectation, so we recommend you have a clear, documented set of requirements upfront from the team describing what they want to see in the reports and how they want them delivered.
It’s a good idea at this point to have a view on your maturity level when it comes to APM - how are you finding out there’s a problem with your apps? Are your customers telling you or are you getting alerts from other tooling? What is the other tooling (make an inventory and classify the existing tools (monitoring for database, network etc) and how many alerts are you getting? How easy is it to interpret them? Who is currently using them and how? What do you like / not like about them? Perform a gap analysis based on the capabilities described in section 3.
It’s unlikely your project’s going to get off the ground without a solid business case so find and identify one or two (probably business critical) applications that are being significantly impacted by outages / performance issues - if you can calculate the revenue/profits lost during downtime all’s the better.
Additionally, you will, we hope, have baselined a number of quantifiable key metrics (MTTR, volume/severity of outages, number of people/time spent troubleshooting etc) and as part of your business case and project plan have some stated goals to achieve. Some of these may directly impact your SLAs and require a change in process - certain alerts for example may mean that your developers receive the pager call at 2am to tell them a severity one outage has occurred that requires their immediate attention. Everyone needs to be ready to respond to what your APM system can tell your organization.
Here are some questions to ask yourself as you prepare and deploy your APM solution:
Have you defined the alerts’ routing to the people defined in your support and recovery process?
Has everyone who will be receiving or has visibility of the new alerts been made aware of what they are and how to interpret them?
If someone hears about your new APM solution and wants to receive alerts from it, how do they request this? What is your process for servicing this request?
Is there a process to swiftly on-board a new application?
Is the APM tool integrated with other corporate systems?
Educate and Evangelise
Your vendor may provide training materials, train-the-trainer or onsite/online training courses on your new APM tool. You’ll need training material for:
- Basic usage
- Advanced concepts (memory leaks, policies, dashboard creation, etc)
- Operations (alerts/events) training
- Reports interpretation for line-of-business and development
You need to make sure the people who will touch the product or consume the data have the information they need to be successful. Their success drives your success - you are all in this together.
Evangelise your new tool - broadcast your success and, wherever possible, quote quantified metrics (a percentage improvement in uptime, the new average MTTR etc). Your vendor should be able to and want to work with you to monitor, record and socialise the success of the tool. They may even ask you to be a reference - supporting the tools you love helps ensure their longevity and is great for your career development.
For every problem you solve with your new APM tool take a few minutes to document the success and make it available to the team and stakeholders. Collect the following information:
Problem Description: Which application, what happened, what was the impact?
Resolution: What was the root cause and how was it resolved? (Include screenshots)
Business Impact: What was the time to resolution and how did that compare to the baselined MTTR? What was the quantified business impact and how did that compare to life before APM?
SAAS or On-Premise
One of the first decisions you’ll need to make at implementation (and you’ll more than likely have already performed a POT/POC so may already have discussed this at some length with your chosen vendor) is whether to host your own controller or use the vendor controlled SAAS. It’s easier (less work for you) to go the SAAS route since the vendor provides and manages the platform but you may have considerations and requirements (around security in particular) that mean you would rather run the infrastructure yourself. Check your server procurement and set-up processes early if this is the case to avoid your project slowing down. Once the APM solution is up and running it’s time to install the agents.
After you’ve deployed your agents (whether straight into production or advancing through your route-to-live) and you have started used the monitored application, you’ll want to look at the user interface to see if the information contained within looks correct.
Look at your application flow map and try to identify any missing application components
Check the business transactions - are all the expected transactions there and reporting metrics?
Are your end user experience metrics displaying?
Do you have transaction snapshots showing your custom code executing in the run time?
Send out test alerts to see if they make it to their destination
If things don’t look right you’ll need to work out why. Maybe your application is different than you had initially conceived, or perhaps there a problem with the monitoring. Resolve any issues you see before declaring deployment and configuration victory.
NOTE: Production Load Cannot Be Simulated Exactly
To realize the most value from your APM purchase you must run it in production. No matter how good your Quality Engineering team is they cannot possibly code all of the weird and wonderful things your users will try to do in production. It can also be very difficult to duplicate your application environment in production. Example, you have 5000 JVMs spread across multiple cloud provider data centers. Replicating that environment would be time consuming and really expensive.
Discovery and Configuration
If you’ve picked the right tool, this bit should be easy.
Your APM tool should not require more than a series of simple steps in order to be installed and should be able to perform auto-discovery. It should be up, running, and instrumenting the distributed application within hours or even minutes. The process looks something like this: the end user installs the agents on all managed servers and virtual machines, installs the controller, and re-starts the application. At that point, the tool itself should be able to handle the challenge of mapping all the databases, tiers, and nodes within the distributed application, as well as displaying those relationships in an intuitive and visual way.
Do you like alerts? Not many people do. Why? Because it’s difficult to set meaningful ones and ones that preempt real system issues. When you’re setting your alerts, set your mind to thinking about business impact - not IT infrastructure resource utilisation.
The first question to ask is:
“How can alerting be done the right way without spending more time and money than it costs to develop and run my applications?”
Your most important and actionable alerts should be based off metrics that are directly associated with business impact. Here are some examples:
End user response time (good indicator of regional issues)
Business transaction response time (good indicator of systemic issues)
Business transaction throughput rate (are we seeing the same amount of traffic as usual?)
Number of widgets sold (is there a problem preventing users from buying?)
And remember, you might not always get all of this right first time and what’s right today might not be right tomorrow. But your APM tool, used properly, will enable you to deliver higher quality innovation, faster.