Thursday, September 15, 2016

Troubleshooting performance issues caused by performance monitoring tool


It have been a while after my last post because I'm now working mostly with applications and not some much with infrastructure anymore. Anyway, now I have nice real world troubleshooting story which I like to share with you.

Everyone who has even tried troubleshoot application performannce issues knows that it is very time consuming. That why we decided to acquire Application Performance Management/Monitoring tool to help with it.

After comparing our requirements to tools which are available we decided to start proof of concept project using Dynatrace APM which is provided us by their local partner company here in Finland so we was able to be sure that log data does not leave the country (which was business requirement).

Findings in PoC

Installation package issues

After testing Dynatrace Agent to couple of servers I noticed that there looks be these two bugs on installation MSI package:

  • On some servers IIS agent installation fails but I did not found any real reason to it.
  • On some servers MSI package fails to store IIS module registration to IIS configuration.
    • On this case MSI log even says that installation was done successfully.
Another issue was that there is also quite many manual configuration tasks needed after installation.

Because we are planning to deploy Dynatrace Agents to multiple servers and enabling/disabling them where needed I decided to create PowerShell scripts which can be used to install Dynatrace agents and enable/disable them when needed and include workarounds to these issues to these scripts (These scripts are btw available here.).

Performance issues caused by performance monitoring

First test results from production environment was very shocking. Reports shown for us that there was huge slowness on one of the environments.

When I investigated reasons to this issue I found that slowness mostly happened evening/night times when there was just couple of users on system.

Then on one evening when I was troubleshooting this issue I noticed that biggest delays was there immediately after iisreset and slowness was much worse than I had seen earlier on any other environment even when I was only user on system.

Because Dynatrace Agent was only difference between this and other environments which worked fine I decided to try remove it also from this environment and suprise suprise first load time after iisreset decreased to one tenth of earlier.

Situation was very interesting that tool which should help us to improve application perfomance actually caused them.

When we investigated this with Dynatrace support we got information that is it actually normal that instrumenting .NET profilers will slow down application start little bit even we did some tuning to configurations.

As solution to this we decided to configure IIS to keep application pools always running and preload them immediately after iisreset based on this guide:

That looked fixing issue so I included also that feature to my enable.ps1 -script.

Difficulties moving to production mode

After done lot of testings on test environment we assumed that we are finally ready to go production with this tool and we decided to enable monitoring agents again to couple of production environments.

Assuming was once wrong again. This time issue was that most of the application pools did not started at all after enabling Dynatrace agent. Logs showed that IIS agents started fine without any errors but .NET agents not even tried to start.

Hidden IIS feature

This issue seemed to be tricky and it took long time for me, my colleagues and Dynatrace partner company to investigate. In the end we noticed that when we removed all configurations done by script and did all configuration steps manually it suddenly started working and that happened on two different environment but not on third one even we did all steps same way than ealier.

After comparing working and not working installations I found that there is actually one hidden feature on IIS Management Console.
When you register module using IIS console it checks if binary is 32 or 64 bit version and automatically adds precondition that module will be only loaded by application pool which are running on same bitness. PowerShell cmdlet "New-WebGlobalModule" does not do that automatically so if you do not give -Precondition parameter (which is not mandatory) IIS will try load that module to both 32 and 64 applications. And actually you need give same precondition again when you run "Enable-WebGlobalModule" cmdlet.

Here you can see that precondition is not visible on IIS console:

But here you can see that precondition is still on IIS config file:

or it can be missing like we had and you cannot see that from IIS console and it caused application pools starting issue.

This is also now fixed to my scripts.

Whose fault it was?

When you use lot of time for troubleshooting and finally got problem fixed it is also interesting to use sometime to think that whose fault it was after all?

If we think all issues I have explained here then we can probably say that:

  • It is Dynatrace fault that their MSI package does not fail if installing IIS modules fails.
    • Or maybe it is Microsoft fault because their appcmd.exe tools is used inside of MSI?
    • Or maybe it is Microsoft fault that it is their AppFabric process which locks IIS config when this happens?
  • It is Microsoft fault that IIS module registration using IIS console and PowerShell works differently.
    • Or maybe it is Dynatrace fault that they don't say on their documentation that we should care about this but they are still using "/precondition" parameter inside of MSI package?
    • Or maybe it was my fault that I did not readed all Microsoft documentations that how IIS module registration should be done using PowerShell?

Anyway, it was Olli's problem and after all I was able to fix it :)

1 comment:

  1. I also found that there is same issue on "official" Dynatrace-PowerShell module which is created partly by Dynatrace employees. They probably have not seen this issue if they have only either 32 or 64 bit application pools in use and their module only enables one version of module to IIS.

    Anyway, I created pull request to same fix for their code: