asp.net core mvc – Intermittent Azure App Service “hang” issue

We’re experiencing a strange issue with one of our Azure App Services. At various unpredictable points in the day the app will suddenly appear to hang for around 30-50 seconds, where no requests get serviced. It’s as if we’re waiting on a cold start.

It’s an ASP.NET MVC .NET 7 application (C#) monolith. It has a DI service layer, but this isn’t API based – all contained within one application. It uses Azure Redis extensively and has an Azure SQL backend. It also uses Azure Storage (Tables, Blobs and Queues) extensively.

The app uses the async-await pattern throughout. There should be virtually no synchronous calls or anything that obviously blocks a thread. We cannot find anything that ‘locks’ any resource for any period of time.

It doesn’t really need to call any third party APIs, and we don’t tend to use external CDNs much. Everything we need is pretty much inside the architecture described.

The MVC app is running on P2V2 (210 vCPU, 7GB RAM) and scaled out to two instances (session affinity on).

Redis instance is P1 Premium (6 GB cache).

Azure SQL is Standard S4 (200 DTUs), geo replicated between UK South (R/W) and UK West (R/O). In our application, we use both connection strings. Read-only queries are directed to UK West and Upsert/Deletion operations are directed to UK South, thereby “load-balancing” the SQL server.

During “normal” operation, the application is extremely quick, like in the low ms range. However, for no identifiable reason several times per day (perhaps 5 times) the application suddenly “hangs” on both instances for up to 50 seconds. During this time, the browser spins and nothing appears to be happening. Then all of a sudden the requests are serviced and it goes back to great performance. It’s as if the app is “cold booting” but it’s not – we were using it perfectly well seconds before.

During these periods, we check as many diagnostic sources as we can, but have found nothing to point towards this sudden hang, for example:

  • App Service CPU metrics on both machines don’t go above 15%
  • No sudden spike in memory usage
  • SQL server DTU% typically 5-15% during these periods on both R/W and R/O servers
  • No spike in Redis memory usage and in the region of only 200MB
  • Redis server load typically 5-6%
  • No spikes in Ingress or Egress in Azure Storage data
  • Nothing of any interest in Application Insights
  • No spikes in errors, warnings, etc.
  • Nothing of interest in diagnostics event logs
  • No timeouts or any other latency issues that we can find
  • No background, scheduled or timed updates/CRON jobs running
  • Database queries are optimised and well indexed
  • Health checks remain at 100%
  • Instances are not rebooting, according to Azure logs. Uptime remains at 100%

All the pieces of architecture are well over-specced for our requirements at this stage.

There are no other obvious pieces of architecture that we can put our finger on, such as firewalls, etc.

The issue feels “internal” to MVC, .NET or the App Service itself. We cannot replicate the issue locally in development and we cannot predict when it will happen on production.

We’ve considered GC collection or potential database connection pool recycling, etc. but cannot find any data to suggest these things are issues.

We’re a bit stumped. It’s frustrating because other than these momentary spikes throughout the day, the app is running really well and super quick.

I’ve raised an issue with Azure Support and await their feedback but has anybody else has similar experiences with similar architectures? Do you have any suggestions that we could look at, any logs/diagnostics we could consider adding to trace where this issue may be coming from?

Read more here: Source link