-
Notifications
You must be signed in to change notification settings - Fork 16
Description
So I'm noticing something that seems like a potential anti-pattern where Azure client instances are not being cached and instead constantly being re-materialized all the way from configuration up every single time.
It starts with the AzureIAuthJanitorProviderExtensions::GetAzure method. This method uses the fluent Azure SDK to construct the base IAzure instance which includes network calls to authenticate against, retrieve subscriptions and ultimately "bind" to the target Azure subscription. This also includes allocating a new RestClient instance which, if you dive deep enough into it, you'll find actually allocates a new HttpClientHandler each time. This is the root cause of the well known HttpClient anti-pattern which can/will ultimately result in TCP connection resource starvation.
By itself the current design of this method is probably as good as it can be, but it puts the onus on the caller, in this case the various IAuthJanitorProvider implementations, to know they need to cache the resulting client instance... spoiler alert tho: none of them do. 😞
Now, that we understand the implementation and the costs here, let's dive into why it ends up being a problem. If you go look at the various IAuthJanitorProvider implementations you will find they constantly just call the extension method via this.GetAzure() whenever they want to do something against their respective Azure resources. For example, here's one such usage from the AppSettingsWebAppApplicationLifecycleProvider class which illustrates one of the bigger problems that comes from the current design:
Lines 62 to 86 in eb2c5e6
| private async Task ApplySecrets(string slotName, List<RegeneratedSecret> secrets) | |
| { | |
| if (secrets.Count > 1 && secrets.Select(s => s.UserHint).Distinct().Count() != secrets.Count) | |
| { | |
| throw new Exception("Multiple secrets sent to Provider but without distinct UserHints!"); | |
| } | |
| IUpdate<IDeploymentSlot> updateBase = (await GetDeploymentSlot(slotName)).Update(); | |
| foreach (RegeneratedSecret secret in secrets) | |
| { | |
| var appSettingName = string.IsNullOrEmpty(secret.UserHint) ? Configuration.SettingName : $"{Configuration.SettingName}-{secret.UserHint}"; | |
| _logger.LogInformation("Updating AppSetting '{0}' in slot '{1}' (as {2})", appSettingName, slotName, | |
| Configuration.CommitAsConnectionString ? "connection string" : "secret"); | |
| updateBase = updateBase.WithAppSetting(appSettingName, | |
| Configuration.CommitAsConnectionString ? secret.NewConnectionStringOrKey : secret.NewSecretValue); | |
| } | |
| _logger.LogInformation("Applying changes."); | |
| await updateBase.ApplyAsync(); | |
| _logger.LogInformation("Swapping to '{0}'", slotName); | |
| await (await GetWebApp()).SwapAsync(slotName); | |
| } | |
| } |
In this method you see a call to GetDeploymentSlot first, this actually calls GetWebApp which calls GetAzure(). It then later calls GetWebApp directly which again calls GetAzure. So within that one method call, you've gone through the full creation, configuration and subsequent leaking of the underlying TCP connections. This is just one highly localized example where the problem surfaces. Here's another one from WebApplicationLifecycleProvider::Test:
AuthJanitor/AuthJanitor.Providers.AppServices/WebApps/WebAppApplicationLifecycleProvider.cs
Lines 13 to 23 in eb2c5e6
| public override async Task Test() | |
| { | |
| var sourceDeploymentSlot = await GetDeploymentSlot(SourceSlotName); | |
| if (sourceDeploymentSlot == null) throw new Exception("Source Deployment Slot is invalid"); | |
| var temporaryDeploymentSlot = await GetDeploymentSlot(TemporarySlotName); | |
| if (temporaryDeploymentSlot == null) throw new Exception("Temporary Deployment Slot is invalid"); | |
| var destinationDeploymentSlot = await GetDeploymentSlot(DestinationSlotName); | |
| if (destinationDeploymentSlot == null) throw new Exception("Destination Deployment Slot is invalid"); | |
| } |
Now, we see the GetDeploymentSlot method being called three distinct times here in just this one method.
But wait, it gets worse. 😫 If you understand the rekeying workflow of the AuthJanitor every provider is retrieved and goes through the following method call sequence:
Test -> GetSecretToUseDuringRekeying -> BeforeRekeying -> Rekey -> CommitNewSecrets -> AfterRekeying -> OnConsumingApplicationSwapped
Each one of those methods, if you dig into each provider implementation, usually makes one or more calls to GetAzure(). If you extrapolate out N providers calling GetAzure() M times per workflow every T increments of time (right now 2mins) you can see how this becomes problematic pretty quickly.
There are several ways to solve this, but I would personally want to think about it more before recommending a solution. I just wanted to surface this problem so that we understand it exists right now and can begin to have discussions about.