-
Notifications
You must be signed in to change notification settings - Fork 12
Adding Transient error support #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@cgoconseils , I'm going to review and test this in one of our implementations, will report here with feedback. How has it worked out for you with the retry logic? What we've done is, we've implemented Polly and we do the retries there because in addition to the API limits/transient errors mentioned in the documentation, there are some other exceptions that we get as well, network/connection level stuff, so we just use Polly for retries and handle/manage those. Knowing that the API limits have changed before, might want to consider moving the ErrorCodes to some sort of config? |
|
Also wondering how this has panned out and how it's working? We have a version I've got to replace with CrmServiceClient so we can use oAuth, however, it still doesn't give us a facility to use webapi yet and has a few of it's own issues I'm working out. The throttle logic here wouldn't change that at all. But I'd like to know how this is working in-practice and if it's something we'd be better off using another library like Polly to manage or not. Thanks in advance for your feedback! |
|
@mohsinonxrm would you mind sharing your implementation of polly as a sample? I've hit a few snags and haven't been able to get the time to iron those out yet. |
|
@seanmcne , sure I'll share some code snippets on that here. We've done it in a way such that we didn't touch the nuget package. We do our own batching and paging logic on it and then have Polly execute/retry that. With the new API limits, especially the one with max of 40 concurrent requests, I'm wondering if we should move the implementation to use ExecuteMultipleRequest and have shallow batches and work with support to increase the limit to ... say 40 parallel ExecuteMultipleRequests because right now, this implementation isn't giving us the throughput that it used to back in the day =). I know this because I've been using this PFE library since the good old 2011 days. Another thing to consider is maybe, adding configuration to use multiple application user accounts (App Registrations) and divvy up the OrganizationRequests among those users so that the API throughput can be maximized. Even going a step further would be to use both SOAP and WebAPI endpoints, but in the end the Azure SQL will hit the connection limit of whatever DTU it is scaled at that particular time and obviously there's impact of any plug-ins or workflows registered on those OrganizationRequests as well so yeah... back to this, I'll share something soon. |
|
@seanmcne : I took out parts from the implementation and dumped into a new class file, this won't compile without some additional classes etc. |
| if (e.Detail.ErrorCode == RateLimitExceededErrorCode) | ||
| { | ||
| // Use Retry-After delay when specified | ||
| delay = (TimeSpan)e.Detail.ErrorDetails["Retry-After"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a check here to see if "Retry-After" exists in the ErrorDetails, as this isn't always the case. There are different types of RateLimits represented by the same RateLimitExceededErrorCode, so don't assume. I already ran into this issue.
| private const int TimeLimitExceededErrorCode = -2147015903; | ||
| private const int ConcurrencyLimitExceededErrorCode = -2147015898; | ||
|
|
||
| private const int MaxRetries = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this configurable, part of OrganizationServiceProxyOptions, set default to 3 if none provided
|
FYI - i've been testing this a bit but integrated into a v9 branch using Service Client which uses oAuth. I'm going to continue testing. I might just publish the branch up here as well so you can take a peek and contribute changes - so far it's working fairly well but it's been a bit rocky getting serviceclient to function well and behave consistently - getting closer though. Thanks! |
|
@seanmcne , let me know if you're available for a quick chat about some scenarios. Please let me know some time slots. |
- removed all paralle discoveryservice operations, use global discovery now instead for more responsive discovery - Removing all organizationservicemanager constructors in favor of new constructors specifically for service client - Added 3 new constructors, one to use a crmserviceclient and clone it. And two that will create their own crmserviceclient either with or without a developer provided appid and redirecturi - Retargeting project to a minimum of 4.6.2 - added very basic throttle retry based on public documentation provided on docs.microsoft.com and provided in PR #41 - Added more ETW tracing for exceptions and on clone of service client - removed unnecessary etw for events no longer used - Mocking communication exceptions as faults so parallel operations continue and customer provided error handler will be called instead of ending the whole parallel operation for a timeout
|
@mohsinonxrm - I finally was able to do some enhancements and get a v9 build functional w/ service client. I've done a lot of high throughput testing, though I've not yet fully exercised the throttling retry logic quite yet. My plan is to try and artificially lower the throttles on one instance and see if I can get some good tests, that said I haven't had a chance to do that yet and I'm going to be pretty limited on time the next few weeks. |
|
@seanmcne , can you please publish the branch here, so that I can test that as well? I have scenarios ready that'll test all the different API limits. Missed the v9 commit, I'll play with it. |
|
Yep take a peek, it's published up there now. I also published an alpha on nuget.org https://www.nuget.org/packages/Pfe.Microsoft.Xrm.CoreV9/9.0.0-alpha I've got a few tweaks (like logging errors if you're using a version of adal that's known not to work) - overall it's working well as it can w/ service client. Troubleshooting login still continues to be difficult, at best, other than enabling etw tracing to a file it's tough. But if your creds are working, it's pretty smooth sailing so far. |
|
@seanmcne , I think we need to think about wrapping the operations into core retry logic. Polly perhaps? For example, right now we're missing the retry logic on Retrieve and RetrieveMultiple. I can create a new pull request on those 2 items but I think instead of implementing the retry logic for all these operations separately, I think it should be refactored and wrapped in like a Polly retry. Thoughts? |
I guess... we can just always call Execute with the RetrieveRequest/CreateRequest/UpdateRequest/DeleteRequest as things stand right now. |
|
Interesting thought, the library today (and previous) just covers: managing connections, returning connections (svcclients/org service proxies) and managing/executing parallel operations. With overall goal of making everything just a little easier to connect, manage, and execute bulk operations (hopefully handling the hard part around the bulk executions). All operations executed in parallel using this v9 code will handle and retry when throttled, but any generic call made against the service today (from getProxy()) will be up to the caller to manage. I don't think wrapping all operations would be within the scope of the library today since the concerns covered are really connecting, thread safety, and Parallel operation support (not wrapping all service calls). Since the result of GetProxy() now is a CrmServiceClient you can simply call Service.Retrieve, Service.RetrieveMultiple, Service.Execute, Service.Create/Update/Delete -- just like you always could with OrganizationServiceProxy (since both inherit from IOrganizationService). |
|
@seanmcne , I agree. I think I got confused a bit since I have my own implementation as well with Polly and I was reviewing/comparing that. Anyway, I'm all good, I'll be running this and will report back results in the next couple of days. I think this will reduce a lot of complexity that is currently in my implementation because of custom batching and paging logic. |
|
Thanks! Let me know if you need anything to get started, basically you can either:
|
|
Ok some new info:
|
|
@seanmcne , I tested this a bit yesterday. I passed in Uri, Username and Password to the OrganizationServiceManager. It works fine, however I was getting extremely poor performance. I looked at the Visual Studio and it was logging tons of Events (ETW Logging from ADAL, something to do with Token found in cache etc.). I'll take a look at the new changes mentioned above. |
|
Interesting... I wonder how it'll do the retries on a batch with a mix of create/update requests. Retry all again or failed only. Anyway, just for reference: |
|
Won't make a difference, each request throttle is handled independently just like the code was doing. So no matter if the request is a create/update/retrieve/executeMultiple it will retry. Also, keep in mind that executemultiple is only one request that contains child requests but the throttle only applies to the one execute multiple request and it's throttled before it's processed or accepted. |
|
@seanmcne , I know ExecuteMultipleRequest is counted as a single request but I was thinking and maybe you know the answer to this but how does Dynamics 365 platform knows that the requests inside the execute multiple request message will go above the API limit, specifically this one: -2147015903 | Combined execution time of incoming requests exceeded limit of 1,200,000 milliseconds over time window of 300 seconds. Decrease number of concurrent requests or reduce the duration of requests and try again later. |
|
It doesn't know anything in the future, only the data that has already taken place over the window of time configured (300seconds in the above example). The execution time limit is a measure of seconds or milliseconds of execution, per second: 40 connections all connected at the exact same time, if all 40 connections stay connected for 1 minute - you have now consumed 40 minutes of execution per minute and the next request in (after it hits the limits configured) will be denied due to combined execution time. The throttle metrics you have above allow for 4 seconds per second of execution time per second (4000 ms each second for 300 seconds. If you go above the throttle with in-flight requests, it will block the next request (not rollback anything in-flight) because the throttles are applied prior to the request being processed. Hopefully that helps clear things up and doesn't cause more confusion :) |
|
@seanmcne , I was talking to another premier support engineer couple months ago (early March) and I was told that the limits are applied per user per endpoint (SOAP and WebAPI) and per application server (there's an affinity cookie) but in addition to what you described above he also mentioned that the any plug-ins/custom workflows/actions running as part of OrganizationRequest will also be counted towards this API (execution time) limit, is that true or is there more to it? Thanks a lot for the explanation above, really helps clear things and lines up with what I'm seeing with how I have coded up the batching/paging/retry logic with Polly. |
|
Correct: “applied per user per endpoint (SOAP and WebAPI) and per application server (there's an affinity cookie)”. The time you wait for the request to return a response (including sync plugin time) counts but the whole thing only counts towards a single request (for the burst throttle) any async time is not included in the time throttle nor the burst throttle. And individual requests from plugins back to the context org service are not counted towards the burst or concurrency throttles - only external requests in to the platform count towards concurrency and burst throttles today. |
|
@seanmcne , thanks a lot for clarifying that. Not sure how useful this information others might find this but this is extremely helpful for me. |
|
You can use the app.config to disable the Adam trace - I think if you just comment out all the trace xml it will stop logging out.
|
|
@seanmcne , I have turned it off directly in the code, I'll try to configure it through app.config as well: In ADAL V2, to disable logging: AdalTrace.LegacyTraceSwitch.Level = TraceLevel.Error; |
|
@seanmcne , using app.config worked. Anyway, more testing in progress. |
|
Excellent! Also, I just published a 9.0.4-alpha updated build on nuget which should include all the code now in the v9 branch from here (I've left the 9.0.3-alpha build listed as well, just in case you hit an issue). https://www.nuget.org/packages/Pfe.Microsoft.Xrm.CoreV9/9.0.4-alpha Here is the code I use to create a new local OrganizationServiceManager within a class that can be used by anything within that class (that can also be static and used across the entire codebase). Here is how to change the MaxRetryCount and RetryPauseTime on an OrganizationServiceManager. For this example I'm using the uri,username,password constructor versus providing my own serviceclient. this.LocalConnectionManager = new OrganizationServiceManager(OrgLocation, Username, Password);
this.OrganizationManager.MaxRetryCount;
this.OrganizationManager.RetryPauseTime;
using (var proxy = this.OrganizationManager.GetProxy())
{
WhoAmIRequest request = new WhoAmIRequest();
WhoAmIResponse response = (WhoAmIResponse)proxy.Execute(request);
}
//Parallel test - create 500 generic accounts
for (var i = 0; i < 500; i++)
{
var target = new Entity("account");
target["error_name"] = String.Format("Test Account #{0}", i);
createTargets.Add(target);
}
var errorCount = 0;
try
{
//Create accounts, provide an error handler which will be called for each individual request that fails after the parallel process has completed
var createResponses = this.OrganizationManager.ParallelProxy.Create(createTargets, (e, ex) => { errorCount++; });
}
catch (AggregateException ex)
{
//add exception handling for AggregateExceptions
} |
|
@seanmcne , thanks for the above. I'm using the same pattern: I was tested with the following retry config: However, in the logs, it doesn't seem to be using the PauseTime, I reviewed the implementation of v9 client and you're passing that over to the CrmServiceClient, even for the cloned, but for some reason I don't think it is sleeping for 10 seconds. Here are the logs: Microsoft.Xrm.Tooling.Connector.CrmServiceClient Verbose: 16 : Retry No=1 Retry=Started IsThrottle=True Delay=00:00:01 for Command Delete This is with latest alpha 9.0.4-alpha I ran it again and this time i commented out the RetryPauseTime to see if it'll use the default. |
|
Also, |
This is expected, there is no CrmServiceClient.OrganizationServiceProxy - instead just use CrmServiceClient.Execute; .Create; .Update; etc. With oAuth authentication it uses an organizationWebProxyClient but there shouldn't be any real need to interact directly with that (outside of a couple circumstances where we need to change timeouts, etc). To clarify, you'd use OrganizationServiceManager.GetProxy().Execute, etc. But you'd want to put the getProxy in a using so it's disposed of properly and doesn't leak. |
Those are actually properties on the CrmSeriviceClient now. |
|
Thanks @seanmcne and @MattB-msft for clarifying. I'll go through my old implementation and make changes and test more. I haven't spent much time with it but from what I've seen so far, I feel that the performance with CrmServiceClient is nowhere near to what we had before without it. I think we should release it after more testing and then I'll probably open a performance issue with some numbers of old vs new. I'll research just to see if I'm just feeling it being slow or if it is actually slow based on numbers. |
|
@seanmcne , looks like long running process fails because the token expires and it doesn't auto refresh (no managed token proxy classes with CrmServiceClient). Haven't looked at the details yet to catch it and auto-refresh it and continue on. |
|
@mohsinonxrm this was an issue in 9.0.2.12 but I believe it was fixed, but we can confirm. Can you check what version of xrm tooling dll's you're using? @MattB-msft - can you confirm we fixed this in 9.0.2.25 update of XRM Tooling? |
|
@seanmcne , sorry for the late reply. |
|
Yes that is fixed on the current nuget in public space. |
|
@seanmcne , so it has been a while since I gave any updated. I've been testing it and using it. Other than that, it seems that it doesn't handle or retry any other exceptions (Communication/Web/Socket/IO), those are still thrown and will have to be retried. Overall, works pretty good. @MattB-msft , thanks for the info. |
|
In the current pre-release on Nugent should let you handle the error via the error handler for communication errors - if I missed an exception type let me know. If you’re not using an error handler it will throw them as part of the aggregate exception but it should not stop processing the batch for these specific errors. Thanks for the feedback! |
|
@seanmcne , Everything works as expected with the exceptions and all that, I do have a handler. I just wanted to note/mention my findings about what the CrmServiceClient will retry. I do have a suggestion though, I hope in the future CrmServiceClient can be open sourced on github, so that we can enhance it or customize it as needed. I also hope that the Logic Apps/Flow CDS connector and Dynamics 365 Logic Apps/Flow connector will be put on github as well. On a separate note, can you shed some light (if you happen to know, otherwise can you direct me to someone who does know) on the internal working of this CDS/D365 Connector? It appears that it is sending the requests over to an APIM endpoint (in case of CDS) and some other API endpoint on azurewebsites (in case of D365) and from there the actual requests get executed, does it still forward the requests to SOAP/WebAPI endpoint or is there another channel to talk to CDS/D365 environment? Also, wondering if can we directly call the CDS APIM endpoint? |
|
@seanmcne , @MattB-msft , is this seriously happening? 20K/user/24 hours? Looks like the old WebAPI limits are still going apply. Also what's the point of an application user now since as per the article they can't make API calls unless they are assigned a license. Can we request these limits to be adjusted/increased based on the requirements per tenant/instance? How did Microsoft come up with this number? |
|
@seanmcne , I think this can be closed because now that the back-end is using the CrmServiceClient which has Retry and Wait already baked-in. |
|
I know it adheres to the default retry policy for XrmTooling but I don't know if it actually looks at the retry-after (at least not yet) because it's still using SOAP/web endpoint which I don't think will return that information |
|
I was thinking the same since it has the property for CrmServiceClient.RetryPauseTime. Anyway, all good. I think we can close this one. |
|
The crmServiceClient does respect the retry-after and the error/retry codes that are specific to the CDS\Dataverse Server.
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
From: mohsinonxrm<mailto:notifications@github.com>
Sent: Thursday, February 11, 2021 8:03 AM
To: seanmcne/XrmCoreLibrary<mailto:XrmCoreLibrary@noreply.github.com>
Cc: MattB<mailto:mattb-msft@hotmail.com>; Mention<mailto:mention@noreply.github.com>
Subject: Re: [seanmcne/XrmCoreLibrary] Adding Transient error support (#41)
I was thinking the same since it has the property for CrmServiceClient.RetryPauseTime. Anyway, all good. I think we can close this one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#41 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACQUENFLGY4LIGQ63RIP3ZTS6P5VRANCNFSM4HSABAIA>.
|
Applying recommended implementation : https://docs.microsoft.com/en-us/dynamics365/customer-engagement/developer/api-limits#retry-class