Moving to the providers team to review. There seems to be a few directly related issues that we would be ignoring if we were to just try to fix provisioning. Namely missed authentication checks and failing refreshes. I am not sure I can agree at the moment with the "Expected Results" comment that provisioning should call re-authenticate until we try to understand the root cause of the other issues.
Michael - Failing sooner would make sense, but it does not guarantee that the credentials will not switch from valid to invalid between provisioning and the looping checkprovisioned state since that can run for a while depending on how long provisioning takes. I believe we use the "with_provider_connection" call for both the start of provisioning and the checkprovisioned call and it would be part of the "with_provider_connection" method to raise on bad credentials. checkprovisioned goes through here: https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L3 initiate provisioning here: https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L185
An OpenTimeoutException means either your network has failed, or Azure is having a really bad day and we aren't able to connect. In this case probably the latter since we had at least one other report of this happening. As per Joe R's suggestion, we modified things so that our existing authentication code will behave better by raising the appropriate exception, which in turn will result in retries with exponential backoff times. In other words, your refresh may still fail if Azure is down for long periods of time, but this is a more robust approach that should deal with sporadic failures: https://github.com/ManageIQ/manageiq-providers-azure/pull/278
*** Bug 1601583 has been marked as a duplicate of this bug. ***
Is the customer still experiencing this problem? As described in the comments above, it looks like Azure was having network issues last Friday.
Verified on 5.10.0.8 by: 1) Setting up azure w/ an http proxy 2) Bringing azure "down" by using iptables -j DROP on the proxy machine After hitting an OpenTimeoutException we can see the auth checks are being retried with multiple attempts: [----] W, [2018-08-03T22:03:33.619836 #63909:e2ef78] WARN -- : MIQ(ManageIQ::Providers::Azure::CloudManager#authentication_check_no_validation) type: ["default"] for [1] [azure] Validation failed: error, [] Timed out connecting to server (cause: Timed out connecting to server) [----] I, [2018-08-03T22:03:33.642201 #63909:e2ef78] INFO -- : MIQ(MiqQueue.put) Message id: [4553], id: [], Zone: [default], Role: [ems_operations], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-08-04 02:07:33 UTC], Data: [], Args: [[], {:attempt=>3}]