Bug 1600968
Summary: | Service Provision is Failing Because Last Auth Check Failed for Azure Provider | |||
---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | myoder | |
Component: | Providers | Assignee: | Daniel Berger <dberger> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brandon Squizzato <bsquizza> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 5.9.0 | CC: | brant.evans, bsorota, bsquizza, cpelland, dberger, gblomqui, gmccullo, jfrey, jhardy, jrafanie, myoder, ndhandre, obarenbo, simaishi | |
Target Milestone: | GA | Keywords: | TestOnly, ZStream | |
Target Release: | 5.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | 5.10.0.5 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1601589 (view as bug list) | Environment: | ||
Last Closed: | 2019-02-11 14:00:16 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | Bug | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | Azure | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1601589 |
Comment 2
Greg McCullough
2018-07-13 14:14:57 UTC
Michael - Failing sooner would make sense, but it does not guarantee that the credentials will not switch from valid to invalid between provisioning and the looping checkprovisioned state since that can run for a while depending on how long provisioning takes. I believe we use the "with_provider_connection" call for both the start of provisioning and the checkprovisioned call and it would be part of the "with_provider_connection" method to raise on bad credentials. checkprovisioned goes through here: https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L3 initiate provisioning here: https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L185 An OpenTimeoutException means either your network has failed, or Azure is having a really bad day and we aren't able to connect. In this case probably the latter since we had at least one other report of this happening. As per Joe R's suggestion, we modified things so that our existing authentication code will behave better by raising the appropriate exception, which in turn will result in retries with exponential backoff times. In other words, your refresh may still fail if Azure is down for long periods of time, but this is a more robust approach that should deal with sporadic failures: https://github.com/ManageIQ/manageiq-providers-azure/pull/278 *** Bug 1601583 has been marked as a duplicate of this bug. *** Is the customer still experiencing this problem? As described in the comments above, it looks like Azure was having network issues last Friday. Verified on 5.10.0.8 by: 1) Setting up azure w/ an http proxy 2) Bringing azure "down" by using iptables -j DROP on the proxy machine After hitting an OpenTimeoutException we can see the auth checks are being retried with multiple attempts: [----] W, [2018-08-03T22:03:33.619836 #63909:e2ef78] WARN -- : MIQ(ManageIQ::Providers::Azure::CloudManager#authentication_check_no_validation) type: ["default"] for [1] [azure] Validation failed: error, [] Timed out connecting to server (cause: Timed out connecting to server) [----] I, [2018-08-03T22:03:33.642201 #63909:e2ef78] INFO -- : MIQ(MiqQueue.put) Message id: [4553], id: [], Zone: [default], Role: [ems_operations], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-08-04 02:07:33 UTC], Data: [], Args: [[], {:attempt=>3}] |