Bug 1600968

Summary: Service Provision is Failing Because Last Auth Check Failed for Azure Provider
Product: Red Hat CloudForms Management Engine Reporter: myoder
Component: ProvidersAssignee: Daniel Berger <dberger>
Status: CLOSED CURRENTRELEASE QA Contact: Brandon Squizzato <bsquizza>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.9.0CC: brant.evans, bsorota, bsquizza, cpelland, dberger, gblomqui, gmccullo, jfrey, jhardy, jrafanie, myoder, ndhandre, obarenbo, simaishi
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.5 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1601589 (view as bug list) Environment:
Last Closed: 2019-02-11 14:00:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Azure Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1601589    

Comment 2 Greg McCullough 2018-07-13 14:14:57 UTC
Moving to the providers team to review.

There seems to be a few directly related issues that we would be ignoring if we were to  just try to fix provisioning.  Namely missed authentication checks and failing refreshes.

I am not sure I can agree at the moment with the "Expected Results" comment that provisioning should call re-authenticate until we try to understand the root cause of the other issues.

Comment 4 Greg McCullough 2018-07-13 19:25:28 UTC
Michael - Failing sooner would make sense, but it does not guarantee that the credentials will not switch from valid to invalid between provisioning and the looping checkprovisioned state since that can run for a while depending on how long provisioning takes.

I believe we use the "with_provider_connection" call for both the start of provisioning and the checkprovisioned call and it would be part of the "with_provider_connection" method to raise on bad credentials.

checkprovisioned goes through here:
https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L3

initiate provisioning here:
https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L185

Comment 8 Daniel Berger 2018-07-16 18:43:25 UTC
An OpenTimeoutException means either your network has failed, or Azure is having a really bad day and we aren't able to connect. In this case probably the latter since we had at least one other report of this happening.

As per Joe R's suggestion, we modified things so that our existing authentication code will behave better by raising the appropriate exception, which in turn will result in retries with exponential backoff times.

In other words, your refresh may still fail if Azure is down for long periods of time, but this is a more robust approach that should deal with sporadic failures:

https://github.com/ManageIQ/manageiq-providers-azure/pull/278

Comment 10 Daniel Berger 2018-07-17 13:50:56 UTC
*** Bug 1601583 has been marked as a duplicate of this bug. ***

Comment 11 Bronagh Sorota 2018-07-17 15:10:44 UTC
Is the customer still experiencing this problem? As described in the comments above, it looks like Azure was having network issues last Friday.

Comment 16 Brandon Squizzato 2018-08-04 02:07:09 UTC
Verified on 5.10.0.8 by:
1) Setting up azure w/ an http proxy
2) Bringing azure "down" by using iptables -j DROP on the proxy machine

After hitting an OpenTimeoutException we can see the auth checks are being retried with multiple attempts:

[----] W, [2018-08-03T22:03:33.619836 #63909:e2ef78]  WARN -- : MIQ(ManageIQ::Providers::Azure::CloudManager#authentication_check_no_validation) type: ["default"] for [1] [azure] Validation failed: error, [] Timed out connecting to server (cause: Timed out connecting to server)
[----] I, [2018-08-03T22:03:33.642201 #63909:e2ef78]  INFO -- : MIQ(MiqQueue.put) Message id: [4553],  id: [], Zone: [default], Role: [ems_operations], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-08-04 02:07:33 UTC], Data: [], Args: [[], {:attempt=>3}]