1600968 – Service Provision is Failing Because Last Auth Check Failed for Azure Provider

Bug 1600968 - Service Provision is Failing Because Last Auth Check Failed for Azure Provider

Summary: Service Provision is Failing Because Last Auth Check Failed for Azure Provider

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Daniel Berger
QA Contact:	Brandon Squizzato
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1601583 (view as bug list)
Depends On:
Blocks:	1601589
TreeView+	depends on / blocked

Reported:	2018-07-13 13:46 UTC by myoder
Modified:	2021-12-10 16:38 UTC (History)
CC List:	14 users (show)
Fixed In Version:	5.10.0.5
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1601589 (view as bug list)
Environment:
Last Closed:	2019-02-11 14:00:16 UTC
Category:	Bug
Cloudforms Team:	Azure
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 Greg McCullough 2018-07-13 14:14:57 UTC

Moving to the providers team to review.

There seems to be a few directly related issues that we would be ignoring if we were to  just try to fix provisioning.  Namely missed authentication checks and failing refreshes.

I am not sure I can agree at the moment with the "Expected Results" comment that provisioning should call re-authenticate until we try to understand the root cause of the other issues.

Comment 4 Greg McCullough 2018-07-13 19:25:28 UTC

Michael - Failing sooner would make sense, but it does not guarantee that the credentials will not switch from valid to invalid between provisioning and the looping checkprovisioned state since that can run for a while depending on how long provisioning takes.

I believe we use the "with_provider_connection" call for both the start of provisioning and the checkprovisioned call and it would be part of the "with_provider_connection" method to raise on bad credentials.

checkprovisioned goes through here:
https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L3

initiate provisioning here:
https://github.com/ManageIQ/manageiq-providers-azure/blob/master/app/models/manageiq/providers/azure/cloud_manager/provision/cloning.rb#L185

Comment 8 Daniel Berger 2018-07-16 18:43:25 UTC

An OpenTimeoutException means either your network has failed, or Azure is having a really bad day and we aren't able to connect. In this case probably the latter since we had at least one other report of this happening.

As per Joe R's suggestion, we modified things so that our existing authentication code will behave better by raising the appropriate exception, which in turn will result in retries with exponential backoff times.

In other words, your refresh may still fail if Azure is down for long periods of time, but this is a more robust approach that should deal with sporadic failures:

https://github.com/ManageIQ/manageiq-providers-azure/pull/278

Comment 10 Daniel Berger 2018-07-17 13:50:56 UTC

*** Bug 1601583 has been marked as a duplicate of this bug. ***

Comment 11 Bronagh Sorota 2018-07-17 15:10:44 UTC

Is the customer still experiencing this problem? As described in the comments above, it looks like Azure was having network issues last Friday.

Comment 16 Brandon Squizzato 2018-08-04 02:07:09 UTC

Verified on 5.10.0.8 by:
1) Setting up azure w/ an http proxy
2) Bringing azure "down" by using iptables -j DROP on the proxy machine

After hitting an OpenTimeoutException we can see the auth checks are being retried with multiple attempts:

[----] W, [2018-08-03T22:03:33.619836 #63909:e2ef78]  WARN -- : MIQ(ManageIQ::Providers::Azure::CloudManager#authentication_check_no_validation) type: ["default"] for [1] [azure] Validation failed: error, [] Timed out connecting to server (cause: Timed out connecting to server)
[----] I, [2018-08-03T22:03:33.642201 #63909:e2ef78]  INFO -- : MIQ(MiqQueue.put) Message id: [4553],  id: [], Zone: [default], Role: [ems_operations], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-08-04 02:07:33 UTC], Data: [], Args: [[], {:attempt=>3}]

Note You need to log in before you can comment on or make changes to this bug.