Bug 1601589 - Service Provision is Failing Because Last Auth Check Failed for Azure Provider
Summary: Service Provision is Failing Because Last Auth Check Failed for Azure Provider
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: GA
: 5.9.4
Assignee: Daniel Berger
QA Contact: Brandon Squizzato
URL:
Whiteboard:
Depends On: 1600968
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-16 19:17 UTC by Satoe Imaishi
Modified: 2018-09-04 18:01 UTC (History)
12 users (show)

Fixed In Version: 5.9.4.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1600968
Environment:
Last Closed: 2018-09-04 18:00:53 UTC
Category: ---
Cloudforms Team: Azure


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2561 None None None 2018-09-04 18:01:46 UTC

Comment 2 CFME Bot 2018-07-19 00:52:44 UTC
New commit detected on ManageIQ/manageiq-providers-azure/gaprindashvili:

https://github.com/ManageIQ/manageiq-providers-azure/commit/78cb5e84e41f9969e37eca7283a7653b62636562
commit 78cb5e84e41f9969e37eca7283a7653b62636562
Author:     Bronagh Sorota <bsorota@redhat.com>
AuthorDate: Mon Jul 16 14:33:34 2018 -0400
Commit:     Bronagh Sorota <bsorota@redhat.com>
CommitDate: Mon Jul 16 14:33:34 2018 -0400

    Merge pull request #278 from djberg96/unreachable

    Default to StandardError if a connection cannot be made
    (cherry picked from commit 745543cdfd81c3883a93f2f1cc59caaa86e31ad6)

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1601589

 app/models/manageiq/providers/azure/manager_mixin.rb | 2 +-
 spec/models/manageiq/providers/azure/cloud_manager_spec.rb | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comment 12 Joe Rafaniello 2018-08-03 19:58:15 UTC
Since I failed to clarify how to properly test this, I'll try to here:

1) Make sure scheduler role is enabled

2) Modify the authentication_check_interval in the schedule_worker section so it checks authentications soon (10.minutes) after startup.  The default is 1.hour.  Note, the interval must be enough for you to "fake" a provider down so give yourself enough time.  5.minutes might not be enough time.

3) Restart the appliance or just kill the schedule worker pid (not kill -9, just kill).

4) Verify you get authentication_check_schedule MiqQueue put lines in the evm.log within 10 minutes (whatever you set in step 2).

grep authentication_check_schedule log/evm.log

If you don't, that means it didn't read your setting for the interval set in 2)

https://github.com/ManageIQ/manageiq/blob/ce1baa277c780ba807060d81cb179958ab1f289b/app/models/miq_schedule_worker/runner.rb#L148-L149

5) Now, make sure the authentication is valid in the UI/database.

6) Bring "down" the provider via a iptables drop.

7) When the schedule enqueues the next authentication_check_schedule (as seen in step 4), it will go through all ems and host authentications and enqueue a authentication_check for each.

8) The authentication_check for your "down" provider should fail with an error that is NOW retryable.  It should then enqueue a recheck within 2 minutes or as described here:
https://github.com/ManageIQ/manageiq/blob/d101c7997d556f08a7b0747703654ba74915b7da/app/models/mixins/authentication_mixin.rb#L236-L240

You should see "attempt" increment in the queue message for these retries but not for the initial one that came from the schedule firing.

This Bug fixes the problem in step 8, where we were raising unretryable exception, so you'd have to wait for the next authentication_check_interval.

Comment 13 Brandon Squizzato 2018-08-04 02:09:14 UTC
Verified on 5.9.4.2 by:
1) Setting up azure w/ an http proxy
2) Bringing azure "down" by using iptables -j DROP on the proxy machine

After hitting an OpenTimeoutException we can see the auth checks are being retried with multiple attempts:

[----] W, [2018-08-03T22:08:07.060751 #4044:59310c]  WARN -- : MIQ(ManageIQ::Providers::Azure::CloudManager#authentication_check_no_validation) type: ["default"] for [3] [azure] Validation failed: error, [] Timed out connecting to server (cause: Timed out connecting to server)
[----] W, [2018-08-03T22:08:07.061307 #4044:59310c]  WARN -- : MIQ(AuthUseridPassword#validation_failed) [ExtManagementSystem] [3], previously valid on: 2018-08-04 01:54:43 UTC, previous status: [Error]
[----] I, [2018-08-03T22:08:07.081018 #4044:59310c]  INFO -- : MIQ(MiqQueue.put) Message id: [41580],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["ManageIQ::Providers::Azure::CloudManager", 3], "ems_auth_error", {}]
[----] I, [2018-08-03T22:08:07.090988 #4044:59310c]  INFO -- : MIQ(MiqQueue.put) Message id: [41581],  id: [], Zone: [default], Role: [ems_operations], Server: [], Ident: [generic], Target id: [], Instance id: [4], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-08-04 02:12:07 UTC], Data: [], Args: [[], {:attempt=>3}]

Comment 15 errata-xmlrpc 2018-09-04 18:00:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2561


Note You need to log in before you can comment on or make changes to this bug.