New commit detected on ManageIQ/manageiq-providers-azure/gaprindashvili: https://github.com/ManageIQ/manageiq-providers-azure/commit/78cb5e84e41f9969e37eca7283a7653b62636562 commit 78cb5e84e41f9969e37eca7283a7653b62636562 Author: Bronagh Sorota <bsorota> AuthorDate: Mon Jul 16 14:33:34 2018 -0400 Commit: Bronagh Sorota <bsorota> CommitDate: Mon Jul 16 14:33:34 2018 -0400 Merge pull request #278 from djberg96/unreachable Default to StandardError if a connection cannot be made (cherry picked from commit 745543cdfd81c3883a93f2f1cc59caaa86e31ad6) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1601589 app/models/manageiq/providers/azure/manager_mixin.rb | 2 +- spec/models/manageiq/providers/azure/cloud_manager_spec.rb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
Since I failed to clarify how to properly test this, I'll try to here: 1) Make sure scheduler role is enabled 2) Modify the authentication_check_interval in the schedule_worker section so it checks authentications soon (10.minutes) after startup. The default is 1.hour. Note, the interval must be enough for you to "fake" a provider down so give yourself enough time. 5.minutes might not be enough time. 3) Restart the appliance or just kill the schedule worker pid (not kill -9, just kill). 4) Verify you get authentication_check_schedule MiqQueue put lines in the evm.log within 10 minutes (whatever you set in step 2). grep authentication_check_schedule log/evm.log If you don't, that means it didn't read your setting for the interval set in 2) https://github.com/ManageIQ/manageiq/blob/ce1baa277c780ba807060d81cb179958ab1f289b/app/models/miq_schedule_worker/runner.rb#L148-L149 5) Now, make sure the authentication is valid in the UI/database. 6) Bring "down" the provider via a iptables drop. 7) When the schedule enqueues the next authentication_check_schedule (as seen in step 4), it will go through all ems and host authentications and enqueue a authentication_check for each. 8) The authentication_check for your "down" provider should fail with an error that is NOW retryable. It should then enqueue a recheck within 2 minutes or as described here: https://github.com/ManageIQ/manageiq/blob/d101c7997d556f08a7b0747703654ba74915b7da/app/models/mixins/authentication_mixin.rb#L236-L240 You should see "attempt" increment in the queue message for these retries but not for the initial one that came from the schedule firing. This Bug fixes the problem in step 8, where we were raising unretryable exception, so you'd have to wait for the next authentication_check_interval.
Verified on 5.9.4.2 by: 1) Setting up azure w/ an http proxy 2) Bringing azure "down" by using iptables -j DROP on the proxy machine After hitting an OpenTimeoutException we can see the auth checks are being retried with multiple attempts: [----] W, [2018-08-03T22:08:07.060751 #4044:59310c] WARN -- : MIQ(ManageIQ::Providers::Azure::CloudManager#authentication_check_no_validation) type: ["default"] for [3] [azure] Validation failed: error, [] Timed out connecting to server (cause: Timed out connecting to server) [----] W, [2018-08-03T22:08:07.061307 #4044:59310c] WARN -- : MIQ(AuthUseridPassword#validation_failed) [ExtManagementSystem] [3], previously valid on: 2018-08-04 01:54:43 UTC, previous status: [Error] [----] I, [2018-08-03T22:08:07.081018 #4044:59310c] INFO -- : MIQ(MiqQueue.put) Message id: [41580], id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["ManageIQ::Providers::Azure::CloudManager", 3], "ems_auth_error", {}] [----] I, [2018-08-03T22:08:07.090988 #4044:59310c] INFO -- : MIQ(MiqQueue.put) Message id: [41581], id: [], Zone: [default], Role: [ems_operations], Server: [], Ident: [generic], Target id: [], Instance id: [4], Task id: [], Command: [ExtManagementSystem.authentication_check_types], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-08-04 02:12:07 UTC], Data: [], Args: [[], {:attempt=>3}]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2561