Bug 1383442 - Scheduled validation check for VCenter sets status to error and never re-validates
Summary: Scheduled validation check for VCenter sets status to error and never re-vali...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: GA
: 5.8.0
Assignee: Joe Rafaniello
QA Contact: Alex Newman
URL:
Whiteboard: vsphere
Depends On:
Blocks: 1391715
TreeView+ depends on / blocked
 
Reported: 2016-10-10 15:46 UTC by Jared Deubel
Modified: 2019-12-16 07:03 UTC (History)
9 users (show)

Fixed In Version: 5.8.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1391715 (view as bug list)
Environment:
Last Closed: 2017-06-12 17:06:27 UTC
Category: ---
Cloudforms Team: VMware
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jared Deubel 2016-10-10 15:46:42 UTC
Description of problem:
We have a zone that supports 2 providers (provider A and provider B). We have successful EmsRefreshes for provider A, but provider B never refreshes.
We have validated credentials for both providers so they are correct. When initiating a full-refresh from the UI, we confirmed seeing the message put down on the queue.

For some reason the check by the scheduler on these VC sets the status to error and never goes back and resets it. Validation works and in fact if you reset the status with the Authentication button all comes live.

Version-Release number of selected component (if applicable):
5.6

Comment 5 Adam Grare 2016-10-13 14:21:21 UTC
There was a temporary network interruption that caused the credential validation to fail:
WARN -- : MIQ(ManageIQ::Providers::Vmware::InfraManager#verify_credentials) #<HTTPClient::ConnectTimeoutError: execution expired>

It did not try again for a full day, the workaround is to increase the frequency that credential validation is done.  The proposed fix is to perform a retry with exponential backoff if a temporary error is encountered.

Comment 6 Adam Grare 2016-10-15 18:41:34 UTC
Credential validation failed with the error HTTPClient::ConnectTimeoutError on 10-05 and 10-11.

There are regular checks scheduled by the wk1 appliance from 09-21 - 09-30, then once on 10-04.
The sched appliance starts regularly scheduling checks on 10-08 and 10-09, then on 10-11 it starts running auth checks every hour.

This means that no authentication checks were being scheduled for many days to correct the invalid credentials.

An MiqScheduleWorker appears to have been running the whole time so it is not clear why auth checks weren't being queued for that time period.

Comment 7 Adam Grare 2016-10-27 16:07:00 UTC
Issue appears to be with the schedule worker, moving to appliance team.

Comment 9 Joe Rafaniello 2016-11-03 15:43:35 UTC
This was fixed in https://github.com/ManageIQ/manageiq/pull/11964

"This pull request:

* Changes the default schedule for authentication check from 1.day to 1.hour

* Implements retries for Unreachable and generic errors (errors that are not missing/invalid credentials)

* The first retry occurs in 2 minutes, then 4, then 8, then 16 (last retry is scheduled ~30 minutes after the first failure

* After the last failure, we'll fall back to the hourly schedule

* There is some logic to prevent us from queueing up the same authentication check twice, although it's a bit more complicated than I like because the "in-flight" message that fails also tries to queue up a message and SHOULD even though the "in-flight" one is in the queue."


It's available in the following downstream tags:

5.7.0.7
5.7.0.8
5.7.0.9


Note You need to log in before you can comment on or make changes to this bug.