Bug 1383442

Summary: Scheduled validation check for VCenter sets status to error and never re-validates
Product: Red Hat CloudForms Management Engine Reporter: Jared Deubel <jdeubel>
Component: ApplianceAssignee: Joe Rafaniello <jrafanie>
Status: CLOSED CURRENTRELEASE QA Contact: Alex Newman <anewman>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.6.0CC: abellott, cpelland, gblomqui, gtanzill, jcutter, jfrey, jhardy, obarenbo, simaishi
Target Milestone: GAKeywords: TestOnly
Target Release: 5.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: vsphere
Fixed In Version: 5.8.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1391715 (view as bug list) Environment:
Last Closed: 2017-06-12 17:06:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: VMware Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1391715    

Description Jared Deubel 2016-10-10 15:46:42 UTC
Description of problem:
We have a zone that supports 2 providers (provider A and provider B). We have successful EmsRefreshes for provider A, but provider B never refreshes.
We have validated credentials for both providers so they are correct. When initiating a full-refresh from the UI, we confirmed seeing the message put down on the queue.

For some reason the check by the scheduler on these VC sets the status to error and never goes back and resets it. Validation works and in fact if you reset the status with the Authentication button all comes live.

Version-Release number of selected component (if applicable):
5.6

Comment 5 Adam Grare 2016-10-13 14:21:21 UTC
There was a temporary network interruption that caused the credential validation to fail:
WARN -- : MIQ(ManageIQ::Providers::Vmware::InfraManager#verify_credentials) #<HTTPClient::ConnectTimeoutError: execution expired>

It did not try again for a full day, the workaround is to increase the frequency that credential validation is done.  The proposed fix is to perform a retry with exponential backoff if a temporary error is encountered.

Comment 6 Adam Grare 2016-10-15 18:41:34 UTC
Credential validation failed with the error HTTPClient::ConnectTimeoutError on 10-05 and 10-11.

There are regular checks scheduled by the wk1 appliance from 09-21 - 09-30, then once on 10-04.
The sched appliance starts regularly scheduling checks on 10-08 and 10-09, then on 10-11 it starts running auth checks every hour.

This means that no authentication checks were being scheduled for many days to correct the invalid credentials.

An MiqScheduleWorker appears to have been running the whole time so it is not clear why auth checks weren't being queued for that time period.

Comment 7 Adam Grare 2016-10-27 16:07:00 UTC
Issue appears to be with the schedule worker, moving to appliance team.

Comment 9 Joe Rafaniello 2016-11-03 15:43:35 UTC
This was fixed in https://github.com/ManageIQ/manageiq/pull/11964

"This pull request:

* Changes the default schedule for authentication check from 1.day to 1.hour

* Implements retries for Unreachable and generic errors (errors that are not missing/invalid credentials)

* The first retry occurs in 2 minutes, then 4, then 8, then 16 (last retry is scheduled ~30 minutes after the first failure

* After the last failure, we'll fall back to the hourly schedule

* There is some logic to prevent us from queueing up the same authentication check twice, although it's a bit more complicated than I like because the "in-flight" message that fails also tries to queue up a message and SHOULD even though the "in-flight" one is in the queue."


It's available in the following downstream tags:

5.7.0.7
5.7.0.8
5.7.0.9