Bug 1383442
| Summary: | Scheduled validation check for VCenter sets status to error and never re-validates | |||
|---|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Jared Deubel <jdeubel> | |
| Component: | Appliance | Assignee: | Joe Rafaniello <jrafanie> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Alex Newman <anewman> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 5.6.0 | CC: | abellott, cpelland, gblomqui, gtanzill, jcutter, jfrey, jhardy, obarenbo, simaishi | |
| Target Milestone: | GA | Keywords: | TestOnly | |
| Target Release: | 5.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | vsphere | |||
| Fixed In Version: | 5.8.0.0 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1391715 (view as bug list) | Environment: | ||
| Last Closed: | 2017-06-12 17:06:27 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | VMware | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1391715 | |||
|
Description
Jared Deubel
2016-10-10 15:46:42 UTC
There was a temporary network interruption that caused the credential validation to fail: WARN -- : MIQ(ManageIQ::Providers::Vmware::InfraManager#verify_credentials) #<HTTPClient::ConnectTimeoutError: execution expired> It did not try again for a full day, the workaround is to increase the frequency that credential validation is done. The proposed fix is to perform a retry with exponential backoff if a temporary error is encountered. Credential validation failed with the error HTTPClient::ConnectTimeoutError on 10-05 and 10-11. There are regular checks scheduled by the wk1 appliance from 09-21 - 09-30, then once on 10-04. The sched appliance starts regularly scheduling checks on 10-08 and 10-09, then on 10-11 it starts running auth checks every hour. This means that no authentication checks were being scheduled for many days to correct the invalid credentials. An MiqScheduleWorker appears to have been running the whole time so it is not clear why auth checks weren't being queued for that time period. Issue appears to be with the schedule worker, moving to appliance team. This was fixed in https://github.com/ManageIQ/manageiq/pull/11964 "This pull request: * Changes the default schedule for authentication check from 1.day to 1.hour * Implements retries for Unreachable and generic errors (errors that are not missing/invalid credentials) * The first retry occurs in 2 minutes, then 4, then 8, then 16 (last retry is scheduled ~30 minutes after the first failure * After the last failure, we'll fall back to the hourly schedule * There is some logic to prevent us from queueing up the same authentication check twice, although it's a bit more complicated than I like because the "in-flight" message that fails also tries to queue up a message and SHOULD even though the "in-flight" one is in the queue." It's available in the following downstream tags: 5.7.0.7 5.7.0.8 5.7.0.9 |