Description of problem: We have a zone that supports 2 providers (provider A and provider B). We have successful EmsRefreshes for provider A, but provider B never refreshes. We have validated credentials for both providers so they are correct. When initiating a full-refresh from the UI, we confirmed seeing the message put down on the queue. For some reason the check by the scheduler on these VC sets the status to error and never goes back and resets it. Validation works and in fact if you reset the status with the Authentication button all comes live. Version-Release number of selected component (if applicable): 5.6
There was a temporary network interruption that caused the credential validation to fail: WARN -- : MIQ(ManageIQ::Providers::Vmware::InfraManager#verify_credentials) #<HTTPClient::ConnectTimeoutError: execution expired> It did not try again for a full day, the workaround is to increase the frequency that credential validation is done. The proposed fix is to perform a retry with exponential backoff if a temporary error is encountered.
Credential validation failed with the error HTTPClient::ConnectTimeoutError on 10-05 and 10-11. There are regular checks scheduled by the wk1 appliance from 09-21 - 09-30, then once on 10-04. The sched appliance starts regularly scheduling checks on 10-08 and 10-09, then on 10-11 it starts running auth checks every hour. This means that no authentication checks were being scheduled for many days to correct the invalid credentials. An MiqScheduleWorker appears to have been running the whole time so it is not clear why auth checks weren't being queued for that time period.
Issue appears to be with the schedule worker, moving to appliance team.
This was fixed in https://github.com/ManageIQ/manageiq/pull/11964 "This pull request: * Changes the default schedule for authentication check from 1.day to 1.hour * Implements retries for Unreachable and generic errors (errors that are not missing/invalid credentials) * The first retry occurs in 2 minutes, then 4, then 8, then 16 (last retry is scheduled ~30 minutes after the first failure * After the last failure, we'll fall back to the hourly schedule * There is some logic to prevent us from queueing up the same authentication check twice, although it's a bit more complicated than I like because the "in-flight" message that fails also tries to queue up a message and SHOULD even though the "in-flight" one is in the queue." It's available in the following downstream tags: 5.7.0.7 5.7.0.8 5.7.0.9