Description of problem: If a get of the current state fails (for network errors, for example) the sync status is moved to "Failed" and never recovers. Version-Release number of selected component (if applicable): How reproducible: Always, assuming the call to the api server fails at least once. Steps to Reproduce: 1. Let the daemon running long enough that a get fails in the reconciliation loop 2. 3. Actual results: The sync status is "Failed" even if the sync happened successfully. Expected results: The status to recover as soon as the daemon checks that the two versions are identical. Additional info: Logs: The daemon configuration is stable: current generation is 3, no need to update the status I0902 17:05:08.067764 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3 I0902 17:05:08.072472 2241061 daemon.go:363] nodeStateSyncHandler(): Interface not changed I0902 17:05:08.072484 2241061 daemon.go:301] Successfully synced I0902 17:05:08.072489 2241061 daemon.go:264] worker queue size: 0 At some point, there is an error when fetching the state: I0902 17:05:23.067976 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3 W0902 17:05:23.070931 2241061 daemon.go:358] nodeStateSyncHandler(): Failed to fetch node state XXXXX.redhat.com: rpc error: code = Unavailable desc = transport is closing I0902 17:05:23.070953 2241061 writer.go:61] Run(): refresh trigger It updates the status with that error: I0902 17:05:23.112955 2241061 writer.go:125] setNodeStateStatus(): syncStatus: Failed, lastSyncError: rpc error: code = Unavailable desc = transport is closing E0902 17:05:23.123108 2241061 daemon.go:306] error syncing: rpc error: code = Unavailable desc = transport is closing, requeuing It requeues the item, but since the local and remote versions are the same, the status is not updated and remains in: lastSyncError: 'rpc error: code = Unavailable desc = transport is closing' syncStatus: Failed I0902 17:05:23.123120 2241061 daemon.go:264] worker queue size: 0 I0902 17:05:23.123478 2241061 daemon.go:266] get item: 3 I0902 17:05:23.123502 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3 I0902 17:05:23.127624 2241061 daemon.go:363] nodeStateSyncHandler(): Interface not changed I0902 17:05:23.127656 2241061 daemon.go:301] Successfully synced ^^^^ This above does not update the status
Verified this bug on 4.6.0-202009082256.p0 oc logs sriov-network-config-daemon-4786n | grep -i syncstatus I0909 01:43:03.872399 616407 writer.go:125] setNodeStateStatus(): syncStatus: , lastSyncError: I0909 01:43:09.146376 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:43:39.249977 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:44:09.349390 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:44:39.478036 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:45:09.583740 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:45:39.686482 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:46:09.804401 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:46:39.921391 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:47:10.049896 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:47:40.157610 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:48:10.260813 616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: I0909 01:48:24.211777 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0909 01:48:54.313503 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0909 01:49:24.427046 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0909 01:49:54.532975 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0909 01:50:25.075616 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0909 01:50:55.202403 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0909 01:51:13.200971 616407 writer.go:125] setNodeStateStatus(): syncStatus: Failed, lastSyncError: etcdserver: leader changed I0909 01:51:13.345190 616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196