Bug 1875330 - Sync status remains in Failed and never recovers when a get of the current status fails
Summary: Sync status remains in Failed and never recovers when a get of the current st...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.6.0
Assignee: Federico Paolinelli
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-03 10:39 UTC by Federico Paolinelli
Modified: 2020-10-27 16:37 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:37:26 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 338 0 None closed Bug 1875330: Reset sync error if local status generation is equal remote. 2020-09-09 07:26:08 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:37:43 UTC

Description Federico Paolinelli 2020-09-03 10:39:25 UTC
Description of problem:


If a get of the current state fails (for network errors, for example) the sync status is moved to "Failed" and never recovers.

Version-Release number of selected component (if applicable):


How reproducible:

Always, assuming the call to the api server fails at least once.

Steps to Reproduce:
1. Let the daemon running long enough that a get fails in the reconciliation loop
2.
3.

Actual results:

The sync status is "Failed" even if the sync happened successfully.

Expected results:

The status to recover as soon as the daemon checks that the two versions are identical.

Additional info:

Logs:

The daemon configuration is stable: 
current generation is 3, no need to update the status

I0902 17:05:08.067764 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3
I0902 17:05:08.072472 2241061 daemon.go:363] nodeStateSyncHandler(): Interface not changed
I0902 17:05:08.072484 2241061 daemon.go:301] Successfully synced
I0902 17:05:08.072489 2241061 daemon.go:264] worker queue size: 0

At some point, there is an error when fetching the state:

I0902 17:05:23.067976 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3
W0902 17:05:23.070931 2241061 daemon.go:358] nodeStateSyncHandler(): Failed to fetch node state XXXXX.redhat.com: rpc error: code = Unavailable desc = transport is closing
I0902 17:05:23.070953 2241061 writer.go:61] Run(): refresh trigger

It updates the status with that error:

I0902 17:05:23.112955 2241061 writer.go:125] setNodeStateStatus(): syncStatus: Failed, lastSyncError: rpc error: code = Unavailable desc = transport is closing
E0902 17:05:23.123108 2241061 daemon.go:306] error syncing: rpc error: code = Unavailable desc = transport is closing, requeuing

It requeues the item, but since the local and remote versions are the same, the status is not updated and remains in:

    lastSyncError: 'rpc error: code = Unavailable desc = transport is closing'
    syncStatus: Failed


I0902 17:05:23.123120 2241061 daemon.go:264] worker queue size: 0
I0902 17:05:23.123478 2241061 daemon.go:266] get item: 3
I0902 17:05:23.123502 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3
I0902 17:05:23.127624 2241061 daemon.go:363] nodeStateSyncHandler(): Interface not changed
I0902 17:05:23.127656 2241061 daemon.go:301] Successfully synced

^^^^ This above does not update the status

Comment 3 zhaozhanqi 2020-09-09 09:09:35 UTC
Verified this bug on 4.6.0-202009082256.p0

oc logs sriov-network-config-daemon-4786n | grep -i syncstatus
I0909 01:43:03.872399  616407 writer.go:125] setNodeStateStatus(): syncStatus: , lastSyncError: 
I0909 01:43:09.146376  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:43:39.249977  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:44:09.349390  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:44:39.478036  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:45:09.583740  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:45:39.686482  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:46:09.804401  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:46:39.921391  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:47:10.049896  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:47:40.157610  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:48:10.260813  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:48:24.211777  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:48:54.313503  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:49:24.427046  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:49:54.532975  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:50:25.075616  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:50:55.202403  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:51:13.200971  616407 writer.go:125] setNodeStateStatus(): syncStatus: Failed, lastSyncError: etcdserver: leader changed
I0909 01:51:13.345190  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError:

Comment 5 errata-xmlrpc 2020-10-27 16:37:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.