Bug 1875330

Summary:	Sync status remains in Failed and never recovers when a get of the current status fails
Product:	OpenShift Container Platform	Reporter:	Federico Paolinelli <fpaoline>
Component:	Networking	Assignee:	Federico Paolinelli <fpaoline>
Networking sub component:	SR-IOV	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:37:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Federico Paolinelli 2020-09-03 10:39:25 UTC

Description of problem:


If a get of the current state fails (for network errors, for example) the sync status is moved to "Failed" and never recovers.

Version-Release number of selected component (if applicable):


How reproducible:

Always, assuming the call to the api server fails at least once.

Steps to Reproduce:
1. Let the daemon running long enough that a get fails in the reconciliation loop
2.
3.

Actual results:

The sync status is "Failed" even if the sync happened successfully.

Expected results:

The status to recover as soon as the daemon checks that the two versions are identical.

Additional info:

Logs:

The daemon configuration is stable: 
current generation is 3, no need to update the status

I0902 17:05:08.067764 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3
I0902 17:05:08.072472 2241061 daemon.go:363] nodeStateSyncHandler(): Interface not changed
I0902 17:05:08.072484 2241061 daemon.go:301] Successfully synced
I0902 17:05:08.072489 2241061 daemon.go:264] worker queue size: 0

At some point, there is an error when fetching the state:

I0902 17:05:23.067976 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3
W0902 17:05:23.070931 2241061 daemon.go:358] nodeStateSyncHandler(): Failed to fetch node state XXXXX.redhat.com: rpc error: code = Unavailable desc = transport is closing
I0902 17:05:23.070953 2241061 writer.go:61] Run(): refresh trigger

It updates the status with that error:

I0902 17:05:23.112955 2241061 writer.go:125] setNodeStateStatus(): syncStatus: Failed, lastSyncError: rpc error: code = Unavailable desc = transport is closing
E0902 17:05:23.123108 2241061 daemon.go:306] error syncing: rpc error: code = Unavailable desc = transport is closing, requeuing

It requeues the item, but since the local and remote versions are the same, the status is not updated and remains in:

    lastSyncError: 'rpc error: code = Unavailable desc = transport is closing'
    syncStatus: Failed


I0902 17:05:23.123120 2241061 daemon.go:264] worker queue size: 0
I0902 17:05:23.123478 2241061 daemon.go:266] get item: 3
I0902 17:05:23.123502 2241061 daemon.go:353] nodeStateSyncHandler(): new generation is 3
I0902 17:05:23.127624 2241061 daemon.go:363] nodeStateSyncHandler(): Interface not changed
I0902 17:05:23.127656 2241061 daemon.go:301] Successfully synced

^^^^ This above does not update the status

Comment 3 zhaozhanqi 2020-09-09 09:09:35 UTC

Verified this bug on 4.6.0-202009082256.p0

oc logs sriov-network-config-daemon-4786n | grep -i syncstatus
I0909 01:43:03.872399  616407 writer.go:125] setNodeStateStatus(): syncStatus: , lastSyncError: 
I0909 01:43:09.146376  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:43:39.249977  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:44:09.349390  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:44:39.478036  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:45:09.583740  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:45:39.686482  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:46:09.804401  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:46:39.921391  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:47:10.049896  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:47:40.157610  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:48:10.260813  616407 writer.go:125] setNodeStateStatus(): syncStatus: InProgress, lastSyncError: 
I0909 01:48:24.211777  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:48:54.313503  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:49:24.427046  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:49:54.532975  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:50:25.075616  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:50:55.202403  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0909 01:51:13.200971  616407 writer.go:125] setNodeStateStatus(): syncStatus: Failed, lastSyncError: etcdserver: leader changed
I0909 01:51:13.345190  616407 writer.go:125] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError:

Comment 5 errata-xmlrpc 2020-10-27 16:37:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196