1840104 – [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeout Client.Timeout exceeded while awaiting headers

Bug 1840104 - [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeout Client.Timeout exceeded while awaiting headers

Summary: [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Peng Liu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1840849 1841042
TreeView+	depends on / blocked

Reported:	2020-05-26 11:23 UTC by Nikita
Modified:	2020-07-19 06:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1840849 1841042 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:41:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sriov-network-operator pull 222	0	None	closed	Bug 1840104: Make the daemon more resilient to failures.	2020-09-30 09:51:59 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:41:57 UTC

Description Nikita 2020-05-26 11:23:16 UTC

Description of problem:
SriovNetwrokNodePolicy manipulation (add/remove) can lead to api-access timeout. This issue is sporadic however it is reproducible frequent. As a result of client connection timeout Node falls in Unschedulable state.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-05-21-042450

How reproducible:
Add/remove/update SriovNetworkNodePolicy several times in row. Please pay attention: Issue is sporadic. 

Steps to Reproduce:
1.
2.
3.

Actual results:
After policy change Node in Unschedulable mode all the time

Expected results:
sriov client is able to communicate with api-server and as a result of this communication node will be switched into Schedulable state.

Additional info:


Logs from sriov operator:
I0526 10:36:29.367658  735491 utils.go:282] tryGetInterfaceName(): name is eno2
I0526 10:36:29.367698  735491 utils.go:282] tryGetInterfaceName(): name is eno2
I0526 10:36:32.641181  735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
E0526 10:36:50.897773  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0526 10:37:02.641322  735491 writer.go:57] Run(): period refresh
I0526 10:37:02.641341  735491 writer.go:71] pollNicStatus()
I0526 10:37:02.641347  735491 utils.go:35] DiscoverSriovDevices
I0526 10:37:02.680226  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.0
I0526 10:37:02.680274  735491 utils.go:282] tryGetInterfaceName(): name is eno1
I0526 10:37:02.680319  735491 utils.go:282] tryGetInterfaceName(): name is eno1
I0526 10:37:02.680726  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.2
I0526 10:37:02.680766  735491 utils.go:282] tryGetInterfaceName(): name is eno1v0
I0526 10:37:02.680935  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.3
I0526 10:37:02.680965  735491 utils.go:282] tryGetInterfaceName(): name is eno1v1
I0526 10:37:02.681168  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.4
I0526 10:37:02.681201  735491 utils.go:282] tryGetInterfaceName(): name is eno1v2
I0526 10:37:02.681446  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.5
I0526 10:37:02.681475  735491 utils.go:282] tryGetInterfaceName(): name is eno1v3
I0526 10:37:02.681758  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.6
I0526 10:37:02.681790  735491 utils.go:282] tryGetInterfaceName(): name is eno1v4
I0526 10:37:02.681833  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.1
I0526 10:37:02.681859  735491 utils.go:282] tryGetInterfaceName(): name is eno2
I0526 10:37:02.681899  735491 utils.go:282] tryGetInterfaceName(): name is eno2
E0526 10:37:06.898220  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:37:17.682197  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:37:22.898587  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0526 10:37:38.899032  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:37:42.682560  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:37:54.899500  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:38:02.682581  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:38:10.899890  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:38:22.682554  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:38:26.900321  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0526 10:38:27.689821  735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0526 10:38:27.902168  735491 daemon.go:244] nodeStateChangeHandler(): current generation is 146

Comment 1 Yaniv Joseph 2020-05-26 11:52:09 UTC

Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1834914 ?

Comment 2 Peng Liu 2020-05-26 12:32:12 UTC

@Nikita 
What is your cluster network? OVN or Openshift-SDN? According to the log, the apiserver did not reply to the REST API request. It looks more like a cluster network issue.

Comment 3 Nikita 2020-05-26 13:43:35 UTC

@Peng
Cluster network plugin is OVN.

Comment 4 Federico Paolinelli 2020-05-26 16:01:20 UTC

Adding a bit more info as I think we triaged the bug.
When a get of the nodes fails during the uncordon operation, the daemon exits leaving the node unschedulable.

A manual operation can bring it back.
Also, if a get fails when the daemon tries to retrieve the device plugin pod (when it attempts restarting it), the operation requested has no effect.

Comment 5 zhaozhanqi 2020-05-27 01:53:47 UTC

From the logs, I guess the node cannot be accessed 172.30.0.1, if so, this should be an OVN issue. means all hostnetwork pod on that node cannot access '172.30.0.1' for now.

Comment 6 Peng Liu 2020-05-27 03:14:04 UTC

Yes, it may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1826769. However, we are trying to mitigate this issue from the operator side by adding a retry mechanism to all the REST calls.

Regarding the log message in the description, as it is not complete, so it didn't show the moment when the real issue happened. 

> E0526 10:37:06.898220  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

This message is expected in 4.4. As we have set a 15s timeout in the client, so the informer will report such error every 15s if there is no update received from apiserver, which is normal if the node state is not changed. And it would not affect the functionality. But for other REST calls, the apiserver is expected to reply immediately, so if timeout happened in this case, it would cause a problem.

Comment 10 zhaozhanqi 2020-05-28 10:43:50 UTC

I tested on 4.5.0-202005271737. this issue cannot be reproduced in my side. 

@Nikita since this issue did not reproduced in my cluster before. so could you help have a try in your cluster too if this issue is fixed.  thanks.

Comment 14 zhaozhanqi 2020-06-01 02:09:21 UTC

Thanks @Federico Move this bug to 'verified' according to your comment.

Comment 15 errata-xmlrpc 2020-07-13 17:41:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.