1840849 – [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeout Client.Timeout exceeded while awaiting headers

Bug 1840849 - [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeout Client.Timeout exceeded while awaiting headers

Summary: [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeo...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Billy McFall
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1840104 1841042
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-27 17:31 UTC by Dan Williams
Modified:	2020-10-08 13:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1840104
Environment:
Last Closed:	2020-10-08 13:14:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dan Williams 2020-05-27 17:31:23 UTC

+++ This bug was initially created as a clone of Bug #1840104 +++

Description of problem:
SriovNetwrokNodePolicy manipulation (add/remove) can lead to api-access timeout. This issue is sporadic however it is reproducible frequent. As a result of client connection timeout Node falls in Unschedulable state.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-05-21-042450

How reproducible:
Add/remove/update SriovNetworkNodePolicy several times in row. Please pay attention: Issue is sporadic. 

Steps to Reproduce:
1.
2.
3.

Actual results:
After policy change Node in Unschedulable mode all the time

Expected results:
sriov client is able to communicate with api-server and as a result of this communication node will be switched into Schedulable state.

Additional info:


Logs from sriov operator:
I0526 10:36:29.367658  735491 utils.go:282] tryGetInterfaceName(): name is eno2
I0526 10:36:29.367698  735491 utils.go:282] tryGetInterfaceName(): name is eno2
I0526 10:36:32.641181  735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
E0526 10:36:50.897773  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0526 10:37:02.641322  735491 writer.go:57] Run(): period refresh
I0526 10:37:02.641341  735491 writer.go:71] pollNicStatus()
I0526 10:37:02.641347  735491 utils.go:35] DiscoverSriovDevices
I0526 10:37:02.680226  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.0
I0526 10:37:02.680274  735491 utils.go:282] tryGetInterfaceName(): name is eno1
I0526 10:37:02.680319  735491 utils.go:282] tryGetInterfaceName(): name is eno1
I0526 10:37:02.680726  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.2
I0526 10:37:02.680766  735491 utils.go:282] tryGetInterfaceName(): name is eno1v0
I0526 10:37:02.680935  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.3
I0526 10:37:02.680965  735491 utils.go:282] tryGetInterfaceName(): name is eno1v1
I0526 10:37:02.681168  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.4
I0526 10:37:02.681201  735491 utils.go:282] tryGetInterfaceName(): name is eno1v2
I0526 10:37:02.681446  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.5
I0526 10:37:02.681475  735491 utils.go:282] tryGetInterfaceName(): name is eno1v3
I0526 10:37:02.681758  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.6
I0526 10:37:02.681790  735491 utils.go:282] tryGetInterfaceName(): name is eno1v4
I0526 10:37:02.681833  735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.1
I0526 10:37:02.681859  735491 utils.go:282] tryGetInterfaceName(): name is eno2
I0526 10:37:02.681899  735491 utils.go:282] tryGetInterfaceName(): name is eno2
E0526 10:37:06.898220  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:37:17.682197  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:37:22.898587  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0526 10:37:38.899032  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:37:42.682560  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:37:54.899500  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:38:02.682581  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:38:10.899890  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
W0526 10:38:22.682554  735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying...
E0526 10:38:26.900321  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0526 10:38:27.689821  735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: 
I0526 10:38:27.902168  735491 daemon.go:244] nodeStateChangeHandler(): current generation is 146

--- Additional comment from Yaniv Joseph on 2020-05-26 06:52:09 CDT ---

Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1834914 ?

--- Additional comment from Peng Liu on 2020-05-26 07:32:12 CDT ---

@Nikita 
What is your cluster network? OVN or Openshift-SDN? According to the log, the apiserver did not reply to the REST API request. It looks more like a cluster network issue.

--- Additional comment from Nikita on 2020-05-26 08:43:35 CDT ---

@Peng
Cluster network plugin is OVN.

--- Additional comment from Federico Paolinelli on 2020-05-26 11:01:20 CDT ---

Adding a bit more info as I think we triaged the bug.
When a get of the nodes fails during the uncordon operation, the daemon exits leaving the node unschedulable.

A manual operation can bring it back.
Also, if a get fails when the daemon tries to retrieve the device plugin pod (when it attempts restarting it), the operation requested has no effect.

--- Additional comment from zhaozhanqi on 2020-05-26 20:53:47 CDT ---

From the logs, I guess the node cannot be accessed 172.30.0.1, if so, this should be an OVN issue. means all hostnetwork pod on that node cannot access '172.30.0.1' for now.

--- Additional comment from Peng Liu on 2020-05-26 22:14:04 CDT ---

Yes, it may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1826769. However, we are trying to mitigate this issue from the operator side by adding a retry mechanism to all the REST calls.

Regarding the log message in the description, as it is not complete, so it didn't show the moment when the real issue happened. 

> E0526 10:37:06.898220  735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

This message is expected in 4.4. As we have set a 15s timeout in the client, so the informer will report such error every 15s if there is no update received from apiserver, which is normal if the node state is not changed. And it would not affect the functionality. But for other REST calls, the apiserver is expected to reply immediately, so if timeout happened in this case, it would cause a problem.

Comment 1 Dan Williams 2020-05-27 17:32:40 UTC

Even though the operator has worked around the issue with retries, we should root-cause the problem from the ovnkube side. For example, perhaps https://bugzilla.redhat.com/show_bug.cgi?id=1834918 is a root cause.

Comment 2 Federico Paolinelli 2020-06-01 07:51:38 UTC

Just to add more info, I am seeing this consistently on 4.4.

I _think_ it's happening on 4.4 only, as I did not manage to reproduce it on 4.5.

Comment 3 Billy McFall 2020-09-08 14:44:30 UTC

During bug scrub, it was decided this issue could be closed. Moving to 4.7 to investigate one thing before closing.

Note You need to log in before you can comment on or make changes to this bug.