+++ This bug was initially created as a clone of Bug #1840104 +++ Description of problem: SriovNetwrokNodePolicy manipulation (add/remove) can lead to api-access timeout. This issue is sporadic however it is reproducible frequent. As a result of client connection timeout Node falls in Unschedulable state. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-05-21-042450 How reproducible: Add/remove/update SriovNetworkNodePolicy several times in row. Please pay attention: Issue is sporadic. Steps to Reproduce: 1. 2. 3. Actual results: After policy change Node in Unschedulable mode all the time Expected results: sriov client is able to communicate with api-server and as a result of this communication node will be switched into Schedulable state. Additional info: Logs from sriov operator: I0526 10:36:29.367658 735491 utils.go:282] tryGetInterfaceName(): name is eno2 I0526 10:36:29.367698 735491 utils.go:282] tryGetInterfaceName(): name is eno2 I0526 10:36:32.641181 735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: E0526 10:36:50.897773 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0526 10:37:02.641322 735491 writer.go:57] Run(): period refresh I0526 10:37:02.641341 735491 writer.go:71] pollNicStatus() I0526 10:37:02.641347 735491 utils.go:35] DiscoverSriovDevices I0526 10:37:02.680226 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.0 I0526 10:37:02.680274 735491 utils.go:282] tryGetInterfaceName(): name is eno1 I0526 10:37:02.680319 735491 utils.go:282] tryGetInterfaceName(): name is eno1 I0526 10:37:02.680726 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.2 I0526 10:37:02.680766 735491 utils.go:282] tryGetInterfaceName(): name is eno1v0 I0526 10:37:02.680935 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.3 I0526 10:37:02.680965 735491 utils.go:282] tryGetInterfaceName(): name is eno1v1 I0526 10:37:02.681168 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.4 I0526 10:37:02.681201 735491 utils.go:282] tryGetInterfaceName(): name is eno1v2 I0526 10:37:02.681446 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.5 I0526 10:37:02.681475 735491 utils.go:282] tryGetInterfaceName(): name is eno1v3 I0526 10:37:02.681758 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.6 I0526 10:37:02.681790 735491 utils.go:282] tryGetInterfaceName(): name is eno1v4 I0526 10:37:02.681833 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.1 I0526 10:37:02.681859 735491 utils.go:282] tryGetInterfaceName(): name is eno2 I0526 10:37:02.681899 735491 utils.go:282] tryGetInterfaceName(): name is eno2 E0526 10:37:06.898220 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:37:17.682197 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:37:22.898587 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) E0526 10:37:38.899032 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:37:42.682560 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:37:54.899500 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:38:02.682581 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:38:10.899890 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:38:22.682554 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:38:26.900321 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0526 10:38:27.689821 735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0526 10:38:27.902168 735491 daemon.go:244] nodeStateChangeHandler(): current generation is 146 --- Additional comment from Yaniv Joseph on 2020-05-26 06:52:09 CDT --- Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1834914 ? --- Additional comment from Peng Liu on 2020-05-26 07:32:12 CDT --- @Nikita What is your cluster network? OVN or Openshift-SDN? According to the log, the apiserver did not reply to the REST API request. It looks more like a cluster network issue. --- Additional comment from Nikita on 2020-05-26 08:43:35 CDT --- @Peng Cluster network plugin is OVN. --- Additional comment from Federico Paolinelli on 2020-05-26 11:01:20 CDT --- Adding a bit more info as I think we triaged the bug. When a get of the nodes fails during the uncordon operation, the daemon exits leaving the node unschedulable. A manual operation can bring it back. Also, if a get fails when the daemon tries to retrieve the device plugin pod (when it attempts restarting it), the operation requested has no effect. --- Additional comment from zhaozhanqi on 2020-05-26 20:53:47 CDT --- From the logs, I guess the node cannot be accessed 172.30.0.1, if so, this should be an OVN issue. means all hostnetwork pod on that node cannot access '172.30.0.1' for now. --- Additional comment from Peng Liu on 2020-05-26 22:14:04 CDT --- Yes, it may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1826769. However, we are trying to mitigate this issue from the operator side by adding a retry mechanism to all the REST calls. Regarding the log message in the description, as it is not complete, so it didn't show the moment when the real issue happened. > E0526 10:37:06.898220 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) This message is expected in 4.4. As we have set a 15s timeout in the client, so the informer will report such error every 15s if there is no update received from apiserver, which is normal if the node state is not changed. And it would not affect the functionality. But for other REST calls, the apiserver is expected to reply immediately, so if timeout happened in this case, it would cause a problem.
Even though the operator has worked around the issue with retries, we should root-cause the problem from the ovnkube side. For example, perhaps https://bugzilla.redhat.com/show_bug.cgi?id=1834918 is a root cause.
Just to add more info, I am seeing this consistently on 4.4. I _think_ it's happening on 4.4 only, as I did not manage to reproduce it on 4.5.
During bug scrub, it was decided this issue could be closed. Moving to 4.7 to investigate one thing before closing.