Description of problem: SriovNetwrokNodePolicy manipulation (add/remove) can lead to api-access timeout. This issue is sporadic however it is reproducible frequent. As a result of client connection timeout Node falls in Unschedulable state. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-05-21-042450 How reproducible: Add/remove/update SriovNetworkNodePolicy several times in row. Please pay attention: Issue is sporadic. Steps to Reproduce: 1. 2. 3. Actual results: After policy change Node in Unschedulable mode all the time Expected results: sriov client is able to communicate with api-server and as a result of this communication node will be switched into Schedulable state. Additional info: Logs from sriov operator: I0526 10:36:29.367658 735491 utils.go:282] tryGetInterfaceName(): name is eno2 I0526 10:36:29.367698 735491 utils.go:282] tryGetInterfaceName(): name is eno2 I0526 10:36:32.641181 735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: E0526 10:36:50.897773 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0526 10:37:02.641322 735491 writer.go:57] Run(): period refresh I0526 10:37:02.641341 735491 writer.go:71] pollNicStatus() I0526 10:37:02.641347 735491 utils.go:35] DiscoverSriovDevices I0526 10:37:02.680226 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.0 I0526 10:37:02.680274 735491 utils.go:282] tryGetInterfaceName(): name is eno1 I0526 10:37:02.680319 735491 utils.go:282] tryGetInterfaceName(): name is eno1 I0526 10:37:02.680726 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.2 I0526 10:37:02.680766 735491 utils.go:282] tryGetInterfaceName(): name is eno1v0 I0526 10:37:02.680935 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.3 I0526 10:37:02.680965 735491 utils.go:282] tryGetInterfaceName(): name is eno1v1 I0526 10:37:02.681168 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.4 I0526 10:37:02.681201 735491 utils.go:282] tryGetInterfaceName(): name is eno1v2 I0526 10:37:02.681446 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.5 I0526 10:37:02.681475 735491 utils.go:282] tryGetInterfaceName(): name is eno1v3 I0526 10:37:02.681758 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.6 I0526 10:37:02.681790 735491 utils.go:282] tryGetInterfaceName(): name is eno1v4 I0526 10:37:02.681833 735491 utils.go:287] getNetdevMTU(): get MTU for device 0000:19:00.1 I0526 10:37:02.681859 735491 utils.go:282] tryGetInterfaceName(): name is eno2 I0526 10:37:02.681899 735491 utils.go:282] tryGetInterfaceName(): name is eno2 E0526 10:37:06.898220 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:37:17.682197 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:37:22.898587 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) E0526 10:37:38.899032 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:37:42.682560 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:37:54.899500 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:38:02.682581 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:38:10.899890 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) W0526 10:38:22.682554 735491 writer.go:124] Failed to fetch node worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com (Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates/worker-0.cluster1.savanna.lab.eng.rdu2.redhat.com?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)); retrying... E0526 10:38:26.900321 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0526 10:38:27.689821 735491 writer.go:111] setNodeStateStatus(): syncStatus: Succeeded, lastSyncError: I0526 10:38:27.902168 735491 daemon.go:244] nodeStateChangeHandler(): current generation is 146
Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1834914 ?
@Nikita What is your cluster network? OVN or Openshift-SDN? According to the log, the apiserver did not reply to the REST API request. It looks more like a cluster network issue.
@Peng Cluster network plugin is OVN.
Adding a bit more info as I think we triaged the bug. When a get of the nodes fails during the uncordon operation, the daemon exits leaving the node unschedulable. A manual operation can bring it back. Also, if a get fails when the daemon tries to retrieve the device plugin pod (when it attempts restarting it), the operation requested has no effect.
From the logs, I guess the node cannot be accessed 172.30.0.1, if so, this should be an OVN issue. means all hostnetwork pod on that node cannot access '172.30.0.1' for now.
Yes, it may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1826769. However, we are trying to mitigate this issue from the operator side by adding a retry mechanism to all the REST calls. Regarding the log message in the description, as it is not complete, so it didn't show the moment when the real issue happened. > E0526 10:37:06.898220 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) This message is expected in 4.4. As we have set a 15s timeout in the client, so the informer will report such error every 15s if there is no update received from apiserver, which is normal if the node state is not changed. And it would not affect the functionality. But for other REST calls, the apiserver is expected to reply immediately, so if timeout happened in this case, it would cause a problem.
I tested on 4.5.0-202005271737. this issue cannot be reproduced in my side. @Nikita since this issue did not reproduced in my cluster before. so could you help have a try in your cluster too if this issue is fixed. thanks.
Thanks @Federico Move this bug to 'verified' according to your comment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409