Bug 1840104
| Summary: | [sriov] SriovNetworkNodePolicy change cause api-access timeout: client timeout Client.Timeout exceeded while awaiting headers | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Nikita <nkononov> | |
| Component: | Networking | Assignee: | Peng Liu <pliu> | |
| Networking sub component: | SR-IOV | QA Contact: | zhaozhanqi <zzhao> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | unspecified | CC: | fpaoline, ncredi, yjoseph | |
| Version: | 4.4 | Keywords: | UpcomingSprint | |
| Target Milestone: | --- | |||
| Target Release: | 4.5.0 | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1840849 1841042 (view as bug list) | Environment: | ||
| Last Closed: | 2020-07-13 17:41:38 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1840849, 1841042 | |||
|
Description
Nikita
2020-05-26 11:23:16 UTC
Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1834914 ? @Nikita What is your cluster network? OVN or Openshift-SDN? According to the log, the apiserver did not reply to the REST API request. It looks more like a cluster network issue. @Peng Cluster network plugin is OVN. Adding a bit more info as I think we triaged the bug. When a get of the nodes fails during the uncordon operation, the daemon exits leaving the node unschedulable. A manual operation can bring it back. Also, if a get fails when the daemon tries to retrieve the device plugin pod (when it attempts restarting it), the operation requested has no effect. From the logs, I guess the node cannot be accessed 172.30.0.1, if so, this should be an OVN issue. means all hostnetwork pod on that node cannot access '172.30.0.1' for now. Yes, it may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1826769. However, we are trying to mitigate this issue from the operator side by adding a retry mechanism to all the REST calls. Regarding the log message in the description, as it is not complete, so it didn't show the moment when the real issue happened. > E0526 10:37:06.898220 735491 reflector.go:125] github.com/openshift/sriov-network-operator/pkg/daemon/daemon.go:112: Failed to list *v1.SriovNetworkNodeState: Get https://172.30.0.1:443/apis/sriovnetwork.openshift.io/v1/namespaces/openshift-sriov-network-operator/sriovnetworknodestates?fieldSelector=metadata.name%3Dworker-0.cluster1.savanna.lab.eng.rdu2.redhat.com&limit=500&resourceVersion=0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) This message is expected in 4.4. As we have set a 15s timeout in the client, so the informer will report such error every 15s if there is no update received from apiserver, which is normal if the node state is not changed. And it would not affect the functionality. But for other REST calls, the apiserver is expected to reply immediately, so if timeout happened in this case, it would cause a problem. I tested on 4.5.0-202005271737. this issue cannot be reproduced in my side. @Nikita since this issue did not reproduced in my cluster before. so could you help have a try in your cluster too if this issue is fixed. thanks. Thanks @Federico Move this bug to 'verified' according to your comment. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |