Description of problem: Network Operator degraded with `openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s)`. At the same time, the console pod, both on master-0 were unable to reach the router. It is unclear if this is related. Version-Release number of selected component (if applicable): 4.3.28 IPI Azure How reproducible: It takes several days, but the problem periodically occurs. Steps to Reproduce: 1. Normal cluster usage in Azure 2. 3. Actual results: Network operator becomes degraded Expected results: Additional info: Console pods are unable to reach the router[10.128.2.15, 10.131.0.15] due to a handshake error. Since both console pods were on the impacted master. 2020-07-20T14:53:24.619071022Z 2020/07/20 14:53:24 http: TLS handshake error from 10.128.2.15:58432: read tcp 10.128.0.31:8443->10.128.2.15:58432: read: connection reset by peer 2020-07-20T14:53:34.694660488Z 2020/07/20 14:53:34 http: TLS handshake error from 10.131.0.15:54188: read tcp 10.128.0.31:8443->10.131.0.15:54188: read: connection reset by peer 2020-07-20T14:53:34.704729068Z 2020/07/20 14:53:34 http: TLS handshake error from 10.128.2.15:58718: read tcp 10.128.0.31:8443->10.128.2.15:58718: read: connection reset by peer Once we deleted one of the console pods, it was scheduled to another master and we were able to login to the console. At the same time, the Network Operator is reporting that it is degraded: status: conditions: - lastTransitionTime: '2020-07-20T13:39:39Z' message: 'Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s' We tried deleting the SDN, OVS, and Multus pods. This did not address the issue. The only way to resolve the issue was to restart master-0.
OVS flows, sosreport from master-0, and must-gather to follow.
Thanks for the report. I'm trying to get some eyes on this to get a second opinion. I don't recognize this error. And it looks like an API connectivity issue for the multus-admission-controller from the output. > We tried deleting the SDN, OVS, and Multus pods. This did not address the issue. The only way to resolve the issue was to restart master-0. This also to me points at there possibly being a problem on the API side. The console does also seem relevant, it may be suffering the same ills. I need another opinion on this, and if the multus-admission-controller is a symptom of a deeper cause, then I'd like to get the BZ assigned to the right engineers.
Thanks to some investigation from Aniket Bhat, we did also find this error: ``` 2020-07-10T21:53:58.671180642Z W0710 21:53:58.670857 1 reflector.go:302] github.com/k8snetworkplumbingwg/net-attach-def-admission-controller/pkg/controller/controller.go:181: watch of *v1.Pod ended with: very short watch: github.com/k8snetworkplumbingwg/net-attach-def-admission-controller/pkg/controller/controller.go:181: Unexpected watch close - watch lasted less than a second and no items received ``` It's still not a smoking gun, but provides some more insight. Some research about this message seems to point at the possibility of the node not being able to communicate with the API server.
Ever since we upgraded our Openshift cluster (Azure IPI) from Openshift 4.3.22 to 4.3.28, we get a similar issue "Error creating build pod: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s" At the same time, we also see the below errors reported on the dashboard: "The API server has an abnormal latency of 13.016189253374991 seconds for POST pods." "API server is returning errors for 100% of requests for POST pods" "The API server has a 99th percentile latency of 14.95 seconds for POST pods."
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days