Bug 1859240

Summary: Network Operator degraded with `openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s)`
Product: OpenShift Container Platform Reporter: rvanderp
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: low    
Version: 4.3.zCC: abudavis, aos-bugs, bbennett, cruhm, dosmith, fpan, mfojtik, rkshirsa, sbatsche, sttts, xxia
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1954032 (view as bug list) Environment:
Last Closed: 2020-08-31 15:17:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1954032    

Description rvanderp 2020-07-21 14:25:20 UTC
Description of problem:
Network Operator degraded with `openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s)`.  At the same time, the console pod, both on master-0 were unable to reach the router.  It is unclear 
if this is related.  


Version-Release number of selected component (if applicable):
4.3.28 IPI Azure

How reproducible:
It takes several days, but the problem periodically occurs.

Steps to Reproduce:
1. Normal cluster usage in Azure
2.
3.

Actual results:
Network operator becomes degraded

Expected results:


Additional info:
Console pods are unable to reach the router[10.128.2.15, 10.131.0.15] due to a handshake error.  Since both console pods were on the impacted master.  
2020-07-20T14:53:24.619071022Z 2020/07/20 14:53:24 http: TLS handshake error from 10.128.2.15:58432: read tcp 10.128.0.31:8443->10.128.2.15:58432: read: connection reset by peer
2020-07-20T14:53:34.694660488Z 2020/07/20 14:53:34 http: TLS handshake error from 10.131.0.15:54188: read tcp 10.128.0.31:8443->10.131.0.15:54188: read: connection reset by peer
2020-07-20T14:53:34.704729068Z 2020/07/20 14:53:34 http: TLS handshake error from 10.128.2.15:58718: read tcp 10.128.0.31:8443->10.128.2.15:58718: read: connection reset by peer

Once we deleted one of the console pods, it was scheduled to another master and we were able to login to the console.  

At the same time, the Network Operator is reporting that it is degraded:
status:
  conditions:
  - lastTransitionTime: '2020-07-20T13:39:39Z'
    message: 'Error while updating operator configuration: could not apply (/v1, Kind=Namespace)
      /openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus:
      Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed
      to complete mutation in 13s'

We tried deleting the SDN, OVS, and Multus pods.  This did not address the issue.  The only way to resolve the issue was to restart master-0.

Comment 1 rvanderp 2020-07-21 14:29:42 UTC
OVS flows, sosreport from master-0, and must-gather to follow.

Comment 22 Douglas Smith 2020-07-22 17:19:11 UTC
Thanks for the report. I'm trying to get some eyes on this to get a second opinion. I don't recognize this error. And it looks like an API connectivity issue for the multus-admission-controller from the output.

> We tried deleting the SDN, OVS, and Multus pods.  This did not address the issue.  The only way to resolve the issue was to restart master-0.

This also to me points at there possibly being a problem on the API side. 

The console does also seem relevant, it may be suffering the same ills.

I need another opinion on this, and if the multus-admission-controller is a symptom of a deeper cause, then I'd like to get the BZ assigned to the right engineers.

Comment 23 Douglas Smith 2020-07-23 13:14:29 UTC
Thanks to some investigation from Aniket Bhat, we did also find this error:

```
2020-07-10T21:53:58.671180642Z W0710 21:53:58.670857       1 reflector.go:302] github.com/k8snetworkplumbingwg/net-attach-def-admission-controller/pkg/controller/controller.go:181: watch of *v1.Pod ended with: very short watch: github.com/k8snetworkplumbingwg/net-attach-def-admission-controller/pkg/controller/controller.go:181: Unexpected watch close - watch lasted less than a second and no items received
```

It's still not a smoking gun, but provides some more insight. Some research about this message seems to point at the possibility of the node not being able to communicate with the API server.

Comment 29 Abu Davis 2020-08-10 09:24:09 UTC
Ever since we upgraded our Openshift cluster (Azure IPI) from Openshift 4.3.22 to 4.3.28, we get a similar issue "Error creating build pod: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s" 

At the same time, we also see the below errors reported on the dashboard:
"The API server has an abnormal latency of 13.016189253374991 seconds for POST pods."
"API server is returning errors for 100% of requests for POST pods"
"The API server has a 99th percentile latency of 14.95 seconds for POST pods."

Comment 52 Red Hat Bugzilla 2023-09-15 00:34:24 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days