Bug 1859240 - Network Operator degraded with `openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s)`
Summary: Network Operator degraded with `openshift-multus: could not update object (/v...
Keywords:
Status: CLOSED DUPLICATE of bug 1868133
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: 4.7.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1954032
TreeView+ depends on / blocked
 
Reported: 2020-07-21 14:25 UTC by rvanderp
Modified: 2023-12-15 18:31 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1954032 (view as bug list)
Environment:
Last Closed: 2020-08-31 15:17:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description rvanderp 2020-07-21 14:25:20 UTC
Description of problem:
Network Operator degraded with `openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s)`.  At the same time, the console pod, both on master-0 were unable to reach the router.  It is unclear 
if this is related.  


Version-Release number of selected component (if applicable):
4.3.28 IPI Azure

How reproducible:
It takes several days, but the problem periodically occurs.

Steps to Reproduce:
1. Normal cluster usage in Azure
2.
3.

Actual results:
Network operator becomes degraded

Expected results:


Additional info:
Console pods are unable to reach the router[10.128.2.15, 10.131.0.15] due to a handshake error.  Since both console pods were on the impacted master.  
2020-07-20T14:53:24.619071022Z 2020/07/20 14:53:24 http: TLS handshake error from 10.128.2.15:58432: read tcp 10.128.0.31:8443->10.128.2.15:58432: read: connection reset by peer
2020-07-20T14:53:34.694660488Z 2020/07/20 14:53:34 http: TLS handshake error from 10.131.0.15:54188: read tcp 10.128.0.31:8443->10.131.0.15:54188: read: connection reset by peer
2020-07-20T14:53:34.704729068Z 2020/07/20 14:53:34 http: TLS handshake error from 10.128.2.15:58718: read tcp 10.128.0.31:8443->10.128.2.15:58718: read: connection reset by peer

Once we deleted one of the console pods, it was scheduled to another master and we were able to login to the console.  

At the same time, the Network Operator is reporting that it is degraded:
status:
  conditions:
  - lastTransitionTime: '2020-07-20T13:39:39Z'
    message: 'Error while updating operator configuration: could not apply (/v1, Kind=Namespace)
      /openshift-multus: could not update object (/v1, Kind=Namespace) /openshift-multus:
      Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed
      to complete mutation in 13s'

We tried deleting the SDN, OVS, and Multus pods.  This did not address the issue.  The only way to resolve the issue was to restart master-0.

Comment 1 rvanderp 2020-07-21 14:29:42 UTC
OVS flows, sosreport from master-0, and must-gather to follow.

Comment 22 Douglas Smith 2020-07-22 17:19:11 UTC
Thanks for the report. I'm trying to get some eyes on this to get a second opinion. I don't recognize this error. And it looks like an API connectivity issue for the multus-admission-controller from the output.

> We tried deleting the SDN, OVS, and Multus pods.  This did not address the issue.  The only way to resolve the issue was to restart master-0.

This also to me points at there possibly being a problem on the API side. 

The console does also seem relevant, it may be suffering the same ills.

I need another opinion on this, and if the multus-admission-controller is a symptom of a deeper cause, then I'd like to get the BZ assigned to the right engineers.

Comment 23 Douglas Smith 2020-07-23 13:14:29 UTC
Thanks to some investigation from Aniket Bhat, we did also find this error:

```
2020-07-10T21:53:58.671180642Z W0710 21:53:58.670857       1 reflector.go:302] github.com/k8snetworkplumbingwg/net-attach-def-admission-controller/pkg/controller/controller.go:181: watch of *v1.Pod ended with: very short watch: github.com/k8snetworkplumbingwg/net-attach-def-admission-controller/pkg/controller/controller.go:181: Unexpected watch close - watch lasted less than a second and no items received
```

It's still not a smoking gun, but provides some more insight. Some research about this message seems to point at the possibility of the node not being able to communicate with the API server.

Comment 29 Abu Davis 2020-08-10 09:24:09 UTC
Ever since we upgraded our Openshift cluster (Azure IPI) from Openshift 4.3.22 to 4.3.28, we get a similar issue "Error creating build pod: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s" 

At the same time, we also see the below errors reported on the dashboard:
"The API server has an abnormal latency of 13.016189253374991 seconds for POST pods."
"API server is returning errors for 100% of requests for POST pods"
"The API server has a 99th percentile latency of 14.95 seconds for POST pods."

Comment 52 Red Hat Bugzilla 2023-09-15 00:34:24 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.