Bug 1714699 - Rolling out OVS daemonset changes takes a very long time.
Summary: Rolling out OVS daemonset changes takes a very long time.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: David Eads
QA Contact: zhou ying
URL:
Whiteboard:
: 1715106 1715515 (view as bug list)
Depends On:
Blocks: 1718375
TreeView+ depends on / blocked
 
Reported: 2019-05-28 15:50 UTC by Casey Callendrello
Modified: 2019-10-16 06:29 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1718375 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:29:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:29:35 UTC

Description Casey Callendrello 2019-05-28 15:50:21 UTC
Description of problem:

Triggering an OVS daemonset change takes a very, very long time. Specifically, the kube controller manager seems to pause for about 20 minutes between pod deletion and creation.

It doesn't happen for every pod. It seems to happen when the OVS pod is deleted on the current k-c-m leader.


Version-Release number of selected component (if applicable): 4.2 tip


How reproducible:  very


Steps to Reproduce:
1. Spin up a tip cluster. 
2. Scale the cvo and cno to 0 (so you can make ovs daemonset changes)
  oc -n openshift-cluster-version scale deployment cluster-version-operator --replicas 0
  oc -n openshift-network-operator scale deployment network-operator --replicas 0

3. Make a trivial change to the OVS daemonset that triggers a rollout (e.g. add an 'echo' in the run command:
  oc -n openshift-sdn edit ds ovs

4. Watch pod rollout:
  oc -n openshift-sdn get pods -o wide -l app=ovs -w

One of the pods will be deleted, and it will be 20-or-so minutes before it is recreated.

Actual results:


Expected results:


Additional info:

Comment 1 Casey Callendrello 2019-05-28 15:54:22 UTC
For example, this is one instance:

from k-c-m

I0528 14:35:50.616011       1 controller_utils.go:598] Controller ovs deleting pod openshift-sdn/ovs-mmhtb
I0528 14:35:50.625727       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"d621ea49-8151-11e9-8eb6-02357c0703f8", APIVersion:"apps/v1", ResourceVersion:"19890", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: ovs-mmhtb
I0528 14:36:52.901169       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"d621ea49-8151-11e9-8eb6-02357c0703f8", APIVersion:"apps/v1", ResourceVersion:"20018", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: Timeout: request did not complete within requested timeout 30s

Comment 2 Casey Callendrello 2019-05-28 16:18:54 UTC
I have a lot of logs from reproducing this, but there's no smoking gun. During the time of disruption, the apiserver complains a lot about OpenAPI problems. It makes sense. often time, it is when the ovs pod is deleted on the kcm leader node. The kcm is hard-coded to talk to the apiserver on localhost.

Comment 3 Dan Winship 2019-05-28 16:52:51 UTC
Possibly related to this, kube-controller-manager does not come up successfully when using ovn-kubernetes. Though we haven't debugged that yet, one possibility that could explain both cases is if k-c-m is doing something that depends on the existence of the pod/service network during the period of time when it is (re-)deploying the network plugin pods on its own node.

Comment 4 Dan Williams 2019-05-29 04:04:26 UTC
(In reply to Casey Callendrello from comment #2)
> I have a lot of logs from reproducing this, but there's no smoking gun.
> During the time of disruption, the apiserver complains a lot about OpenAPI
> problems. It makes sense. often time, it is when the ovs pod is deleted on
> the kcm leader node. The kcm is hard-coded to talk to the apiserver on
> localhost.

I'm probably missing something, but the only hardcoding I see is for SecureServing which wouldn't be for talking to the apiserver, but for healthz and debugging or something.

Comment 5 Dan Williams 2019-05-29 04:31:06 UTC
From what I can see the controller-manager gets its apiserver from a kubeconfig, rendered by https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/cmd/render/render.go#L227. Are we able to conclusively determine what the apiserver address its using is? Can we stick in some debugging into kubernetes/cmd/kube-controller-manager/app/options/options.go to print out Master and the contents of the Kubeconfig file?

kube-controller-manager is supposed to be hostNetwork (https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/bootkube/bootstrap-manifests/kube-controller-manager-pod.yaml#L13) and the apiserver is supposed to be hostNetwork, so I don't see why they shouldn't be able to talk to each other. But if there are some admission controllers in the way that aren't hostNetwork, maybe that's causing issues?

Comment 6 Casey Callendrello 2019-05-29 14:29:13 UTC
Yeah, I'm certain the k-c-m is talking to the apiserver on localhost:6443.

I'm doing just a bit more investigating; deleting the OVS pod shouldn't cause networking to die (though it will stop service IP changes from rolling out).

Comment 7 Dan Williams 2019-05-30 16:08:52 UTC
*** Bug 1715515 has been marked as a duplicate of this bug. ***

Comment 8 Dan Williams 2019-05-30 16:43:38 UTC
Ok, taking a recent example:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2053/

and one of the master nodes: ip-10-0-130-17.ec2.internal

On upgrade kube-controller-manager deletes its node's OVS pod and then times out talking to the apiserver:

I0530 11:20:44.815682       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"9ae4bdee-82c9-11e9-bd3e-12c64ec43b90", APIVersion:"apps/v1", ResourceVersion:"28327", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: ovs-vk475
E0530 11:21:03.511611       1 resource_quota_controller.go:414] unable to retrieve the complete list of server APIs: authorization.openshift.io/v1: Get https://localhost:6443/apis/authorization.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), build.openshift.io/v1: Get https://localhost:6443/apis/build.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), oauth.openshift.io/v1: Get https://localhost:6443/apis/oauth.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), route.openshift.io/v1: Get https://localhost:6443/apis/route.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), security.openshift.io/v1: Get https://localhost:6443/apis/security.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The apiserver on the same node for the same timeframe says:

E0530 11:20:39.905058       1 available_controller.go:405] v1.packages.operators.coreos.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": th
e object has been modified; please apply your changes to the latest version and try again
E0530 11:20:44.875097       1 available_controller.go:405] v1.route.openshift.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.route.openshift.io": the object has been modi
fied; please apply your changes to the latest version and try again
E0530 11:20:54.859852       1 available_controller.go:405] v1.security.openshift.io failed with: no response from https://10.128.0.56:8443: Get https://10.128.0.56:8443: net/http: request canceled (Client.Timeou
t exceeded while awaiting headers)
E0530 11:20:54.861613       1 available_controller.go:405] v1.quota.openshift.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.quota.openshift.io": the object has been modi
fied; please apply your changes to the latest version and try again
E0530 11:20:54.872271       1 available_controller.go:405] v1.packages.operators.coreos.com failed with: no response from https://10.129.0.55:5443: Get https://10.129.0.55:5443: net/http: request canceled (Clien
t.Timeout exceeded while awaiting headers)

10.128.0.56:8443 is an openshift-apiserver-rnxz6 pod on node ip-10-0-154-110.ec2.internal.

So clearly while OVS is down the apiserver cannot reach things on the pod network. That appears to be the problem.

Comment 9 Dan Williams 2019-05-30 17:51:48 UTC
We are not guaranteed (nor do Winship and I think we have ever guaranteed) that pod network will continue working while OVS is down. Existing connections will continue to function, but new connections (eg a new HTTP request) may hang because the OVS vswitchd is not around to service the kernel's upcall.

The big question is why the kube-controller-manager is trying to talk to an apiserver on localhost and only localhost. Were it configured to talk to any available apiserver (either through an ELB or service VIP or something) it would be able to contact an apiserver on a node that didn't have OVS restarting at that exact time, and this would not be a problem.

Comment 10 Dan Winship 2019-05-30 18:10:53 UTC
So Casey isn't around today so I'm guessing a little bit, but... this bug is something Casey didn't notice until he was testing https://github.com/openshift/cluster-network-operator/pull/181, which fixed a bug that made us fail to shut down OVS correctly. And apparently, basically immediately after that merged, this bug started causing CI to fail basically all the time (bug 1715515). So it looks like the reason we never got hit by this bug before was because previously we were never shutting OVS down correctly, and our infrastructure was accidentally depending on that fact.

(And CNO #181 is now being reverted to unbreak CI.)

Comment 11 Dan Williams 2019-05-31 15:11:51 UTC
(In reply to Dan Winship from comment #10)
> So Casey isn't around today so I'm guessing a little bit, but... this bug is
> something Casey didn't notice until he was testing
> https://github.com/openshift/cluster-network-operator/pull/181, which fixed
> a bug that made us fail to shut down OVS correctly. And apparently,
> basically immediately after that merged, this bug started causing CI to fail
> basically all the time (bug 1715515). So it looks like the reason we never
> got hit by this bug before was because previously we were never shutting OVS
> down correctly, and our infrastructure was accidentally depending on that
> fact.
> 
> (And CNO #181 is now being reverted to unbreak CI.)

And the revert doesn't seem to have fixed issues, so is likely unrelated. Current theory from @deads2k is that apiserver and etcd are unable to correctly talk to each other for some reason.

Comment 12 Casey Callendrello 2019-06-03 13:38:16 UTC
From what I could see, the apiserver on the same node as the k-c-m master was not responsive. It appeared to be due to difficulty accessing aggregated API resources, which makes sense, since the pod network was down.

Comment 13 David Eads 2019-06-06 20:28:23 UTC
Found it.  There's a restmapping done to make block deletion determinations that is a very likely candidate after finding

I0606 20:17:04.678377       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"00b78075-8882-11e9-aa6c-02ca800c5ec6", APIVersion:"apps/v1", ResourceVersion:"52500", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s

via https://github.com/openshift/origin/pull/23066.

The mapping requires hitting aggregated API servers and we can't.  That's the only way to get a logically complete answer, but we *could* choose to make a fake RESTMapper for certain GVKs that we want to force.  I'll negotiate with Stefan for the amount of ugly he can take.

Comment 14 David Eads 2019-06-07 13:42:21 UTC
https://github.com/openshift/origin/pull/23082 is an option for the short to medium term

Comment 15 Casey Callendrello 2019-06-18 15:33:28 UTC
*** Bug 1715106 has been marked as a duplicate of this bug. ***

Comment 17 zhou ying 2019-06-28 03:53:04 UTC
Confirmed with latest ocp with the PR, can't reproduce it.
Payload: 4.2.0-0.nightly-2019-06-27-204704

[root@dhcp-140-138 ~]# oc -n openshift-sdn get pods -o wide -l app=ovs
NAME        READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovs-97v4k   1/1     Running   0          86s     10.0.133.53    ip-10-0-133-53.us-east-2.compute.internal    <none>           <none>
ovs-kzfcr   1/1     Running   0          109s    10.0.148.1     ip-10-0-148-1.us-east-2.compute.internal     <none>           <none>
ovs-nj5fc   1/1     Running   0          65s     10.0.162.206   ip-10-0-162-206.us-east-2.compute.internal   <none>           <none>
ovs-pb5dj   1/1     Running   0          44s     10.0.134.218   ip-10-0-134-218.us-east-2.compute.internal   <none>           <none>
ovs-t2jht   1/1     Running   0          2m10s   10.0.157.14    ip-10-0-157-14.us-east-2.compute.internal    <none>           <none>

Comment 19 errata-xmlrpc 2019-10-16 06:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.