1718375 – Rolling out OVS daemonset changes takes a very long time.

Bug 1718375 - Rolling out OVS daemonset changes takes a very long time.

Summary: Rolling out OVS daemonset changes takes a very long time.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	David Eads
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:	4.1.2
Duplicates (1):	1719423 (view as bug list)
Depends On:	1714699
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-07 15:08 UTC by Greg Blomquist
Modified:	2019-06-19 06:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1714699
Environment:
Last Closed:	2019-06-19 06:45:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23086	0	'None'	closed	Bug 1718375: hardcode a small list of mappings to allow SDN to rebootstrap	2021-02-15 23:41:25 UTC
Red Hat Product Errata	RHBA-2019:1382	0	None	None	None	2019-06-19 06:45:44 UTC

Description Greg Blomquist 2019-06-07 15:08:15 UTC

+++ This bug was initially created as a clone of Bug #1714699 +++

Description of problem:

Triggering an OVS daemonset change takes a very, very long time. Specifically, the kube controller manager seems to pause for about 20 minutes between pod deletion and creation.

It doesn't happen for every pod. It seems to happen when the OVS pod is deleted on the current k-c-m leader.


Version-Release number of selected component (if applicable): 4.2 tip


How reproducible:  very


Steps to Reproduce:
1. Spin up a tip cluster. 
2. Scale the cvo and cno to 0 (so you can make ovs daemonset changes)
  oc -n openshift-cluster-version scale deployment cluster-version-operator --replicas 0
  oc -n openshift-network-operator scale deployment network-operator --replicas 0

3. Make a trivial change to the OVS daemonset that triggers a rollout (e.g. add an 'echo' in the run command:
  oc -n openshift-sdn edit ds ovs

4. Watch pod rollout:
  oc -n openshift-sdn get pods -o wide -l app=ovs -w

One of the pods will be deleted, and it will be 20-or-so minutes before it is recreated.

Actual results:


Expected results:


Additional info:

--- Additional comment from Casey Callendrello on 2019-05-28 15:54:22 UTC ---

For example, this is one instance:

from k-c-m

I0528 14:35:50.616011       1 controller_utils.go:598] Controller ovs deleting pod openshift-sdn/ovs-mmhtb
I0528 14:35:50.625727       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"d621ea49-8151-11e9-8eb6-02357c0703f8", APIVersion:"apps/v1", ResourceVersion:"19890", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: ovs-mmhtb
I0528 14:36:52.901169       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"d621ea49-8151-11e9-8eb6-02357c0703f8", APIVersion:"apps/v1", ResourceVersion:"20018", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: Timeout: request did not complete within requested timeout 30s

--- Additional comment from Casey Callendrello on 2019-05-28 16:18:54 UTC ---

I have a lot of logs from reproducing this, but there's no smoking gun. During the time of disruption, the apiserver complains a lot about OpenAPI problems. It makes sense. often time, it is when the ovs pod is deleted on the kcm leader node. The kcm is hard-coded to talk to the apiserver on localhost.

--- Additional comment from Dan Winship on 2019-05-28 16:52:51 UTC ---

Possibly related to this, kube-controller-manager does not come up successfully when using ovn-kubernetes. Though we haven't debugged that yet, one possibility that could explain both cases is if k-c-m is doing something that depends on the existence of the pod/service network during the period of time when it is (re-)deploying the network plugin pods on its own node.

--- Additional comment from Dan Williams on 2019-05-29 04:04:26 UTC ---

(In reply to Casey Callendrello from comment #2)
> I have a lot of logs from reproducing this, but there's no smoking gun.
> During the time of disruption, the apiserver complains a lot about OpenAPI
> problems. It makes sense. often time, it is when the ovs pod is deleted on
> the kcm leader node. The kcm is hard-coded to talk to the apiserver on
> localhost.

I'm probably missing something, but the only hardcoding I see is for SecureServing which wouldn't be for talking to the apiserver, but for healthz and debugging or something.

--- Additional comment from Dan Williams on 2019-05-29 04:31:06 UTC ---

From what I can see the controller-manager gets its apiserver from a kubeconfig, rendered by https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/pkg/cmd/render/render.go#L227. Are we able to conclusively determine what the apiserver address its using is? Can we stick in some debugging into kubernetes/cmd/kube-controller-manager/app/options/options.go to print out Master and the contents of the Kubeconfig file?

kube-controller-manager is supposed to be hostNetwork (https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/bootkube/bootstrap-manifests/kube-controller-manager-pod.yaml#L13) and the apiserver is supposed to be hostNetwork, so I don't see why they shouldn't be able to talk to each other. But if there are some admission controllers in the way that aren't hostNetwork, maybe that's causing issues?

--- Additional comment from Casey Callendrello on 2019-05-29 14:29:13 UTC ---

Yeah, I'm certain the k-c-m is talking to the apiserver on localhost:6443.

I'm doing just a bit more investigating; deleting the OVS pod shouldn't cause networking to die (though it will stop service IP changes from rolling out).

--- Additional comment from Dan Williams on 2019-05-30 16:08:52 UTC ---



--- Additional comment from Dan Williams on 2019-05-30 16:43:38 UTC ---

Ok, taking a recent example:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2053/

and one of the master nodes: ip-10-0-130-17.ec2.internal

On upgrade kube-controller-manager deletes its node's OVS pod and then times out talking to the apiserver:

I0530 11:20:44.815682       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"9ae4bdee-82c9-11e9-bd3e-12c64ec43b90", APIVersion:"apps/v1", ResourceVersion:"28327", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: ovs-vk475
E0530 11:21:03.511611       1 resource_quota_controller.go:414] unable to retrieve the complete list of server APIs: authorization.openshift.io/v1: Get https://localhost:6443/apis/authorization.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), build.openshift.io/v1: Get https://localhost:6443/apis/build.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), oauth.openshift.io/v1: Get https://localhost:6443/apis/oauth.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), route.openshift.io/v1: Get https://localhost:6443/apis/route.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), security.openshift.io/v1: Get https://localhost:6443/apis/security.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The apiserver on the same node for the same timeframe says:

E0530 11:20:39.905058       1 available_controller.go:405] v1.packages.operators.coreos.com failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": th
e object has been modified; please apply your changes to the latest version and try again
E0530 11:20:44.875097       1 available_controller.go:405] v1.route.openshift.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.route.openshift.io": the object has been modi
fied; please apply your changes to the latest version and try again
E0530 11:20:54.859852       1 available_controller.go:405] v1.security.openshift.io failed with: no response from https://10.128.0.56:8443: Get https://10.128.0.56:8443: net/http: request canceled (Client.Timeou
t exceeded while awaiting headers)
E0530 11:20:54.861613       1 available_controller.go:405] v1.quota.openshift.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.quota.openshift.io": the object has been modi
fied; please apply your changes to the latest version and try again
E0530 11:20:54.872271       1 available_controller.go:405] v1.packages.operators.coreos.com failed with: no response from https://10.129.0.55:5443: Get https://10.129.0.55:5443: net/http: request canceled (Clien
t.Timeout exceeded while awaiting headers)

10.128.0.56:8443 is an openshift-apiserver-rnxz6 pod on node ip-10-0-154-110.ec2.internal.

So clearly while OVS is down the apiserver cannot reach things on the pod network. That appears to be the problem.

--- Additional comment from Dan Williams on 2019-05-30 17:51:48 UTC ---

We are not guaranteed (nor do Winship and I think we have ever guaranteed) that pod network will continue working while OVS is down. Existing connections will continue to function, but new connections (eg a new HTTP request) may hang because the OVS vswitchd is not around to service the kernel's upcall.

The big question is why the kube-controller-manager is trying to talk to an apiserver on localhost and only localhost. Were it configured to talk to any available apiserver (either through an ELB or service VIP or something) it would be able to contact an apiserver on a node that didn't have OVS restarting at that exact time, and this would not be a problem.

--- Additional comment from Dan Winship on 2019-05-30 18:10:53 UTC ---

So Casey isn't around today so I'm guessing a little bit, but... this bug is something Casey didn't notice until he was testing https://github.com/openshift/cluster-network-operator/pull/181, which fixed a bug that made us fail to shut down OVS correctly. And apparently, basically immediately after that merged, this bug started causing CI to fail basically all the time (bug 1715515). So it looks like the reason we never got hit by this bug before was because previously we were never shutting OVS down correctly, and our infrastructure was accidentally depending on that fact.

(And CNO #181 is now being reverted to unbreak CI.)

--- Additional comment from Dan Williams on 2019-05-31 15:11:51 UTC ---

(In reply to Dan Winship from comment #10)
> So Casey isn't around today so I'm guessing a little bit, but... this bug is
> something Casey didn't notice until he was testing
> https://github.com/openshift/cluster-network-operator/pull/181, which fixed
> a bug that made us fail to shut down OVS correctly. And apparently,
> basically immediately after that merged, this bug started causing CI to fail
> basically all the time (bug 1715515). So it looks like the reason we never
> got hit by this bug before was because previously we were never shutting OVS
> down correctly, and our infrastructure was accidentally depending on that
> fact.
> 
> (And CNO #181 is now being reverted to unbreak CI.)

And the revert doesn't seem to have fixed issues, so is likely unrelated. Current theory from @deads2k is that apiserver and etcd are unable to correctly talk to each other for some reason.

--- Additional comment from Casey Callendrello on 2019-06-03 13:38:16 UTC ---

From what I could see, the apiserver on the same node as the k-c-m master was not responsive. It appeared to be due to difficulty accessing aggregated API resources, which makes sense, since the pod network was down.

--- Additional comment from David Eads on 2019-06-06 20:28:23 UTC ---

Found it.  There's a restmapping done to make block deletion determinations that is a very likely candidate after finding

I0606 20:17:04.678377       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"00b78075-8882-11e9-aa6c-02ca800c5ec6", APIVersion:"apps/v1", ResourceVersion:"52500", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s

via https://github.com/openshift/origin/pull/23066.

The mapping requires hitting aggregated API servers and we can't.  That's the only way to get a logically complete answer, but we *could* choose to make a fake RESTMapper for certain GVKs that we want to force.  I'll negotiate with Stefan for the amount of ugly he can take.

--- Additional comment from David Eads on 2019-06-07 13:42:21 UTC ---

https://github.com/openshift/origin/pull/23082 is an option for the short to medium term

Comment 1 W. Trevor King 2019-06-11 20:27:38 UTC

*** Bug 1719423 has been marked as a duplicate of this bug. ***

Comment 2 W. Trevor King 2019-06-11 23:29:59 UTC

merged: https://github.com/openshift/origin/pull/23086#event-2405710093

Comment 3 W. Trevor King 2019-06-11 23:41:29 UTC

https://openshift-release.svc.ci.openshift.org/releasestream/4.1.0-0.ci/release/4.1.0-0.ci-2019-06-11-232837 has the fix (although I'm not moving to ON_QA, since that's supposed to happen via ART tooling):

$ oc adm release info --changelog ~/.local/lib/go/src --changes-from quay.io/openshift-release-dev/ocp-release:4.1.1 registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-06-11-232837
...
### [cli, cli-artifacts, deployer, hyperkube, hypershift, node, tests](https://github.com/openshift/origin)

* [Bug 1718375](https://bugzilla.redhat.com/show_bug.cgi?id=1718375): hardcode a small list of mappings to allow SDN to rebootstrap [#23086](https://github.com/openshift/origin/pull/23086)
* [Full changelog](https://github.com/openshift/origin/compare/4b47d33946729fd09e16da3562355922aedc89b8...3915504e7f6f91c47b242aeb80802f143757b1d7)
...

Comment 4 Mike Fiedler 2019-06-11 23:50:07 UTC

re: comment 3 Whether by tooling or manually, it should go ON_QA when available in a downstream (nightly) build.

Comment 6 zhou ying 2019-06-13 08:36:01 UTC

Confirmed with latest payload: 4.1.0-0.nightly-2019-06-12-205455, can not reproduce it, most times the ovs pods will restart in 3 mins:

Steps:
`oc set env ds/ovs key=value6 -n openshift-sdn`
`oc -n openshift-sdn get pods -o wide -l app=ovs -w`
NAME        READY   STATUS        RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovs-2lnmh   1/1     Running       0          4m9s    10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>           <none>
ovs-7t9px   1/1     Running       0          63s     10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>           <none>
ovs-9c9qr   1/1     Terminating   0          2m16s   10.0.172.221   ip-10-0-172-221.us-east-2.compute.internal   <none>           <none>
ovs-9vz6m   1/1     Running       0          3m29s   10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>           <none>
ovs-f7nfg   1/1     Running       0          97s     10.0.157.39    ip-10-0-157-39.us-east-2.compute.internal    <none>           <none>
ovs-hhg2n   1/1     Running       0          2m48s   10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>           <none>
ovs-9c9qr   0/1   Terminating   0     2m45s   10.0.172.221   ip-10-0-172-221.us-east-2.compute.internal   <none>   <none>
ovs-9c9qr   0/1   Terminating   0     2m56s   10.0.172.221   ip-10-0-172-221.us-east-2.compute.internal   <none>   <none>
ovs-9c9qr   0/1   Terminating   0     2m56s   10.0.172.221   ip-10-0-172-221.us-east-2.compute.internal   <none>   <none>
ovs-2lqdw   0/1   Pending   0     0s    <none>   <none>   <none>   <none>
ovs-2lqdw   0/1   Pending   0     0s    <none>   ip-10-0-172-221.us-east-2.compute.internal   <none>   <none>
ovs-2lqdw   0/1   ContainerCreating   0     0s    10.0.172.221   ip-10-0-172-221.us-east-2.compute.internal   <none>   <none>
ovs-2lqdw   1/1   Running   0     2s    10.0.172.221   ip-10-0-172-221.us-east-2.compute.internal   <none>   <none>
ovs-hhg2n   1/1   Terminating   0     3m30s   10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-hhg2n   0/1   Terminating   0     4m1s   10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-hhg2n   0/1   Terminating   0     4m10s   10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-hhg2n   0/1   Terminating   0     4m10s   10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-lngh7   0/1   Pending   0     0s    <none>   <none>   <none>   <none>
ovs-lngh7   0/1   Pending   0     0s    <none>   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-lngh7   0/1   ContainerCreating   0     0s    10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-lngh7   1/1   Running   0     2s    10.0.173.160   ip-10-0-173-160.us-east-2.compute.internal   <none>   <none>
ovs-2lnmh   1/1   Terminating   0     5m33s   10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-2lnmh   0/1   Terminating   0     6m3s   10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-2lnmh   0/1   Terminating   0     6m4s   10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-2lnmh   0/1   Terminating   0     6m4s   10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-6f4j9   0/1   Pending   0     0s    <none>   <none>   <none>   <none>
ovs-6f4j9   0/1   Pending   0     0s    <none>   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-6f4j9   0/1   ContainerCreating   0     0s    10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-6f4j9   1/1   Running   0     1s    10.0.150.238   ip-10-0-150-238.us-east-2.compute.internal   <none>   <none>
ovs-9vz6m   1/1   Terminating   0     5m25s   10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-9vz6m   0/1   Terminating   0     5m56s   10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-9vz6m   0/1   Terminating   0     5m57s   10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-9vz6m   0/1   Terminating   0     5m57s   10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-5gqm8   0/1   Pending   0     0s    <none>   <none>   <none>   <none>
ovs-5gqm8   0/1   Pending   0     0s    <none>   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-5gqm8   0/1   ContainerCreating   0     0s    10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-5gqm8   1/1   Running   0     1s    10.0.134.221   ip-10-0-134-221.us-east-2.compute.internal   <none>   <none>
ovs-7t9px   1/1   Terminating   0     3m32s   10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-7t9px   0/1   Terminating   0     4m3s   10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-7t9px   0/1   Terminating   0     4m10s   10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-7t9px   0/1   Terminating   0     4m10s   10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-4dd74   0/1   Pending   0     0s    <none>   <none>   <none>   <none>
ovs-4dd74   0/1   Pending   0     0s    <none>   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-4dd74   0/1   ContainerCreating   0     0s    10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-4dd74   1/1   Running   0     1s    10.0.136.231   ip-10-0-136-231.us-east-2.compute.internal   <none>   <none>
ovs-f7nfg   1/1   Terminating   0     4m45s   10.0.157.39   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>
ovs-f7nfg   0/1   Terminating   0     5m16s   10.0.157.39   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>
ovs-f7nfg   0/1   Terminating   0     5m17s   10.0.157.39   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>
ovs-f7nfg   0/1   Terminating   0     5m17s   10.0.157.39   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>
ovs-vvhhr   0/1   Pending   0     0s    <none>   <none>   <none>   <none>
ovs-vvhhr   0/1   Pending   0     0s    <none>   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>
ovs-vvhhr   0/1   ContainerCreating   0     0s    10.0.157.39   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>
ovs-vvhhr   1/1   Running   0     1s    10.0.157.39   ip-10-0-157-39.us-east-2.compute.internal   <none>   <none>

Comment 8 errata-xmlrpc 2019-06-19 06:45:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1382

Note You need to log in before you can comment on or make changes to this bug.