Bug 1706082

Summary: [upgrade] Cluster upgrade should maintain a functioning cluster: replicaset "rs" never became ready
Product: OpenShift Container Platform Reporter: Matthew Wong <mawong>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs, calfonso, ccoleman, deads, gblomqui, jokerman, kgarriso, lmeyer, lserven, maszulik, mifiedle, mmccomas, mnguyen, pcameron, rkrawitz, schoudha, spadgett, sponnaga, walters, wking, xtian
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 4.1.4
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1721583 (view as bug list) Environment:
Last Closed: 2019-07-04 09:01:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1721583    

Description Matthew Wong 2019-05-03 13:49:28 UTC
Description of problem: In e2e-aws-upgrade:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:83]: during upgrade
Expected error:
    <*errors.errorString | 0xc0013a5af0>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition",
    }
    Cluster did not complete upgrade: timed out waiting for the condition
not to have occurred

OR

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/replicasets.go:85]: Expected error:
    <*errors.errorString | 0xc00335a5e0>: {
        s: "replicaset \"rs\" never became ready",
    }
    replicaset "rs" never became ready
not to have occurred

seen 23x in last day out of 32x total in last week.

https://search.svc.ci.openshift.org/?search=replicaset+%22rs%22+never+became+ready&maxAge=168h&context=2&type=all

$ ./deck-build-log-grep "Unexpected error occurred: replicaset \"rs\" never became ready"
Unexpected error occurred: replicaset "rs" never became ready matches 23 of 340 failures

examples:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1117#openshift-tests-disruptive-cluster-upgrade-should-maintain-a-functioning-cluster-featureclusterupgrade-suiteopenshift-serial
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1101#openshift-tests-disruptive-cluster-upgrade-should-maintain-a-functioning-cluster-featureclusterupgrade-suiteopenshift-serial
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1094#openshift-tests-disruptive-cluster-upgrade-should-maintain-a-functioning-cluster-featureclusterupgrade-suiteopenshift-serial

Comment 2 W. Trevor King 2019-05-03 18:08:10 UTC
Actually, Clayton was pointing at Cloud for k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58.  So I dunno who this one belongs to.  Master seems like a plausible guess, but feel free to punt if you have better ideas.

Comment 3 Greg Blomquist 2019-05-03 19:24:12 UTC
Based on frequency, I'm moving this to 4.2 (non-4.1.0-blocker).  If anyone disagrees, pull it back in.

Comment 5 Clayton Coleman 2019-05-09 14:46:11 UTC
I’m seeing this more, I don’t see how this isn’t a platform bug or a serious blocker for knowing our upgrades work.

We *must* know upgrades work, I expect to see a root cause before we kick it out (we kick out based on cause, not symptom).

Comment 6 Clayton Coleman 2019-05-09 14:47:44 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399.

May be unrelated, but has similar characteristics

Comment 7 Maciej Szulik 2019-05-09 16:31:58 UTC
I've checked the last 4 failures and here are my findings:


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1136

From events:

May 03 11:31:59.038 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Available message changed from "Available: v1.apps.openshift.io is not ready: 503\nAvailable: v1.build.openshift.io is not ready: 503\nAvailable: v1.oauth.openshift.io is not ready: 503\nAvailable: v1.route.openshift.io is not ready: 503\nAvailable: v1.security.openshift.io is not ready: 503" to "Available: v1.authorization.openshift.io is not ready: 503\nAvailable: v1.user.openshift.io is not ready: 503"

Looking inside openshift-apiserver pods, ALL 3 of them contain:

W0503 11:30:43.724291       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd.openshift-etcd.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.30.23.218:2379: connect: connection refused". Reconnect
ing...                                                                                                                                                                                                                                                                           W0503 11:30:43.728351       1 clientconn.go:1440] grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1117

From events:

04:33:26 (11979) "" MCDBootstrapSyncFailure pending config rendered-master-15fff6d40908cf99c46b23dbcb2a5ebd bootID 6f502d19-5a7e-44da-a3f1-cb0693a93833 matches current! Failed to reboot?

I haven't dug deeper but the number of events 12k (sic!) is more than worrying.


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399

From events:
13:25:23 (2) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "" to "StaticPodsDegraded: nodes/ip-10-0-161-216.ec2.internal pods/kube-apiserver-ip-10-0-161-216.ec2.internal container=""kube-apiserver-7"" is not ready

In 2 kube-apiserver pods' logs I see:

W0509 11:29:19.745173       1 clientconn.go:953] Failed to dial etcd-2.ci-op-ckldis1m-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry.


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399

From events:

15:04:20 (6) "openshift-controller-manager" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
15:04:20 (25) "openshift-machine-config-operator" FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.
15:04:20 (5) "openshift-kube-controller-manager" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
15:04:20 (6) "openshift-dns" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
15:04:20 (5) "openshift-kube-apiserver" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]


There's no single failure that repeats, but problems reaching nodes or etcd in the cluster suggests some networking issues. I'd like the networking
team to double check the logs for their components.

Comment 8 Phil Cameron 2019-05-09 17:45:39 UTC
API server and etcd are static pods that use host network. Cluster Network can't start until those are up. 

From the message:
FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.

Things are broken enough so that Networking can't get started.

Comment 9 David Eads 2019-05-09 17:48:51 UTC
Opened https://github.com/openshift/origin/pull/22809 to gather more logs, but based on logs I've looked at in the past, I suspect we're successfully creating the pods and they aren't running as expected.  I think this ticked up when we started creating pods more quickly.

Comment 10 David Eads 2019-05-09 17:52:15 UTC
@phil, trying to parse your message.

I agree that networking won't run until the kube-apiserver and etcd are up, but the fact that you're able to see the nodes is a clear indication that both are up and working.  It's also an indication that the scheduler is running and if you're chasing your correct pod (you only get scheduling messages when trying to schedule pods), the kubecontroller-manager is up.  That means that

1. etcd is up and working
2. kube-apiserver is up and working
3. kube-controller-manager is up and working (or your pod is already created somehow)
4. kube-scheduler is up and working

I think that having those working should be enough for the network to be functional.  What more do you think is required and what piece do you see as missing?

Comment 11 Dan Williams 2019-05-09 18:12:23 UTC
(In reply to Maciej Szulik from comment #7)
> I've checked the last 4 failures and here are my findings:
> 
> 
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/1136
> 
> From events:

Do you mean events.json here? Or where are you getting the events from?

Comment 12 Phil Cameron 2019-05-09 18:48:07 UTC
@David, comment 10. The pods are being prevented from starting.

FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.

In so far as the daemonset yaml in the cno is fixed (doesn't change), there is something that is occasionally preventing the pods from starting. I would think that by the time cno runs the nodes would let daemonsets start pods. Feels like we don't have 6 running but "Not Ready" nodes at this point. Could this be a race where cno starts too soon?

I am way out of my depth here. I am not sure how the cluster bring up sequences itself into a running cluster.

Comment 13 Robert Krawitz 2019-05-09 19:20:46 UTC
@pcameron @maszulik:

FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) were unschedulable.

is a normal message from the etcd-quorum-guard when a master isn't up (which it won't be between when it starts to drain and hasn't fully started).  This is because of the quorum guard's deployment that wants 3 replicas, all on master nodes, and all on different nodes.

However, if it says *2* nodes are unschedulable (as seen), that's a very different matter -- that means that one node was cordoned while another node was still not schedulable.

From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399/artifacts/e2e-aws-upgrade/e2e.log we see that that is indeed somehow happening (I've annotated it with line number):

    4433 May 09 11:14:56.583 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-86767b79b5-6tt7d 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. (12 times)
    4434 May 09 11:14:56.696 I node/ip-10-0-145-55.ec2.internal Node ip-10-0-145-55.ec2.internal status is now: NodeNotSchedulable
    4435 May 09 11:14:56.962 I node/ip-10-0-145-55.ec2.internal Node will reboot into config rendered-master-8dd27e0144c911a471964aca19dc5652
    4436 May 09 11:14:56.999 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Pulling image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:81eed66b4359d4a3015a3b0e92782afd6ef410e47ad7e68ed65c9dde54aa5b57"
    4437 May 09 11:14:57.113 I ns/openshift-service-ca pod/configmap-cabundle-injector-78f77bd9c9-d7q9x Container image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:8b0c9a1ad0a500f886b716bf85c8627c69376502f91915adb83f06b3ce9e9443" already present on machine
    4438 May 09 11:14:57.207 I ns/openshift-machine-config-operator pod/machine-config-operator-59b6486f69-xnbw2 Pulling image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:7e7ca4129d0909c97d3b4bd08756eb7088cfd86e9a28c56b70f3a6858830a63f"
    4439 May 09 11:14:57.331 I ns/openshift-service-ca pod/configmap-cabundle-injector-78f77bd9c9-d7q9x Created container configmap-cabundle-injector-controller
    4440 May 09 11:14:57.356 I ns/openshift-service-ca pod/configmap-cabundle-injector-78f77bd9c9-d7q9x Started container configmap-cabundle-injector-controller
    4441 May 09 11:14:57.918 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Successfully pulled image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:81eed66b4359d4a3015a3b0e92782afd6ef410e47ad7e68ed65c9dde54aa5b57"
    4442 May 09 11:14:58.203 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Created container dns-operator
    4443 May 09 11:14:58.284 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Started container dns-operator
    4444 May 09 11:14:58.935 I node/ip-10-0-162-146.ec2.internal Written pending config rendered-worker-5c17190a4fe3174649e5b19e539d750e
    4445 May 09 11:14:58.945 I node/ip-10-0-162-146.ec2.internal Draining node to update config.
    4446 May 09 11:14:58.970 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-86767b79b5-6tt7d 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.

Whether directly related or not, it does look like a second node (10.0.162.146) is being taken down before the first node to be upgraded (10.0.145.55) is fully ready.  It's only much later that 10.0.145.55 is schedulable:

    5077 May 09 11:17:51.017 W ns/openshift-dns pod/dns-default-lp7qd node/ip-10-0-145-55.ec2.internal container=dns-node-resolver container restarted
    5078 May 09 11:17:51.017 W ns/openshift-dns pod/dns-default-lp7qd node/ip-10-0-145-55.ec2.internal container=dns container restarted
    5079 May 09 11:17:51.041 I ns/openshift-apiserver pod/apiserver-f4xwn Container image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:45b6b845da1e8ec069ed206f2aadc752b27c5c2aea4bf7a4211d71b27bdc248d" already present on machine
    5080 May 09 11:17:51.236 I ns/openshift-apiserver pod/apiserver-f4xwn Created container openshift-apiserver
    5081 May 09 11:17:51.444 I ns/openshift-apiserver pod/apiserver-f4xwn Started container openshift-apiserver
    5082 May 09 11:17:51.635 I node/ip-10-0-145-55.ec2.internal Node ip-10-0-145-55.ec2.internal status is now: NodeSchedulable

However, 10.0.162.146 becomes ready much more quickly:

    4921 May 09 11:17:37.122 I node/ip-10-0-162-146.ec2.internal Node ip-10-0-162-146.ec2.internal status is now: NodeReady
    4922 May 09 11:17:37.126 I node/ip-10-0-162-146.ec2.internal Node ip-10-0-162-146.ec2.internal status is now: NodeSchedulable

Comment 14 Robert Krawitz 2019-05-09 19:42:11 UTC
Belay that last...10.0.162.146 was a worker, not a master.

Comment 15 Casey Callendrello 2019-05-09 19:42:46 UTC
Nothing jumps out as obvious network failures.

- The GRPC goaways seem to indicate that etcd is shutting down, no? In any case, the sdn wouldn't have much to do with this; it's host networking
- The nodes being NotReady are 100% expected - they will be NotReady until the SDN has re-initialized after a reboot, after which the scheduled pod will start.

It is possible that we're rolling out reboots to quickly. Does the mco wait for the node to be ready before proceeding with the next reboot?

What exactly is causing the tests to fail? There seem to be a lot of unexplained error conditions that are mostly transient.

Comment 16 Maciej Szulik 2019-05-10 12:19:47 UTC
*** Bug 1708454 has been marked as a duplicate of this bug. ***

Comment 17 Maciej Szulik 2019-05-10 12:47:06 UTC
Last failure after David merged https://github.com/openshift/origin/pull/22809 happened 2019-05-10 12:44 CEST took 1h17m 
in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/:

from looking at the dump I don't see job controller creating 5 pods in total to fulfil the job, but then upon failure in the 
dumped pods there ZERO running pods for this job. I don't see any events that might point to anything unusual, but I need to 
check controller-manager logs for job-controller logs since there might be a chance that the backoff policy kicked in and
we created pods slower when the job is failing often enough. Also I don't see any node logs, since we have 
exutil.TestContext.MaxNodesToGather = 0 configured in https://github.com/openshift/origin/blob/66bcaddbb4264915cfeb798860f097ce042e6015/cmd/openshift-tests/openshift-tests.go#L271

Comment 18 David Eads 2019-05-10 13:28:51 UTC
I don't see a bug.

Have a look at https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/

1. Expanding the failing test from https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/#operator-run-template-e2e-aws-upgrade---e2e-aws-upgrade-container-test we find 

   At 2019-05-10 11:51:02 +0000 UTC - event for foo-vk4rp: {default-scheduler } Scheduled: Successfully assigned e2e-tests-sig-apps-job-upgrade-9ntz9/foo-vk4rp to ip-10-0-147-189.ec2.internal

   This gives us a pod and a node to look at.

2. Open the worker kubelet logs here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/artifacts/e2e-aws-upgrade/nodes/workers-journal

   Filter by our pod and find multiple messages about Status for pod "foo-vk4rp_e2e-tests-sig-apps-job-upgrade-9ntz9(e2b85800-7319-11e9-a78d-1293b1f7ea52)" updated successfully:

3. Expand those statuses and dig into them and you find transitions from pending to running to pending with no restart count increasing.  ImageIDs and ContainerIDs suddenly disappear.

4. At the same time, the kubelet and I'm guessing the node (based on events like:             "message": "Updated machine ci-op-9mgil1kz-77109-s2h7v-worker-us-east-1a-lqk7t",) are restarted.

5. The restart stops the pod, which doesn't get restarted before the test does a single check for running pods.


Restarting all the nodes is disruptive and will cause tests like this to fail.  We knew this.  Rolling updates of nodes cause disruption to workloads.  Why are we surprised that this test flakes if it's single check for the "right" number of running pods while restarting a cluster fails?

I don't see why this should be a blocker.  The test could be improved to wait for pods to come back to life after nodes restart, but I wouldn't add the risk into the release at this point.

Comment 19 W. Trevor King 2019-05-10 23:11:18 UTC
> The restart stops the pod, which doesn't get restarted before the test does a single check for running pods.

[1] landed upstream, and it would help turn this case up by reporting the expected pods as pending or starting instead of the current "<bool>: false".

> Rolling updates of nodes cause disruption to workloads.

Some workloads will allow for this sort of disruption; others will not.  Maybe job pods have no way to configure a rolling update strategy with max unavailable, etc.?

> I don't see why this should be a blocker.  The test could be improved to wait for pods to come back to life after nodes restart, but I wouldn't add the risk into the release at this point.

Yeah, if we can't configure the job to always keep some fraction of Pods running.  Or we relax the check to also count Pending, etc. statuses as acceptable.  If you don't consider this test a blocker, what do you see as the risk to adjusting the test logic?  And now that we're targeting 4.2 here, there's a bit more breathing room ;).

[1]: https://github.com/kubernetes/kubernetes/pull/77716

Comment 20 Clayton Coleman 2019-05-22 15:03:28 UTC
The root cause of this bug originally was that workers were still restarting while the upgrade test "post phase" was running.  That was fixed by the machine config team.

The current set of 4.2 upgrade flakes started happening after the rebase.

Comment 21 W. Trevor King 2019-06-13 20:34:53 UTC
For future reference, "wait for the compute pools" was https://github.com/openshift/origin/pull/22835 in 4.2/master and https://github.com/openshift/origin/pull/22843 in 4.1.

Comment 22 Colin Walters 2019-06-13 21:23:19 UTC
Offhand it looks like this may be the same race that https://github.com/openshift/machine-config-operator/pull/773 fixed.  Though I think the `Generation` handling is also supposed to prevent this.

Comment 23 Colin Walters 2019-06-14 15:00:41 UTC
Moving back to 4.1.z so we can target

https://github.com/openshift/machine-config-operator/pull/856

Comment 29 errata-xmlrpc 2019-07-04 09:01:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1635

Comment 30 Maciej Szulik 2019-08-16 13:37:27 UTC
*** Bug 1726315 has been marked as a duplicate of this bug. ***