Bug 1721583

Summary:	[upgrade] Cluster upgrade should maintain a functioning cluster: replicaset "rs" never became ready
Product:	OpenShift Container Platform	Reporter:	Antonio Murdaca <amurdaca>
Component:	openshift-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	aos-bugs, bbennett, calfonso, ccoleman, deads, gblomqui, jokerman, kgarriso, lserven, maszulik, mawong, mfojtik, miabbott, mifiedle, mmccomas, mnguyen, mstaeble, pcameron, rkrawitz, schoudha, sponnaga, walters, wking
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1706082	Environment:
Last Closed:	2019-10-16 06:32:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1706082
Bug Blocks:

Description Antonio Murdaca 2019-06-18 15:23:12 UTC

+++ This bug was initially created as a clone of Bug #1706082 +++

Description of problem: In e2e-aws-upgrade:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:83]: during upgrade
Expected error:
    <*errors.errorString | 0xc0013a5af0>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition",
    }
    Cluster did not complete upgrade: timed out waiting for the condition
not to have occurred

OR

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/replicasets.go:85]: Expected error:
    <*errors.errorString | 0xc00335a5e0>: {
        s: "replicaset \"rs\" never became ready",
    }
    replicaset "rs" never became ready
not to have occurred

seen 23x in last day out of 32x total in last week.

https://search.svc.ci.openshift.org/?search=replicaset+%22rs%22+never+became+ready&maxAge=168h&context=2&type=all

$ ./deck-build-log-grep "Unexpected error occurred: replicaset \"rs\" never became ready"
Unexpected error occurred: replicaset "rs" never became ready matches 23 of 340 failures

examples:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1117#openshift-tests-disruptive-cluster-upgrade-should-maintain-a-functioning-cluster-featureclusterupgrade-suiteopenshift-serial
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1101#openshift-tests-disruptive-cluster-upgrade-should-maintain-a-functioning-cluster-featureclusterupgrade-suiteopenshift-serial
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1094#openshift-tests-disruptive-cluster-upgrade-should-maintain-a-functioning-cluster-featureclusterupgrade-suiteopenshift-serial

--- Additional comment from W. Trevor King on 2019-05-03 18:05:29 UTC ---

We had a bunch of these yesterday, but they mostly stopped at 2019-05-02T20:09 [1] with only two since [2,3] based on [4].  Assigning to Master, since Clayton pointed at them for this yesterday ;).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1101
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1117
[3]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1136
[4]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=e2e-aws-upgrade&search=replicaset%20.*%20never%20became%20ready

--- Additional comment from W. Trevor King on 2019-05-03 18:08:10 UTC ---

Actually, Clayton was pointing at Cloud for k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58.  So I dunno who this one belongs to.  Master seems like a plausible guess, but feel free to punt if you have better ideas.

--- Additional comment from Greg Blomquist on 2019-05-03 19:24:12 UTC ---

Based on frequency, I'm moving this to 4.2 (non-4.1.0-blocker).  If anyone disagrees, pull it back in.

--- Additional comment from  on 2019-05-09 14:22:06 UTC ---

Seeing his again today: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1402/build-log.txt

--- Additional comment from Clayton Coleman on 2019-05-09 14:46:11 UTC ---

I’m seeing this more, I don’t see how this isn’t a platform bug or a serious blocker for knowing our upgrades work.

We *must* know upgrades work, I expect to see a root cause before we kick it out (we kick out based on cause, not symptom).

--- Additional comment from Clayton Coleman on 2019-05-09 14:47:44 UTC ---

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399.

May be unrelated, but has similar characteristics

--- Additional comment from Maciej Szulik on 2019-05-09 16:31:58 UTC ---

I've checked the last 4 failures and here are my findings:


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1136

From events:

May 03 11:31:59.038 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Available message changed from "Available: v1.apps.openshift.io is not ready: 503\nAvailable: v1.build.openshift.io is not ready: 503\nAvailable: v1.oauth.openshift.io is not ready: 503\nAvailable: v1.route.openshift.io is not ready: 503\nAvailable: v1.security.openshift.io is not ready: 503" to "Available: v1.authorization.openshift.io is not ready: 503\nAvailable: v1.user.openshift.io is not ready: 503"

Looking inside openshift-apiserver pods, ALL 3 of them contain:

W0503 11:30:43.724291       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd.openshift-etcd.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.30.23.218:2379: connect: connection refused". Reconnect
ing...                                                                                                                                                                                                                                                                           W0503 11:30:43.728351       1 clientconn.go:1440] grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1117

From events:

04:33:26 (11979) "" MCDBootstrapSyncFailure pending config rendered-master-15fff6d40908cf99c46b23dbcb2a5ebd bootID 6f502d19-5a7e-44da-a3f1-cb0693a93833 matches current! Failed to reboot?

I haven't dug deeper but the number of events 12k (sic!) is more than worrying.


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399

From events:
13:25:23 (2) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "" to "StaticPodsDegraded: nodes/ip-10-0-161-216.ec2.internal pods/kube-apiserver-ip-10-0-161-216.ec2.internal container=""kube-apiserver-7"" is not ready

In 2 kube-apiserver pods' logs I see:

W0509 11:29:19.745173       1 clientconn.go:953] Failed to dial etcd-2.ci-op-ckldis1m-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry.


https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399

From events:

15:04:20 (6) "openshift-controller-manager" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
15:04:20 (25) "openshift-machine-config-operator" FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.
15:04:20 (5) "openshift-kube-controller-manager" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
15:04:20 (6) "openshift-dns" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
15:04:20 (5) "openshift-kube-apiserver" NetworkNotReady network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]


There's no single failure that repeats, but problems reaching nodes or etcd in the cluster suggests some networking issues. I'd like the networking
team to double check the logs for their components.

--- Additional comment from Phil Cameron on 2019-05-09 17:45:39 UTC ---

API server and etcd are static pods that use host network. Cluster Network can't start until those are up. 

From the message:
FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.

Things are broken enough so that Networking can't get started.

--- Additional comment from David Eads on 2019-05-09 17:48:51 UTC ---

Opened https://github.com/openshift/origin/pull/22809 to gather more logs, but based on logs I've looked at in the past, I suspect we're successfully creating the pods and they aren't running as expected.  I think this ticked up when we started creating pods more quickly.

--- Additional comment from David Eads on 2019-05-09 17:52:15 UTC ---

@phil, trying to parse your message.

I agree that networking won't run until the kube-apiserver and etcd are up, but the fact that you're able to see the nodes is a clear indication that both are up and working.  It's also an indication that the scheduler is running and if you're chasing your correct pod (you only get scheduling messages when trying to schedule pods), the kubecontroller-manager is up.  That means that

1. etcd is up and working
2. kube-apiserver is up and working
3. kube-controller-manager is up and working (or your pod is already created somehow)
4. kube-scheduler is up and working

I think that having those working should be enough for the network to be functional.  What more do you think is required and what piece do you see as missing?

--- Additional comment from Dan Williams on 2019-05-09 18:12:23 UTC ---

(In reply to Maciej Szulik from comment #7)
> I've checked the last 4 failures and here are my findings:
> 
> 
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/1136
> 
> From events:

Do you mean events.json here? Or where are you getting the events from?

--- Additional comment from Phil Cameron on 2019-05-09 18:48:07 UTC ---

@David, comment 10. The pods are being prevented from starting.

FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.

In so far as the daemonset yaml in the cno is fixed (doesn't change), there is something that is occasionally preventing the pods from starting. I would think that by the time cno runs the nodes would let daemonsets start pods. Feels like we don't have 6 running but "Not Ready" nodes at this point. Could this be a race where cno starts too soon?

I am way out of my depth here. I am not sure how the cluster bring up sequences itself into a running cluster.

--- Additional comment from Robert Krawitz on 2019-05-09 19:20:46 UTC ---

@pcameron @maszulik:

FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) were unschedulable.

is a normal message from the etcd-quorum-guard when a master isn't up (which it won't be between when it starts to drain and hasn't fully started).  This is because of the quorum guard's deployment that wants 3 replicas, all on master nodes, and all on different nodes.

However, if it says *2* nodes are unschedulable (as seen), that's a very different matter -- that means that one node was cordoned while another node was still not schedulable.

From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1399/artifacts/e2e-aws-upgrade/e2e.log we see that that is indeed somehow happening (I've annotated it with line number):

    4433 May 09 11:14:56.583 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-86767b79b5-6tt7d 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. (12 times)
    4434 May 09 11:14:56.696 I node/ip-10-0-145-55.ec2.internal Node ip-10-0-145-55.ec2.internal status is now: NodeNotSchedulable
    4435 May 09 11:14:56.962 I node/ip-10-0-145-55.ec2.internal Node will reboot into config rendered-master-8dd27e0144c911a471964aca19dc5652
    4436 May 09 11:14:56.999 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Pulling image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:81eed66b4359d4a3015a3b0e92782afd6ef410e47ad7e68ed65c9dde54aa5b57"
    4437 May 09 11:14:57.113 I ns/openshift-service-ca pod/configmap-cabundle-injector-78f77bd9c9-d7q9x Container image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:8b0c9a1ad0a500f886b716bf85c8627c69376502f91915adb83f06b3ce9e9443" already present on machine
    4438 May 09 11:14:57.207 I ns/openshift-machine-config-operator pod/machine-config-operator-59b6486f69-xnbw2 Pulling image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:7e7ca4129d0909c97d3b4bd08756eb7088cfd86e9a28c56b70f3a6858830a63f"
    4439 May 09 11:14:57.331 I ns/openshift-service-ca pod/configmap-cabundle-injector-78f77bd9c9-d7q9x Created container configmap-cabundle-injector-controller
    4440 May 09 11:14:57.356 I ns/openshift-service-ca pod/configmap-cabundle-injector-78f77bd9c9-d7q9x Started container configmap-cabundle-injector-controller
    4441 May 09 11:14:57.918 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Successfully pulled image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:81eed66b4359d4a3015a3b0e92782afd6ef410e47ad7e68ed65c9dde54aa5b57"
    4442 May 09 11:14:58.203 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Created container dns-operator
    4443 May 09 11:14:58.284 I ns/openshift-dns-operator pod/dns-operator-6487459cdc-9t89p Started container dns-operator
    4444 May 09 11:14:58.935 I node/ip-10-0-162-146.ec2.internal Written pending config rendered-worker-5c17190a4fe3174649e5b19e539d750e
    4445 May 09 11:14:58.945 I node/ip-10-0-162-146.ec2.internal Draining node to update config.
    4446 May 09 11:14:58.970 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-86767b79b5-6tt7d 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.

Whether directly related or not, it does look like a second node (10.0.162.146) is being taken down before the first node to be upgraded (10.0.145.55) is fully ready.  It's only much later that 10.0.145.55 is schedulable:

    5077 May 09 11:17:51.017 W ns/openshift-dns pod/dns-default-lp7qd node/ip-10-0-145-55.ec2.internal container=dns-node-resolver container restarted
    5078 May 09 11:17:51.017 W ns/openshift-dns pod/dns-default-lp7qd node/ip-10-0-145-55.ec2.internal container=dns container restarted
    5079 May 09 11:17:51.041 I ns/openshift-apiserver pod/apiserver-f4xwn Container image "registry.svc.ci.openshift.org/ocp/4.1-2019-05-09-101018@sha256:45b6b845da1e8ec069ed206f2aadc752b27c5c2aea4bf7a4211d71b27bdc248d" already present on machine
    5080 May 09 11:17:51.236 I ns/openshift-apiserver pod/apiserver-f4xwn Created container openshift-apiserver
    5081 May 09 11:17:51.444 I ns/openshift-apiserver pod/apiserver-f4xwn Started container openshift-apiserver
    5082 May 09 11:17:51.635 I node/ip-10-0-145-55.ec2.internal Node ip-10-0-145-55.ec2.internal status is now: NodeSchedulable

However, 10.0.162.146 becomes ready much more quickly:

    4921 May 09 11:17:37.122 I node/ip-10-0-162-146.ec2.internal Node ip-10-0-162-146.ec2.internal status is now: NodeReady
    4922 May 09 11:17:37.126 I node/ip-10-0-162-146.ec2.internal Node ip-10-0-162-146.ec2.internal status is now: NodeSchedulable

--- Additional comment from Robert Krawitz on 2019-05-09 19:42:11 UTC ---

Belay that last...10.0.162.146 was a worker, not a master.

--- Additional comment from Casey Callendrello on 2019-05-09 19:42:46 UTC ---

Nothing jumps out as obvious network failures.

- The GRPC goaways seem to indicate that etcd is shutting down, no? In any case, the sdn wouldn't have much to do with this; it's host networking
- The nodes being NotReady are 100% expected - they will be NotReady until the SDN has re-initialized after a reboot, after which the scheduled pod will start.

It is possible that we're rolling out reboots to quickly. Does the mco wait for the node to be ready before proceeding with the next reboot?

What exactly is causing the tests to fail? There seem to be a lot of unexplained error conditions that are mostly transient.

--- Additional comment from Maciej Szulik on 2019-05-10 12:19:47 UTC ---



--- Additional comment from Maciej Szulik on 2019-05-10 12:47:06 UTC ---

Last failure after David merged https://github.com/openshift/origin/pull/22809 happened 2019-05-10 12:44 CEST took 1h17m 
in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/:

from looking at the dump I don't see job controller creating 5 pods in total to fulfil the job, but then upon failure in the 
dumped pods there ZERO running pods for this job. I don't see any events that might point to anything unusual, but I need to 
check controller-manager logs for job-controller logs since there might be a chance that the backoff policy kicked in and
we created pods slower when the job is failing often enough. Also I don't see any node logs, since we have 
exutil.TestContext.MaxNodesToGather = 0 configured in https://github.com/openshift/origin/blob/66bcaddbb4264915cfeb798860f097ce042e6015/cmd/openshift-tests/openshift-tests.go#L271

--- Additional comment from David Eads on 2019-05-10 13:28:51 UTC ---

I don't see a bug.

Have a look at https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/

1. Expanding the failing test from https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/#operator-run-template-e2e-aws-upgrade---e2e-aws-upgrade-container-test we find 

   At 2019-05-10 11:51:02 +0000 UTC - event for foo-vk4rp: {default-scheduler } Scheduled: Successfully assigned e2e-tests-sig-apps-job-upgrade-9ntz9/foo-vk4rp to ip-10-0-147-189.ec2.internal

   This gives us a pod and a node to look at.

2. Open the worker kubelet logs here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1445/artifacts/e2e-aws-upgrade/nodes/workers-journal

   Filter by our pod and find multiple messages about Status for pod "foo-vk4rp_e2e-tests-sig-apps-job-upgrade-9ntz9(e2b85800-7319-11e9-a78d-1293b1f7ea52)" updated successfully:

3. Expand those statuses and dig into them and you find transitions from pending to running to pending with no restart count increasing.  ImageIDs and ContainerIDs suddenly disappear.

4. At the same time, the kubelet and I'm guessing the node (based on events like:             "message": "Updated machine ci-op-9mgil1kz-77109-s2h7v-worker-us-east-1a-lqk7t",) are restarted.

5. The restart stops the pod, which doesn't get restarted before the test does a single check for running pods.


Restarting all the nodes is disruptive and will cause tests like this to fail.  We knew this.  Rolling updates of nodes cause disruption to workloads.  Why are we surprised that this test flakes if it's single check for the "right" number of running pods while restarting a cluster fails?

I don't see why this should be a blocker.  The test could be improved to wait for pods to come back to life after nodes restart, but I wouldn't add the risk into the release at this point.

--- Additional comment from W. Trevor King on 2019-05-10 23:11:18 UTC ---

> The restart stops the pod, which doesn't get restarted before the test does a single check for running pods.

[1] landed upstream, and it would help turn this case up by reporting the expected pods as pending or starting instead of the current "<bool>: false".

> Rolling updates of nodes cause disruption to workloads.

Some workloads will allow for this sort of disruption; others will not.  Maybe job pods have no way to configure a rolling update strategy with max unavailable, etc.?

> I don't see why this should be a blocker.  The test could be improved to wait for pods to come back to life after nodes restart, but I wouldn't add the risk into the release at this point.

Yeah, if we can't configure the job to always keep some fraction of Pods running.  Or we relax the check to also count Pending, etc. statuses as acceptable.  If you don't consider this test a blocker, what do you see as the risk to adjusting the test logic?  And now that we're targeting 4.2 here, there's a bit more breathing room ;).

[1]: https://github.com/kubernetes/kubernetes/pull/77716

--- Additional comment from Clayton Coleman on 2019-05-22 15:03:28 UTC ---

The root cause of this bug originally was that workers were still restarting while the upgrade test "post phase" was running.  That was fixed by the machine config team.

The current set of 4.2 upgrade flakes started happening after the rebase.

--- Additional comment from W. Trevor King on 2019-06-13 20:34:53 UTC ---

For future reference, "wait for the compute pools" was https://github.com/openshift/origin/pull/22835 in 4.2/master and https://github.com/openshift/origin/pull/22843 in 4.1.

--- Additional comment from Colin Walters on 2019-06-13 21:23:19 UTC ---

Offhand it looks like this may be the same race that https://github.com/openshift/machine-config-operator/pull/773 fixed.  Though I think the `Generation` handling is also supposed to prevent this.

--- Additional comment from Colin Walters on 2019-06-14 15:00:41 UTC ---

Moving back to 4.1.z so we can target

https://github.com/openshift/machine-config-operator/pull/856

Comment 1 Antonio Murdaca 2019-06-18 15:23:45 UTC

master/4.2 PR: https://github.com/openshift/machine-config-operator/pull/773

Comment 3 Michael Nguyen 2019-07-01 18:57:14 UTC

Verified on 4.2.0-0.nightly-2019-06-27-041730
Verified presence of status.Configuration and spec.Configuration
Verified upgrade to 4.2.0-0.ci-2019-07-01-153500
Verified machine counts in upgrade process

Comment 4 Matthew Staebler 2019-07-18 21:41:05 UTC

Saw this again: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4189

Comment 5 Antonio Murdaca 2019-08-22 10:48:47 UTC

This really looks like an issue with Deployments - the MCC is ready and steady according to logs, master logs don't show anything wrong and the MCO just reads from the Deployment object for machine-config-controller and despite the pod being there and running, it reports that there's an unavailable replica. It doesn't look like it's MCO related. Looking around there seem to be also a bunch of route issues from other clusteroperators:

message: "Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s)",

message: "Available: "route.openshift.io.v1" is not ready: 0 (Get https://172.30.0.1:443/apis/route.openshift.io/v1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)) Available: "security.openshift.io.v1" is not ready: 0 (Get https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)) Available: "user.openshift.io.v1" is not ready: 0 (Get https://172.30.0.1:443/apis/user.openshift.io/v1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers))",

message: "OAuthClientsDegraded: the server was unable to return a response in the time allotted, but may still be processing the request (get oauthclients.oauth.openshift.io openshift-browser-client)",

Not sure who owns routes, I think network? again not sure why the deployment is off despite the pod running here but doesn't look like MCO related.

Comment 6 Ben Bennett 2019-08-23 15:40:43 UTC

The route API is provided by the OpenShift API server, passing it to them to take a look.

Comment 7 Michal Fojtik 2019-08-26 12:29:53 UTC

https://search.svc.ci.openshift.org/?search=never+became+ready&maxAge=336h&context=7&type=all this shows zero evidence this is happening anymore, closing.

Comment 8 W. Trevor King 2019-08-26 19:19:57 UTC

> https://search.svc.ci.openshift.org/?search=never+became+ready&maxAge=336h&context=7&type=all this shows zero evidence this is happening anymore, closing.

The vanilla CI search is leaky; ci-search-next is more reliable.  [1] turns it up in a few places; most recently in a PR two hours ago [2], and most recently in a release-promotion gate 4 days ago [3].  So I'm not sure we've fixed this completely, but the flake rate is certainly low, with that PR failure being the only one of the past 24 hours' 930 -e2e- jobs to see it.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=replicaset%20.*rs.*%20never%20became%20ready
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/516/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/448
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6013

Comment 9 W. Trevor King 2019-10-11 23:05:52 UTC

https://github.com/openshift/origin/pull/23944 landed, which should help with debugging this in 4.2.z

Comment 11 errata-xmlrpc 2019-10-16 06:32:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922