Bug 1868786 - upgrades from 4.5.6 -> 4.6.0.0-nightly failing: Failed to upgrade openshift-apiserver, operator was degraded
Summary: upgrades from 4.5.6 -> 4.6.0.0-nightly failing: Failed to upgrade openshift-a...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-13 19:35 UTC by Micah Abbott
Modified: 2020-09-11 18:14 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]
Last Closed: 2020-09-11 18:14:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Micah Abbott 2020-08-13 19:35:58 UTC
Upgrading from 4.5.6 has been red last 24h; one repeatable signal I have observed is problems upgrading the openshift-apiserver.  Not sure if this the root cause of the overall upgrade failure, but figured I should record it here to start.


https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1293686718314057728

```
Aug 13 01:07:18.730 - 165s  W ns/openshift-oauth-apiserver pod/apiserver-745664ffc-wc776 node/ pod has been pending longer than a minute
Aug 13 01:07:18.951 I ns/openshift-apiserver pod/apiserver-5956588477-7m9nc node/ip-10-0-153-182.us-west-1.compute.internal container/openshift-apiserver reason/Ready
Aug 13 01:07:19.033 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/EndpointCheckCreated Created PodNetworkConnectivityCheck.controlplane.operator.openshift.io/apiserver-ip-10-0-153-182.us-west-1.compute.internal-to-kubernetes-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-apiserver because it was missing.
Aug 13 01:07:19.048 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/EndpointCheckUpdated Updated PodNetworkConnectivityCheck.controlplane.operator.openshift.io/apiserver-ip-10-0-153-182.us-west-1.compute.internal-to-kubernetes-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-apiserver because it changed.
Aug 13 01:07:19.505 I ns/openshift-multus pod/multus-l9rbh node/ip-10-0-153-182.us-west-1.compute.internal container/kube-multus reason/Ready
Aug 13 01:07:19.576 W clusteroperator/network changed Progressing to False
Aug 13 01:07:19.809 I ns/openshift-apiserver deployment/apiserver reason/ConnectivityRestored Connectivity restored after 9m41.734890364s: storage-endpoint-ip-10-0-169-229.us-west-1.compute.internal: tcp connection to 10.0.169.229:2379 succeeded
Aug 13 01:07:20.668 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "StaticPodsDegraded: pod/kube-apiserver-ip-10-0-153-182.us-west-1.compute.internal container \"kube-apiserver-check-endpoints\" is not ready: unknown reason\nNodeControllerDegraded: All master nodes are ready" to "NodeControllerDegraded: All master nodes are ready"
Aug 13 01:07:21.483 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/EndpointCheckUpdated Updated PodNetworkConnectivityCheck.controlplane.operator.openshift.io/kube-apiserver-ip-10-0-153-182.us-west-1.compute.internal-to-openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-kube-apiserver because it changed.
Aug 13 01:07:21.491 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/EndpointCheckCreated Created PodNetworkConnectivityCheck.controlplane.operator.openshift.io/kube-apiserver-ip-10-0-153-182.us-west-1.compute.internal-to-openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-kube-apiserver because it was missing.
Aug 13 01:07:22.844 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/EndpointCheckCreated Created PodNetworkConnectivityCheck.controlplane.operator.openshift.io/kube-apiserver-ip-10-0-169-229.us-west-1.compute.internal-to-openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-kube-apiserver because it was missing.
Aug 13 01:07:22.851 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/EndpointCheckUpdated Updated PodNetworkConnectivityCheck.controlplane.operator.openshift.io/kube-apiserver-ip-10-0-169-229.us-west-1.compute.internal-to-openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-kube-apiserver because it changed.
Aug 13 01:07:23.975 I ns/openshift-kube-apiserver node/ip-10-0-153-182.us-west-1.compute.internal reason/ConnectivityRestored Connectivity restored after 7m4.908507492s: openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal: tcp connection to 10.128.0.10:8443 succeeded
Aug 13 01:07:24.850 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/EndpointCheckUpdated Updated PodNetworkConnectivityCheck.controlplane.operator.openshift.io/kube-apiserver-ip-10-0-204-179.us-west-1.compute.internal-to-openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-kube-apiserver because it changed.
Aug 13 01:07:24.859 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/EndpointCheckCreated Created PodNetworkConnectivityCheck.controlplane.operator.openshift.io/kube-apiserver-ip-10-0-204-179.us-west-1.compute.internal-to-openshift-apiserver-endpoint-ip-10-0-153-182.us-west-1.compute.internal -n openshift-kube-apiserver because it was missing.
Aug 13 01:07:26.160 W ns/openshift-kube-apiserver node/ip-10-0-153-182.us-west-1.compute.internal reason/ConnectivityOutageDetected Connectivity outage detected: openshift-apiserver-endpoint-ip-10-0-204-179.us-west-1.compute.internal: failed to establish a TCP connection to 10.130.0.61:8443: dial tcp 10.130.0.61:8443: i/o timeout
Aug 13 01:07:26.232 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/OperatorStatusChanged Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable for apiserver.openshift-apiserver" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver" (2 times)
```



https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1293840616014221312

```
Aug 13 11:16:51.047 - 30s   W node/ip-10-0-171-21.us-east-2.compute.internal node is not ready
Aug 13 11:16:52.056 W ns/openshift-apiserver pod/apiserver-6cd8df84c7-m2cz7 reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. (16 times)
Aug 13 11:16:52.065 W ns/openshift-oauth-apiserver pod/apiserver-7cb45c44fc-kzhlb reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. (15 times)
Aug 13 11:16:52.072 W ns/openshift-etcd pod/etcd-quorum-guard-6d58b447d6-fqflg reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. (10 times)
Aug 13 11:16:53.039 I ns/openshift-apiserver deployment/apiserver reason/ConnectivityRestored Connectivity restored after 15.994100686s: load-balancer-api-internal: tcp connection to api-int.ci-op-7xrvlib5-77109.origin-ci-int-aws.dev.rhcloud.com:6443 succeeded
Aug 13 11:16:53.056 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-65bfd89668-5crfb reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. (11 times)
Aug 13 11:16:55.722 I node/ip-10-0-241-144.us-east-2.compute.internal reason/PendingConfig Written pending config rendered-master-3a7bb08a0b0e56f2c27ce05cd969cc81
Aug 13 11:16:55.728 I node/ip-10-0-241-144.us-east-2.compute.internal reason/Reboot Node will reboot into config rendered-master-3a7bb08a0b0e56f2c27ce05cd969cc81
Aug 13 11:16:58.689 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: storage-endpoint-ip-10-0-241-144.us-east-2.compute.internal: failed to establish a TCP connection to 10.0.241.144:2379: dial tcp 10.0.241.144:2379: connect: connection refused
Aug 13 11:16:58.795 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-endpoint-ip-10-0-241-144.us-east-2.compute.internal: failed to establish a TCP connection to 10.0.241.144:6443: dial tcp 10.0.241.144:6443: connect: connection refused
Aug 13 11:17:00.369 W ns/openshift-kube-apiserver node/ip-10-0-137-142.us-east-2.compute.internal reason/ConnectivityOutageDetected Connectivity outage detected: etcd-server-ip-10-0-241-144.us-east-2.compute.internal: failed to establish a TCP connection to 10.0.241.144:2379: dial tcp 10.0.241.144:2379: connect: connection refused
Aug 13 11:17:01.028 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-endpoint-ip-10-0-241-144.us-east-2.compute.internal: failed to establish a TCP connection to 10.0.241.144:6443: dial tcp 10.0.241.144:6443: connect: connection refused
Aug 13 11:17:01.028 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-service-cluster: failed to establish a TCP connection to 172.30.91.176:443: dial tcp 172.30.91.176:443: connect: connection refused (11 times)
Aug 13 11:17:01.071 I ns/openshift-apiserver deployment/apiserver reason/ConnectivityRestored Connectivity restored after 1.011240654s: kubernetes-apiserver-service-cluster: tcp connection to 172.30.91.176:443 succeeded
Aug 13 11:17:01.078 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-service-cluster: failed to establish a TCP connection to 172.30.91.176:443: dial tcp 172.30.91.176:443: connect: connection refused (12 times)
Aug 13 11:17:01.210 W ns/openshift-kube-apiserver node/ip-10-0-178-187.us-east-2.compute.internal reason/ConnectivityOutageDetected Connectivity outage detected: etcd-server-ip-10-0-241-144.us-east-2.compute.internal: failed to establish a TCP connection to 10.0.241.144:2379: dial tcp 10.0.241.144:2379: connect: connection refused
Aug 13 11:17:01.670 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/OperatorStatusChanged Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver" to "EncryptionStateControllerDegraded: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-apiserver/pods?labelSelector=apiserver%3Dtrue\": dial tcp 172.30.0.1:443: connect: connection refused\nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver"
Aug 13 11:17:01.890 W clusteroperator/authentication changed Available to False: APIServerDeployment_NoDeployment: APIServerDeploymentAvailable: deployment/openshift-oauth-apiserver: could not be retrieved
Aug 13 11:17:01.890 W clusteroperator/authentication changed Progressing to True: APIServerDeployment_NoDeployment: APIServerDeploymentProgressing: deployment/openshift-oauth-apiserver: could not be retrieved
Aug 13 11:17:01.918 I ns/openshift-authentication-operator deployment/authentication-operator reason/OperatorStatusChanged Status for clusteroperator/authentication changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver" to "APIServerDeploymentDegraded: deployment/openshift-oauth-apiserver: could not be retrieved\nAPIServerWorkloadDegraded: \"deployments\": invalid dependency reference: \"Get \\\"https://172.30.0.1:443/api/v1/namespaces/openshift-oauth-apiserver/secrets/etcd-client\\\": dial tcp 172.30.0.1:443: connect: connection refused\"\nAPIServerWorkloadDegraded: ",Progressing changed from False to True ("APIServerDeploymentProgressing: deployment/openshift-oauth-apiserver: could not be retrieved"),Available changed from True to False ("APIServerDeploymentAvailable: deployment/openshift-oauth-apiserver: could not be retrieved")
Aug 13 11:17:01.957 I ns/openshift-authentication-operator deployment/authentication-operator reason/OperatorStatusChanged Status for clusteroperator/authentication changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver" to "APIServerDeploymentDegraded: deployment/openshift-oauth-apiserver: could not be retrieved\nAPIServerWorkloadDegraded: \"deployments\": invalid dependency reference: \"Get \\\"https://172.30.0.1:443/api/v1/namespaces/openshift-oauth-apiserver/secrets/etcd-client\\\": dial tcp 172.30.0.1:443: connect: connection refused\"\nAPIServerWorkloadDegraded: ",Progressing changed from False to True ("APIServerDeploymentProgressing: deployment/openshift-oauth-apiserver: could not be retrieved"),Available changed from True to False ("APIServerDeploymentAvailable: deployment/openshift-oauth-apiserver: could not be retrieved") (2 times)
Aug 13 11:17:02.019 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-default-service-cluster: failed to establish a TCP connection to 172.30.0.1:443: dial tcp 172.30.0.1:443: connect: connection refused
Aug 13 11:17:02.053 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: storage-endpoint-ip-10-0-241-144.us-east-2.compute.internal: failed to establish a TCP connection to 10.0.241.144:2379: dial tcp 10.0.241.144:2379: connect: connection refused
Aug 13 11:17:02.068 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-service-cluster: failed to establish a TCP connection to 172.30.91.176:443: dial tcp 172.30.91.176:443: connect: connection refused
Aug 13 11:17:02.078 I ns/openshift-apiserver deployment/apiserver reason/ConnectivityRestored Connectivity restored after 999.123435ms: kubernetes-apiserver-service-cluster: tcp connection to 172.30.91.176:443 succeeded
Aug 13 11:17:02.086 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-service-cluster: failed to establish a TCP connection to 172.30.91.176:443: dial tcp 172.30.91.176:443: connect: connection refused (2 times)
Aug 13 11:17:02.092 I ns/openshift-apiserver deployment/apiserver reason/ConnectivityRestored Connectivity restored after 999.966789ms: kubernetes-apiserver-service-cluster: tcp connection to 172.30.91.176:443 succeeded
Aug 13 11:17:02.165 I ns/openshift-kube-apiserver configmap/cert-regeneration-controller-lock reason/LeaderElection ip-10-0-178-187_6650a731-0909-4730-92fd-967e1e6e62f3 became leader
Aug 13 11:17:02.448 I ns/openshift-etcd pod/revision-pruner-4-ip-10-0-241-144.us-east-2.compute.internal node/ip-10-0-241-144.us-east-2.compute.internal reason/Created
Aug 13 11:17:02.456 I ns/openshift-etcd-operator deployment/etcd-operator reason/PodCreated Created Pod/revision-pruner-4-ip-10-0-241-144.us-east-2.compute.internal -n openshift-etcd because it was missing (2 times)
Aug 13 11:17:02.776 W ns/openshift-authentication-operator deployment/authentication-operator reason/AuthenticationUpdateStatusFailed Failed to update authentication operator status: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again (9 times)
Aug 13 11:17:03.235 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/OperatorStatusChanged Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "EncryptionStateControllerDegraded: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-apiserver/pods?labelSelector=apiserver%3Dtrue\": dial tcp 172.30.0.1:443: connect: connection refused\nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver"
```

Comment 1 Stefan Schimanski 2020-08-26 08:41:15 UTC
Network checker shows plenty of outages.

Two of the three openshift-apiservers (those which are not available) get connection refused from etcd.

Comment 2 Bharathi Tenneti 2020-09-09 13:07:50 UTC
Any update on this ? This is a blocker for verification needing for https://bugzilla.redhat.com/show_bug.cgi?id=1868750

Comment 3 Dan Mace 2020-09-11 16:30:20 UTC
I'm confused as to how this bug is evidence of any current blocker. It's referring to a couple of release promotion failures from mid August.

As far as I can tell, the most recent CI and nightly promotion jobs are not being rejected in the last day due to any upgrade issue resembling the report:

https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.6.0-0.ci
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.6.0-0.nightly

The nearest equivalent AWS periodic jobs are not failing for any apparently related reason:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci

What about the two failures from August support the claim there's a blocker today? Am I misunderstanding the current state of the promotion or periodic jobs?

FWIW, I took a look at those two old jobs reported in August, and:

1. The apiserver, etcd, and MCO all report successful report and health by the end of the 4.6 rollout
2. The test seems to time out because of a perceived lack of MCO progress that I couldn't associate with any errors from the MCO, but I didn't dig deeper

So even in a historical context, it's not clear these failures are of much interest re: etcd except perhaps in the context of ongoing disruption flakes which are already extensively covered elsewhere.

I'm going to downgrade the severity and priority of this bug pending more detailed information about any perceived ongoing issue, and intend to close the bug soon lacking any such details.

In the meantime I'll continue looking around in the release job pipeline to see if there's anything I'm missing.

Comment 4 Micah Abbott 2020-09-11 18:10:38 UTC
Yeah, I was just trying to do my best reporting problems while playing build watcher.  Looking at the same job (release-openshift-origin-installer-e2e-aws-upgrade) I linked in the original report, it has been moderately healthy lately.

I'm not aware of any underlying problems with the promotion jobs and not sure why this was tagged as an UpgradeBlocker without additional details.

Comment 5 Dan Mace 2020-09-11 18:14:22 UTC
Understood, thanks! Going to go ahead and close this one out.


Note You need to log in before you can comment on or make changes to this bug.