Bug 1817588

Summary:	[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable
Product:	OpenShift Container Platform	Reporter:	Nikolaos Leandros Moraitis <nmoraiti>
Component:	openshift-apiserver	Assignee:	Luis Sanchez <sanchezl>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	low
Version:	4.5	CC:	aos-bugs, braander, dmace, donny, emoss, fgiudici, hongkliu, jluhrsen, mfojtik, mheon, sanchezl, sdodson, sejug, sttts, wking, wlewis, xxia
Target Milestone:	---	Keywords:	Reopened, Upgrades
Target Release:	---	Flags:	wking: needinfo?
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleReset tag-ci
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-30 20:05:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nikolaos Leandros Moraitis 2020-03-26 15:55:14 UTC

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]"


 Mar 26 15:05:57.233: INFO: Dumping logs locally to: /tmp/artifacts/junit
Mar 26 15:05:57.234: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:135]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-03-26-125808
Unexpected error:
    <*errors.errorString | 0xc002ad3440>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
occurred
failed: (1h29m15s) 2020-03-26T15:05:57 "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]"


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177

Comment 1 Venkata Siva Teja Areti 2020-04-06 19:51:36 UTC

Around the time at which API server operator is degraded, I see these error messages in openshift-apiserver pods

> 2020-03-26T14:53:39.107799512Z W0326 14:53:39.107753       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.131.2\
> 25:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.131.225:2379: connect: connection refused". Reconnecting.\
>     ..                                                                                                                                                       
> 2020-03-26T14:53:39.265204342Z W0326 14:53:39.265162       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.131.2\
>     25:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.131.225:2379: connect: connection refused". Reconnecting.\
>     ..

I found similar log in other API Server pod as well. Looks like etcd connection issues. When looked at the log for etcd pod running on node 10.0.131.225


> 2020-03-26T14:53:24.372530093Z 2020-03-26 14:53:24.372506 I | rafthttp: stopped HTTP pipelining with peer de1b723e4dfe731a
> 2020-03-26T14:53:24.372593772Z 2020-03-26 14:53:24.372576 I | rafthttp: stopped HTTP pipelining with peer ae89652406a1c107
> 2020-03-26T14:53:24.372674993Z 2020-03-26 14:53:24.372619 I | rafthttp: stopping peer de1b723e4dfe731a...
> 2020-03-26T14:53:24.373513716Z 2020-03-26 14:53:24.373050 I | rafthttp: closed the TCP streaming connection with peer de1b723e4dfe731a (stream MsgApp v2 writer)
> 2020-03-26T14:53:24.373513716Z 2020-03-26 14:53:24.373069 I | rafthttp: stopped streaming with peer de1b723e4dfe731a (writer)
> 2020-03-26T14:53:24.373513716Z 2020-03-26 14:53:24.373456 I | rafthttp: closed the TCP streaming connection with peer de1b723e4dfe731a (stream Message writer)
> 2020-03-26T14:53:24.373513716Z 2020-03-26 14:53:24.373468 I | rafthttp: stopped streaming with peer de1b723e4dfe731a (writer)
> 2020-03-26T14:53:24.373721812Z 2020-03-26 14:53:24.373621 I | rafthttp: stopped HTTP pipelining with peer de1b723e4dfe731a
> 2020-03-26T14:53:24.373721812Z 2020-03-26 14:53:24.373705 W | rafthttp: lost the TCP streaming connection with peer de1b723e4dfe731a (stream MsgApp v2 reader)
> 2020-03-26T14:53:24.373743895Z 2020-03-26 14:53:24.373718 E | rafthttp: failed to read de1b723e4dfe731a on stream MsgApp v2 (context canceled)
> 2020-03-26T14:53:24.373743895Z 2020-03-26 14:53:24.373725 I | rafthttp: peer de1b723e4dfe731a became inactive (message send to peer failed)
> 2020-03-26T14:53:24.373743895Z 2020-03-26 14:53:24.373734 I | rafthttp: stopped streaming with peer de1b723e4dfe731a (stream MsgApp v2 reader)
> 2020-03-26T14:53:24.373877224Z 2020-03-26 14:53:24.373803 W | rafthttp: lost the TCP streaming connection with peer de1b723e4dfe731a (stream Message reader)
> 2020-03-26T14:53:24.373877224Z 2020-03-26 14:53:24.373822 I | rafthttp: stopped streaming with peer de1b723e4dfe731a (stream Message reader)
> 2020-03-26T14:53:24.373877224Z 2020-03-26 14:53:24.373830 I | rafthttp: stopped peer de1b723e4dfe731a
> 2020-03-26T14:53:24.373877224Z 2020-03-26 14:53:24.373838 I | rafthttp: stopping peer ae89652406a1c107...
> 2020-03-26T14:53:24.374182267Z 2020-03-26 14:53:24.374160 I | rafthttp: closed the TCP streaming connection with peer ae89652406a1c107 (stream MsgApp v2 writer)
> 2020-03-26T14:53:24.374276594Z 2020-03-26 14:53:24.374262 I | rafthttp: stopped streaming with peer ae89652406a1c107 (writer)
> 2020-03-26T14:53:24.374699803Z 2020-03-26 14:53:24.374677 I | rafthttp: closed the TCP streaming connection with peer ae89652406a1c107 (stream Message writer)
> 2020-03-26T14:53:24.374788879Z 2020-03-26 14:53:24.374774 I | rafthttp: stopped streaming with peer ae89652406a1c107 (writer)
2020-03-26T14:53:24.374839475Z 2020-03-26 14:53:24.374826 I | rafthttp: stopped HTTP pipelining with peer ae89652406a1c107
2020-03-26T14:53:24.374924535Z 2020-03-26 14:53:24.374908 W | rafthttp: lost the TCP streaming connection with peer ae89652406a1c107 (stream MsgApp v2 reader)
2020-03-26T14:53:24.374959642Z 2020-03-26 14:53:24.374948 I | rafthttp: stopped streaming with peer ae89652406a1c107 (stream MsgApp v2 reader)
2020-03-26T14:53:24.375053876Z 2020-03-26 14:53:24.375037 W | rafthttp: lost the TCP streaming connection with peer ae89652406a1c107 (stream Message reader)
2020-03-26T14:53:24.375107548Z 2020-03-26 14:53:24.375091 E | rafthttp: failed to read ae89652406a1c107 on stream Message (context canceled)
2020-03-26T14:53:24.37513788Z 2020-03-26 14:53:24.375125 I | rafthttp: peer ae89652406a1c107 became inactive (message send to peer failed)
2020-03-26T14:53:24.375167255Z 2020-03-26 14:53:24.375154 I | rafthttp: stopped streaming with peer ae89652406a1c107 (stream Message reader)
2020-03-26T14:53:24.37520013Z 2020-03-26 14:53:24.375186 I | rafthttp: stopped peer ae89652406a1c107
2020-03-26T14:53:24.397307356Z 2020-03-26 14:53:24.397263 I | embed: rejected connection from "10.0.143.234:40920" (error "read tcp 10.0.131.225:2380->10.0.143.234:40920: use of closed network connection", ServerName "")
2020-03-26T14:53:24.397352512Z 2020-03-26 14:53:24.397293 I | embed: rejected connection from "10.0.157.232:46956" (error "read tcp 10.0.131.225:2380->10.0.157.232:46956: use of closed network connection", ServerName "")
2020-03-26T14:53:24.397352512Z 2020-03-26 14:53:24.397320 I | embed: rejected connection from "10.0.157.232:46958" (error "read tcp 10.0.131.225:2380->10.0.157.232:46958: use of closed network connection", ServerName "")
2020-03-26T14:53:24.397352512Z 2020-03-26 14:53:24.397338 I | embed: rejected connection from "10.0.143.234:40918" (error "read tcp 10.0.131.225:2380->10.0.143.234:40918: use of closed network connection", ServerName "")

Moving this etcd team for them to get inputs for them.

Comment 2 Daneyon Hansen 2020-04-10 18:18:41 UTC

I'm hitting this BZ with https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24858/pull-ci-openshift-origin-master-e2e-gcp-builds/1845. I see similar failure behavior as https://bugzilla.redhat.com/show_bug.cgi?id=1821654 and https://bugzilla.redhat.com/show_bug.cgi?id=1817588, so we may have duplicates.

I see the following apiserver log messages:

2020-04-09T23:51:01.148650724Z E0409 23:51:01.148590       1 watcher.go:214] watch chan error: etcdserver: mvcc: required revision has been compacted

I see the etcd leader removed:
2020-04-09T21:15:52.794681538Z 2020-04-09 21:15:52.794666 I | rafthttp: stopping peer 40e0ff5ee27c98d0...
2020-04-09T21:15:52.795099826Z 2020-04-09 21:15:52.795059 I | rafthttp: closed the TCP streaming connection with peer 40e0ff5ee27c98d0 (stream MsgApp v2 writer)
2020-04-09T21:15:52.795099826Z 2020-04-09 21:15:52.795080 I | rafthttp: stopped streaming with peer 40e0ff5ee27c98d0 (writer)
2020-04-09T21:15:52.795525514Z 2020-04-09 21:15:52.795484 I | rafthttp: closed the TCP streaming connection with peer 40e0ff5ee27c98d0 (stream Message writer)
2020-04-09T21:15:52.795525514Z 2020-04-09 21:15:52.795506 I | rafthttp: stopped streaming with peer 40e0ff5ee27c98d0 (writer)
2020-04-09T21:15:52.795616653Z 2020-04-09 21:15:52.795555 I | rafthttp: stopped HTTP pipelining with peer 40e0ff5ee27c98d0
2020-04-09T21:15:52.795650289Z 2020-04-09 21:15:52.795626 W | rafthttp: lost the TCP streaming connection with peer 40e0ff5ee27c98d0 (stream MsgApp v2 reader)
2020-04-09T21:15:52.795662826Z 2020-04-09 21:15:52.795647 E | rafthttp: failed to read 40e0ff5ee27c98d0 on stream MsgApp v2 (context canceled)
2020-04-09T21:15:52.795662826Z 2020-04-09 21:15:52.795655 I | rafthttp: peer 40e0ff5ee27c98d0 became inactive (message send to peer failed)
2020-04-09T21:15:52.795700751Z 2020-04-09 21:15:52.795675 I | rafthttp: stopped streaming with peer 40e0ff5ee27c98d0 (stream MsgApp v2 reader)
2020-04-09T21:15:52.795758997Z 2020-04-09 21:15:52.795734 W | rafthttp: lost the TCP streaming connection with peer 40e0ff5ee27c98d0 (stream Message reader)
2020-04-09T21:15:52.7958148Z 2020-04-09 21:15:52.795790 I | rafthttp: stopped streaming with peer 40e0ff5ee27c98d0 (stream Message reader)
2020-04-09T21:15:52.7958148Z 2020-04-09 21:15:52.795805 I | rafthttp: stopped peer 40e0ff5ee27c98d0
2020-04-09T21:15:52.795830451Z 2020-04-09 21:15:52.795817 I | rafthttp: removed peer 40e0ff5ee27c98d0
2020-04-09T21:15:52.80407384Z 2020-04-09 21:15:52.804006 I | rafthttp: peer 40e0ff5ee27c98d0 became active
2020-04-09T21:15:52.805805999Z 2020-04-09 21:15:52.805773 E | rafthttp: failed to write 40e0ff5ee27c98d0 on pipeline (unexpected EOF)
2020-04-09T21:15:52.819340182Z 2020-04-09 21:15:52.819283 I | embed: rejected connection from "10.0.0.6:55440" (error "EOF", ServerName "")
2020-04-09T21:15:52.819495637Z 2020-04-09 21:15:52.819452 I | embed: rejected connection from "10.0.0.6:55442" (error "read tcp 10.0.0.3:2380->10.0.0.6:55442: read: connection reset by peer", ServerName "")
2020-04-09T21:15:53.763852724Z raft2020/04/09 21:15:53 INFO: 1d99a50681726a1a [term: 3] received a MsgVote message with higher term from 66f3004c0553c2d6 [term: 4]
2020-04-09T21:15:53.763852724Z raft2020/04/09 21:15:53 INFO: 1d99a50681726a1a became follower at term 4
2020-04-09T21:15:53.763852724Z raft2020/04/09 21:15:53 INFO: 1d99a50681726a1a [logterm: 3, index: 8103, vote: 0] cast MsgVote for 66f3004c0553c2d6 [logterm: 3, index: 8103] at term 4
2020-04-09T21:15:53.763932561Z raft2020/04/09 21:15:53 INFO: raft.node: 1d99a50681726a1a lost leader 40e0ff5ee27c98d0 at term 4
2020-04-09T21:15:53.766880428Z raft2020/04/09 21:15:53 INFO: raft.node: 1d99a50681726a1a elected leader 66f3004c0553c2d6 at term 4
...


I also see plenty of:
2020-04-09T21:34:46.03330722Z 2020-04-09 21:34:46.033252 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/serviceaccount-ca\" " with result "range_response_count:1 size:6299" took too long (456.196812ms) to execute

Comment 3 W. Trevor King 2020-04-20 19:33:54 UTC

[1] lists other jobs that died this way.  Recent promotion example is [2] which was 4.5.0-0.ci-2020-04-17-015018 -> 4.5.0-0.ci-2020-04-17-023018.

[1]: https://search.svc.ci.openshift.org/?name=upgrade&search=Cluster%20did%20not%20complete%20upgrade:%20timed%20out%20waiting%20for%20the%20condition:%20Cluster%20operator%20openshift-apiserver%20is%20reporting%20a%20failure:%20APIServerDeploymentDegraded.*requested%20instances%20are%20unavailable
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.5/152

Comment 4 Seth Jennings 2020-04-28 02:07:38 UTC

One more
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/185/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-upgrade/341

Comment 5 Ben Parees 2020-04-28 16:01:26 UTC

*** Bug 1818071 has been marked as a duplicate of this bug. ***

Comment 6 Ben Parees 2020-04-28 16:02:53 UTC

*** Bug 1820807 has been marked as a duplicate of this bug. ***

Comment 7 Sam Batschelet 2020-05-22 21:52:35 UTC

> 2020-04-09T23:51:01.148650724Z E0409 23:51:01.148590       1 watcher.go:214] watch chan error: etcdserver: mvcc: required revision has been compacted

This error is literal and not harmful. It means that the watcher can not continue because the rev it was watching on has been compacted. So it is not retriable.

> I see the etcd leader removed:

As the bootstrap etcd is the first started it is in almost all cases the leader when it is removed during bootstrap. etcd does attempt to gracefully transfer leadership but that does not always succeed. While this is not optimal it should not directly affect upgrade as it happens during bootstrap.

>I also see plenty of:
>2020-04-09T21:34:46.03330722Z 2020-04-09 21:34:46.033252 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/serviceaccount-ca\" " with result "range_response_count:1 size:6299" took too long (456.196812ms) to execute

These are warnings and common to see during e2e.

As far as etcd general health I see 2 failures from quorum-guard.

Apr 09 21:14:54.964015 ci-op-55x45-m-0.c.openshift-gce-devel-ci.internal (failure):
> Apr 09 21:15:04.947513 ci-op-55x45-m-0.c.openshift-gce-devel-ci.internal (failure):


>Apr 09 21:15:04.947513 ci-op-55x45-m-0.c.openshift-gce-devel-ci.internal hyperkube[1472]: I0409 21:15:04.946942    1472 prober.go:124] Readiness probe for "etcd-quorum-guard-5668c56-7gzs9_openshift-machine-config-operator(cb4625d7-0810-44ca-afef-923720abb5d9):guard" failed (failure):

This is a few seconds before that etcd starts.

>   2020-04-09T21:15:14.128450714Z 2020-04-09 21:15:14.128403 I | embed: serving client requests on [::]:2379

In general, I do not see etcd as the cause of instability I will continue to dig

Comment 8 Sam Batschelet 2020-05-26 21:30:18 UTC

More digging:

from https://bugzilla.redhat.com/show_bug.cgi?id=1817588#c1

The reason you are seeing those logs is the etcd member is being sent sigterm as part of the upgrade process. A few lines above where you started the output you will see.[1]

> 2020-03-26 14:53:23.759105 N | pkg/osutil: received terminated signal, shutting down...

[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-ip-10-0-131-225.us-west-1.compute.internal_etcd_previous.log

Comment 12 Dan Mace 2020-07-14 14:33:18 UTC

I think Sam showed pretty convincingly there's no evidence of an etcd issue here. It seems the motivation for implicating etcd in the first place was a misunderstanding of log output.

I'm going to close this. If there's new evidence which can help inform who should open an investigation, please speak up. If you think this bug should be re-opened against the etcd team, please help us understand the justification.

Comment 13 W. Trevor King 2020-07-17 05:34:12 UTC

4.4.12 -> 4.4.13 CI died this way too [1].  I don't think INSUFFICIENT_DATA is a valid state for CI failures.  If there is insufficient data to figure out what went wrong, then it is at least a bug in our must-gather logic or other CI artifact collection.  Reopening.  Looking at the original job from comment 0:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177/build-log.txt | grep APIServerDeploymentDegraded | grep '^Mar ' | sort | tail -n1
Mar 26 14:55:09.720: INFO: cluster upgrade is Failing: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177/artifacts/e2e-aws-upgrade/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-apiserver-operator" and .reason == "OperatorStatusChanged") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | tail -n2
2020-03-26T14:47:46Z 1 Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable for apiserver.openshift-apiserver" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver"
2020-03-26T14:56:28Z 1 Status for clusteroperator/openshift-apiserver changed: Degraded changed from True to False ("")
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177/artifacts/e2e-aws-upgrade/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-apiserver" and .involvedObject.kind == "Deployment") | .firstTimestamp + " " + (.count | tostring) + " " + .message'
2020-03-26T13:27:31Z 1 Scaled up replica set apiserver-6dbb989685 to 3
2020-03-26T13:27:40Z 1 Scaled down replica set apiserver-6dbb989685 to 2
2020-03-26T13:27:40Z 1 Scaled up replica set apiserver-c75fc9f to 1
2020-03-26T13:28:03Z 1 Scaled down replica set apiserver-6dbb989685 to 1
2020-03-26T13:28:03Z 1 Scaled up replica set apiserver-c75fc9f to 2
2020-03-26T13:28:43Z 1 Scaled down replica set apiserver-c75fc9f to 1
2020-03-26T13:28:43Z 1 Scaled up replica set apiserver-7df5549487 to 1
2020-03-26T13:29:42Z 1 Scaled down replica set apiserver-6dbb989685 to 0
2020-03-26T13:29:42Z 1 Scaled up replica set apiserver-7df5549487 to 2
2020-03-26T13:30:18Z 2 (combined from similar events): Scaled up replica set apiserver-7df5549487 to 3
2020-03-26T13:43:24Z 1 Scaled down replica set apiserver-7df5549487 to 2
2020-03-26T13:43:25Z 1 Scaled up replica set apiserver-5b4f45679c to 1
2020-03-26T13:43:56Z 1 Scaled down replica set apiserver-7df5549487 to 1
2020-03-26T13:43:56Z 1 Scaled up replica set apiserver-5b4f45679c to 2
2020-03-26T13:44:37Z 1 Scaled down replica set apiserver-7df5549487 to 0
2020-03-26T13:44:37Z 1 Scaled up replica set apiserver-5b4f45679c to 3

OpenShift API-server operator logs start at 14:52 [2], so after all of that.  Looking at node reboots:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177/artifacts/e2e-aws-upgrade/nodes/masters-journal | gunzip | grep 'systemd.*Shutting down\|Starting Kubernetes Kubelet' | sort
Mar 26 13:21:31.780827 ip-10-0-157-232 systemd[1]: Starting Kubernetes Kubelet...
Mar 26 13:21:33.781551 ip-10-0-143-234 systemd[1]: Starting Kubernetes Kubelet...
Mar 26 13:21:47.703891 ip-10-0-131-225 systemd[1]: Starting Kubernetes Kubelet...
Mar 26 14:45:59.889201 ip-10-0-143-234 systemd[1]: Shutting down.
Mar 26 14:46:39.993622 ip-10-0-143-234 systemd[1]: Starting Kubernetes Kubelet...
Mar 26 14:50:35.798233 ip-10-0-157-232 systemd[1]: Shutting down.
Mar 26 14:51:17.189099 ip-10-0-157-232 systemd[1]: Starting Kubernetes Kubelet...
Mar 26 14:54:54.067183 ip-10-0-131-225 systemd[1]: Shutting down.
Mar 26 14:55:35.068369 ip-10-0-131-225 systemd[1]: Starting Kubernetes Kubelet...

So seems like this may have been the API-server operator complaining about a control-plane node reboot or something in that vicinity.



Checking my 4.4.12 -> 4.4.13 job [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696/build-log.txt | grep APIServerDeploymentDegraded | grep '^Jul ' | sort | tail -n1
Jul 15 15:02:09.086: INFO: cluster upgrade is Failing: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-apiserver-operator" and .reason == "OperatorStatusChanged") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | tail -n1
2020-07-15T14:06:31Z 1 Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable")
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696/artifacts/launch/nodes/masters-journal | gunzip | grep 'systemd.*Shutting down\|Starting Kubernetes Kubelet' | sort
Jul 15 13:22:54.995611 ip-10-0-172-128 systemd[1]: Shutting down.
Jul 15 13:23:01.933536 ip-10-0-217-145 systemd[1]: Shutting down.
Jul 15 13:23:02.289590 ip-10-0-180-183 systemd[1]: Shutting down.
Jul 15 13:23:57.366301 ip-10-0-172-128 systemd[1]: Starting Kubernetes Kubelet...
Jul 15 13:24:04.754364 ip-10-0-180-183 systemd[1]: Starting Kubernetes Kubelet...
Jul 15 13:24:05.720385 ip-10-0-217-145 systemd[1]: Starting Kubernetes Kubelet...

So it seems unrelated to node reboots and did not recover.  Upside is that it means we have operator logs about the decision, starting at 13:48 [3].  Downside is that those logs go into no detail about the unavailable-pod reasons.  Checking the deployment directly:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696/artifacts/launch/deployments.json | gunzip | jq -r '.items[] | select(.metadata.namespace == "openshift-apiserver").status'
{
  "availableReplicas": 2,
  "conditions": [
    {
      "lastTransitionTime": "2020-07-15T13:28:38Z",
      "lastUpdateTime": "2020-07-15T13:28:38Z",
      "message": "Deployment has minimum availability.",
      "reason": "MinimumReplicasAvailable",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2020-07-15T13:28:17Z",
      "lastUpdateTime": "2020-07-15T13:51:30Z",
      "message": "ReplicaSet \"apiserver-7547f7d496\" has successfully progressed.",
      "reason": "NewReplicaSetAvailable",
      "status": "True",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 3,
  "readyReplicas": 2,
  "replicas": 3,
  "unavailableReplicas": 1,
  "updatedReplicas": 3
}

Doesn't tell us which replica is unavailable.  Checking the pods directly:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696/artifacts/launch/pods.json | jq -r '.items[] | (.status.conditions[] | select(.type == "Ready")) as $ready | select(.metadata.namespace == "openshift-apiserver" and $ready .status == "False") | .metadata.name + "\n" + ([.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .message] | sort | join("\n")) + "\n" + (.status.containerStatuses[] | select(.name == "openshift-apiserver") | (.restartCount | tostring) + "\n" + .lastState.terminated.message)'
apiserver-7547f7d496-fjx7m
2020-07-15T13:50:42Z PodScheduled=True 
2020-07-15T13:50:50Z Initialized=True 
2020-07-15T14:04:24Z ContainersReady=False containers with unready status: [openshift-apiserver]
2020-07-15T14:04:24Z Ready=False containers with unready status: [openshift-apiserver]
18
Copying system trust bundle
F0715 15:07:19.871901       1 cmd.go:72] unable to load configmap based request-header-client-ca-file: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: no route to host

So seems like an SDN issue there.  Reopening and assigning to the API-server folks, so they can decide where to take this.  Options I can see:

* Improve the message, so things like "172.30.0.1:443: connect: no route to host" get bubbled up somewhere without folks having to drill down into the pods.  Or other changes to make this degraded condition more actionable.
* CI-gather improvements to make this degraded message easier to diagnose in CI if there are impediments to making the degraded condition more actionable in general.
* Continue to dig into comment 0's example job, to see if they can root-cause that or find a different team to hand it off to.
* Hand off to the SDN folks to figure out why the 4.4.12 -> 4.4.13 job got the "172.30.0.1:443: connect: no route to host" error.  This may be already tracked as part of the bug 1852618 series, which is ON_QA for 4.6 and still unpatched for earlier 4.y.
* Probably several more routes I'm not thinking of at the moment :p.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23177/artifacts/e2e-aws-upgrade/pods/openshift-apiserver-operator_openshift-apiserver-operator-fc8f8f985-dngt7_openshift-apiserver-operator.log
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1283386123065757696/artifacts/launch/pods/openshift-apiserver-operator_openshift-apiserver-operator-f7cb9f874-pm7m2_openshift-apiserver-operator.log

Comment 14 Stefan Schimanski 2020-08-03 10:20:06 UTC

There is network-checker for exactly these kind of issue. @Luis how would this show up to the admin? Is there some alerting if the outages noticed by network-checker become really bad? If yes, I tend to close this with CURRENT_VERSION.

Comment 15 Luis Sanchez 2020-08-03 21:02:14 UTC

The network checker, check-endpoints, would detect outages and report events in similar cases. In addition to the events, there are podnetworkconnectivity check resources with historical details of outages.

Comment 16 W. Trevor King 2020-08-03 23:10:46 UTC

So [1] is a 4.5.4 -> 4.6.0-0.ci-2020-08-03-103018 CI update dying on (as far as the CVO was concerned) APIServerDeploymentDegraded 10h ago.  'Monitor cluster while tests execute' gathered 80 error-level events on that run.  Those errors include:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1290235003745079296/build-log.txt | grep 'clusteroperator/.*changed.*APIServerDeploymentDegraded' | head -n1
Aug 03 11:12:24.362 I ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/OperatorStatusChanged Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver",Progressing message changed from "OperatorConfigProgressing: openshiftapiserveroperatorconfigs/instance: observed generation is 3, desired generation is 4.\nAPIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: observed generation is 3, desired generation is 4." to "OperatorConfigProgressing: openshiftapiserveroperatorconfigs/instance: observed generation is 3, desired generation is 4.\nAPIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: observed generation is 4, desired generation is 5."

The gathered events events also include:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1290235003745079296/build-log.txt | grep 'kube-apiserver-check-endpoints' | grep -v ' I ns/'
Aug 03 11:14:29.528 W clusteroperator/kube-apiserver changed Degraded to False: AsExpected: NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-187-68.ec2.internal container "kube-apiserver-check-endpoints" is not ready: unknown reason
Aug 03 11:19:50.977 W ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-137-64.ec2.internal node/ip-10-0-137-64.ec2.internal container/kube-apiserver-check-endpoints reason/NotReady
Aug 03 11:22:04.622 W ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-187-68.ec2.internal node/ip-10-0-187-68.ec2.internal container/kube-apiserver-check-endpoints reason/NotReady

Maybe there is more information on the pods themselves.  Not clear to me how an admin who sees APIServerDeploymentDegraded should know to look for these particular events.  Do we alert on them or anything?  Dropping the job into PromeCIeus and querying for ALERTS{alertstate="firing",severity="critical"} , we had these critical alerts firing at some point over the life of the cluster:

ALERTS{alertname="etcdMembersDown",alertstate="firing",job="etcd",namespace="openshift-etcd",pod="etcd-ip-10-0-137-64.ec2.internal",service="etcd",severity="critical"}
ALERTS{alertname="etcdInsufficientMembers",alertstate="firing",endpoint="etcd-metrics",job="etcd",namespace="openshift-etcd",pod="etcd-ip-10-0-137-64.ec2.internal",service="etcd",severity="critical"}
ALERTS{alertname="MCDDrainError",alertstate="firing",drain_time="606.395266515 sec",endpoint="metrics",err="5 tries: [error when evicting pod "router-default-66b8dc4967-zmqvk": etcdserver: request timed out, error when evicting pod "router-default-55b6d684c9-kr9hh": global timeout reached: 1m30s]",instance="10.0.174.203:9001",job="machine-config-daemon",namespace="openshift-machine-config-operator",pod="machine-config-daemon-68ll6",service="machine-config-daemon",severity="critical"}
ALERTS{alertname="KubeDeploymentReplicasMismatch",alertstate="firing",deployment="router-default",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-ingress",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentReplicasMismatch",alertstate="firing",deployment="router-default",endpoint="https-main",instance="10.129.2.29:8443",job="kube-state-metrics",namespace="openshift-ingress",pod="kube-state-metrics-77d7f49bdc-kdm6q",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="thanos-querier",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="telemeter-client",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="prometheus-operator",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="prometheus-adapter",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="openshift-state-metrics",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="kube-state-metrics",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDeploymentGenerationMismatch",alertstate="firing",deployment="grafana",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-monitoring",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="KubeDaemonSetRolloutStuck",alertstate="firing",daemonset="multus",endpoint="https-main",instance="10.131.0.17:8443",job="kube-state-metrics",namespace="openshift-multus",pod="kube-state-metrics-77d7f49bdc-k5wtf",service="kube-state-metrics",severity="critical"}
ALERTS{alertname="ClusterOperatorDown",alertstate="firing",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",name="network",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-2lnhs",service="cluster-version-operator",severity="critical",version="4.5.4"}
ALERTS{alertname="ClusterOperatorDown",alertstate="firing",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",name="monitoring",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-2lnhs",service="cluster-version-operator",severity="critical",version="4.6.0-0.ci-2020-08-03-103018"}
ALERTS{alertname="ClusterOperatorDown",alertstate="firing",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",name="ingress",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-rwkqg",service="cluster-version-operator",severity="critical",version="4.6.0-0.ci-2020-08-03-103018"}
ALERTS{alertname="ClusterOperatorDegraded",alertstate="firing",condition="Degraded",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",name="network",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-2lnhs",reason="RolloutHung",service="cluster-version-operator",severity="critical"}
ALERTS{alertname="ClusterOperatorDegraded",alertstate="firing",condition="Degraded",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",name="monitoring",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-2lnhs",reason="UpdatingopenshiftStateMetricsFailed",service="cluster-version-operator",severity="critical"}
ALERTS{alertname="ClusterOperatorDegraded",alertstate="firing",condition="Degraded",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",name="ingress",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-rwkqg",reason="IngressControllersDegraded",service="cluster-version-operator",severity="critical"}
ALERTS{alertname="CloudCredentialOperatorDown",alertstate="firing",severity="critical"}
ALERTS{alertname="CannotRetrieveUpdates",alertstate="firing",endpoint="metrics",instance="10.0.187.68:9099",job="cluster-version-operator",namespace="openshift-cluster-version",pod="cluster-version-operator-6f8b8b6446-2lnhs",service="cluster-version-operator",severity="critical"}

What can we do to make it easier for admins who see their cluster stuck on APIServerDeploymentDegraded for at least 2 minutes more likely to check and fix the underlying network issue?  Or to avoid firing a false-positive Degraded=True message if we don't think there's anything serious enough going wrong that they need to wade in and try to manually fix something?

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1290235003745079296
[2]: https://promecieus.svc.ci.openshift.org/

Comment 17 Matthew Heon 2020-08-06 19:34:14 UTC

I'm going to reopen this. We are seeing a rash of failures on 4.5 -> 4.6 upgrade with this issue.

Sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1291378585432494080

Comment 18 Stefan Schimanski 2020-08-31 10:40:43 UTC

@wking: it's not clear to me what in "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver" is unclear. Do you have a suggestion what you would have expected as a message?

Maybe

  APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver. Please check the pods in namespace openshift-apiserver.

?

There are many many reasons why a pod might be not available. We depend on kubelet and kcm to expose this information to the operator. I don't think it is worth to identify a cause like networking (which is one out of a hundred) individually everywhere. I would rather expect the networking controller to report problems, but it doesn't and is happy all the time despite the issue on the master node:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1290235003745079296/build-log.txt | grep 'clusteroperator/network' | head -n1
Aug 03 11:19:19.556 W clusteroperator/network changed Progressing to True: Deploying: Waiting for DaemonSet "openshift-multus/network-metrics-daemon" to be created\nWaiting for DaemonSet "openshift-sdn/sdn-metrics" to be created

Comment 19 W. Trevor King 2020-08-31 20:08:38 UTC

> I would rather expect the networking controller to report problems, but it doesn't and is happy all the time despite the issue on the master node.

Is that a bug with KubeDaemonSetRolloutStuck [1]?  The network operator?  Both?  Another component?  Ideally by the time the operator goes degraded, and certainly by the time the critical ClusterOperatorDegraded [2] goes off 10m later, I'd like to have some signal from the cluster about what is actually going wrong and what we expect the cluster admin who's pager just went off to do about it.  Does the API server actually care that one of three pods is not available?  It seems to me that the API server does not need to complain about a single control-plane machine being down/unusable for an extended period of time, as long as:

* The etcd quorum guard or similar is protecting the cluster from losing quorum by keeping the machine-config operator from intentionally rolling one of the remaining nodes.
* The API server can successfully schedule a healthy pod on every control-plane machine that has a running etcd member within some reasonable time of the etcd member going healthy.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/170f91faabc9683a34df29d1d892027292ed0296/assets/prometheus-k8s/rules.yaml#L1348-L1378
[2]: https://github.com/openshift/cluster-version-operator/blob/05f3d334cec0f3e187c9a64a55934a1c5011122b/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L79-L86

Comment 20 Michal Fojtik 2020-09-30 20:51:15 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 21 W. Trevor King 2020-09-30 21:37:11 UTC

Updating comment 3:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?name=upgrade&search=Cluster%20did%20not%20complete%20upgrade:%20timed%20out%20waiting%20for%20the%20condition:%20Cluster%20operator%20openshift-apiserver%20is%20reporting%20a%20failure:%20APIServerDeploymentDegraded.*requested%20instances%20are%20unavailable&maxAge=168h' | grep 'failures match'
release-openshift-origin-installer-e2e-gcp-upgrade - 107 runs, 35% failed, 3% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 28 runs, 32% failed, 11% of failures match
$ curl -s 'https://search.ci.openshift.org/search?name=upgrade&search=Cluster%20did%20not%20complete%20upgrade:%20timed%20out%20waiting%20for%20the%20condition:%20Cluster%20operator%20openshift-apiserver%20is%20reporting%20a%20failure:%20APIServerDeploymentDegraded.*requested%20instances%20are%20unavailable&maxAge=168h&type=junit' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1308855218804363264
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1309519437019222016

Those are 4.6.0-0.ci-2020-09-23-155325 -> 4.6.0-0.ci-2020-09-23-162325 and 4.6.0-0.ci-2020-09-25-132739 -> 4.6.0-0.ci-2020-09-25-154439 respectively.

Comment 22 Michal Fojtik 2020-10-30 22:12:06 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 23 Donny Davis 2020-11-02 22:26:23 UTC

This is still broken for me. Upgrading from 4.5.16 to 4.6.1

Comment 24 Michal Fojtik 2020-11-02 23:12:13 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 26 jamo luhrsen 2020-11-14 00:53:16 UTC

Looking at some newer ovn-upgrade jobs I've added recently, I think we are seeing this bug there:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1327335021974917120/build-log.txt

Nov 13 21:18:22.019: INFO: cluster upgrade is Progressing: Unable to apply 4.7.0-0.ci-2020-11-13-152113: the cluster operator openshift-apiserver is degraded
Nov 13 21:18:22.019: INFO: cluster upgrade is Failing: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

Comment 27 jamo luhrsen 2020-11-24 19:34:06 UTC

What can we do to make some progress on this bug? I'm not sure if the root cause is the same now as when this was originally filed, but this
failing test case "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" is not biting
us with 4.7 release payload promotion. This job, for example:

https://testgrid.k8s.io/redhat-openshift-informing#release-openshift-origin-installer-e2e-aws-upgrade

Also, it seems the failure is seen in aprox 3%+ of all failing jobs now over the past two days:
https://search.ci.openshift.org/?search=Cluster+should+remain+functional+during+upgrade&maxAge=48h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 28 Michal Fojtik 2020-12-24 19:58:22 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 29 Wally 2021-01-22 16:13:37 UTC

Setting blocker- since priority = low.

Comment 30 Michal Fojtik 2021-01-22 16:20:36 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 31 Francesco Giudici 2021-01-27 19:37:34 UTC

This still happening in current upgrade jobs from 4.6.13 --> 4.7.
See https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/dashboards/overview.
A sample of the failing job here: 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1354347102104195072

failing with:
"Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()"

Comment 32 Michal Fojtik 2021-02-26 20:07:05 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 34 Michal Fojtik 2021-05-19 19:16:49 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 37 Luis Sanchez 2021-08-30 20:05:24 UTC

Not seeing this(1) in the latest upgrades. 

If this should be re-opened, I'll take a closer look at the test.

1: Specifically, not seeing the "Cluster should remain functional during upgrade" test fail due to "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver".


In the past, things that might delay pods from being ready included routes that took a while to be established (or other network issues) or kubelet having trouble restarting a pod when the old just wouldn't die. 

Notes: Currently it seems to take about 7-8 minutes for openshift apiserver deployment to roll out. I would be curious how long the test takes to "time-out". Of course the api-server should be available during the rollout, so it's technically still functional.