Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1765219

Summary: [Disruptive] Cluster upgrade should maintain a functioning cluster
Product: OpenShift Container Platform Reporter: Fabio Bertinatto <fbertina>
Component: Cluster Version OperatorAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, eparis, jokerman, sttts, xxia
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-23 11:09:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Abhinav Dahiya 2019-10-30 16:14:58 UTC
This should have been fixed already by https://github.com/openshift/cluster-version-operator/pull/265

Comment 3 liujia 2019-11-05 08:41:19 UTC
From job [1], i did not see any obvious log about if it's related with api version. And from recent ci build jobs, i notice the similar fail happen again in [2].

@Stefan Schimanski
I'm not quite sure how should qe verify the bug against pr#265. Could u help confirm?


[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/9490
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10456

Comment 4 liujia 2019-12-06 03:19:17 UTC
Still see this test failure for recent ci tests. 
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12127/

[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:130]: during upgrade
Unexpected error:
    <*errors.errorString | 0xc000be4360>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-05-213858: 13% complete",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-05-213858: 13% complete
occurred

Comment 5 Stefan Schimanski 2019-12-11 16:02:17 UTC
@liujia: Not every run failing with "Cluster upgrade should maintain a functioning cluster" is due to the CRD topic fixed in https://github.com/openshift/cluster-version-operator/pull/265. This was about one very specific case where upgrade could fail.

What #265 fixed was about failing updates of very early CRDs before kube-apiserver was updated to 4.3.

Looking https://search.svc.ci.openshift.org/?search=resource+may+have+been+deleted&maxAge=168h&context=2&type=all suggests me that this is not the case anymore. Moving back to modified.

Comment 6 Xingxing Xia 2019-12-13 08:02:57 UTC
While verifying bug 1779237#c4 yesterday, its tested cluster also failed, stuck at below for 20+ hours:
oc get clusterversion
version   4.2.0-0.nightly-2019-12-11-171302   True        True          22h     Working towards 4.3.0-0.nightly-2019-12-12-021332: 13% complete
[xxia 2019-12-13 15:47:48 my]$ oc get co | grep -v "True        False         False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress                                    4.3.0-0.nightly-2019-12-12-021332   False       True          True       21h
kube-apiserver                             4.3.0-0.nightly-2019-12-12-021332   True        True          True       23h
kube-controller-manager                    4.3.0-0.nightly-2019-12-12-021332   True        True          True       23h
kube-scheduler                             4.3.0-0.nightly-2019-12-12-021332   True        False         True       23h
machine-config                             4.2.0-0.nightly-2019-12-11-171302   False       True          True       21h
monitoring                                 4.3.0-0.nightly-2019-12-12-021332   False       True          True       21h
network                                    4.3.0-0.nightly-2019-12-12-021332   True        True          True       23h
[xxia 2019-12-13 15:48:14 my]$ oc get no
NAME                                         STATUS                        ROLES    AGE   VERSION
ip-10-0-135-186.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   23h   v1.14.6+cebabbf4a
ip-10-0-141-151.us-east-2.compute.internal   Ready                         master   23h   v1.14.6+cebabbf4a
ip-10-0-147-188.us-east-2.compute.internal   Ready                         worker   23h   v1.14.6+cebabbf4a
ip-10-0-153-212.us-east-2.compute.internal   NotReady,SchedulingDisabled   master   23h   v1.14.6+cebabbf4a
ip-10-0-170-139.us-east-2.compute.internal   Ready                         master   23h   v1.14.6+cebabbf4a
[xxia 2019-12-13 15:48:31 my]$ oc describe co kube-apiserver
...
Status:
  Conditions:
    Last Transition Time:  2019-12-12T10:12:55Z
    Message:               NodeControllerDegraded: The master node(s) "ip-10-0-153-212.us-east-2.compute.internal" not ready
    Reason:                NodeControllerDegradedMasterNodesReady
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-12-13T03:06:56Z
    Message:               Progressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 9
    Reason:                Progressing
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-12-12T08:11:12Z
    Message:               Available: 3 nodes are active; 3 nodes are at revision 7; 0 nodes have achieved new revision 9
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-12-12T08:10:06Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
...
[xxia 2019-12-13 15:52:22 my]$ oc get po -n openshift-kube-apiserver -l apiserver --show-labels
NAME                                                        READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ip-10-0-141-151.us-east-2.compute.internal   3/3     Running   0          22h   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-ip-10-0-153-212.us-east-2.compute.internal   3/3     Running   0          22h   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-ip-10-0-170-139.us-east-2.compute.internal   3/3     Running   0          22h   apiserver=true,app=openshift-kube-apiserver,revision=7

I checked https://openshift-release.svc.ci.openshift.org/ , click latest payload - 4.3.0-0.nightly-2019-12-13-032731, in https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-13-032731 , saw "4.2.10 (changes) - Failed", click it, saw https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12586 also shows:
Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-13-032731: 13% complete

I'm not sure if this failure is same as original bug report. But 4.2 to 4.3 upgrade indeed does not work now.

Comment 8 liujia 2019-12-13 08:17:56 UTC
(In reply to Stefan Schimanski from comment #5)
> @liujia: Not every run failing with "Cluster upgrade should maintain a
> functioning cluster" is due to the CRD topic fixed in
> https://github.com/openshift/cluster-version-operator/pull/265. This was
> about one very specific case where upgrade could fail.
> 
> What #265 fixed was about failing updates of very early CRDs before
> kube-apiserver was updated to 4.3.
> 
> Looking
> https://search.svc.ci.openshift.org/
> ?search=resource+may+have+been+deleted&maxAge=168h&context=2&type=all
> suggests me that this is not the case anymore. Moving back to modified.

Thx for point out the concrete error info in failed unit test-[Disruptive] Cluster upgrade should maintain a functioning cluster. Double confirm there was not the same error info in job https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12127/ from last verify.

Comment 9 liujia 2019-12-13 08:23:55 UTC
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/12586 also shows:
> Cluster did not complete upgrade: timed out waiting for the condition:
> Working towards 4.3.0-0.nightly-2019-12-13-032731: 13% complete
> 
> I'm not sure if this failure is same as original bug report. But 4.2 to 4.3
> upgrade indeed does not work now.

Double check https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12586, it is not the same with the fix in pr265 about CRD api version. So @xxia, you could file a new bug for your new issue. i will verify the bug.

Comment 10 Xingxing Xia 2019-12-13 09:17:25 UTC
(In reply to liujia from comment #9)
> So @xxia, you could file a new bug for your new issue
Thanks, should be same as bug 1778904 .

Comment 12 errata-xmlrpc 2020-01-23 11:09:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 13 Red Hat Bugzilla 2023-09-14 05:44:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days