Bug 1688454

Summary:	OCP4 Upgrade test failed on ci-builids
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Cluster Version Operator	Assignee:	Abhinav Dahiya <adahiya>
Status:	CLOSED ERRATA	QA Contact:	Hongkai Liu <hongkliu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	adahiya, aos-bugs, hongkliu, jmencak, jokerman, mmccomas
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:45:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2019-03-13 18:09:16 UTC

The upgrade is done on ci-builds.
After upgrading, console is not accessible and oc-cli is very very slow.

I am wondering what to do in such situation? Is there rolling-back if upgrade failed.


Here are the steps:

# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-11-054346   True        False         2m31s   Cluster version is 4.0.0-0.ci-2019-03-11-054346


# oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge

# oc get clusterversion version -o json | jq -r '.status.availableUpdates'
[
  {
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-11-063655",
    "version": "4.0.0-0.ci-2019-03-11-063655"
  },
  {
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-11-070013",
    "version": "4.0.0-0.ci-2019-03-11-070013"
  },
  {
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-11-063655",
    "version": "4.0.0-0.ci-2019-03-11-063655"
  },
  {
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-11-070013",
    "version": "4.0.0-0.ci-2019-03-11-070013"
  }
]



Perform upgrade:

# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-11-063655
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-11-063655

### UI shows: UPDATE STATUS: Updating
### desired version has been changed too
# oc get clusterversion version -o json | jq .status.desired.version
"4.0.0-0.ci-2019-03-11-063655"

### get update process
# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-11-063655   True        True          3m59s   Working towards 4.0.0-0.ci-2019-03-11-063655: 9% complete



# oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-11-063655   True        True          31m     Working towards 4.0.0-0.ci-2019-03-11-063655: 29% complete

# oc status
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get imagestreams.image.openshift.io)
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get builds.build.openshift.io)

# oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-11-063655   True        True          46m     Unable to apply 4.0.0-0.ci-2019-03-11-063655: the cluster operator machine-config has not yet successfully rolled out


# time oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-11-063655   True        True          87m     Unable to apply 4.0.0-0.ci-2019-03-11-063655: the cluster operator machine-config is failing

real	1m30.651s
user	0m0.170s
sys	0m0.039s


### after over 2 hours, it seems oc-cli is fine. But console is still down.
# time oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-11-063655   True        True          141m    Unable to apply 4.0.0-0.ci-2019-03-11-063655: the cluster operator openshift-cloud-credential-operator is failing

real	0m0.717s
user	0m0.220s
sys	0m0.045s

# time oc status
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get builds.build.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get imagestreams.image.openshift.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deploymentconfigs.apps.openshift.io)

real	0m0.757s
user	0m0.184s
sys	0m0.038s

I will upload journal logs of masters.

Comment 3 Scott Dodson 2019-03-15 19:37:27 UTC

This is all just coming together right now, please check again in one weeks time. Please close if it's resolved then.

Comment 4 Hongkai Liu 2019-03-15 21:50:27 UTC

Will check in a week.
I also heard from my team they have got several success updates.

Comment 5 Hongkai Liu 2019-03-19 16:48:54 UTC

Upgrade from
4.0.0-0.ci-2019-03-19-102101
to
4.0.0-0.ci-2019-03-19-122710
succeeded.

The only thing is that the number showing the progress of `% complete` is not incremental.
Is that expected?

# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-19-122710
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-19-122710

...

#  oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-19-122710   True        True          11m     Working towards 4.0.0-0.ci-2019-03-19-122710: 24% complete

root@ip-172-31-31-218: ~ #  oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-19-122710   True        True          14m     Working towards 4.0.0-0.ci-2019-03-19-122710: 33% complete

#  oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-19-122710   True        True          16m     Working towards 4.0.0-0.ci-2019-03-19-122710: 37% complete

#  oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-19-122710   True        True          18m     Working towards 4.0.0-0.ci-2019-03-19-122710: 2% complete
#  oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-19-122710   True        True          18m     Working towards 4.0.0-0.ci-2019-03-19-122710: 16% complete

...

#  oc get clusterversion version
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-19-122710   True        False         18m     Cluster version is 4.0.0-0.ci-2019-03-19-122710

Comment 7 Abhinav Dahiya 2019-03-27 18:35:14 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1688454#c5

is not related to this bug.


> The upgrade is done on ci-builds.
> After upgrading, console is not accessible and oc-cli is very very slow.

> I am wondering what to do in such situation? Is there rolling-back if upgrade failed.

rollbacks are not supported or performed automatically.

And based on these logs:
> # time oc status
> Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)
> Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)
> Error from server (ServiceUnavailable): the server is currently unable to handle the request (get builds.build.openshift.io)
> Error from server (ServiceUnavailable): the server is currently unable to handle the request (get imagestreams.image.openshift.io)
> Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deploymentconfigs.apps.openshift.io)

There was a failure to upgrade openshift-apiserver.

Can you provide information if upgrade is still failing... if it is please provide `oc get co -oyaml` `oc get clusterversion -oyaml` so that we can track which operator is failing.

Comment 8 Hongkai Liu 2019-03-27 19:16:08 UTC

Thanks, Abhinav,

already redid the test in Comment 5.
upgrade was good there ... except the process percentage is not incremental.

Let me know if you think I should run the update again.

Comment 9 Abhinav Dahiya 2019-03-27 19:19:46 UTC

(In reply to Hongkai Liu from comment #8)
> Thanks, Abhinav,
> 
> already redid the test in Comment 5.
> upgrade was good there ... except the process percentage is not incremental.

That's a separate issue. And we are already tracking that. :)

> Let me know if you think I should run the update again.

Comment 11 errata-xmlrpc 2019-06-04 10:45:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758