Bug 1713228
| Summary: | Upgrade fail if an apiserver on the same node as CVO fails | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Tomáš Nožička <tnozicka> |
| Component: | Cluster Version Operator | Assignee: | Abhinav Dahiya <adahiya> |
| Status: | CLOSED NOTABUG | QA Contact: | liujia <jiajliu> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.1.0 | CC: | aos-bugs, bleanhar, ccoleman, erich, jokerman, mmccomas, wking, wsun, xxia |
| Target Milestone: | --- | Keywords: | NeedsTestCase |
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-09-30 17:31:03 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tomáš Nožička
2019-05-23 08:10:30 UTC
Reproduced it in 4.1.0-0.nightly-2019-05-21-060354 -> 4.1.0-0.nightly-2019-05-22-050858 $ oc get po -n openshift-kube-apiserver kube-apiserver-ip-10-0-128-254.sa-east-1.compute.internal 2/2 Running 0 21m kube-apiserver-ip-10-0-133-211.sa-east-1.compute.internal 2/2 Running 0 22m kube-apiserver-ip-10-0-157-53.sa-east-1.compute.internal 2/2 Running 0 19m Check `oc get po -n openshift-cluster-version -o wide`, found it is on 10.0.128.254. Thus make pod kube-apiserver-ip-10-0-128-254.sa-east-1.compute.internal fail by: $ ssh-ocp4 core.128.254 [core@ip-10-0-128-254 ~]$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml ~/ $ oc get po -n openshift-kube-apiserver kube-apiserver-ip-10-0-133-211.sa-east-1.compute.internal 2/2 Running 0 22m kube-apiserver-ip-10-0-157-53.sa-east-1.compute.internal 2/2 Running 0 19m Then run `oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-05-22-050858 --force` $ watch oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-21-060354 True False 33m Cluster version is 4.1.0-0.nightly-2019-05-21-060354 ^ It always shows like above, no upgrade progress Workaround: repeat below until CVO pod is rescheduled to other master other than 10.0.128.254 by: $ oc delete po -n openshift-cluster-version -l k8s-app=cluster-version-operator $ oc get po -n openshift-cluster-version -o wide Then `oc get clusterversion` becomes to show upgrade progress > Given it uses localhost we likely need to run it in HA mode on all master with leader election.
Or we could have it exit if the local API server was unreachable, in which case it would be automatically rescheduled, possibly to a node with a working API server. If it landed on a node with a broken API server, it would just die again.
We have shortly discussed the same idea before, but I didn't like multiple restart on happy path. Also the image being present on one node should make scheduling prefer to retry it there. But you need leader election even at the scale of 1 with recreate strategy, hopefully CVO already does it, so this would become just changing scale from 1 to 3 and setting anti affinity. This is more severe for single master clusters. This is only a problem if the cluster is single master which is unsupported, if it's multimaster things will eventually fail over and the upgrade will continue. |