Bug 1870394

Summary:

Upgrade from 4.5.6 to 4.6.0-0.nightly-2020-08-18-165040 failed due to one node is NotReady

Product:

OpenShift Container Platform

Reporter:

sunzhaohua <zhsun>

Component:

Node

Assignee:

Seth Jennings <sjenning>

Status:

CLOSED DUPLICATE

QA Contact:

Sunil Choudhary <schoudha>

Severity:

medium

Docs Contact:

Priority:

low

Version:

4.5

CC:

aos-bugs, jokerman, nagrawal, schoudha, wduan, wsun

Target Milestone:

---

Keywords:

Reopened, Upgrades

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-09-29 05:27:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
coredump in zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn	none
0824-coresystemd	none

Description sunzhaohua 2020-08-20 00:01:24 UTC

Description of problem:
IPI on azure, enable etcd encryption, enable fips, Upgrade from 4.5.6 to 4.6.0-0.nightly-2020-08-18-165040 failed due to one node is NotReady

Version-Release number of selected component (if applicable):
4.5.6->4.6.0-0.nightly-2020-08-18-165040

How reproducible:
met once time

Steps to Reproduce:
1. Set up 4.5.6 cluster, after setting up the cluster,openshift-apiserver is not avaiable,so use this workaround https://bugzilla.redhat.com/show_bug.cgi?id=1825219#c51, then upgrade.
openshift-apiserver                        4.6.0-0.nightly-2020-08-18-165040   False        False         False      8h

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.6     True        False         4h27m   Cluster version is 4.5.6


$ oc get node
NAME                                                STATUS   ROLES    AGE   VERSION
zhsunupgrade819-ctcx8-master-0                      Ready    master   72m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-1                      Ready    master   71m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-2                      Ready    master   72m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-lsccw   Ready    worker   56m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn   Ready    worker   56m   v1.18.3+002a51f

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.6     True        False         False      4h27m
cloud-credential                           4.5.6     True        False         False      4h56m
cluster-autoscaler                         4.5.6     True        False         False      4h43m
config-operator                            4.5.6     True        False         False      4h43m
console                                    4.5.6     True        False         False      4h9m
csi-snapshot-controller                    4.5.6     True        False         False      4h24m
dns                                        4.5.6     True        False         False      4h50m
etcd                                       4.5.6     True        False         False      4h49m
image-registry                             4.5.6     True        False         False      4h34m
ingress                                    4.5.6     True        False         False      4h34m
insights                                   4.5.6     True        False         False      4h44m
kube-apiserver                             4.5.6     True        False         False      4h49m
kube-controller-manager                    4.5.6     True        False         False      4h49m
kube-scheduler                             4.5.6     True        False         False      4h48m
kube-storage-version-migrator              4.5.6     True        False         False      4h21m
machine-api                                4.5.6     True        False         False      4h41m
machine-approver                           4.5.6     True        False         False      4h45m
machine-config                             4.5.6     True        False         False      4h42m
marketplace                                4.5.6     True        False         False      4h19m
monitoring                                 4.5.6     True        False         False      3m13s
network                                    4.5.6     True        False         False      4h51m
node-tuning                                4.5.6     True        False         False      4h51m
openshift-apiserver                        4.5.6     True        False         False      3h22m
openshift-controller-manager               4.5.6     True        False         False      4h44m
openshift-samples                          4.5.6     True        False         False      4h43m
operator-lifecycle-manager                 4.5.6     True        False         False      4h50m
operator-lifecycle-manager-catalog         4.5.6     True        False         False      4h51m
operator-lifecycle-manager-packageserver   4.5.6     True        False         False      4h19m
service-ca                                 4.5.6     True        False         False      4h51m
storage                                    4.5.6     True        False         False      4h44m

2. Upgrade to 4.6.0-0.nightly-2020-08-18-165040
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-18-165040 --force --allow-explicit-upgrade
3. 

Actual results:
One node is in NotReady status, it is stuck at FIPS mode initialized. After rebooting it, the node is back and in Ready state. Finally the cluster upgrade to 4.6.0-0.nightly-2020-08-18-165040 successful.

Before reboot the NotReady node:
sh-4.4# ssh -i openshift-qe.pem core@zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn
FIPS mode initialized
Red Hat Enterprise Linux CoreOS 45.82.202008101249-0
  Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html
---
[systemd]
Failed Units: 5
  crio-f4f3ecc7efafaab5f8fba7052ae9999ae6e5e449a5af64d4829d26d55fcc770f.scope
  afterburn-checkin.service
  chronyd.service
  systemd-coredump
  systemd-coredump 

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.6     True        True          3h42m   Unable to apply 4.6.0-0.nightly-2020-08-18-165040: the cluster operator monitoring has not yet successfully rolled out

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      26s
cloud-credential                           4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
cluster-autoscaler                         4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
config-operator                            4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
console                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h24m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      40m
dns                                        4.5.6                               True        True          False      8h
etcd                                       4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
image-registry                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      42m
ingress                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h26m
insights                                   4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-apiserver                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-controller-manager                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-scheduler                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-18-165040   True        False         False      41m
machine-api                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
machine-approver                           4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
machine-config                             4.5.6                               False       False         True       3h9m
marketplace                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h25m
monitoring                                 4.5.6                               False       False         True       3h23m
network                                    4.5.6                               True        True          True       8h
node-tuning                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h25m
openshift-apiserver                        4.6.0-0.nightly-2020-08-18-165040   True        False         False      139m
openshift-controller-manager               4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h24m
openshift-samples                          4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h25m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-18-165040   True        False         False      166m
service-ca                                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
storage                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h26m

$ oc get node
NAME                                                STATUS     ROLES    AGE   VERSION
zhsunupgrade819-ctcx8-master-0                      Ready      master   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-1                      Ready      master   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-2                      Ready      master   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-lsccw   Ready      worker   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn   NotReady   worker   8h    v1.18.3+002a51f



Expected results:
Upgrade is successfull.

Additional info:
must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.7896037696304830913.zip
cluster: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/107252/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 1 sunzhaohua 2020-08-20 07:17:34 UTC

Created attachment 1711968 [details]
coredump in zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn

Comment 2 Wei Duan 2020-08-24 10:20:41 UTC

I met the similar issue when upgrade from 4.5.0-0.nightly-2020-08-23-191713 to 4.6.0-0.nightly-2020-08-23-185640.
Two workers are in NotReady during the upgrade.

$ oc get node
NAME                                        STATUS     ROLES    AGE     VERSION
ip-10-0-55-19.us-east-2.compute.internal    NotReady   worker   6h44m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-58-5.us-east-2.compute.internal     Ready      worker   6h45m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-58-77.us-east-2.compute.internal    Ready      master   6h54m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-60-223.us-east-2.compute.internal   Ready      master   6h54m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-77-95.us-east-2.compute.internal    Ready      master   6h54m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-79-138.us-east-2.compute.internal   NotReady   worker   6h45m   v1.19.0-rc.2+3e083ac-dirty


After manually reboot, node are back to ready state with @Sunil's help, related coredumps are attached. And Paste the os version.

ip-10-0-79-138.us-east-2.compute.internal   Ready    worker   7h29m   v1.19.0-rc.2+3e083ac-dirty   10.0.79.138   <none>        Red Hat Enterprise Linux CoreOS 46.82.202008231640-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-87.rhaos4.6.git5f02fa2.el8-dev

Comment 3 Wei Duan 2020-08-24 10:23:34 UTC

Created attachment 1712329 [details]
0824-coresystemd

Comment 4 Seth Jennings 2020-08-28 20:55:04 UTC

There was a recent issue where fips enablement raced and could prevent boot
https://bugzilla.redhat.com/show_bug.cgi?id=1862957

Fixed by
https://github.com/openshift/installer/pull/4066

*** This bug has been marked as a duplicate of bug 1862957 ***