Bug 1870394

Summary: Upgrade from 4.5.6 to 4.6.0-0.nightly-2020-08-18-165040 failed due to one node is NotReady
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: low    
Version: 4.5CC: aos-bugs, jokerman, nagrawal, schoudha, wduan, wsun
Target Milestone: ---Keywords: Reopened, Upgrades
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-29 05:27:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
coredump in zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn
none
0824-coresystemd none

Description sunzhaohua 2020-08-20 00:01:24 UTC
Description of problem:
IPI on azure, enable etcd encryption, enable fips, Upgrade from 4.5.6 to 4.6.0-0.nightly-2020-08-18-165040 failed due to one node is NotReady

Version-Release number of selected component (if applicable):
4.5.6->4.6.0-0.nightly-2020-08-18-165040

How reproducible:
met once time

Steps to Reproduce:
1. Set up 4.5.6 cluster, after setting up the cluster,openshift-apiserver is not avaiable,so use this workaround https://bugzilla.redhat.com/show_bug.cgi?id=1825219#c51, then upgrade.
openshift-apiserver                        4.6.0-0.nightly-2020-08-18-165040   False        False         False      8h

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.6     True        False         4h27m   Cluster version is 4.5.6


$ oc get node
NAME                                                STATUS   ROLES    AGE   VERSION
zhsunupgrade819-ctcx8-master-0                      Ready    master   72m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-1                      Ready    master   71m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-2                      Ready    master   72m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-lsccw   Ready    worker   56m   v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn   Ready    worker   56m   v1.18.3+002a51f

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.6     True        False         False      4h27m
cloud-credential                           4.5.6     True        False         False      4h56m
cluster-autoscaler                         4.5.6     True        False         False      4h43m
config-operator                            4.5.6     True        False         False      4h43m
console                                    4.5.6     True        False         False      4h9m
csi-snapshot-controller                    4.5.6     True        False         False      4h24m
dns                                        4.5.6     True        False         False      4h50m
etcd                                       4.5.6     True        False         False      4h49m
image-registry                             4.5.6     True        False         False      4h34m
ingress                                    4.5.6     True        False         False      4h34m
insights                                   4.5.6     True        False         False      4h44m
kube-apiserver                             4.5.6     True        False         False      4h49m
kube-controller-manager                    4.5.6     True        False         False      4h49m
kube-scheduler                             4.5.6     True        False         False      4h48m
kube-storage-version-migrator              4.5.6     True        False         False      4h21m
machine-api                                4.5.6     True        False         False      4h41m
machine-approver                           4.5.6     True        False         False      4h45m
machine-config                             4.5.6     True        False         False      4h42m
marketplace                                4.5.6     True        False         False      4h19m
monitoring                                 4.5.6     True        False         False      3m13s
network                                    4.5.6     True        False         False      4h51m
node-tuning                                4.5.6     True        False         False      4h51m
openshift-apiserver                        4.5.6     True        False         False      3h22m
openshift-controller-manager               4.5.6     True        False         False      4h44m
openshift-samples                          4.5.6     True        False         False      4h43m
operator-lifecycle-manager                 4.5.6     True        False         False      4h50m
operator-lifecycle-manager-catalog         4.5.6     True        False         False      4h51m
operator-lifecycle-manager-packageserver   4.5.6     True        False         False      4h19m
service-ca                                 4.5.6     True        False         False      4h51m
storage                                    4.5.6     True        False         False      4h44m

2. Upgrade to 4.6.0-0.nightly-2020-08-18-165040
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-18-165040 --force --allow-explicit-upgrade
3. 

Actual results:
One node is in NotReady status, it is stuck at FIPS mode initialized. After rebooting it, the node is back and in Ready state. Finally the cluster upgrade to 4.6.0-0.nightly-2020-08-18-165040 successful.

Before reboot the NotReady node:
sh-4.4# ssh -i openshift-qe.pem core@zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn
FIPS mode initialized
Red Hat Enterprise Linux CoreOS 45.82.202008101249-0
  Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html
---
[systemd]
Failed Units: 5
  crio-f4f3ecc7efafaab5f8fba7052ae9999ae6e5e449a5af64d4829d26d55fcc770f.scope
  afterburn-checkin.service
  chronyd.service
  systemd-coredump
  systemd-coredump 

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.6     True        True          3h42m   Unable to apply 4.6.0-0.nightly-2020-08-18-165040: the cluster operator monitoring has not yet successfully rolled out

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      26s
cloud-credential                           4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
cluster-autoscaler                         4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
config-operator                            4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
console                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h24m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      40m
dns                                        4.5.6                               True        True          False      8h
etcd                                       4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
image-registry                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      42m
ingress                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h26m
insights                                   4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-apiserver                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-controller-manager                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-scheduler                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-18-165040   True        False         False      41m
machine-api                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
machine-approver                           4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
machine-config                             4.5.6                               False       False         True       3h9m
marketplace                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h25m
monitoring                                 4.5.6                               False       False         True       3h23m
network                                    4.5.6                               True        True          True       8h
node-tuning                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h25m
openshift-apiserver                        4.6.0-0.nightly-2020-08-18-165040   True        False         False      139m
openshift-controller-manager               4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h24m
openshift-samples                          4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h25m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-18-165040   True        False         False      166m
service-ca                                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      8h
storage                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      3h26m

$ oc get node
NAME                                                STATUS     ROLES    AGE   VERSION
zhsunupgrade819-ctcx8-master-0                      Ready      master   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-1                      Ready      master   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-master-2                      Ready      master   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-lsccw   Ready      worker   8h    v1.18.3+002a51f
zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn   NotReady   worker   8h    v1.18.3+002a51f



Expected results:
Upgrade is successfull.

Additional info:
must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.7896037696304830913.zip
cluster: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/107252/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 1 sunzhaohua 2020-08-20 07:17:34 UTC
Created attachment 1711968 [details]
coredump in zhsunupgrade819-ctcx8-worker-northcentralus-pl9xn

Comment 2 Wei Duan 2020-08-24 10:20:41 UTC
I met the similar issue when upgrade from 4.5.0-0.nightly-2020-08-23-191713 to 4.6.0-0.nightly-2020-08-23-185640.
Two workers are in NotReady during the upgrade.

$ oc get node
NAME                                        STATUS     ROLES    AGE     VERSION
ip-10-0-55-19.us-east-2.compute.internal    NotReady   worker   6h44m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-58-5.us-east-2.compute.internal     Ready      worker   6h45m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-58-77.us-east-2.compute.internal    Ready      master   6h54m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-60-223.us-east-2.compute.internal   Ready      master   6h54m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-77-95.us-east-2.compute.internal    Ready      master   6h54m   v1.19.0-rc.2+3e083ac-dirty
ip-10-0-79-138.us-east-2.compute.internal   NotReady   worker   6h45m   v1.19.0-rc.2+3e083ac-dirty


After manually reboot, node are back to ready state with @Sunil's help, related coredumps are attached. And Paste the os version.

ip-10-0-79-138.us-east-2.compute.internal   Ready    worker   7h29m   v1.19.0-rc.2+3e083ac-dirty   10.0.79.138   <none>        Red Hat Enterprise Linux CoreOS 46.82.202008231640-0 (Ootpa)   4.18.0-211.el8.x86_64   cri-o://1.19.0-87.rhaos4.6.git5f02fa2.el8-dev

Comment 3 Wei Duan 2020-08-24 10:23:34 UTC
Created attachment 1712329 [details]
0824-coresystemd

Comment 4 Seth Jennings 2020-08-28 20:55:04 UTC
There was a recent issue where fips enablement raced and could prevent boot
https://bugzilla.redhat.com/show_bug.cgi?id=1862957

Fixed by
https://github.com/openshift/installer/pull/4066

*** This bug has been marked as a duplicate of bug 1862957 ***