Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-05-18-072155 How reproducible: Not sure Steps to Reproduce: 1. Enable the migration prepare state, set an annotation on the Cluster Network Operator configuration object oc patch Network.operator.openshift.io cluster --type='merge' \ --patch '{ "spec": { "migration": {"networkType": "OVNKubernetes" } } }' 2. Wait MCO updates machines in each config pool. Actual results: One node stuck on "Ready,SchedulingDisabled". $ oc get nodes NAME STATUS ROLES AGE VERSION cluster0-hhl96-master-0 Ready master 22h v1.21.0-rc.0+9d99e1c cluster0-hhl96-master-1 Ready,SchedulingDisabled master 22h v1.21.0-rc.0+9d99e1c cluster0-hhl96-master-2 Ready master 22h v1.21.0-rc.0+9d99e1c cluster0-hhl96-worker-mvvqr Ready worker 22h v1.21.0-rc.0+9d99e1c cluster0-hhl96-worker-tf97g Ready worker 22h v1.21.0-rc.0+9d99e1c cluster0-hhl96-worker-vwnj6 Ready worker 22h v1.21.0-rc.0+9d99e1c I0520 08:24:51.089781 9274 daemon.go:330] evicting pod openshift-etcd/etcd-quorum-guard-56b9d5858-nqn69 E0520 08:24:51.099791 9274 daemon.go:330] error when evicting pods/"etcd-quorum-guard-56b9d5858-nqn69" -n "openshift-etcd" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0520 08:24:56.100528 9274 daemon.go:330] evicting pod openshift-etcd/etcd-quorum-guard-56b9d5858-nqn69 E0520 08:24:56.111067 9274 daemon.go:330] error when evicting pods/"etcd-quorum-guard-56b9d5858-nqn69" -n "openshift-etcd" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0520 08:25:01.111573 9274 daemon.go:330] evicting pod openshift-etcd/etcd-quorum-guard-56b9d5858-nqn69 E0520 08:25:01.119942 9274 daemon.go:330] error when evicting pods/"etcd-quorum-guard-56b9d5858-nqn69" -n "openshift-etcd" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. Expected results: Migrated to OVN successfully. Additional info:
I find a systemd unit 'var-lib-etcd.mount' from the master node, which mounts the etcd dir as a ramdisk. I think that is why we lost the files after the node reboot. This systemd unit was injected by a MachineConfig '99-installer-ignition-master' which I don’t see on other platforms. @mburman Do you know why we have this MachineConfig for this cluster?
Hi @pliu, regarding the etcd mounted in ramdisk (tmpfs), this is actually an installation approach as we were having some issues regarding HW and storage and it we use a script to perform this injection (when generating the ign files). Seems like there some were leftovers and the script ran in the last deployment. My bad for that and thanks for finding out! Do you want us to deploy a new env without the etcd mounted in the tmpfs? Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438