1962525 – [Migration] SDN migration stuck on MCO on RHV cluster

Bug 1962525 - [Migration] SDN migration stuck on MCO on RHV cluster

Summary: [Migration] SDN migration stuck on MCO on RHV cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Peng Liu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-20 09:04 UTC by huirwang
Modified:	2021-07-27 23:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:09:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:09:55 UTC

Description huirwang 2021-05-20 09:04:14 UTC

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-05-18-072155

How reproducible:
Not sure


Steps to Reproduce:
1. Enable the migration prepare state, set an annotation on the Cluster Network Operator configuration object
oc patch Network.operator.openshift.io cluster --type='merge' \
  --patch '{ "spec": { "migration": {"networkType": "OVNKubernetes" } } }'

2. Wait MCO updates machines in each config pool.


Actual results:

One node stuck on "Ready,SchedulingDisabled". 

$ oc get nodes
NAME                          STATUS                     ROLES    AGE   VERSION
cluster0-hhl96-master-0       Ready                      master   22h   v1.21.0-rc.0+9d99e1c
cluster0-hhl96-master-1       Ready,SchedulingDisabled   master   22h   v1.21.0-rc.0+9d99e1c
cluster0-hhl96-master-2       Ready                      master   22h   v1.21.0-rc.0+9d99e1c
cluster0-hhl96-worker-mvvqr   Ready                      worker   22h   v1.21.0-rc.0+9d99e1c
cluster0-hhl96-worker-tf97g   Ready                      worker   22h   v1.21.0-rc.0+9d99e1c
cluster0-hhl96-worker-vwnj6   Ready                      worker   22h   v1.21.0-rc.0+9d99e1c

I0520 08:24:51.089781    9274 daemon.go:330] evicting pod openshift-etcd/etcd-quorum-guard-56b9d5858-nqn69
E0520 08:24:51.099791    9274 daemon.go:330] error when evicting pods/"etcd-quorum-guard-56b9d5858-nqn69" -n "openshift-etcd" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0520 08:24:56.100528    9274 daemon.go:330] evicting pod openshift-etcd/etcd-quorum-guard-56b9d5858-nqn69
E0520 08:24:56.111067    9274 daemon.go:330] error when evicting pods/"etcd-quorum-guard-56b9d5858-nqn69" -n "openshift-etcd" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0520 08:25:01.111573    9274 daemon.go:330] evicting pod openshift-etcd/etcd-quorum-guard-56b9d5858-nqn69
E0520 08:25:01.119942    9274 daemon.go:330] error when evicting pods/"etcd-quorum-guard-56b9d5858-nqn69" -n "openshift-etcd" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.




Expected results:
Migrated to OVN successfully.


Additional info:

Comment 14 Peng Liu 2021-05-26 07:32:54 UTC

I find a systemd unit 'var-lib-etcd.mount' from the master node, which mounts the etcd dir as a ramdisk. I think that is why we lost the files after the node reboot.
This systemd unit was injected by a MachineConfig '99-installer-ignition-master' which I don’t see on other platforms.

@mburman Do you know why we have this MachineConfig for this cluster?

Comment 15 Guilherme Santos 2021-05-26 07:57:45 UTC

Hi @pliu, regarding the etcd mounted in ramdisk (tmpfs), this is actually an installation approach as we were having some issues regarding HW and storage and it we use a script to perform this injection (when generating the ign files). Seems like there some were leftovers and the script ran in the last deployment. My bad for that and thanks for finding out!
Do you want us to deploy a new env without the etcd mounted in the tmpfs?
Thanks

Comment 21 errata-xmlrpc 2021-07-27 23:09:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.