Bug 2047257

Summary: [CP MIGRATION] Node drain failure during control plane node migration
Product: OpenShift Container Platform Reporter: Jon Uriarte <juriarte>
Component: Cloud ComputeAssignee: Pierre Prinetti <pprinett>
Cloud Compute sub component: OpenStack Provider QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: m.andre, mfedosin, pprinett
Version: 4.10Keywords: Triaged
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:44:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jon Uriarte 2022-01-27 13:23:45 UTC
Description of problem:

Control plane node migration cannot be performed due to a failure during node drain operation.


Version-Release number of selected component (if applicable):
OCP 4.10.0-0.nightly-2022-01-25-023600
OSP 16.1.7

How reproducible: always


Steps to Reproduce:
1. Install OCP 4.10
2. Follow CP node migration procedure described here: https://github.com/openshift/installer/tree/master/docs/user/openstack#control-plane-node-migration

Actual results:

$ OS_CLOUD=overcloud ./cp_node_migration.sh ostest-kznkt-master-0                                                                                                                                                      
+ declare -r node_name=ostest-kznkt-master-0
+ declare server_id
++ openstack server list --all-projects -f value -c ID -c Name
++ grep ostest-kznkt-master-0
++ cut '-d ' -f1
+ server_id=6b5e7191-a2b8-41e7-9be0-d269ebc09e5c
+ readonly server_id
+ oc adm cordon ostest-kznkt-master-0
node/ostest-kznkt-master-0 cordoned
+ oc adm drain ostest-kznkt-master-0 --delete-emptydir-data --ignore-daemonsets
node/ostest-kznkt-master-0 already cordoned
error: unable to drain node "ostest-kznkt-master-0" due to error:cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): openshift-kube-apiserver/kube-apiserver-guard-ostest-kznkt-master-0, openshift-kube-controller-manager/kube-controller-manager-guard-ostest-kznkt-master-0, openshift-kube-scheduler/openshift-kube-scheduler-guard-ostest-kznkt-master-0, continuing command...                            
There are pending nodes to be drained:
 ostest-kznkt-master-0
cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): openshift-kube-apiserver/kube-apiserver-guard-ostest-kznkt-master-0, openshift-kube-controller-manager/kube-controller-manager-guard-ostest-kznkt-master-0, openshift-kube-scheduler/openshift-kube-scheduler-guard-ostest-kznkt-master-0

Expected results: node successfully migrated


Additional info:

$ openstack server list --host compute-0.redhat.local --all
+--------------------------------------+-----------------------------+--------+-------------------------------------------------+--------------------+--------+
| ID                                   | Name                        | Status | Networks                                        | Image              | Flavor |
+--------------------------------------+-----------------------------+--------+-------------------------------------------------+--------------------+--------+
| 60749007-a8be-4897-bd05-afdf6728b347 | ostest-kznkt-worker-0-tjmws | ACTIVE | ostest-kznkt-openshift=10.196.0.205             | ostest-kznkt-rhcos |        |
| cd93d6a1-827f-4d3e-b6ff-df3f3e3e1ed0 | ostest-kznkt-bootstrap      | ACTIVE | ostest-kznkt-openshift=10.196.1.20, 10.46.23.49 | ostest-kznkt-rhcos |        |
| 1b287a04-40f3-4ed0-a7c8-25800dd7d537 | ostest-kznkt-master-1       | ACTIVE | ostest-kznkt-openshift=10.196.3.84              | ostest-kznkt-rhcos |        |
+--------------------------------------+-----------------------------+--------+-------------------------------------------------+--------------------+--------+

$ openstack server list --host compute-1.redhat.local --all
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+
| ID                                   | Name                        | Status | Networks                            | Image              | Flavor |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+
| 18fd7016-2521-4b81-9e4f-84321e6edbfd | ostest-kznkt-worker-0-hlkp2 | ACTIVE | ostest-kznkt-openshift=10.196.3.245 | ostest-kznkt-rhcos |        |
| eaf32211-cb78-482a-a349-ec6c92ef7370 | ostest-kznkt-worker-0-7dqh2 | ACTIVE | ostest-kznkt-openshift=10.196.0.94  | ostest-kznkt-rhcos |        |
| 68e49ac2-fdbe-4e7e-a1fc-3c87005802c6 | ostest-kznkt-master-2       | ACTIVE | ostest-kznkt-openshift=10.196.2.51  | ostest-kznkt-rhcos |        |
| 6b5e7191-a2b8-41e7-9be0-d269ebc09e5c | ostest-kznkt-master-0       | ACTIVE | ostest-kznkt-openshift=10.196.1.253 | ostest-kznkt-rhcos |        |
+--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+--------+

$ oc get pods -o wide -A | grep master-0 | grep guard
openshift-etcd                                     etcd-quorum-guard-6d5548d4c4-jgzkh                           1/1     Running     0               22h     10.196.1.253   ostest-kznkt-master-0         <none>           <none>
openshift-kube-apiserver                           kube-apiserver-guard-ostest-kznkt-master-0                   1/1     Running     0               21h     10.128.0.44    ostest-kznkt-master-0         <none>           <none>
openshift-kube-controller-manager                  kube-controller-manager-guard-ostest-kznkt-master-0          1/1     Running     0               21h     10.128.0.42    ostest-kznkt-master-0         <none>           <none>
openshift-kube-scheduler                           openshift-kube-scheduler-guard-ostest-kznkt-master-0         1/1     Running     0               22h     10.128.0.34    ostest-kznkt-master-0         <none>           <none>

Comment 1 Pierre Prinetti 2022-01-28 12:05:13 UTC
This is probably worth checking again with a payload that contains the fix to Bug 2038481.

Comment 2 Pierre Prinetti 2022-01-28 18:41:16 UTC
Setting blocker- because it’s a potential bug in the docs.

Comment 5 Pierre Prinetti 2022-02-07 11:21:30 UTC
Do you mind testing the proposed patch in your environment before merge?

Comment 6 Jon Uriarte 2022-02-10 12:50:48 UTC
(In reply to Pierre Prinetti from comment #5)
> Do you mind testing the proposed patch in your environment before merge?

Tested and looking good

Comment 10 Jon Uriarte 2022-05-11 14:41:51 UTC
Verified in 4.11.0-0.nightly-2022-05-10-045003 on top of OSP 16.1.8.

Control plane node migration is correctly done.

2022-05-11 13:20:23.559 |     "vm_per_compute": {
2022-05-11 13:20:23.562 |         "computehci-0.redhat.local": [
2022-05-11 13:20:23.564 |             "ostest-6pp4w-worker-0-7dnxv",
2022-05-11 13:20:23.567 |             "ostest-6pp4w-master-2"
2022-05-11 13:20:23.570 |         ],
2022-05-11 13:20:23.572 |         "computehci-1.redhat.local": [
2022-05-11 13:20:23.575 |             "ostest-6pp4w-worker-0-twnxn",
2022-05-11 13:20:23.577 |             "ostest-6pp4w-master-0"
2022-05-11 13:20:23.580 |         ],
2022-05-11 13:20:23.583 |         "computehci-2.redhat.local": [
2022-05-11 13:20:23.585 |             "ostest-6pp4w-worker-0-4tkgd",
2022-05-11 13:20:23.588 |             "ostest-6pp4w-master-1"
2022-05-11 13:20:23.590 |         ]
2022-05-11 13:20:23.593 |     }
...
2022-05-11 13:21:18.461 | Going to migrate 'ostest-6pp4w-master-0' OCP node from 'computehci-1.redhat.local' OSP compute
...
2022-05-11 13:27:41.527 |     "vm_per_compute_after": {
2022-05-11 13:27:41.529 |         "computehci-0.redhat.local": [
2022-05-11 13:27:41.532 |             "ostest-6pp4w-worker-0-7dnxv",
2022-05-11 13:27:41.534 |             "ostest-6pp4w-master-2"
2022-05-11 13:27:41.537 |         ],
2022-05-11 13:27:41.539 |         "computehci-1.redhat.local": [
2022-05-11 13:27:41.541 |             "ostest-6pp4w-worker-0-twnxn"
2022-05-11 13:27:41.544 |         ],
2022-05-11 13:27:41.546 |         "computehci-2.redhat.local": [
2022-05-11 13:27:41.548 |             "ostest-6pp4w-worker-0-4tkgd",
2022-05-11 13:27:41.551 |             "ostest-6pp4w-master-1",
2022-05-11 13:27:41.553 |             "ostest-6pp4w-master-0"
2022-05-11 13:27:41.555 |         ]
2022-05-11 13:27:41.558 |     }

Comment 13 errata-xmlrpc 2022-08-10 10:44:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069