Description of problem: We identified a very critical situation in the Openshift 4.5.8 IPI VMware environment, we noticed that the vmdk's referring to the pv's attacked to the node, were deleted after deleting the machine. We performed some tests and applied the 3 scenarios below to reproduce this behavior: Scene 1: - We created a postgresql pod using persistent disk rwo (vsphere) and allocated it in a specific node. - We shut down the server abruptly to simulate a crash. - At this moment the applications have migrated, but the disk has not been detached from the disconnected server, following the link below the documentation, we removed the node by deleting the machine for this disk. - After the node was removed, we noticed that the postgresql application failed to "not found" in the volume. - PV and PVC remained bound, without showing failure. - Searching for the disk in the datastore, we did not find vmdk, deleted vmdk together with the node. https://docs.openshift.com/container-platform/4.5/support/troubleshooting/troubleshooting-storage-issues.html#storage-multi-attach-error_troubleshooting-storage-issues Scenario 2 - We recreated the postgresql pod with rwo persistent disk and moved it to a specific node. - We removed the machine - Applications were drained and pv's allocated disks were detached - The disk was attacked on another node and the application was started correctly. Scenario 3 - We recreated the postgresql pod with rwo persistent disk and moved it to a specific node. - To simulate a situation with the server connected, however notready, we stopped the kubelet service on this node. - After the server has a status notready, delete the machine. - After a few minutes the server was removed and also removed the vmdk's, without detecting and attaching them to another node. This is a very critical situation, since the volume check is only done if the node is fully functional, where it is not necessary to delete the node. Version-Release number of selected component (if applicable): Openshift 4.5.8 IPI VMware - Productive and Non-Productive Cluster Actual results: Expected results: Additional info: - Found a few Issues related to this: https://github.com/kubernetes/kubernetes/issues/75738 https://github.com/kubernetes/enhancements/pull/719 https://github.com/vmware/vsphere-storage-for-kubernetes/issues/55 https://github.com/rancher/rancher/issues/24690
Kumar, I was wondering if the following use case could be implemented in the machine-api-operator in order to avoid accidental pv removal during machine deleting process: - Before removing a VM on vsphere check all vmdks attached in the VM. - If there is any vmdk that comes from PVs, machine-api will try to detach this volume from the VM (before VM deletion). - If VMDKs are detached successfully the machine-api-operator continues to remove the VM. - If any VMDK is not detached for some reason, machine-api fails to remove the machine with some intuitive message like "Failed to delete the VM due failure to detach one or more vmdks from Persistent Volume: VMDK: <VMDK_NAME> - Error: <MSG FROM VMWARE>" Do you think this is feasible? I would like to propose this use case for implementation. Do you know how to make a proposal for a enhancement in the machine-api-operator? Thanks a lot! Appreciate that! Regards, Giovanni Fontana
Did some testing today with 4.6 nightly (but these parts have not changed for quite some time): 1. When a pod that uses a volume attached to a shutdown node is deleted, Kubernetes waits forever for kubelet to confirm the volume has been unmounted (to prevent data corruption). 1. When such pod is deleted with force, "oc delete pod --force", the volume is detached after ~6 minutes and a new pod can start. So please delete pods with force and be patient. 2. "oc adm drain <node>" does not drain the nodes with force. It waits forever for such a pod to get deleted. So, from storage point of view (vmware volume plugin / attach-detach controller), the system works as designed and prevents data corruption. On overall OCP level, something (MCO?) could force-delete pods on nodes that are confirmed to be shut down in the cloud - MCO has / could have such knowledge, while Kubernetes/kube-controller-manager does not.
I've been thinking about this and there is not much we can do on the storage side. Kubernetes attach/detach controller does not know what's the status of the VM in vSphere and it protects data from corruption, i.e. never force-detaches volumes from nodes unless either the Node object is deleted or Pod is force-deleted (Terminating is not enough, as kubelet must confirm the volumes are unmounted and it's not running at that time). There is already a bug to track documentation changes ("force delete pods before deleting VMs", #1884643). Leaving to MCO team to judge if they can initiate force-delete pods from a node before removing it from vSphere or force-detaching volumes from it. Still, users can go to vSphere console directly and delete VMs manually, potentially loosing their data. And there is nothing we can do about that.
(In reply to Jan Safranek from comment #13) > I've been thinking about this and there is not much we can do on the storage > side. Kubernetes attach/detach controller does not know what's the status of > the VM in vSphere and it protects data from corruption, i.e. never > force-detaches volumes from nodes unless either the Node object is deleted > or Pod is force-deleted (Terminating is not enough, as kubelet must confirm > the volumes are unmounted and it's not running at that time). > > There is already a bug to track documentation changes ("force delete pods > before deleting VMs", #1884643). > > Leaving to MCO team to judge if they can initiate force-delete pods from a > node before removing it from vSphere or force-detaching volumes from it. > Still, users can go to vSphere console directly and delete VMs manually, > potentially loosing their data. And there is nothing we can do about that. The MCO wouldn't force-delete a pod as we also want to avoid any data corruption so until kube tells us it's gone, we still loop the drain. I understand the corruption shouldn't be a problem but the MCO won't grow the knoledge of force deleting a pod (in the short term) and it really sounds like both MCO and kube are doing the right thing but vsphere.
Hello, Did you get a chance to check the above comment made by Sudarshan?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438