2007397 – Unexpected killing of virt-launcher pod, can result in loss of data for hotplugged volumes

Bug 2007397 - Unexpected killing of virt-launcher pod, can result in loss of data for hotplugged volumes

Summary: Unexpected killing of virt-launcher pod, can result in loss of data for hotpl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Alexander Wels
QA Contact:	Yan Du
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2013662 2021209
TreeView+	depends on / blocked

Reported:	2021-09-23 18:25 UTC by Alexander Wels
Modified:	2022-03-16 15:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:	CNV v4.10.0-152
Doc Type:	Bug Fix
Doc Text:	If you hot-plug a virtual disk and then force delete the virt-launcher pod, you might lose data. This is due to a race condition that can cause the VM disk’s contents to be wiped from the persistent volume.
Clone Of:
Clones:	2013662 2021209 (view as bug list)
Environment:
Last Closed:	2022-03-16 15:55:26 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 6464	None	Merged	Resolve hotplug race between kubelet and virt-handler	2021-09-30 17:20:13 UTC
Github	kubevirt kubevirt pull 6479	None	Merged	[release-0.44] Resolve hotplug race between kubelet and virt-handler	2021-11-02 13:33:26 UTC
Github	kubevirt kubevirt pull 6480	None	Merged	[release-0.45] Resolve hotplug race between kubelet and virt-handler	2021-09-30 17:19:31 UTC
Red Hat Product Errata	RHSA-2022:0947	None	None	None	2022-03-16 15:55:58 UTC

Description Alexander Wels 2021-09-23 18:25:42 UTC

Description of problem:
When the virt-launcher pod is killed unexpectedly, there is a possiblity that any hotplugged filesystem volumes will have their disk.img file removed from the backing storage.

Version-Release number of selected component (if applicable):

How reproducible:
Intermittent.

Steps to Reproduce:
1. Start a VM and create a volume then hotplug that volume into the VM.
2. At this point there are 2 pods, the virt-launcher pod and the attachment pod (hpvolume-xyz)
3. Force delete the virt-launcher pod: kubectl delete pod virt-launcher-xyz-abcd --force --graceperiod=0
4. This will terminate the virt-launcer pod, and put the VMI into failed state.
5. Check the contents of the volume you created in step 1. There should be a disk.img file. But a small number of times the disk.img will not be there. The following scenario happened:
- The virt-launcher pod is deleted, since we use an emptyDir as the point that mounts the disk.img file, there is now a race between the kubelet (which will empty the content of the emptyDir) and virt-handler. Virt-handler will notice the pod is gone, and go ahead and unmount all the hotplugged volumes from the virt-launcher pod. If virt-handler is run first, then there is no problem and everything is fine. However if the kubelet runs first, it will remove the contents of the emptyDir, which includes the bind mounted disk.img files. This will then also remove the file from the source volume.

Actual results:
A small percentage of times, the disk.img file disappears due to the kubelet winning the race and clearing the emptyDir before virt-handler can unmount the volumes

Expected results:
There is no race, and virt-handler always gets to unmount first. Or some other way of ensuring that 100% of the time, no data is lost.

Additional info:
It is unlikely the U/S kubernetes community will accept patches that modify the emptyDir behavior of blindly wiping the contents of the emptyDir. We have to find a different solution.

Comment 1 Maya Rashish 2021-10-13 09:29:22 UTC

There's already a backport merged for v4.9, but the target release is v4.10. Do you want to duplicate this bug for v4.9?

Comment 2 Alexander Wels 2021-10-13 13:03:47 UTC

Yeah we need a duplication with a target of 4.9.1

Comment 3 Yan Du 2021-10-27 05:38:44 UTC

Run the test 100 times and issue could not be reproduced. Move bug to Verified
CNV v4.10.0-218

Comment 4 Adam Litke 2021-11-02 13:37:06 UTC

This is in 4.9.0.  Updating target release.

Comment 5 Yan Du 2021-11-03 11:36:39 UTC

There is another bug for 4.9.0 #2013662

Comment 13 errata-xmlrpc 2022-03-16 15:55:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947

Note You need to log in before you can comment on or make changes to this bug.