Created attachment 1976227 [details] vm with 2 disks Description of problem: Can't create a snapshot for VM with containerDisk and *mounted* PVC disk > $ oc get vmsnapshot --watch > NAME SOURCEKIND SOURCENAME PHASE READYTOUSE CREATIONTIME ERROR > snapshot-uninterested-cheetah VirtualMachine vm-fedora-with-pvc InProgress false > snapshot-uninterested-cheetah VirtualMachine vm-fedora-with-pvc Failed false > $ oc get vmsnapshot snapshot-uninterested-cheetah -o json | jq .status.conditions >[ >. > { > "lastProbeTime": null, > "lastTransitionTime": "2023-07-17T17:32:10Z", > "reason": "snapshot deadline exceeded", > "status": "True", > "type": "Failure" > } >] With unmounted second disk the snapshot succesfully completed: > $ oc get vmsnapshot > NAME SOURCEKIND SOURCENAME PHASE READYTOUSE CREATIONTIME ERROR > snapshot-recent-flyingfish VirtualMachine vm-fedora-with-pvc Succeeded true 8s Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. create VM with containerDisk (root) and PVC (second) 2. go to the VM console 3. create ext4 filesystem on the second disk - mkfs.ext4 /dev/vdb 4. mount second disk - mkdir /mnt/test mount /dev/vdb /mnt/test 5. create new file inside that folder with some text - e.g. vi /mnt/test/TEST_FILE 6. try to make a snapshot Actual results: failed to create snapshot Expected results: snapshot created succesfully Additional info:
for info - see same behavior with Fedora38 and RHEL9.2
Summarizing offline chats: The underlying issue is a failure in guest agent freeze: {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to freeze vmi","name":"vm-fedora-with-pvc","namespace":"test-clone","pos":"server.go:269","reason":"virError(Code=1, Domain=10, Message='internal error: unable to execute QEMU agent command 'guest-fsfreeze-freeze': failed to open /mnt/test: Permission denied')","timestamp":"2023-07-17T18:23:58.258646Z","uid":"04ce94f3-5f77-472a-9c61-21eb0f0fb41f"} The corresponding bug for this scenario, and its conclusion is here: https://bugzilla.redhat.com/show_bug.cgi?id=1747960#c35 Some comments on the bug suggest qemu-ga cannot do anything more than expose this (off by default) boolean: https://bugzilla.redhat.com/show_bug.cgi?id=1747960#c20 https://bugzilla.redhat.com/show_bug.cgi?id=1747960#c22 So I am not sure if there's anything we can do on the CNV side, But I am curious about how this has not bugged other users before
Thanks for the explanation Alex. I think single disk VMs are overwhelmingly the norm in the field. Also, I wonder if this would reproduce if the second disk is block and initialized with LVM. In any case, I think we should have a KCS article for this topic. Adding Jean-Francois: What do you think?
(In reply to Adam Litke from comment #3) > I think single disk VMs are overwhelmingly the norm in the field. From a quick look at hotplug tests, this looks like a common pattern (minus taking a snapshot at the end), but yeah I agree about single-disk VMs being the norm
Whoops messed up the needinfo. Michael, I was about to ask if you there is anything we can do from our side like: - Integrate this selinux bool in our golden images - Change the boolean before calling freeze Both seem risky to me, as this should be something that is consciously done by the VM owner
I don't think we should change any VM settings.
Hi Jean-Francois, Will you be able to create a KCS for this prior to 4.14.0 GA?
The KCS is created and published : https://access.redhat.com/solutions/7041127
@jpeimer Please see the KCS article for QA. Thanks.
Verified on CNV 4.14.0 and 4.15.0 Could reproduce the issue, tried the proposed solution, and it worked: [fedora@vm-fedora-with-pvc ~]$ sudo setsebool -P virt_qemu_ga_read_nonsecurity_files on [ 337.004824] SELinux: Class mctp_socket not defined in policy.d_nonsecurity_files on [ 337.007003] SELinux: the above unknown classes and permissions will be allowed [ 337.013869] SELinux: Converting 309 SID table entries... [ 337.060824] SELinux: policy capability network_peer_controls=1 [ 337.063419] SELinux: policy capability open_perms=1 [ 337.065175] SELinux: policy capability extended_socket_class=1 [ 337.067260] SELinux: policy capability always_check_network=0 [ 337.069448] SELinux: policy capability cgroup_seclabel=1 [ 337.071488] SELinux: policy capability nnp_nosuid_transition=1 [ 337.073628] SELinux: policy capability genfs_seclabel_symlinks=0 $ oc get vmsnapshot -A NAMESPACE NAME SOURCEKIND SOURCENAME PHASE READYTOUSE CREATIONTIME ERROR default my-vmsnapshot VirtualMachine vm-fedora-with-pvc Succeeded true 12s $ oc get vmsnapshot my-vmsnapshot -o json | jq .status.snapshotVolumes { "excludedVolumes": [ "containerdisk", "cloudinitdisk" ], "includedVolumes": [ "disk-0" ] }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7704