Bug 2223411 - Unable to create snapshot for VM with mounted second disk (PVC)
Summary: Unable to create snapshot for VM with mounted second disk (PVC)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.14.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.14.1
Assignee: Adam Litke
QA Contact: Jenia Peimer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-17 17:41 UTC by Denys Shchedrivyi
Modified: 2023-12-07 15:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-07 15:00:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vm with 2 disks (1.95 KB, text/plain)
2023-07-17 17:41 UTC, Denys Shchedrivyi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-31077 0 None None None 2023-07-17 17:42:01 UTC
Red Hat Product Errata RHSA-2023:7704 0 None None None 2023-12-07 15:00:43 UTC

Description Denys Shchedrivyi 2023-07-17 17:41:41 UTC
Created attachment 1976227 [details]
vm with 2 disks

Description of problem:
 Can't create a snapshot for VM with containerDisk and *mounted* PVC disk

> $ oc get vmsnapshot --watch
> NAME                            SOURCEKIND       SOURCENAME           PHASE        READYTOUSE   CREATIONTIME   ERROR
> snapshot-uninterested-cheetah   VirtualMachine   vm-fedora-with-pvc   InProgress   false                       
> snapshot-uninterested-cheetah   VirtualMachine   vm-fedora-with-pvc   Failed       false                       

> $ oc get vmsnapshot snapshot-uninterested-cheetah -o json | jq .status.conditions
>[
>.
>  {
>    "lastProbeTime": null,
>    "lastTransitionTime": "2023-07-17T17:32:10Z",
>    "reason": "snapshot deadline exceeded",
>    "status": "True",
>    "type": "Failure"
>  }
>]


 With unmounted second disk the snapshot succesfully completed:

> $ oc get vmsnapshot
> NAME                            SOURCEKIND       SOURCENAME           PHASE       READYTOUSE   CREATIONTIME   ERROR
> snapshot-recent-flyingfish      VirtualMachine   vm-fedora-with-pvc   Succeeded   true         8s             



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. create VM with containerDisk (root) and PVC (second)
2. go to the VM console
3. create ext4 filesystem on the second disk -  
mkfs.ext4 /dev/vdb

4. mount second disk - 
mkdir /mnt/test 
mount /dev/vdb /mnt/test

5. create new file inside that folder with some text - e.g. vi /mnt/test/TEST_FILE
6. try to make a snapshot


Actual results:
 failed to create snapshot

Expected results:
 snapshot created succesfully

Additional info:

Comment 1 Denys Shchedrivyi 2023-07-17 21:33:48 UTC
for info - see same behavior with Fedora38 and RHEL9.2

Comment 2 Alex Kalenyuk 2023-07-18 11:10:18 UTC
Summarizing offline chats:

The underlying issue is a failure in guest agent freeze:
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to freeze vmi","name":"vm-fedora-with-pvc","namespace":"test-clone","pos":"server.go:269","reason":"virError(Code=1, Domain=10, Message='internal error: unable to execute QEMU agent command 'guest-fsfreeze-freeze': failed to open /mnt/test: Permission denied')","timestamp":"2023-07-17T18:23:58.258646Z","uid":"04ce94f3-5f77-472a-9c61-21eb0f0fb41f"}

The corresponding bug for this scenario, and its conclusion is here:
https://bugzilla.redhat.com/show_bug.cgi?id=1747960#c35
Some comments on the bug suggest qemu-ga cannot do anything more than expose this (off by default) boolean:
https://bugzilla.redhat.com/show_bug.cgi?id=1747960#c20
https://bugzilla.redhat.com/show_bug.cgi?id=1747960#c22

So I am not sure if there's anything we can do on the CNV side,
But I am curious about how this has not bugged other users before

Comment 3 Adam Litke 2023-07-19 13:03:34 UTC
Thanks for the explanation Alex.  I think single disk VMs are overwhelmingly the norm in the field.  Also, I wonder if this would reproduce if the second disk is block and initialized with LVM.  In any case, I think we should have a KCS article for this topic.  Adding Jean-Francois:  What do you think?

Comment 4 Alex Kalenyuk 2023-07-20 09:57:56 UTC
(In reply to Adam Litke from comment #3)
> I think single disk VMs are overwhelmingly the norm in the field.

From a quick look at hotplug tests, this looks like a common pattern (minus taking a snapshot at the end),
but yeah I agree about single-disk VMs being the norm

Comment 5 Alex Kalenyuk 2023-07-20 10:00:50 UTC
Whoops messed up the needinfo. Michael, I was about to ask if you there is anything we can
do from our side like:
- Integrate this selinux bool in our golden images
- Change the boolean before calling freeze
Both seem risky to me, as this should be something that is consciously done by the VM owner

Comment 6 Michael Henriksen 2023-07-20 14:19:48 UTC
I don't think we should change any VM settings.

Comment 9 Adam Litke 2023-09-27 18:22:00 UTC
Hi Jean-Francois,

Will you be able to create a KCS for this prior to 4.14.0 GA?

Comment 10 Jean-Francois Saucier 2023-10-25 12:12:06 UTC
The KCS is created and published : https://access.redhat.com/solutions/7041127

Comment 11 Adam Litke 2023-11-15 17:46:39 UTC
@jpeimer Please see the KCS article for QA.  Thanks.

Comment 12 Jenia Peimer 2023-11-15 22:09:45 UTC
Verified on CNV 4.14.0 and 4.15.0

Could reproduce the issue, tried the proposed solution, and it worked:

[fedora@vm-fedora-with-pvc ~]$ sudo setsebool -P virt_qemu_ga_read_nonsecurity_files on
[  337.004824] SELinux:  Class mctp_socket not defined in policy.d_nonsecurity_files on
[  337.007003] SELinux: the above unknown classes and permissions will be allowed
[  337.013869] SELinux:  Converting 309 SID table entries...
[  337.060824] SELinux:  policy capability network_peer_controls=1
[  337.063419] SELinux:  policy capability open_perms=1
[  337.065175] SELinux:  policy capability extended_socket_class=1
[  337.067260] SELinux:  policy capability always_check_network=0
[  337.069448] SELinux:  policy capability cgroup_seclabel=1
[  337.071488] SELinux:  policy capability nnp_nosuid_transition=1
[  337.073628] SELinux:  policy capability genfs_seclabel_symlinks=0

$ oc get vmsnapshot -A 
NAMESPACE   NAME            SOURCEKIND       SOURCENAME           PHASE       READYTOUSE   CREATIONTIME   ERROR
default     my-vmsnapshot   VirtualMachine   vm-fedora-with-pvc   Succeeded   true         12s      

$ oc get vmsnapshot my-vmsnapshot -o json | jq .status.snapshotVolumes
{
  "excludedVolumes": [
    "containerdisk",
    "cloudinitdisk"
  ],
  "includedVolumes": [
    "disk-0"
  ]
}

Comment 19 errata-xmlrpc 2023-12-07 15:00:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7704


Note You need to log in before you can comment on or make changes to this bug.