2162252 – Got 'SyncVMI failed' when hotplug a NFS disk to a NFS VM

Bug 2162252 - Got 'SyncVMI failed' when hotplug a NFS disk to a NFS VM

Summary: Got 'SyncVMI failed' when hotplug a NFS disk to a NFS VM

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.13.1
Assignee:	Alexander Wels
QA Contact:	Yan Du
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-01-19 07:53 UTC by Yan Du
Modified:	2023-06-20 13:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:	v4.13.1.rhel9-121
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-20 13:41:05 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 9591	None	Merged	Improve mountinfo filtering for hotplugging NFS disks	2023-05-24 09:38:22 UTC
Github	kubevirt kubevirt pull 9825	None	Merged	[release-0.59] Improve mountinfo filtering for hotplugging NFS disks	2023-06-01 02:42:21 UTC
Red Hat Issue Tracker	CNV-24497	None	None	None	2023-01-19 07:57:27 UTC
Red Hat Product Errata	RHEA-2023:3686	None	None	None	2023-06-20 13:41:26 UTC

Description Yan Du 2023-01-19 07:53:44 UTC

Description of problem:
Events:
  Type     Reason              Age                   From                       Message
  ----     ------              ----                  ----                       -------
  Normal   SuccessfulCreate    11m                   virtualmachine-controller  Created virtual machine pod virt-launcher-vm-fedora-c4lg4
  Normal   Created             11m                   virt-handler               VirtualMachineInstance defined.
  Normal   Started             11m                   virt-handler               VirtualMachineInstance started.
  Normal   SuccessfulCreate    11m                   virtualmachine-controller  Created attachment pod hp-volume-bgzzw
  Normal   SuccessfulCreate    11m (x6 over 11m)     virtualmachine-controller  Created hotplug attachment pod hp-volume-bgzzw, for volume blank-dv
  Normal   VolumeMountedToPod  11m                   virt-handler               Volume blank-dv has been mounted in virt-launcher pod
  Warning  SyncFailed          112s (x447 over 11m)  virt-handler               server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: unable to execute QEMU command 'device_add': Failed to get \"write\" lock')"


Version-Release number of selected component (if applicable):
CNV 4.12.0

How reproducible:
Always

Steps to Reproduce:
1. Import DV (nfs) and create VM
2. Create a blank dv(nfs)
3. Hotplug the disk to VM
$virtctl addvolume vm-fedora --volume-name=blank-dv
4. Describe the vmi

Actual results:
Got error as description for vmi
Volume blank-dv has been mounted in virt-launcher pod
  Warning  SyncFailed          112s (x447 over 11m)  virt-handler               server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: unable to execute QEMU command 'device_add': Failed to get \"write\" lock')"

and volume status is keeping in VolumeMountedToPod

    volumeStatus:
    - hotplugVolume:
        attachPodName: hp-volume-bgzzw
        attachPodUID: 9c2e93b3-edac-48d9-bbf8-cf679ae9b8fd
      message: Volume blank-dv has been mounted in virt-launcher pod
      name: blank-dv
      persistentVolumeClaimInfo:
        accessModes:
        - ReadWriteOnce
        capacity:
          storage: 5Gi
        filesystemOverhead: "0.055"
        requests:
          storage: 1Gi
        volumeMode: Filesystem
      phase: MountedToPod
      reason: VolumeMountedToPod
      target: ""

Expected results:
VolumeReady in vmi's volumeStatus, hotplug works without error


Additional info:

---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv1
spec:
  source:
    http:
      url: http://url/fedora-images/Fedora-Cloud-Base-34-1.2.x86_64.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: nfs
    volumeMode: Filesystem
  contentType: kubevirt

---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: vm-fedora
  name: vm-fedora
spec:
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm-fedora
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: dv-disk
          - disk:
              bus: virtio
            name: cloudinitdisk
        resources:
          requests:
            memory: 1024Mi
      terminationGracePeriodSeconds: 0
      volumes:
      - name: dv-disk
        dataVolume:
          name: dv1
      - cloudInitNoCloud:
          userData: |-
            #cloud-config
            password: fedora
            chpasswd: { expire: False }
            echo 'printed from cloud-init userdata'
        name: cloudinitdisk

---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: blank-dv
spec:
  source:
    blank: {}
  pvc:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 1Gi
    storageClassName: nfs
    volumeMode: Filesystem

Comment 1 Adam Litke 2023-03-28 21:37:01 UTC

Alexander PTAL

Comment 2 Alexander Wels 2023-04-05 13:41:41 UTC

Yes it is broken, I suspect it is because of some selinux labeling. I created a kubevirtci cluster, with both local and nfs storage, then I added a blank local volume and a blank nfs volume. When I look in the virt-launcher pod I see:

bash-5.1$ ls -alZ
total 5267412
drwxrwxrwx. 2 root root system_u:object_r:container_file_t:s0:c449,c794          64 Apr  5 13:35 .
drwxrwxrwx. 5 root root system_u:object_r:container_file_t:s0:c449,c794          96 Apr  5 13:25 ..
-rw-rw----. 1 qemu qemu system_u:object_r:container_file_t:s0:c449,c794  5073010688 Apr  5 13:33 volume-hotplug-local.img
-rw-rw----. 1 qemu qemu system_u:object_r:nfs_t:s0                      10146021376 Apr  5 13:36 volume-hotplug.img

Note the selinux label of the nfs volume contains nfs_t instead of container_file_t. The local volume was added successfully, but the nfs volume is showing the exact error report.

Comment 3 Alexander Wels 2023-04-05 15:06:46 UTC

Can you show me nfs storage class? I tried other nfs servers (trident-nfs) and it worked. so I suspect it is simply a configuration issue in the nfs server.

Comment 4 Yan Du 2023-04-06 01:53:03 UTC

$ oc get sc nfs -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2023-04-03T14:03:00Z"
  name: nfs
  resourceVersion: "73762"
  uid: bd1d7758-2dc5-4e80-a439-81e4e95595b7
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate
 
 
$ oc get storageprofile nfs -o yaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: StorageProfile
metadata:
  creationTimestamp: "2023-04-03T14:03:00Z"
  generation: 3
  labels:
    app: containerized-data-importer
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.13.0
    cdi.kubevirt.io: ""
  name: nfs
  ownerReferences:
  - apiVersion: cdi.kubevirt.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: CDI
    name: cdi-kubevirt-hyperconverged
    uid: 3bbcd4d7-1305-4cac-bc6a-f470672a6159
  resourceVersion: "74089"
  uid: 924b15d5-b4fe-415f-90cb-26d7f5e2bb4b
spec:
  claimPropertySets:
  - accessModes:
    - ReadWriteMany
    volumeMode: Filesystem
status:
  claimPropertySets:
  - accessModes:
    - ReadWriteMany
    volumeMode: Filesystem
  provisioner: kubernetes.io/no-provisioner
  storageClass: nfs

Comment 5 Yan Du 2023-04-06 01:53:16 UTC

$ oc get sc nfs -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2023-04-03T14:03:00Z"
  name: nfs
  resourceVersion: "73762"
  uid: bd1d7758-2dc5-4e80-a439-81e4e95595b7
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate
 
 
$ oc get storageprofile nfs -o yaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: StorageProfile
metadata:
  creationTimestamp: "2023-04-03T14:03:00Z"
  generation: 3
  labels:
    app: containerized-data-importer
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.13.0
    cdi.kubevirt.io: ""
  name: nfs
  ownerReferences:
  - apiVersion: cdi.kubevirt.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: CDI
    name: cdi-kubevirt-hyperconverged
    uid: 3bbcd4d7-1305-4cac-bc6a-f470672a6159
  resourceVersion: "74089"
  uid: 924b15d5-b4fe-415f-90cb-26d7f5e2bb4b
spec:
  claimPropertySets:
  - accessModes:
    - ReadWriteMany
    volumeMode: Filesystem
status:
  claimPropertySets:
  - accessModes:
    - ReadWriteMany
    volumeMode: Filesystem
  provisioner: kubernetes.io/no-provisioner
  storageClass: nfs

Comment 6 Alexander Wels 2023-04-06 13:28:52 UTC

So after some more debugging the problem is not with selinux. I am able to hotplug nfs volumes if the boot volume is not NFS. For some reason when the virt-handler mounts the volume in the virt-launcher pod, it finds the wrong NFS disk, and this why you are seeing the unable to get lock message, since that image is already locked for the boot disk. Investigating why that is happening.

Comment 7 Yan Du 2023-06-08 03:13:27 UTC

Test on CNV-v4.13.1.rhel9-123, issue has been fixed.

    volumeStatus:
    - hotplugVolume:
        attachPodName: hp-volume-sdn94
        attachPodUID: e350f374-ed7d-4ea5-8687-2096d96dac5b
      message: Successfully attach hotplugged volume blank-dv to VM
      name: blank-dv
      persistentVolumeClaimInfo:
        accessModes:
        - ReadWriteOnce
        capacity:
          storage: 5Gi
        filesystemOverhead: "0.055"
        requests:
          storage: 1Gi
        volumeMode: Filesystem
      phase: Ready
      reason: VolumeReady
      target: sda
    - name: cloudinitdisk
      size: 1048576
      target: vdb
    - name: dv-disk
      persistentVolumeClaimInfo:
        accessModes:
        - ReadWriteOnce
        capacity:
          storage: 25Gi
        filesystemOverhead: "0.055"
        requests:
          storage: 10Gi
        volumeMode: Filesystem
      target: vda
kind: List
metadata:
  resourceVersion: ""

Comment 13 errata-xmlrpc 2023-06-20 13:41:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.13.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:3686

Note You need to log in before you can comment on or make changes to this bug.