Bug 2044348 - VM with ocs-storagecluster-cephfs sc keeps in CrashLoopBackOff
Summary: VM with ocs-storagecluster-cephfs sc keeps in CrashLoopBackOff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Alexander Wels
QA Contact: Yan Du
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-24 12:35 UTC by Yan Du
Modified: 2022-03-16 16:07 UTC (History)
6 users (show)

Fixed In Version: virt-launcher v4.10.0-216, CNV v4.10.0-686
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 16:06:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 6754 0 None Merged Only create 1MiB-aligned disk images 2022-02-17 12:42:13 UTC
Github kubevirt kubevirt pull 7247 0 None Merged [release-0.49] Only create 1MiB-aligned disk images 2022-02-17 22:11:47 UTC
Red Hat Product Errata RHSA-2022:0947 0 None None None 2022-03-16 16:07:04 UTC

Description Yan Du 2022-01-24 12:35:36 UTC
Description of problem:
VM with ocs-storagecluster-cephfs sc keeps in CrashLoopBackOff

Version-Release number of selected component (if applicable):
CNV v4.10.0-605

How reproducible:
Always

Steps to Reproduce:
1. Create DV with cephfs
---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: simple-dv-cephfs
spec:
  source:
      http:
         url: "http://.../files/cnv-tests/fedora-images/Fedora-Cloud-Base-34-1.2.x86_64.qcow2"
  pvc:
    storageClassName: ocs-storagecluster-cephfs
    volumeMode: Filesystem
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi

2. Create VM
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: simple-vm
spec:
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/domain: simple-vm
        kubevirt.io/vm: simple-vm
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: dv-disk
          - disk:
              bus: virtio
            name: cloudinitdisk
        resources:
          requests:
            memory: 2048M
      volumes:
      - dataVolume:
          name: simple-dv-cephfs
        name: dv-disk
      - cloudInitNoCloud:
          userData: |
            #cloud-config
            password: fedora
            chpasswd: { expire: False }
        name: cloudinitdisk



Actual results:
VM keeps in CrashLoopBackOff

$ oc get dv
NAME               PHASE       PROGRESS   RESTARTS   AGE
simple-dv-cephfs   Succeeded   100.0%                3h15m

$ oc get vm
NAME        AGE     STATUS             READY
simple-vm   12m   CrashLoopBackOff   False


Events:
  Type     Reason            Age                From                       Message
  ----     ------            ----               ----                       -------
  Normal   SuccessfulCreate  42s                virtualmachine-controller  Created virtual machine pod virt-launcher-simple-vm-jk2cx
  Warning  SyncFailed        31s                virt-handler               server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-01-24T07:13:50.375272Z qemu-kvm: -device virtio-blk-pci-non-transitional,bus=pci.4,addr=0x0,drive=libvirt-2-format,id=ua-dv-disk,bootindex=1,write-cache=on,werror=stop,rerror=stop: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment')"
  Warning  SyncFailed        30s                virt-handler               server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-01-24T07:13:51.469326Z qemu-kvm: -device virtio-blk-pci-non-transitional,bus=pci.4,addr=0x0,drive=libvirt-2-format,id=ua-dv-disk,bootindex=1,write-cache=on,werror=stop,rerror=stop: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment')"


Expected results:
VM can be running well


Additional info:
workaround
VM can be running if disable filesystemOverhead in hco

Comment 1 Yan Du 2022-01-26 13:20:26 UTC
Is it possible related to miss backport for #bug 1976730?

Comment 2 Alexander Wels 2022-01-26 13:35:51 UTC
If this is happening in 4.10 it is something else because the fix for that bug should be in 4.10

Comment 3 Adam Litke 2022-01-31 13:27:52 UTC
Hey Niels, do you know what the cephfs block size is?  We currently use a 1M aligned size but qemu doesn't seem to like that in this particular scenario.

Comment 4 Alexander Wels 2022-02-04 15:45:24 UTC
Appears that the automatic resize code was not aligning properly. The linked PR in KubeVirt will fix it.

Comment 5 Alexander Wels 2022-02-04 15:50:33 UTC
@alitke The problem is not in the alignment in CDI, the problem is with the online resize in KubeVirt not aligning. I did some testing and after the import from CDI, the virtual disk image size on a 10Gi PVC is 10146021376 bytes. Which is 4k aligned (actually 1Mi aligned). Then after I attempted to start the VM and it failed. I checked the virtual size of the image again, and this time it was 10146860544 bytes which is not 4k aligned (looks like it is 512 aligned). I also found a PR to fix exactly that in the online resize which I have linked to this bugzilla.

Comment 6 Adam Litke 2022-02-04 16:03:19 UTC
Peter, this should be a blocker as we may have problems creating VMs on any storage with a 4k block size.

Comment 7 Yan Du 2022-02-22 08:06:00 UTC
Verify on CNV v4.10.0-686, can not reproduce the issue again.

Comment 10 errata-xmlrpc 2022-03-16 16:06:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.