Bug 1976730

Summary:	Disk is not usable due to incorrect size for proper alignment
Product:	Container Native Virtualization (CNV)	Reporter:	Germano Veit Michel <gveitmic>
Component:	Storage	Assignee:	Alexander Wels <awels>
Status:	CLOSED ERRATA	QA Contact:	Kevin Alon Goldblatt <kgoldbla>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	2.6.5	CC:	awels, cnv-qe-bugs, dholler, fdeutsch, kgershon, mrashish, nrevo, tmicheli, yadu
Target Milestone:	---
Target Release:	4.9.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	virt-cdi-operator:v4.9.0-14, CNV v4.9.0-86	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1990559 1990560 (view as bug list)		Environment:
Last Closed:	2021-11-02 15:59:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1990559, 1990560

Description Germano Veit Michel 2021-06-28 05:53:19 UTC

Description of problem:

I think this is a bug so raising one with what we have, as the customer has not replied with additional info we requested for troubleshooting.

The customer is using IBM Spectrum Scale Storage, with 1MiB block size.

If they create a new disk and attach to a VM using that storage, it fails to start because its not aligned.

{"component":"virt-launcher","level":"error","msg":"internal error: qemu unexpectedly closed the monitor: 2021-06-15T15:16:06.788431Z qemu-kvm: -device virtio-blk-pci-non-transitional,bus=pci.6,addr=0x0,drive=li
bvirt-1-format,id=ua-disk-0,write-cache=on: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment","pos":"qemuProcessReportLogError:2111","subcomponent":"libvirt","thread":"290","timestamp":"2021-06-15T15:16:06.811000Z"}

Here: https://github.com/qemu/qemu/blob/8e6dad2028d01b7f9ec76cf3b83457fab57fa1eb/block.c#L2309

Our current theory is the "fs overhead" doing a round down based on a hardcoded block size (512), here:
https://github.com/kubevirt/containerized-data-importer/blob/2014ddecfd39ba69fe75970ce613f31c32d2b182/pkg/importer/data-processor.go#L364

So it reduces the size by the overhead % and then rounds it down to 512 block size multiple. This may fail to set the file size correctly if the block size is not 512. The customer on this case was 1MiB.

How reproducible:
Unknown (I tried on NFS but it worked, not the right scenario though)

Comment 2 Nate Revo 2021-07-08 20:24:04 UTC

I ran into this issue with OCP 4.7 + ODF 4.7 + OCP Virt ?.  I am using Ceph RDB from ODF/OCS and requested 50GiB.   

(combined from similar events): server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: process exited while connecting to monitor: 2021-07-08T20:19:47.296918Z qemu-kvm: -device ide-hd,bus=ide.2,drive=libvirt-1-format,id=ua-rootdisk-install,bootindex=2,write-cache=on: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment')"

Comment 3 Germano Veit Michel 2021-07-12 21:45:32 UTC

Report from 02964811 mentions that adjusting the filesystemOverhead to zero did not work. Getting more data.

Comment 4 Germano Veit Michel 2021-07-13 23:35:29 UTC

(In reply to Nate Revo from comment #2)
> I ran into this issue with OCP 4.7 + ODF 4.7 + OCP Virt ?.  I am using Ceph
> RDB from ODF/OCS and requested 50GiB.   
> 
> (combined from similar events): server error. command SyncVMI failed:
> "LibvirtError(Code=1, Domain=10, Message='internal error: process exited
> while connecting to monitor: 2021-07-08T20:19:47.296918Z qemu-kvm: -device
> ide-hd,bus=ide.2,drive=libvirt-1-format,id=ua-rootdisk-install,bootindex=2,
> write-cache=on: Cannot get 'write' permission without 'resize': Image size
> is not a multiple of request alignment')"

Do you know what is the IO size of the underlying storage? It's not 512B is it?

You can try to recreate the disk manually as below just to confirm its kubevirt creating a disk with incorrect size for your storage. Make sure at step 10 you use a size that aligns with the IO of your storage (I think with ceph it should be a multiple of 4194304 bytes).

~~~
1. Find the VM:

$ oc get vmi
NAME                   AGE   PHASE     IP             NODENAME
rhel7-blushing-aphid   39h   Running   10.129.2.47    openshift-worker-0
rhel7-usual-weasel     9s    Running   10.129.3.247   openshift-worker-0

2. Find the pod may will fail quickly (~5m) as the VM fails to start, so we probably need to be quick.

$ oc get pods | grep weasel
virt-launcher-rhel7-usual-weasel-4llrq                1/1     Running   0          19s

3. Debug the pod (this will carbon copy it)
$ oc debug virt-launcher-rhel7-usual-weasel-4llrq

4. Go to the UI and stop/shutdown the VM (the debug pod should still be alive after that, we are doing it to avoid simultaneous access to the disks and locking failures)

5. Back on the debug pod, enter this directory:
sh-4.4# cd /var/run/kubevirt-private/vmi-disks/

6. Stat and qemu-img all disks (each in a directory), and also collect mount output
sh-4.4# for d in `ls`; do echo $d; qemu-img info $d/disk.img; stat $d/disk.img; done
sh-4.4# mount

7. Manually try to do qemu-io against each (replace the path below for each file), the error should reproduce on the problematic disk.

sh-4.4# qemu-io -f raw -t none /var/run/kubevirt-private/vmi-disks/disk-0/disk.img
qemu-io> quit

NOTE: we just want to know if the prompt opens, above is a sucessful run against the disk, move to the next.

8. If it fails to open continue with the rest of the plan below.

9. CD into the directory of the disk that failed to open above, and do a qemu-img info:

sh-4.4# cd <directory of the failed disk, i.e: disk-0>
sh-4.4# qemu-img info disk.img 

10. Delete the file and create a new one manually, using the size in bytes from the output of the command above, adjusting it properly for alignment with the underlying storage.

sh-4.4# rm disk.img 
sh-4.4# qemu-img create -f raw disk.img <ALIGNED SIZE HERE>
~~~

Comment 6 Germano Veit Michel 2021-07-14 01:23:39 UTC

How reproducible:
Always

Steps to Reproduce:
1. Prepare 4K aligned fake block storage
# fallocate -l 10GiB /exports/iscsi
# losetup -b 4096 /exports/iscsi

2. Add it as an iscsi target
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 1]
  | | o- iscsi ......................................................................... [/dev/loop0 (10.0GiB) write-thru activated]
  | |   o- alua ................................................................................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized]

  o- iscsi ............................................................................................................ [Targets: 2]
  | o- iqn.2003-01.org.linux-iscsi.fedora.x8664:sn.7898909edacd .......................................................... [TPGs: 1]
  | | o- tpg1 ............................................................................................... [no-gen-acls, no-auth]
  | |   o- acls .......................................................................................................... [ACLs: 1]
  | |   | o- iqn.2003-01.org.linux-iscsi:shift .................................................................... [Mapped LUNs: 1]
  | |   |   o- mapped_lun0 ................................................................................. [lun1 block/iscsi (rw)]
  | |   o- luns .......................................................................................................... [LUNs: 2]
  | |   | o- lun0 .................................................................... [block/iscsi (/dev/loop0) (default_tg_pt_gp)]
  | |   o- portals .................................................................................................... [Portals: 1]

3. Use it as a PV on OCP

kind: PersistentVolume
apiVersion: v1
metadata:
  name: iscsi-4k
spec:
  capacity:
    storage: 10Gi
  iscsi:
    targetPortal: '192.168.200.254:3260'
    iqn: 'iqn.2003-01.org.linux-iscsi.fedora.x8664:sn.7898909edacd'
    lun: 0
    iscsiInterface: default
    fsType: ext4
    initiatorName: 'iqn.2003-01.org.linux-iscsi:shift'
  accessModes:
    - ReadWriteOnce

4. Create VM disk with it

5. Run VM

2021-07-14T01:12:26.055417Z qemu-kvm: -device virtio-blk-pci-non-transitional,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-disk-0,write-cache=on: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment

6. Inside the pod, we see the disk as sda, see its 4K

sh-4.4# lsblk -o name,phy-sec,log-sec,alignment,min-io,opt-io,mountpoint
NAME   PHY-SEC LOG-SEC ALIGNMENT MIN-IO   OPT-IO MOUNTPOINT
sda       4096    4096         0   4096 10485760 /run/kubevirt-private/vmi-disks/disk-0

7. Reproduce the error:

sh-4.4# qemu-io -f raw -t none /run/kubevirt-private/vmi-disks/disk-0/disk.img 
qemu-io: can't open device /run/kubevirt-private/vmi-disks/disk-0/disk.img: Cannot get 'write' permission without 'resize': Image size is not a multiple of request alignment

8. Check size

sh-4.4# qemu-img info /run/kubevirt-private/vmi-disks/disk-0/disk.img
image: /run/kubevirt-private/vmi-disks/disk-0/disk.img
file format: raw
virtual size: 7.56 GiB (8117488128 bytes)
disk size: 4 KiB

9. Size does not end on storage alignment boundary:

8117488128 / 4096 = 1981808.625

10. Delete disk.ing and create a new file with correct size for alignment = VM runs

Comment 7 Alexander Wels 2021-07-22 12:30:36 UTC

Can I get a copy of the ENTIRE virt-launcher pod. We have seen the resize message after other failures in certain cases and I want to rule that out first.

Comment 8 Alexander Wels 2021-07-22 15:31:41 UTC

ENTIRE virt-launcher pod logs I mean

Comment 9 Germano Veit Michel 2021-07-26 01:58:38 UTC

Yup, attaching it now.

By the way, latest customer report suggests setting filesystemOverhead to zero worked for newly created disks.

Comment 11 Alexander Wels 2021-07-26 11:51:40 UTC

Okay, looking at the logs, the first thing that happens is the error with the alignment. So no other conclusion can be drawn than that is the problem. As you noted this is hard coded, and we would want to somehow make this configureable as there seems to be a large variety of block sizes. or maybe we can look it up with fstat.

Comment 12 Alexander Wels 2021-07-26 15:20:47 UTC

One more thing that might be the problem. Is the VM using cloud init, there is a bug that got fixed with the alignment of the cloud init image that is generated.

Comment 13 Germano Veit Michel 2021-07-27 03:24:13 UTC

(In reply to Alexander Wels from comment #11)
> Okay, looking at the logs, the first thing that happens is the error with
> the alignment. So no other conclusion can be drawn than that is the problem.
> As you noted this is hard coded, and we would want to somehow make this
> configureable as there seems to be a large variety of block sizes. or maybe
> we can look it up with fstat.

Or just do the rounding based on an arbitrary big one (i.e. 32M) instead of 512, so that all smaller multiples are automatically aligned too.

Comment 17 Kevin Alon Goldblatt 2021-08-19 17:07:39 UTC

Verified with the following code:
---------------------------------------------
oc version
Client Version: 4.9.0-202107292313.p0.git.1557476.assembly.stream-1557476
Server Version: 4.9.0-0.nightly-2021-08-04-025616
Kubernetes Version: v1.21.1+8268f88
[cnv-qe-jenkins@stg10-kevin-jdtkw-executor kev]$  oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.9.0   OpenShift Virtualization   4.9.0     kubevirt-hyperconverged-operator.v2.6.5   Succeeded


Verified with the following scenario:
---------------------------------------------
1. Connect to a host on the cluster:
oc debug node/stg10-kevin-jdtkw-worker-0-7f69w

sh-4.4# chroot /host/

2. Prepare 4K aligned fake block storage
fallocate -l 10GiB /var/iscsi
losetup -f -b 4096 /var/iscsi 
losetup -l 
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                        DIO LOG-SEC
/dev/loop1
             0      0         0  0 /var/iscsi                                         0    4096

3. Install the filesystem and mount the device:
mkfs.ext4 -b 4096 /var/iscsi
mkdir  /var/hpvolumes/4k  
mount /dev/loop1 /var/hpvolumes/4k/  

4. Label the device:
ls -ltrZ /var/hpvolumes/4k
total 16
drwx------. 2 root root system_u:object_r:unlabeled_t:s0 16384 Aug 19 15:30 lost+found
sudo chcon -t container_file_t -R /var/hpvolumes/4k
ls -ltrZ /var/hpvolumes/4k
total 16
drwx------. 2 root root system_u:object_r:container_file_t:s0 16384 Aug 19 15:30 lost+found

5. Create a special pv pointing to the loopback device created: (see yaml below)

6. Create the vm: (see yaml below)
oc create -f cirros-vm.yaml 
virtualmachine.kubevirt.io/vm-cirros-dv-2 created

7. oc get dv -w
NAME          PHASE       PROGRESS   RESTARTS   AGE
cirros-dv-2   Succeeded   100.0%                9s

8. Start vm:
oc get vm
NAME             AGE   STATUS    READY
vm-cirros-dv-2   28s   Stopped   False
virtctl start vm-cirros-dv-2
VM vm-cirros-dv-2 was scheduled to start
oc get vmi
NAME             AGE   PHASE     IP             NODENAME                           READY
vm-cirros-dv-2   17s   Running   10.128.2.178   stg10-kevin-jdtkw-worker-0-7f69w   True
[cnv-qe-jenkins@stg10-kevin-jdtkw-executor kev]$ virtctl console vm-cirros-dv-2
Successfully connected to vm-cirros-dv-2 console. The escape sequence is ^]

>>>>> VMI is running and vm can be accessed


Moving to verified!



Additional information:
---------------------------------------------------------------------------------
special pv yaml:
apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-pv-volume
  labels:
    type: local
spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - stg10-kevin-jdtkw-worker-0-7f69w
  storageClassName: test-sc
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/var/hpvolumes/4k"





vm yaml:

apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: vm-cirros-dv-2
  name: vm-cirros-dv-2
spec:
  dataVolumeTemplates:
  - metadata:
      name: cirros-dv-2
    spec:
      pvc:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100M
        storageClassName: test-sc
      source:
        http:
          url: "http://xxx/xxx/xxx/cirros-0.4.0-x86_64-disk.qcow2"
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm-datavolume
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: datavolumevolume
        machine:
          type: ""
        resources:
          requests:
            memory: 64M
      terminationGracePeriodSeconds: 0
      volumes:
      - dataVolume:
          name: cirros-dv-2
        name: datavolumevolume

Comment 22 errata-xmlrpc 2021-11-02 15:59:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104