Bug 1670993

Summary: Some VMIs using DataVolume fail to boot
Product: Container Native Virtualization (CNV) Reporter: Sergi Jimenez Romero <sjr>
Component: StorageAssignee: Alexander Wels <awels>
Status: CLOSED ERRATA QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: high    
Version: 1.4CC: alitke, awels, cnv-qe-bugs, fsimonce, jparrill, ncredi, ngavrilo, sgordon, sjr, sreichar
Target Milestone: ---   
Target Release: 1.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: virt-cdi-importer-container-v1.4.0-7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-26 13:24:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergi Jimenez Romero 2019-01-30 13:24:27 UTC
Description of problem:

Using datavolumetemplate, I tried to create a CentOS based VM pulling from the public URL (see CentOS VM definition below).

The import seems to finish properly and the DV has events of being synced but when the VMI starts it gets stuck at GRUB with the message:

error: attempt to read or write outside of disk `hd0`.
error: attempt to read or write outside of disk `hd0`.
error: attempt to read or write outside of disk `hd0`.
...
Entering rescue mode...
grub rescue>

Then I tried Fedora and Debian, with similar results. The only one that has worked for me is Cirros (see the Cirros VM definition below), while the definition is exactly the same but for the URL field.

CentOS VM: https://paste.fedoraproject.org/paste/3LrJVSiol4HQCY8q~xuCmw
Cirros VM: https://paste.fedoraproject.org/paste/fk~cwC4BLNw0M0nu2kjjCw


Version-Release number of selected component (if applicable):

Installed using: kubevirt-ansible-0.12.2-1.acde806.noarch.rpm
Running on: openshift v3.11.59

How reproducible:

Always

Steps to Reproduce:
1. Apply the VMs above.
2. Connect using vnc to the CentOS VMI.
3.

Actual results:


error: attempt to read or write outside of disk `hd0`.
error: attempt to read or write outside of disk `hd0`.
error: attempt to read or write outside of disk `hd0`.
...
Entering rescue mode...
grub rescue>

Expected results:

The VMI boots.

Additional info:

Comment 1 Steve Reichard 2019-01-30 22:25:04 UTC
I saw teh same failure trying to create a fedora based VM backed by a DV.   My golden image/cloned based Fedora 29 works.

Comment 2 Nelly Credi 2019-01-31 11:12:52 UTC
@Adam can you please take a look?

Comment 3 Adam Litke 2019-01-31 12:37:00 UTC
Alexander, please investigate.

Comment 9 Federico Simoncelli 2019-02-11 10:05:25 UTC
Is the build ready? Should we move to ON_QA?

Comment 10 Adam Litke 2019-02-11 13:37:04 UTC
Yes is fixed in virt-cdi-importer-container-v1.4.0-7.

Comment 11 Natalie Gavrielov 2019-02-18 11:31:02 UTC
@Sergi, this sort of scenario usually work for me, Can you please supply me with the yamls you used so I can see what is different.

Comment 12 Steve Reichard 2019-02-18 13:36:55 UTC
Sergi is on PTO this week, but since I had seen the problem, here is my yaml:

[root@spr-master 1.4]# cat f29vm-dv.yaml
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  creationTimestamp: null
  labels:
    kubevirt.io/vm: f29vm-dv
  name: f29vm-dv
spec:
  dataVolumeTemplates:
  - metadata:
      creationTimestamp: null
      name: f29dv
    spec:
      pvc:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi
        storageClassName: glusterfs-storage
      source:
        http:
          url: http://ftp.usf.edu/pub/fedora/linux/releases/29/Cloud/x86_64/images/Fedora-Cloud-Base-29-1.2.x86_64.qcow2
    status: {}
  running: true
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/vm: f29vm-dv
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: datavolumedisk1
        resources:
          requests:
            memory: 1024M
      terminationGracePeriodSeconds: 0
      volumes:
      - dataVolume:
          name: f29dv
        name: datavolumedisk1
[root@spr-master 1.4]#

Comment 13 Alexander Wels 2019-02-18 13:40:49 UTC
@Steve,

You will find the latest build will fail the import with a message indicating your pvc is not large enough. That particular image has a virtual size of 4G, and you are only allocating 2G of storage for it.

Comment 14 Natalie Gavrielov 2019-02-20 00:59:08 UTC
@Alexander, I tried the following:
On gluster storage, import a RHEL image,
1. To a 6Gi PVC:
   A. qcow2 (1.4G), qcow2.xz (440M), qcow2.gz (618M)
      result: Virtual image size 10737418240 is larger than available size 6333411328, shrink not yet supported.
   B. raw (10G), raw.xz (441M), raw.gz (626M)
      result: write /data/disk.img: no space left on device
2. To a 16Gi PVC:
   A. qcow2, qcow2.xz, qcow2.gz (same sizes as 1.A)
      result: Available space less than requested size, resizing image to available space 14166634496
      (for qcow2 it was 15575592960 the other two were 14166634496)
      Note: VMI's that were created using those PVC's were operational.
   B. raw, raw.xz, raw.gz (same sizes as 1.B)
      result: Available space less than requested size, resizing image to available space 6224830464.
      (the resizing for raw: 6224830464, raw.xz: 6224523264, raw.gz: 6224547840)
      Note: Seems that the VMI's created out of these PVC's were not booting 
      (the vm is running but connecting with virtctl results in black screen)

So for 2.B we still see the problem, what are the chances that the fix should take care of such cases?

Comment 15 Alexander Wels 2019-02-20 01:19:48 UTC
We didn't create a fix for the 2b cases, which is the same as the upload case (although the raw should be a streaming case, and thus should work), where the requirement is 2x actual size (uncompressed) + virtual size.

Comment 16 Natalie Gavrielov 2019-02-20 09:35:59 UTC
Ok, I'm marking this issue as verified because of comments: comment 14 and comment 15.
But we still need to figure out why the import for raw image didn't work as expected.

Comment 17 Federico Simoncelli 2019-02-22 12:58:02 UTC
(In reply to Natalie Gavrielov from comment #16)
> But we still need to figure out why the import for raw image didn't work as
> expected.

Natalie do you want to open another bug to track that?
(Please drop a reference here if you do so, thank you)

It seems worth of a proper tracked investigation / fix if needed and verification.

Comment 19 errata-xmlrpc 2019-02-26 13:24:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0417

Comment 20 Natalie Gavrielov 2019-02-27 12:41:49 UTC
(In reply to Federico Simoncelli from comment #17)
> (In reply to Natalie Gavrielov from comment #16)
> > But we still need to figure out why the import for raw image didn't work as
> > expected.
> 
> Natalie do you want to open another bug to track that?
> (Please drop a reference here if you do so, thank you)
> 
> It seems worth of a proper tracked investigation / fix if needed and
> verification.

I'm unable to reproduce issue described in comment 14 (2.B).
I created a PVC on gluster with 16Gi, imported a 10G raw rhel image and this time I got what was expected to begin with:

$ oc logs -f importer-pvc-on-gluster-for-vmi-rhel-pt4m9
I0227 11:50:56.151603       1 importer.go:45] Starting importer
I0227 11:50:56.152822       1 importer.go:65] begin import process
I0227 11:50:56.153893       1 importer.go:91] begin import process
I0227 11:50:56.153923       1 dataStream.go:299] copying "http://cnv-executor-ngavrilo.example.com/rhel-guest-image-7.6-258.x86_64.raw" to "/data/disk.img"...
I0227 11:50:56.223657       1 util.go:38] begin import...
I0227 11:52:17.486813       1 prlimit.go:107] ExecWithLimits qemu-img, [info --output=json /data/disk.img]
W0227 11:52:18.622974       1 dataStream.go:349] Available space less than requested size, resizing image to available space 16962891776.                     <-- here
I0227 11:52:18.623579       1 dataStream.go:355] Expanding image size to: 16962891776                                                                         <-- here
I0227 11:52:18.623604       1 prlimit.go:107] ExecWithLimits qemu-img, [resize -f raw /data/disk.img 16962891776]
I0227 11:52:22.023738       1 importer.go:98] import complete

Created a VMI and it was running successfully (I was able to connect).

I decided to try out raw.xz and raw.gz again:

raw.xz:
$ oc logs -f importer-pvc-on-gluster-for-vmi-rhel-4bf99
Error from server (BadRequest): container "importer" in pod "importer-pvc-on-gluster-for-vmi-rhel-4bf99" is waiting to start: ContainerCreating
[cloud-user@cnv-executor-ngavrilo-master1 ~]$ oc logs -f importer-pvc-on-gluster-for-vmi-rhel-4bf99
I0227 12:12:48.480909       1 importer.go:45] Starting importer
I0227 12:12:48.481266       1 importer.go:65] begin import process
I0227 12:12:48.483015       1 importer.go:91] begin import process
I0227 12:12:48.483060       1 dataStream.go:299] copying "http://cnv-executor-ngavrilo.example.com/rhel-guest-image-7.6-258.x86_64.raw.xz" to "/data/disk.img"...
I0227 12:12:48.636975       1 util.go:38] begin import...
I0227 12:16:27.391886       1 prlimit.go:107] ExecWithLimits qemu-img, [info --output=json /data/disk.img]
W0227 12:16:28.516438       1 dataStream.go:349] Available space less than requested size, resizing image to available space 16962891776.                     <-- here
I0227 12:16:28.516926       1 dataStream.go:355] Expanding image size to: 16962891776                                                                         <-- here
I0227 12:16:28.516971       1 prlimit.go:107] ExecWithLimits qemu-img, [resize -f raw /data/disk.img 16962891776]
I0227 12:16:30.917404       1 importer.go:98] import complete

Created a VMI and it was running successfully (I was able to connect).

raw.gz:
$ oc logs -f importer-pvc-on-gluster-for-vmi-rhel-blrbc
I0227 12:27:45.794889       1 importer.go:45] Starting importer
I0227 12:27:45.796109       1 importer.go:65] begin import process
I0227 12:27:45.797188       1 importer.go:91] begin import process
I0227 12:27:45.797229       1 dataStream.go:299] copying "http://cnv-executor-ngavrilo.example.com/rhel-guest-image-7.6-258.x86_64.raw.gz" to "/data/disk.img"...
I0227 12:27:45.826682       1 util.go:38] begin import...
I0227 12:29:06.864527       1 prlimit.go:107] ExecWithLimits qemu-img, [info --output=json /data/disk.img]
W0227 12:29:08.796986       1 dataStream.go:349] Available space less than requested size, resizing image to available space 16962891776.                     <-- here
I0227 12:29:08.797506       1 dataStream.go:355] Expanding image size to: 16962891776                                                                         <-- here
I0227 12:29:08.797535       1 prlimit.go:107] ExecWithLimits qemu-img, [resize -f raw /data/disk.img 16962891776]
I0227 12:29:11.459718       1 importer.go:98] import complete

Note:
My environment was reprovisioned since comment 14, so it might be a different build/release?!