Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1085338

Summary: [vdsm] disk resize fails with QImgError if the lv isn't extended fast enough.
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: adahms, amureini, bazulay, ebenahar, gklein, iheim, kwolf, lpeer, scohen, tnisan, yeylon
Target Milestone: ---Flags: amureini: Triaged+
Target Release: 3.4.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.14.7-0.1.beta3.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-09 13:30:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from engine and host none

Description Elad 2014-04-08 12:04:31 UTC
Created attachment 884000 [details]
logs from engine and host

Description of problem:
failure to resize a disk which resides on a block domain when the pool type is reported as file (NFS) (by getStoragePoolInfo)


Version-Release number of selected component (if applicable):
rhevm-3.4.0-0.12.beta2.el6ev.noarch
vdsm-4.14.2-0.2.el6ev.x86_64
libvirt-0.10.2-29.el6_5.5.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.7.x86_64

How reproducible:
Not always

Steps to Reproduce:
On a shared DC with 2 storage domains from different types (file/block). Make sure than the master domain is the file domain (NFS in my case). Check that the pool type is NFS:

# vdsClient -s 0 getStoragePoolInfo

1. Create a VM with a 1G disk located on the block domain attached to it
2. Resize the disk (add more 5G)


Actual results:
Pool type is NFS:

[root@green-vdsa ~]# vdsClient -s 0 getStoragePoolInfo `vdsClient -s 0 getConnectedStoragePoolsList`
        name = dc1
        isoprefix = /rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_shared_shared__iso__domain1/7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6/images/11111111-1111-1111-1111-111111111111
        pool_status = connected
        lver = 0
        spm_id = 1
        master_uuid = 7efff64e-72bf-4369-9adf-8633fdef0a32
        version = 3
        domains = 7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6:Active,1c841bd5-e0eb-4dff-9c15-1271dab0ac9a:Active,ae6330a3-bfae-4aef-ab4b-71074a336285:Active,7efff64e-72bf-4369-9adf-8633fdef0a32:Active,abdeeae9-be09-40d5-ab28-48cc3e1b98ae:Active,e2a17289-66a0-4c3a-b5db-93518b3ff648:Attached
        type = NFS
        master_ver = 7
        7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6 = {'status': 'Active', 'diskfree': '7193745162240', 'isoprefix': '/rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_shared_shared__iso__domain1/7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6/images/11111111-1111-1111-1111-111111111111', 'alerts': [], 'disktotal': '7261350002688', 'version': 0}
        1c841bd5-e0eb-4dff-9c15-1271dab0ac9a = {'status': 'Active', 'diskfree': '42278584320', 'isoprefix': '', 'alerts': [], 'disktotal': '53284438016', 'version': 3}
        ae6330a3-bfae-4aef-ab4b-71074a336285 = {'status': 'Active', 'diskfree': '19327352832', 'isoprefix': '', 'alerts': [], 'disktotal': '52881784832', 'version': 3}
        7efff64e-72bf-4369-9adf-8633fdef0a32 = {'status': 'Active', 'diskfree': '681718054912', 'isoprefix': '', 'alerts': [], 'disktotal': '1056894353408', 'version': 3}
        abdeeae9-be09-40d5-ab28-48cc3e1b98ae = {'status': 'Active', 'diskfree': '681718054912', 'isoprefix': '', 'alerts': [], 'disktotal': '1056894353408', 'version': 3}
        e2a17289-66a0-4c3a-b5db-93518b3ff648 = {'status': 'Attached', 'isoprefix': '', 'alerts': []}



vdsm fails to resize the disk:

Thread-260379::ERROR::2014-04-08 14:48:05,824::dispatcher::70::Storage.Dispatcher.Protect::(run) ecode=1, stdout=[], stderr=['Error resizing image (28)'], message=None
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 62, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
    raise self.error
QImgError: ecode=1, stdout=[], stderr=['Error resizing image (28)'], message=None

Expected results:
Disk resize should work when the domain where the disk is located is from a different type from the pool type

Additional info: logs from engine and host

Comment 2 Allon Mureinik 2014-04-22 12:01:50 UTC
Mixed pools were only introduced in 3.4, so this can't be a regression.

Comment 3 Elad 2014-04-22 12:04:45 UTC
Allon, the regression is for the disk resize

Comment 4 Allon Mureinik 2014-04-27 15:03:28 UTC
Elad, does this reproduce in a non-shared DC too?

Comment 5 Elad 2014-04-28 11:24:10 UTC
(In reply to Allon Mureinik from comment #4)
> Elad, does this reproduce in a non-shared DC too?

Works fine in local DC for both preallocated and for thin-provision disks

Comment 6 Federico Simoncelli 2014-04-28 12:46:35 UTC
The error is a qemu-img error (errno 28 = ENOSPC):

Thread-260379::DEBUG::2014-04-08 14:47:59,359::utils::556::root::(execCmd) '/usr/bin/qemu-img info -f qcow2 /rhev/data-center/mnt/blockSD/ae6330a3-bfae-4aef-ab4b-71074a336285/images/a32c5423-92cc-4978-a015-b8d782bd74ba/e43aee21-e097-4b3e-b164-c526035ddbc4' (cwd None)

Thread-260379::DEBUG::2014-04-08 14:47:59,557::utils::556::root::(execCmd) '/usr/bin/qemu-img resize -f qcow2 /rhev/data-center/mnt/blockSD/ae6330a3-bfae-4aef-ab4b-71074a336285/images/a32c5423-92cc-4978-a015-b8d782bd74ba/e43aee21-e097-4b3e-b164-c526035ddbc4 16106127360' (cwd None)
Thread-260379::DEBUG::2014-04-08 14:47:59,678::utils::576::root::(execCmd) FAILED: <err> = 'Error resizing image (28)\n'; <rc> = 1

Thread-260379::ERROR::2014-04-08 14:48:05,824::dispatcher::70::Storage.Dispatcher.Protect::(run) ecode=1, stdout=[], stderr=['Error resizing image (28)'], message=None
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 62, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
    raise self.error
QImgError: ecode=1, stdout=[], stderr=['Error resizing image (28)'], message=None


As far as I can tell the free space wasn't over in the storage domain as the stats were:

{'info': {'name': 'dc1', 'isoprefix': '/rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_shared_shared__iso__domain1/7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6/images/11111111-1111-1111-1111-111111111111', 'pool_status': 'connected', 'lver': 0, 'domains': '7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6:Active,1c841bd5-e0eb-4dff-9c15-1271dab0ac9a:Active,ae6330a3-bfae-4aef-ab4b-71074a336285:Active,7efff64e-72bf-4369-9adf-8633fdef0a32:Active,abdeeae9-be09-40d5-ab28-48cc3e1b98ae:Active,e2a17289-66a0-4c3a-b5db-93518b3ff648:Attached', 'master_uuid': '7efff64e-72bf-4369-9adf-8633fdef0a32', 'version': '3', 'spm_id': 1, 'type': 'NFS', 'master_ver': 7}, 'dominfo': {'7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6': {'status': 'Active', 'diskfree': '7193745162240', 'isoprefix': '/rhev/data-center/mnt/wolf.qa.lab.tlv.redhat.com:_volumes_wolf_shared_shared__iso__domain1/7233a711-98e8-4c3c-bcfa-44c4bcc4f6c6/images/11111111-1111-1111-1111-111111111111', 'alerts': [], 'disktotal': '7261350002688', 'version': 0}, '1c841bd5-e0eb-4dff-9c15-1271dab0ac9a': {'status': 'Active', 'diskfree': '42278584320', 'isoprefix': '', 'alerts': [], 'disktotal': '53284438016', 'version': 3}, 'ae6330a3-bfae-4aef-ab4b-71074a336285': {'status': 'Active', 'diskfree': '19327352832', 'isoprefix': '', 'alerts': [], 'disktotal': '52881784832', 'version': 3}, '7efff64e-72bf-4369-9adf-8633fdef0a32': {'status': 'Active', 'diskfree': '681718054912', 'isoprefix': '', 'alerts': [], 'disktotal': '1056894353408', 'version': 3}, 'abdeeae9-be09-40d5-ab28-48cc3e1b98ae': {'status': 'Active', 'diskfree': '681718054912', 'isoprefix': '', 'alerts': [], 'disktotal': '1056894353408', 'version': 3}, 'e2a17289-66a0-4c3a-b5db-93518b3ff648': {'status': 'Attached', 'isoprefix': '', 'alerts': []}}}


 'ae6330a3-bfae-4aef-ab4b-71074a336285': {'alerts': [],
                                          'diskfree': '19327352832',
                                          'disktotal': '52881784832',
                                          'isoprefix': '',
                                          'status': 'Active',
                                          'version': 3},


Bottom line: the call to qemu-img resize failed with ENOSPC and I have no additional info on why.

Kevin any idea on why qemu-img resize may have failed? The command was:

qemu-img resize -f qcow2 /rhev/data-center/mnt/blockSD/ae6330a3-bfae-4aef-ab4b-71074a336285/images/a32c5423-92cc-4978-a015-b8d782bd74ba/e43aee21-e097-4b3e-b164-c526035ddbc4 16106127360

Comment 7 Federico Simoncelli 2014-04-28 14:37:10 UTC
It seems that there were some LV extension requests coming from the VM (watermark):

Thread-256590::INFO::2014-04-08 14:47:26,696::vm::2483::vm.Vm::(extendDrivesIfNeeded) vmId=`280e0ee1-ad2f-4234-9d7a-5908d020ce43`::Requesting extension for volume e43aee21-e097-4b3e-b164-c526035ddbc4 on domain ae6330a3-bfae-4aef-ab4b-71074a336285 (apparent: 1073741824, capacity: 10737418240, allocated: 1080167936, physical: 1073741824)
Thread-256590::DEBUG::2014-04-08 14:47:26,696::vm::2556::vm.Vm::(__extendDriveVolume) vmId=`280e0ee1-ad2f-4234-9d7a-5908d020ce43`::Requesting an extension for the volume: {'newSize': 2048, 'domainID': 'ae6330a3-bfae-4aef-ab4b-71074a336285', 'name': u'vda', 'volumeID': 'e43aee21-e097-4b3e-b164-c526035ddbc4', 'imageID': 'a32c5423-92cc-4978-a015-b8d782bd74ba', 'poolID': '009d5dc0-517f-4c87-a300-f10a6c1e1871'}
Thread-256590::INFO::2014-04-08 14:47:26,697::logUtils::44::dispatcher::(wrapper) Run and protect: sendExtendMsg(spUUID='009d5dc0-517f-4c87-a300-f10a6c1e1871', volDict={'newSize': 2048, 'domainID': 'ae6330a3-bfae-4aef-ab4b-71074a336285', 'name': u'vda', 'volumeID': 'e43aee21-e097-4b3e-b164-c526035ddbc4', 'imageID': 'a32c5423-92cc-4978-a015-b8d782bd74ba', 'poolID': '009d5dc0-517f-4c87-a300-f10a6c1e1871'}, newSize=2147483648, callbackFunc=<bound method Vm.__afterVolumeExtension of <vm.Vm object at 0x7f41941b8b50>>)

Thread-260360::DEBUG::2014-04-08 14:47:27,820::BindingXMLRPC::1062::vds::(wrapper) client [10.35.161.52]::call vmDestroy with ('280e0ee1-ad2f-4234-9d7a-5908d020ce43',) {} flowID [30cf70c4]

Relevant reported sizes: allocated: 1080167936, physical: 1073741824
It seems that there was no SPM to satisfy the requests (or it didn't get to it fast enough) but that's a different issue.

Anyway the final situation that triggered the issue is that the LV was full and the qemu-img resize failed with ENOPSC:

Thread-260379::DEBUG::2014-04-08 14:47:59,557::utils::556::root::(execCmd) '/usr/bin/qemu-img resize -f qcow2 /rhev/data-center/mnt/blockSD/ae6330a3-bfae-4aef-ab4b-71074a336285/images/a32c5423-92cc-4978-a015-b8d782bd74ba/e43aee21-e097-4b3e-b164-c526035ddbc4 16106127360' (cwd None)
Thread-260379::DEBUG::2014-04-08 14:47:59,678::utils::576::root::(execCmd) FAILED: <err> = 'Error resizing image (28)\n'; <rc> = 1

Thread-260379::ERROR::2014-04-08 14:48:05,824::dispatcher::70::Storage.Dispatcher.Protect::(run) ecode=1, stdout=[], stderr=['Error resizing image (28)'], message=None
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 62, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
    raise self.error
QImgError: ecode=1, stdout=[], stderr=['Error resizing image (28)'], message=None


A possible solution would be to catch the specific ENOPSC error and extend the LV before trying again.

Comment 8 Allon Mureinik 2014-04-29 06:57:33 UTC
(In reply to Elad from comment #5)
> (In reply to Allon Mureinik from comment #4)
> > Elad, does this reproduce in a non-shared DC too?
> 
> Works fine in local DC for both preallocated and for thin-provision disks
Can you please test with a non-mixed block domain? (e.g., iSCSI only)

Comment 9 Elad 2014-04-29 11:04:55 UTC
Tried several live resizes. Tried with all combinations (block and file with thin and preallocated disks) on non-shared DC.
All live resizes went fine.

Comment 10 Federico Simoncelli 2014-04-30 08:41:01 UTC
(In reply to Federico Simoncelli from comment #7)
> It seems that there were some LV extension requests coming from the VM
> (watermark):
> 
> Thread-256590::INFO::2014-04-08
> 14:47:26,696::vm::2483::vm.Vm::(extendDrivesIfNeeded)
> vmId=`280e0ee1-ad2f-4234-9d7a-5908d020ce43`::Requesting extension for volume
> e43aee21-e097-4b3e-b164-c526035ddbc4 on domain
> ae6330a3-bfae-4aef-ab4b-71074a336285 (apparent: 1073741824, capacity:
> 10737418240, allocated: 1080167936, physical: 1073741824)
> Thread-256590::DEBUG::2014-04-08
> 14:47:26,696::vm::2556::vm.Vm::(__extendDriveVolume)
> vmId=`280e0ee1-ad2f-4234-9d7a-5908d020ce43`::Requesting an extension for the
> volume: {'newSize': 2048, 'domainID':
> 'ae6330a3-bfae-4aef-ab4b-71074a336285', 'name': u'vda', 'volumeID':
> 'e43aee21-e097-4b3e-b164-c526035ddbc4', 'imageID':
> 'a32c5423-92cc-4978-a015-b8d782bd74ba', 'poolID':
> '009d5dc0-517f-4c87-a300-f10a6c1e1871'}
> Thread-256590::INFO::2014-04-08
> 14:47:26,697::logUtils::44::dispatcher::(wrapper) Run and protect:
> sendExtendMsg(spUUID='009d5dc0-517f-4c87-a300-f10a6c1e1871',
> volDict={'newSize': 2048, 'domainID':
> 'ae6330a3-bfae-4aef-ab4b-71074a336285', 'name': u'vda', 'volumeID':
> 'e43aee21-e097-4b3e-b164-c526035ddbc4', 'imageID':
> 'a32c5423-92cc-4978-a015-b8d782bd74ba', 'poolID':
> '009d5dc0-517f-4c87-a300-f10a6c1e1871'}, newSize=2147483648,
> callbackFunc=<bound method Vm.__afterVolumeExtension of <vm.Vm object at
> 0x7f41941b8b50>>)
> 
> Thread-260360::DEBUG::2014-04-08
> 14:47:27,820::BindingXMLRPC::1062::vds::(wrapper) client
> [10.35.161.52]::call vmDestroy with
> ('280e0ee1-ad2f-4234-9d7a-5908d020ce43',) {} flowID [30cf70c4]
> 
> Relevant reported sizes: allocated: 1080167936, physical: 1073741824
> It seems that there was no SPM to satisfy the requests (or it didn't get to
> it fast enough) but that's a different issue.

This was because of bug 1083476

Comment 11 Allon Mureinik 2014-04-30 13:48:55 UTC
As the underlying issue is bug 1083476, moving this one to ON_QA for verification (which may already have been done, according to comment 9).

Comment 12 Elad 2014-05-13 13:18:01 UTC
(In reply to Allon Mureinik from comment #11)
> As the underlying issue is bug 1083476, moving this one to ON_QA for
> verification (which may already have been done, according to comment 9).

Allon,
My test described in comment #9 was on a non-shared DC. 
The bug occurred while I used a shared DC, so I'm not sure why this bug was moved to ON_QA.

Comment 14 Allon Mureinik 2014-05-14 10:43:40 UTC
(In reply to Elad from comment #12)
> (In reply to Allon Mureinik from comment #11)
> > As the underlying issue is bug 1083476, moving this one to ON_QA for
> > verification (which may already have been done, according to comment 9).
> 
> Allon,
> My test described in comment #9 was on a non-shared DC. 
> The bug occurred while I used a shared DC, so I'm not sure why this bug was
> moved to ON_QA.

The original bug was described as:
> failure to resize a disk which resides on a block domain when the pool type is > reported as file (NFS) 

This, by definition is a shared DC.

What am I missing?

Comment 15 Elad 2014-05-14 11:29:55 UTC
On a shared DC with NFS domain as master and iSCSI domain:
Created a VM with a disk resides on the iSCSI domain and resized it from 2G to 4G.
Resize succeeded, tried with both preallocated and thin-provision disks.

Verified using av9.1

Comment 16 errata-xmlrpc 2014-06-09 13:30:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html