Bug 1014942

Summary: [vdsm] Creating several disks at once fails CreateVolume with LVReplaceTagError sometimes
Product: Red Hat Enterprise Virtualization Manager Reporter: Gadi Ickowicz <gickowic>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Gadi Ickowicz <gickowic>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, acanan, amureini, bazulay, gickowic, iheim, lpeer, nlevinki, nsoffer, scohen, srevivo, yeylon
Target Milestone: ---Flags: nsoffer: needinfo-
amureini: Triaged+
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: is22 Doc Type: Bug Fix
Doc Text:
The udev cache was used for obtaining a device list, but this sometimes returned an incomplete list which led to missing volume errors. Therefore, creating several disks in parallel could fail. LVM disabled the udev cache in the in RHEL 6.5, eliminating this error. However, on machines upgraded from earlier versions, lvm does not change the current configuration after an upgrade. It is the responsiblity of the system adminstrator to check the new lvm configuration and update the old confguration with new values. The udev cache is now disabled in vdsm lvm configuration, which ensures proper configuration also on upgraded machines.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 16:17:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1019461, 1032811    
Attachments:
Description Flags
vdsm + engine logs
none
vdsm logs from failed run and successful run
none
var/log/messages for failed and successful run none

Description Gadi Ickowicz 2013-10-03 07:19:33 UTC
Created attachment 806897 [details]
vdsm + engine logs

Description of problem:
Sometimes when creating several vm disks in parallel (for several vms) one of the disk creations fail with LVReplaceTagError. So far this was only seen using automated tests.

9fc26478-80c6-4315-a4d0-386f9394f5c0::ERROR::2013-10-02 01:56:20,557::volume::505::Storage.Volume::(create) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/volume.py", line 466, in create
    srcVolUUID, imgPath, volPath)
  File "/usr/share/vdsm/storage/blockVolume.py", line 205, in _create
    addTags=mdTags)
  File "/usr/share/vdsm/storage/lvm.py", line 1198, in changeLVTags
    (lvname, ", ".join(addTags), ", ".join(delTags), err[-1]))
LogicalVolumeReplaceTagError: Replace Logical Volume tag error: ('lv: `48874cca-4e85-4beb-8afe-3d72a0d908cc/2c612fee-8612-4df1-8ff8-899ecd3b98df` add: `MD_6, PU_00000000-0000-0000-0000-000000000000, IU_3825fdcd-1290-4c88-9607-ac7397b4b211` del: `OVIRT_VOL_INITIALIZING` (  Skipping volume group 48874cca-4e85-4beb-8afe-3d72a0d908cc)',)
c26ef2d7-e6b3-4b96-8f67-c6128b6575f0::DEBUG::2013-10-02 01:56:20,559::lvm::439::OperationMutex::(_reloadlvs) Operation 'lvm reload operation' got the operation mutex
9fc26478-80c6-4315-a4d0-386f9394f5c0::DEBUG::2013-10-02 01:56:20,561::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource '48874cca-4e85-4beb-8afe-3d72a0d908cc_imageNS.3825fdcd-1290-4c88-9607-ac7397b4b211'
9fc26478-80c6-4315-a4d0-386f9394f5c0::DEBUG::2013-10-02 01:56:20,563::resourceManager::634::ResourceManager::(releaseResource) Released resource '48874cca-4e85-4beb-8afe-3d72a0d908cc_imageNS.3825fdcd-1290-4c88-9607-ac7397b4b211' (0 active users)
9fc26478-80c6-4315-a4d0-386f9394f5c0::DEBUG::2013-10-02 01:56:20,564::resourceManager::640::ResourceManager::(releaseResource) Resource '48874cca-4e85-4beb-8afe-3d72a0d908cc_imageNS.3825fdcd-1290-4c88-9607-ac7397b4b211' is free, finding out if anyone is waiting for it.
9fc26478-80c6-4315-a4d0-386f9394f5c0::DEBUG::2013-10-02 01:56:20,565::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource '48874cca-4e85-4beb-8afe-3d72a0d908cc_imageNS.3825fdcd-1290-4c88-9607-ac7397b4b211', Clearing records.
c26ef2d7-e6b3-4b96-8f67-c6128b6575f0::DEBUG::2013-10-02 01:56:20,566::lvm::308::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a|/dev/mapper/3514f0c54d5c017fe|/dev/mapper/3514f0c54d5c017ff|/dev/mapper/3514f0c54d5c01800|/dev/mapper/3514f0c54d5c0181b|/dev/mapper/3514f0c54d5c01832|/dev/mapper/3514f0c54d5c01833|/dev/mapper/3514f0c54d5c018ba|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 48874cca-4e85-4beb-8afe-3d72a0d908cc' (cwd None)
9fc26478-80c6-4315-a4d0-386f9394f5c0::ERROR::2013-10-02 01:56:20,574::task::850::TaskManager.Task::(_setError) Task=`9fc26478-80c6-4315-a4d0-386f9394f5c0`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1994, in createVolume
    desc=desc, srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID)
  File "/usr/share/vdsm/storage/sd.py", line 411, in createVolume
    preallocate, diskType, volUUID, desc, srcImgUUID, srcVolUUID)
  File "/usr/share/vdsm/storage/volume.py", line 466, in create
    srcVolUUID, imgPath, volPath)
  File "/usr/share/vdsm/storage/blockVolume.py", line 205, in _create
    addTags=mdTags)
  File "/usr/share/vdsm/storage/lvm.py", line 1198, in changeLVTags
    (lvname, ", ".join(addTags), ", ".join(delTags), err[-1]))
LogicalVolumeReplaceTagError: Replace Logical Volume tag error: ('lv: `48874cca-4e85-4beb-8afe-3d72a0d908cc/2c612fee-8612-4df1-8ff8-899ecd3b98df` add: `MD_6, PU_00000000-0000-0000-0000-000000000000, IU_3825fdcd-1290-4c88-9607-ac7397b4b211` del: `OVIRT_VOL_INITIALIZING` (  Skipping volume group 48874cca-4e85-4beb-8afe-3d72a0d908cc)',)



Version-Release number of selected component (if applicable):
vdsm-4.12.0-170.el6ev.x86_64

How reproducible:
~50%

Steps to Reproduce:
Reproduced through automated tests
1. Try to create 6 vms in parallel each with a 10GB thin provision disk

Actual results:
Sometimes one of the disk creations will fail

Expected results:
Disk should be created

Additional info:

Comment 1 Nir Soffer 2013-10-09 12:04:12 UTC
1. Can you share the code running in the automated tests that reproduce this?
2. Is is possible that a previous test was not finished before the new test began? Do you run the test with a fresh storage domain or run several tests on the same domain one after another?

Comment 2 Gadi Ickowicz 2013-10-09 12:47:52 UTC
Created attachment 809892 [details]
vdsm logs from failed run and successful run

(In reply to Nir Soffer from comment #1)
> 1. Can you share the code running in the automated tests that reproduce this?
> 2. Is is possible that a previous test was not finished before the new test
> began? Do you run the test with a fresh storage domain or run several tests
> on the same domain one after another?
Storage domain is a new storage domain each time.

Attaching archive with a successful run and another failed run for comparison.

Comment 3 Nir Soffer 2013-10-09 14:25:53 UTC
Can you provide /var/log/messages log for the duration of the test of both a successful and failed test?

Comment 4 Gadi Ickowicz 2013-10-09 14:53:48 UTC
Created attachment 809991 [details]
var/log/messages for failed and successful run

Added /var/log/messages for both a successful and a failed run

Comment 5 Nir Soffer 2013-10-09 15:28:12 UTC
First, please complete the basic details:

RHEL version:
multipathd version:
lvm version:

Second, please run the same test with the previous release: IS 16.

Comment 6 Nir Soffer 2013-10-09 18:38:37 UTC
Please run the same test with build IS 15; we have a theory that the failure is caused by change introduced in IS 16.

Comment 7 Aharon Canan 2013-10-10 13:33:59 UTC
Nir, 

please try to ask for all missing info you need at once.

Like Gadi wrote, we added logs of successful run and another failed run for comparison.
we can't rerun as we have other things in our pipe. I am sure you can recheck it without us (you asked to run using is15 and then asked is16)

I agree that Gadi could attach the automation code, but you can understand the flow from the logs - also he mentioned what the test does.

we will try to do our best to help but based on our priorities.

Comment 8 Nir Soffer 2013-10-10 19:19:16 UTC
(In reply to Aharon Canan from comment #7)

When your priorities allow and you run this test again, do enable lvm logging.

lvm default log level is 0, means no messages are logged even for critical errors. Please set lvm log level to 7 (LOG_DEBUG) while running this test.

To change lvm log level, edit /etc/lvm/lvm.conf, find the log section and edit level to 7:

log {
    ...
    level = 7
    ...
}

Comment 9 Nir Soffer 2013-10-17 15:36:45 UTC
This bug is caused by lvm regresion in 6.5, see bug 1020401.

Comment 10 Nir Soffer 2013-10-23 16:01:33 UTC
A fix for lvm bug 1020401 is in a qa now, and will be available in lvm2-2.02.100-7.el6.

Please test again when this version is available.

If you want to test now, you can edit /etc/lvm/lvm.conf and disable obtain_device_list_from_udev option:

devices {
    obtain_device_list_from_udev = 0
}

Comment 11 Nir Soffer 2013-11-05 16:26:19 UTC
Patch was merge in upstream.

Comment 12 Gadi Ickowicz 2013-11-06 09:11:35 UTC
(In reply to Nir Soffer from comment #10)
> A fix for lvm bug 1020401 is in a qa now, and will be available in
> lvm2-2.02.100-7.el6.
This package is installed on our hosts and so far I have not seen this error.
> 
> Please test again when this version is available.
> 
> If you want to test now, you can edit /etc/lvm/lvm.conf and disable
> obtain_device_list_from_udev option:
> 
> devices {
>     obtain_device_list_from_udev = 0
> }

Comment 13 Nir Soffer 2013-11-06 09:28:02 UTC
(In reply to Gadi Ickowicz from comment #12)
> (In reply to Nir Soffer from comment #10)
> > A fix for lvm bug 1020401 is in a qa now, and will be available in
> > lvm2-2.02.100-7.el6.
> This package is installed on our hosts and so far I have not seen this error.

It the depend on how you install the host. If you do a clean install of 6.5, then this issue should be resolve with any vdsm build. But if you install an older that includes lvm2 < 2.02.100-7, then your lvm.conf would not be updated with new lvm defaults, and the issue will be reproduced with current vdsm builds, but should be fixed with the next build containing the fix.

Do test machines upgraded from 6.4 to 6.5?

Comment 14 Gadi Ickowicz 2013-11-06 09:55:53 UTC
(In reply to Nir Soffer from comment #13)
> (In reply to Gadi Ickowicz from comment #12)
> > (In reply to Nir Soffer from comment #10)
> > > A fix for lvm bug 1020401 is in a qa now, and will be available in
> > > lvm2-2.02.100-7.el6.
> > This package is installed on our hosts and so far I have not seen this error.
> 
> It the depend on how you install the host. If you do a clean install of 6.5,
> then this issue should be resolve with any vdsm build. But if you install an
> older that includes lvm2 < 2.02.100-7, then your lvm.conf would not be
> updated with new lvm defaults, and the issue will be reproduced with current
> vdsm builds, but should be fixed with the next build containing the fix.
> 
> Do test machines upgraded from 6.4 to 6.5?

The machine is a clean install of 6.5

Comment 15 Nir Soffer 2013-11-06 10:21:02 UTC
Unless you want to test machine upgraded from 6.4, you can close this bug.

Comment 16 Nir Soffer 2013-11-06 10:59:24 UTC
Now when we included our own fix, we do not depend any more on bug 1020401.

Comment 17 Gadi Ickowicz 2013-11-13 06:48:04 UTC
verified on is22 - error did not show up in any nightly automation runs since is22

Comment 18 Charlie 2013-11-28 00:34:14 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 19 errata-xmlrpc 2014-01-21 16:17:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html