Bug 973640 - LUN addition to iSCSI storage domain fails and web admin shows invalid state of the storage domain
Summary: LUN addition to iSCSI storage domain fails and web admin shows invalid state ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
urgent
unspecified
Target Milestone: ---
: 3.3.0
Assignee: Ayal Baron
QA Contact: Katarzyna Jachim
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-12 11:57 UTC by Petr Spacek
Modified: 2016-02-10 18:45 UTC (History)
15 users (show)

Fixed In Version: is13
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-21 22:15:22 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 17919 0 None None None Never

Description Petr Spacek 2013-06-12 11:57:15 UTC
Description of problem:
LUN addition to existing iSCSI storage domain (according to https://access.redhat.com/site/solutions/17434 [Option 2]) failed and web admin portal shows non-senses about the affected storage domain.


Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.30.el6ev.x86_64
vdsm-4.10.2-22.0.el6ev.x86_64
(RHEV-M 3.2.0-11.30.el6ev and most up-to-date RHEL hypervisors.)


How reproducible:
? I don't have another cluster to destroy.


Steps to Reproduce/What I did:
Follow https://access.redhat.com/site/solutions/17434 [Option 2]:
To expand storage by adding a new LUN to an existing storage domain:
    Create a new LUN on the SAN
    Edit the Storage Domain
    Make sure the new LUN is discovered (In case of iSCSI, the Discovery and Login steps described in the Admin Guide should be followed)
    Add the LUN to the existing Storage Domain. This will make the Storage Domain grow by the size of the added LUN.


Actual results:
Messages in web admin portal were:
* Storage Domain vm1-belka was updated by admin@internal
* Data Center is being initialized, please wait for initialization to complete.
* Storage Pool Manager runs on Host laika (Address: laika...).
(^^^ SPM jumped to the other hypervisor. It was running on host "belka..." before the storage expansion.)
* Failed to extend Storage Domain vm1-belka.
(Timestamp for log correlation: 2013-Jun-12, 13:15)


Observation from web admin portal:
- Storage domain values 'Size' and 'Available' are bigger.
- Storage domain properties in web admin do not show the new LUN as used: There is still LUN id with un-checked box, see the attached image.


Observation from hypervisor:
- VG scan on one of hypervisors shows this:
# vgs
  VG                                   #PV #LV #SN Attr   VSize   VFree  
  bd06c2cb-3246-4e1b-8d73-737cc165e7a5   2  33   0 wz--n- 478.75g 311.62g

There are two PVs but the web admin shows only one!

- Pvdisplay shows the same information:
# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/mapper/1belka-vm1
  VG Name               bd06c2cb-3246-4e1b-8d73-737cc165e7a5
  PV Size               200.00 GiB / not usable 384.00 MiB
  Allocatable           yes 
  PE Size               128.00 MiB
  Total PE              1597
  Free PE               260
  Allocated PE          1337
  PV UUID               jSYina-zLmW-om8z-q761-LuVC-UZT7-Pg9CSV
   
  --- Physical volume ---
  PV Name               /dev/mapper/1belka-vm2
  VG Name               bd06c2cb-3246-4e1b-8d73-737cc165e7a5
  PV Size               279.40 GiB / not usable 278.26 MiB
  Allocatable           yes 
  PE Size               128.00 MiB
  Total PE              2233
  Free PE               2233
  Allocated PE          0
  PV UUID               M9gWXH-yXvm-UgSt-TwVe-476y-nm5L-FHZcc4

Both PVs are used by VG bd06c2cb-3246-4e1b-8d73-737cc165e7a5.



Expected results:
Storage domain is extended properly :-)

Additional info:
Ping me if you want to access the machines directly.

Comment 1 Petr Spacek 2013-06-12 12:00:35 UTC
Created attachment 760115 [details]
Used LUN is shown as unused in Web Admin Portal

The red arrow on picture used_lun_shown_as_unused.png points to the problematic LUN. I attempted to add this LUN to storage domain, but Web Admin thrown an error. The checkbox is unchecked but LVM thinks that the PV representing the LUN is used.

Comment 4 Pavel Zhukov 2013-06-12 12:36:38 UTC
Looks like vdsm doesn't get us TraceBack for unexpected error. it's pity. 

Thread-109::ERROR::2013-06-12 13:15:34,522::task::850::TaskManager.Task::(_setError) Task=`921429b7-2ee8-4275-91d0-7f4c6e4b8086`::Unexpected error
Thread-109::DEBUG::2013-06-12 13:15:34,525::task::869::TaskManager.Task::(_run) Task=`921429b7-2ee8-4275-91d0-7f4c6e4b8086`::Task._run: 921429b7-2ee8-4275-91d0-7f4c6e4b8086 ('bd06c2cb-3246-4e1b-8d73-737cc165e7a5', 'bb2b1888-ac09-41a8-ba53-5b86aa3417a2', ['1belka-vm2'], False) {} failed - stopping task

Comment 5 Pavel Zhukov 2013-06-12 13:27:24 UTC
Human readable dm names are used:

Thread-19::DEBUG::2013-06-12 13:15:41,400::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1belka-vm1|1belka-vm2|1laika-vm1|1laika-vm2|35000cca03ca52d20%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free bd06c2cb-3246-4e1b-8d73-737cc165e7a5' (cwd None)
Thread-19::DEBUG::2013-06-12 13:15:41,620::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '  Found duplicate PV M9gWXHyXvmUgStTwVe476ynm5LFHZcc4: using /dev/mapper/1belka-vm2 not /dev/mapper/35000cca03ca52d20p1\n'; <rc> = 0

Comment 6 Petr Spacek 2013-06-12 13:50:12 UTC
Interesting. These 'human readable names' were produced by RHEL6 iscsi initiator in default configuration, without any modification to /etc/multipath.conf etc.

Comment 7 Humble Chirammal 2013-06-13 03:34:39 UTC
Pavel, can you attach a LC from this setup ? ooc, does this device '35000cca03ca52d20' have a partition ('....not /dev/mapper/35000cca03ca52d20p1') ?

Comment 8 Petr Spacek 2013-06-13 06:32:39 UTC
/dev/mapper/35000cca03ca52d20 is a physical disk and /dev/mapper/35000cca03ca52d20p1 was (the only one) partition on that disk.

This partition was exported via /etc/tgtd/targets.conf as LUN 2 on iSCSI target belka.idm.lab.eng.brq.redhat.com:belka-vm1.

Excerpt from /etc/tgtd/targets.conf:
<target belka.idm.lab.eng.brq.redhat.com:belka-vm1>
    <backing-store /dev/vg_belka/lv_vm1>
        lun 1
        scsi_sn 7614adbc6e528292
        scsi_id belka-vm1
    </backing-store>
    <backing-store /dev/disk/by-id/scsi-35000cca03ca52d20-part1>
        lun 1
        scsi_sn 5000cca03ca52d2001
        scsi_id belka-vm2
    </backing-store>

    initiator-address 10.34.47.1
    initiator-address 10.34.47.2
</target>

AFAIK the same partition was accessible via /dev/mapper/35000cca03ca52d20p1 and /dev/mapper/1belka-vm2. Is this the root cause of the problem?

Comment 9 Petr Spacek 2013-06-13 10:43:38 UTC
Would it help if I force LVM to use only /dev/sd.* names?

For example, I can use this LVM filter (/etc/lvm/lvm.conf):
filter = [ "a|^/dev/sd.*$|", "r/.*/" ]

I.e. LVM will use iSCSI devices only via names /dev/sdb, /dev/sdc etc.

Comment 12 Humble Chirammal 2013-06-13 12:58:46 UTC
If I am referring correct time-stamp:

pvcreate was tried on 

/dev/mapper/1laika-vm1
/dev/mapper/1belka-vm2 
/dev/mapper/1belka-vm1 

when 2 of the pvs were already engaged ?? : ( I am not sure why we want to try pvcreate like that)

physical volume "/dev/mapper/1laika-vm1" of volume group "c67449cb-e5ec-4342-9c56-af1a67cf5868" 

&&

physical volume "/dev/mapper/1belka-vm1" of volume group "bd06c2cb-3246-4e1b-8d73-737cc165e7a5" 



Thread-107::DEBUG::2013-06-12 13:02:35,326::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm pvcreate --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1belka-vm1|1belka-vm2|1laika-vm1|35000cca03ca52d20%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --test --metadatasize 128m --metadatacopies 2 --metadataignore y /dev/mapper/1laika-vm1 /dev/mapper/1belka-vm2 /dev/mapper/1belka-vm1' (cwd None)

Which made warning from LVM :


Thread-107::DEBUG::2013-06-12 13:02:35,537::misc::83::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = '  TEST MODE: Metadata will NOT be updated and volumes will not be (de)activated.\n  Found duplicate PV pCGBIojdpiPc8tl7qwJZ2aqSPjY1Y3uw: using /dev/mapper/1belka-vm2 not /dev/mapper/35000cca03ca52d20p1\n  Can\'t initialize physical volume "/dev/mapper/1laika-vm1" of volume group "c67449cb-e5ec-4342-9c56-af1a67cf5868" without -ff\n  Can\'t initialize physical volume "/dev/mapper/1belka-vm1" of volume group "bd06c2cb-3246-4e1b-8d73-737cc165e7a5" without -ff\n'; <rc> = 5
Thread-107::DEBUG::2013-06-12 13:02:35,539::lvm::809::Storage.LVM::(testPVCreate) rc: 5, out: ['  Physical volume "/dev/mapper/1belka-vm2" successfully created'], err: ['  TEST MODE: Metadata will NOT be updated and volumes will not be (de)activated.', '  Found duplicate PV pCGBIojdpiPc8tl7qwJZ2aqSPjY1Y3uw: using /dev/mapper/1belka-vm2 not /dev/mapper/35000cca03ca52d20p1', '  Can\'t initialize physical volume "/dev/mapper/1laika-vm1" of volume group "c67449cb-e5ec-4342-9c56-af1a67cf5868" without -ff', '  Can\'t initialize physical volume "/dev/mapper/1belka-vm1" of volume group "bd06c2cb-3246-4e1b-8d73-737cc165e7a5" without -ff'], unusedDevs: set(['/dev/mapper/1belka-vm2']), usedDevs: set(['/dev/mapper/1belka-vm1', '/dev/mapper/1laika-vm1'])

How-ever it looks like : vgextend worked fine:



Thread-109::DEBUG::2013-06-12 13:15:33,632::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgextend --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1belka-vm1|1belka-vm2|1laika-vm1|1laika-vm2|35000cca03ca52d20%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " bd06c2cb-3246-4e1b-8d73-737cc165e7a5 /dev/mapper/1belka-vm2' (cwd None)



Thread-109::DEBUG::2013-06-12 13:15:33,924::lvm::921::Storage.LVM::(extendVG) Cache after extending vg {'bd06c2cb-3246-4e1b-8d73-737cc165e7a5': Stub(name='bd06c2cb-3246-4e1b-8d73-737cc165e7a5', stale=True)}
Thread-109::DEBUG::2013-06-12 13:15:33,925::persistentDict::167::Storage.PersistentDict::(transaction) Starting transaction


Thread-109::INFO::2013-06-12 13:15:33,925::blockSD::779::Storage.StorageDomain::(readMetadataMapping) META MAPPING: {u'PV0': {u'uuid': u'jSYina-zLmW-om8z-q761-LuVC-UZT7-Pg9CSV', u'pestart': u'0', u'mapoffset': u'0', u'pecount': u'1597', 'guid': u'1belka-vm1'}}




Thread-109::DEBUG::2013-06-12 13:15:34,148::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm pvs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1belka-vm1|1belka-vm2|1laika-vm1|1laika-vm2|35000cca03ca52d20%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size /dev/mapper/1belka-vm2' (cwd None)

....

how-ever, later I can see a rollback becuase of an attributeError in persisstentDict:



Thread-109::WARNING::2013-06-12 13:15:34,340::persistentDict::178::Storage.PersistentDict::(transaction) Error in transaction, rolling back changes
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/persistentDict.py", line 170, in transaction
    yield
  File "/usr/share/vdsm/storage/blockSD.py", line 669, in updateMapping
    mapping = self.getMetaDataMapping(self.sdUUID, self.readMetadataMapping())
  File "/usr/share/vdsm/storage/blockSD.py", line 620, in getMetaDataMapping
    pvlist = lvm.listPVNames(vgName)
  File "/usr/share/vdsm/storage/lvm.py", line 1262, in listPVNames
    vgPVs = [pv.name for pv in pvs if pv.vg_name == vgName]
AttributeError: 'Stub' object has no attribute 'vg_name'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


***

Thread-109::DEBUG::2013-06-12 13:15:34,521::lvm::402::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-109::ERROR::2013-06-12 13:15:34,522::task::850::TaskManager.Task::(_setError) Task=`921429b7-2ee8-4275-91d0-7f4c6e4b8086`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 684, in extendStorageDomain
    pool.extendSD(sdUUID, devlist, force)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 2018, in extendSD
    sdCache.produce(sdUUID).extend(devlist, force)
  File "/usr/share/vdsm/storage/blockSD.py", line 696, in extend
    self.updateMapping()
  File "/usr/share/vdsm/storage/blockSD.py", line 675, in updateMapping
    del self._metadata[key]
  File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/share/vdsm/storage/persistentDict.py", line 170, in transaction
    yield
  File "/usr/share/vdsm/storage/blockSD.py", line 669, in updateMapping
    mapping = self.getMetaDataMapping(self.sdUUID, self.readMetadataMapping())
  File "/usr/share/vdsm/storage/blockSD.py", line 620, in getMetaDataMapping
    pvlist = lvm.listPVNames(vgName)
  File "/usr/share/vdsm/storage/lvm.py", line 1262, in listPVNames
    vgPVs = [pv.name for pv in pvs if pv.vg_name == vgName]
AttributeError: 'Stub' object has no attribute 'vg_name'

Where PV_FIELDS are uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size"

Comment 13 Ayal Baron 2013-09-09 06:44:12 UTC
We introduced a change that updates the list of luns in the domain upon activation of the domain, so deactivating and reactivating the domain should solve the issue.
Note however that this is only good for domains V2 and up.
This will not be fixed for older version domains (which includes export domains).
The issue there is that although the luns would be updated the metadata mapping would not.

Comment 14 Aharon Canan 2013-09-09 15:09:03 UTC
Hi all, 

I probably missing something here but - 
executing the same steps [option2] using is12 works great even without reactivating.

everything was updated automatically with success.

[option2] - from https://access.redhat.com/site/solutions/17434
======
1. Create a new LUN on the SAN
2. Edit the Storage Domain
3. Make sure the new LUN is discovered (In case of iSCSI, the Discovery and Login steps described in the Admin Guide should be followed)
4. Add the LUN to the existing Storage Domain. This will make the Storage Domain grow by the size of the added LUN.


please let me know if you think I did something wrong, else I will mark as verified.

Comment 15 Ayal Baron 2013-09-12 11:18:00 UTC
you could make getMetaDataMapping fail (after vgextend) so that you'd get the same end result.
The change that was introduced allows the user to get out of this situation.

Comment 16 Katarzyna Jachim 2013-09-27 11:33:14 UTC
Verified on is13 vs is12.

Exact steps to reproduce:
* create an iscsi non-master storage domain
* modify /usr/share/vdsm/storage/blockSD.py by adding line:
   raise se.MetaDataMappingError("testing")
  right after 'pvlist = list(lvm.listPVNames(vgName))' in getMetaDataMapping function
* restart vdsm
* try to extend the storage domain (it fails)
* revert changes in vdsm
* restart vdsm

In is12 even after restarting vdsm and deactivating and reactivating the storage domain it was in the incorrect state (it was extended, but RHEV-M showed the added LUN as not used by any domain, however when I wanted to add it to a domain, I got a warning that the LUN is used by a VG).

In is13 after deactivating and reactivating the storage domain the information shown by RHEV-M was correct (LUN was shown as used by the storage domain we extended).

Comment 17 Itamar Heim 2014-01-21 22:15:22 UTC
Closing - RHEV 3.3 Released

Comment 18 Itamar Heim 2014-01-21 22:22:36 UTC
Closing - RHEV 3.3 Released


Note You need to log in before you can comment on or make changes to this bug.