Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 918915

Summary: [vdsm] | Negative Flow | VDSM service is not operational (fails to commit storage actions) in case LVM operation disturbed
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: low    
Version: 3.2.0CC: abaron, acathrow, bazulay, hateya, iheim, jkt, lpeer, pstehlik
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-29 13:38:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Elad 2013-03-07 08:43:04 UTC
Created attachment 706421 [details]
logs

Description of problem:

When LVM operation is disturbed and get into zombie state, superVDSM get stuck and don't come back to normal when operation continue, hence vdsm service fails to commit any storage action such as getDeviceList, getStoragePoolInfo, hence pool becomes non-operational.


Version-Release number of selected component (if applicable):
vdsm-4.10.2-10.0.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Run one or more LVM operations on VDSM (like CreateVolume or DeleteVolume)
2. run the following script:

while true; do kill -STOP `pgrep lvm` && sleep 10 && kill -CONT `pgrep lvm`; done

  
Actual results:
superVDSM will stuck because LVM operation stoped but superVDSM won't go back to normal when the LVM operation will continue.

note that lvm processes remaind in zombie state.

Expected results:
superVDSM should go back to normal when LVM operation get continued

Additional info:
see logs attatched

Comment 1 Yaniv Bronhaim 2013-03-10 17:03:39 UTC
I don't see any malfunction in supervdsm in that log. Looks like vdsm never restarted in this log, and each operation that failed in this log has right reason for the failure. Can you aim me to specific location where you think that supervdsm gets stuck?

You get repeated error about wrong operator that vdsm receives, can it be that you send the request wrongly ?

421ec672-0978-4d7b-8973-e9f83d9eb99a::WARNING::2013-03-06 15:41:59,867::task::633::TaskManager.Task::(_dump) Task._dump: object zeroImage_e1d7d91d-3132-4001-942f-9ec412c268b2: <bound method BlockStorageDomain.zeroImage of <storage.blockSD.BlockStorageDomain instance at 0x7f524c0bf8c0>> (args: ('ad3962a4-30b8-47b1-a3df-cf3bd852cb20', 'e1d7d91d-3132-4001-942f-9ec412c268b2', {'a1b84906-73b8-4a75-a598-7c254c19c274': ImgsPar(imgs=('e1d7d91d-3132-4001-942f-9ec412c268b2',), parent='00000000-0000-0000-0000-000000000000')}) kwargs: {}) skipping field runcmd
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 629, in _dump
    "character" % KEY_SEPERATOR)
ValueError: field and value cannot include = character

Do you use the engine to run this test?

About the error with CMD_LOWPRIO, I think it was fixed as part of bug 918469

And about:
Thread-111441::DEBUG::2013-03-06 15:07:55,971::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m session -r 11 -u' (cwd None)
Thread-111441::DEBUG::2013-03-06 15:07:56,505::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-111441::ERROR::2013-03-06 15:07:56,506::supervdsmServer::81::SuperVdsm.ServerCallback::(wrapper) Error in readSessionInfo
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer.py", line 145, in readSessionInfo
    return _readSessionInfo(sessionID)
  File "/usr/share/vdsm/storage/iscsi.py", line 83, in readSessionInfo
    raise OSError(errno.ENOENT, "No such session")
OSError: [Errno 2] No such session

This is probably because you killed the lvm processes and the iscsi session was not created.. it sounds pretty reasonable

Comment 2 Barak 2013-03-24 16:47:15 UTC
Ayal,

Is the above scenario interesting ?
Have we ever encountered such a phenomena in the past ?

Comment 3 Barak 2013-03-24 16:47:50 UTC
Elad,

Is this a new test ?

Comment 4 Haim 2013-03-24 17:20:35 UTC
(In reply to comment #3)
> Elad,
> 
> Is this a new test ?

we always adding new tests, so for your question, yes, its a new test.

Comment 5 Ayal Baron 2013-03-24 23:03:42 UTC
> 421ec672-0978-4d7b-8973-e9f83d9eb99a::WARNING::2013-03-06
> 15:41:59,867::task::633::TaskManager.Task::(_dump) Task._dump: object
> zeroImage_e1d7d91d-3132-4001-942f-9ec412c268b2: <bound method
> BlockStorageDomain.zeroImage of <storage.blockSD.BlockStorageDomain instance
> at 0x7f524c0bf8c0>> (args: ('ad3962a4-30b8-47b1-a3df-cf3bd852cb20',
> 'e1d7d91d-3132-4001-942f-9ec412c268b2',
> {'a1b84906-73b8-4a75-a598-7c254c19c274':
> ImgsPar(imgs=('e1d7d91d-3132-4001-942f-9ec412c268b2',),
> parent='00000000-0000-0000-0000-000000000000')}) kwargs: {}) skipping field
> runcmd
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 629, in _dump
>     "character" % KEY_SEPERATOR)
> ValueError: field and value cannot include = character

This is a logging issue that has been fixed and is irrelevant here.


(In reply to comment #2)
> Ayal,
> 
> Is the above scenario interesting ?
> Have we ever encountered such a phenomena in the past ?

*If* supervdsm hangs then we should understand why (and see if worth fixing), otherwise I wouldn't bother with it.

Comment 6 Barak 2013-03-27 13:58:12 UTC
The above scenario is synthetic (and is not likely to happen in a real deployment).
Hence moving it to 3.3

Comment 7 Andrew Cathrow 2013-07-08 15:22:33 UTC
Same comment as #6, any reason not to close this?
moving to 3.4 for now