908068 – vdsm: KeyError when calling supervdsm methods

Bug 908068 - vdsm: KeyError when calling supervdsm methods

Summary: vdsm: KeyError when calling supervdsm methods

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Yaniv Bronhaim
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (2):	876561 1010030 (view as bug list)
Depends On:
Blocks:	908394 915537 1010030
TreeView+	depends on / blocked

Reported:	2013-02-05 19:08 UTC by Saggi Mizrahi
Modified:	2018-12-03 18:16 UTC (History)
CC List:	18 users (show)
Fixed In Version:	vdsm-4.10.2-7.0.el6ev.x86_64.rpm
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	908394 908395 1010030 (view as bug list)
Environment:
Last Closed:	2013-10-01 14:15:08 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
supervdsm.py (7.14 KB, text/x-python) 2013-08-20 19:40 UTC, Michael Everette	no flags	Details
new vdsm log (3.10 MB, text/x-log) 2013-08-28 22:19 UTC, Michael Everette	no flags	Details
working log (15.23 MB, text/x-log) 2013-09-02 09:11 UTC, Yaniv Bronhaim	no flags	Details
View All

Description Saggi Mizrahi 2013-02-05 19:08:52 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
Thread-1540::DEBUG::2013-02-05 18:21:30,500::supervdsm::179::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm
Thread-1540::ERROR::2013-02-05 18:21:30,506::task::833::TaskManager.Task::(_setError) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2426, in createStorageDomain
    domVersion)
  File "/usr/share/vdsm/storage/nfsSD.py", line 80, in create
    version)
  File "/usr/share/vdsm/storage/nfsSD.py", line 48, in _preCreateValidation
    fileSD.validateDirAccess(domPath)
  File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess
    (os.R_OK | os.X_OK))
  File "/usr/share/vdsm/supervdsm.py", line 76, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 67, in <lambda>
    **kwargs)
  File "<string>", line 2, in validateAccess
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '150ca90'
---------------------------------------------------------------------------
Thread-1540::DEBUG::2013-02-05 18:21:30,507::task::852::TaskManager.Task::(_run) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::Task._run: 5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582 (1, '442d2999-3e9b-4c3b-b491-6436a088c6bb', 'data', 'shual.eng.lab.tlv.redhat.com:/volumes/shual/integration/rhevm-32-integ-test-4', 1, '3') {} failed - stopping task
Thread-1540::DEBUG::2013-02-05 18:21:30,507::task::1177::TaskManager.Task::(stop) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::stopping in state preparing (force False)
Thread-1540::DEBUG::2013-02-05 18:21:30,507::task::957::TaskManager.Task::(_decref) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::ref 1 aborting True
Thread-1540::INFO::2013-02-05 18:21:30,508::task::1134::TaskManager.Task::(prepare) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::aborting: Task is aborted: u'\n---------------------------------------------------------------------------\nTraceback (most recent call last):\n  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client\n    obj, exposed, gettypeid = id_to_obj[ident]\nKeyError: \'150ca90\'\n---------------------------------------------------------------------------' - code 100
Thread-1540::DEBUG::2013-02-05 18:21:30,508::task::1139::TaskManager.Task::(prepare) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::Prepare: aborted: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '150ca90'
---------------------------------------------------------------------------

Expected results:
No Error

Additional info:

Comment 1 Saggi Mizrahi 2013-02-05 19:13:48 UTC

https://gerrit.eng.lab.tlv.redhat.com/6194

Comment 4 Yaniv Bronhaim 2013-02-06 13:16:11 UTC

Changed the bug link. The bug appears because _firstLaunch was always True due to wrong assignment. This causes svdsm to restart in each call and this leads to a race that the proxy call is sent to old instance of svdsm.

Comment 5 Yaniv Bronhaim 2013-02-06 14:44:28 UTC

The exception was raised when trying to import nfs storage domain. super vdsm  restarted in each proxy call. When trying to create the mount the result above is appeared.

Comment 7 Dafna Ron 2013-02-27 16:30:48 UTC

steps to reproduce:

**TEST FOR BOTH NFS AND ISCSI STORAGE**

1.create one domain in one host cluster
2. block connectivity to the storage from the host
3. look for the error in bug's description.

Comment 8 Elad 2013-03-03 08:36:47 UTC

Tested on SF8.

Comment 9 Haim 2013-03-17 20:04:18 UTC

*** Bug 876561 has been marked as a duplicate of this bug. ***

Comment 10 Itamar Heim 2013-06-11 09:14:26 UTC

3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:40:25 UTC

3.2 has been released

Comment 14 Michael Everette 2013-08-20 17:48:50 UTC

I will get that from customer and take a look.

Comment 15 Michael Everette 2013-08-20 19:39:59 UTC

From their supervdsm.py:

    def launch(self):
        self._firstLaunch = False
        self._start()

I have also attached it.

Steps they took: 

1) We clicked on the virtual machine.
2) On the lower pane selected the disks tab.
3) we clicked add
4) clicked the external radio button at the top (Direct LUN)
5) Switched the type from ISCSI to Fibre


We also tried this from the disks tab in the top general pane with the same results.  We had to continuously swap the radio button from Internal to External to prompt the rescan until the luns finally appeared (20-50 times).

Comment 16 Michael Everette 2013-08-20 19:40:56 UTC

Created attachment 788614 [details]
supervdsm.py

Comment 17 Yaniv Bronhaim 2013-08-21 14:57:13 UTC

Tried it with vdsm 3.1 (vdsm-4.10.2-0.el6_4.x86_64) ,
Switched between iscsi to Fibre, and didn't get the KeyError.

Is it 100% reproducible on the customer's environment ?? just by switching between iscsi and fiber?

Can you provide full vdsm.log please?

Thanks.

Comment 19 Michael Everette 2013-08-21 16:47:14 UTC

Yaniv,

Here is the vdsm log with the error. Let me know if you need anything else.

Comment 20 Yaniv Bronhaim 2013-08-22 15:03:43 UTC

Now I'm sure that you add the version before the fix I mentioned above.

Look at the log: each supervdsm call initiated new instance of supervdsm 
~: grep  "Launching Super Vdsm" vdsm.log.23 
Thread-4979::DEBUG::2013-08-17 16:01:15,098::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-4981::DEBUG::2013-08-17 16:01:20,868::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-5022::DEBUG::2013-08-17 16:02:14,548::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-5028::DEBUG::2013-08-17 16:02:20,735::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-5022::DEBUG::2013-08-17 16:02:22,748::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-5028::DEBUG::2013-08-17 16:02:26,142::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-5047::DEBUG::2013-08-17 16:02:47,515::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
Thread-5047::DEBUG::2013-08-17 16:02:54,859::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm
 ....
and so on
$ grep  "Launching Super Vdsm" vdsm.log.23 | wc -l
333

this happened because of the "underscore" bug, without it each call to isRunning() returned false and we launched supervdsm again and again forever.

The KeyError raises when the manager instance is not set, this can be race between supervdsm call and the launching process. but it doesn't matter after the fix. I guess you updated the version later, or fixed it locally. I'm quite sure it won't raise again with the supervdsm.py version you attached here.

The multiply instances of supervdsm process are much worst than this exception print.. if this host still runs, check "ps aux | grep supervdsm | wc -l" - should be 3

Comment 21 Michael Everette 2013-08-22 17:17:46 UTC

Yaniv,

[root@ccrhdb01 ~]# ps aux | grep supervdsm | wc -l
147

Comment 22 Yaniv Bronhaim 2013-08-25 06:36:06 UTC

It means that I'm right. you used to run the old code , each call created new instance of supervdsm process. 

Now the code is fixed, but you still have dirty env. It doesn't hurt in any way, the KeyError should not appear at all, the left processes are there but don't do anything. You can kill them, or wait for the next full restart.

In any case, this specific issue is fixed already in vdsm 3.2

Comment 23 Michael Everette 2013-08-26 15:24:50 UTC

Yaniv,

Updated customer with what you thought. Here is his response.


This cant be the case.

This particular cluster only has one host so when we updated its software all guests were shut down and the server was in maintenance mode.

Consider the order of operations:

1) DB2 RHEV Cluster in 3.0 compatibility mode (only 1 host CCRHDB01)
2) DB2 RHEV Cluster guests all powered off
3) CCRHDB01 is put into maintenance mode
4) CCRHDB01 is updated via 'yum update', in which vdsm* packages are replaced with the version mentioned earlier in the ticket.
5) CCRHDB01 is rebooted and comes back online.
6) CCRHDB01 has its RHEV host properties changed to support v3.1 cluster capabilities
7) The storage domain updates to V3.
8) We attempt to add DirectLUN objects to the guests that run on CCRHDB01 and experience the bug.

Further the process count of supervdsm was taken almost a week after the bug was experienced.  The decreased number of processes could be from some sort of cleanup or expiration event since the bug was encountered.

Comment 24 Yaniv Bronhaim 2013-08-27 17:40:10 UTC

I don't understand it well. Is the exception still repeats? 
Does the customer know exactly when the exception raises? after any specific action someone performed to reproduce it and it happens again and again? any steps that I can flow? because we have only one attached log with that exception, 

from that I only figure the issues that I mentioned above, all related to the code we talked about.

If it continues and repeats in different events, can you share more logs with appearances of the KeyError exception? It will help to investigate the flow that caused it,
But if you don't have any, and didn't repeat, I guess something was changed there, maybe someone fixed the issue manually in the code, but as the code is right now, it shouldn't be raised..

Comment 25 Michael Everette 2013-08-28 17:04:25 UTC

Yaniv,

I have requested the customer to reproduce and collect a new log collector. Is there a particular log you would need other than the vdsm log?

Also as to how they come to this issue here are the steps they took: 

1) We clicked on the virtual machine.
2) On the lower pane selected the disks tab.
3) we clicked add
4) clicked the external radio button at the top (Direct LUN)
5) Switched the type from ISCSI to Fibre


They also tried this from the disks tab in the top general pane with the same results.  They had to continuously swap the radio button from Internal to External to prompt the rescan until the luns finally appeared (20-50 times).

Comment 26 Michael Everette 2013-08-28 22:19:19 UTC

Created attachment 791538 [details]
new vdsm log

Comment 27 Michael Everette 2013-08-28 22:21:06 UTC

Yaniv,

Got current fresh vdsm log.


$ grep  "Launching Super Vdsm" vdsm.log | wc -l
101

Let me know if you need anything else. Customer is getting heated as they have another DirectLUN migration this Saturday.

Comment 30 Yaniv Bronhaim 2013-09-01 15:55:13 UTC

Hey,

I'm trying to reproduce it here asap.. still couldn't. I encountered in few obstacles in the way that might be related to my setup. Trying to add more luns, because it sounds like if there is a bug, it must be related to the luns' scan.

The minute I'll have a reproducer and more knowledge about the error I'll send a reply.

Comment 31 Yaniv Bronhaim 2013-09-02 09:08:14 UTC

Hey,
I ran a setup as described in comment #25, added host, with vdsm-4.10.2-23.0.el6ev.x86_64.rpm), fc connection and few luns configured. Created a fiberchannel dc, added 3 sd and few vms.

Then went to the addDisk tab, 
As you described, switched between iscsi and fiberchannel types, and you are right, same as you say, it takes quite long each refresh.

If I switch between the type before I see the list, its just getting even slower.

I checked in vdsm log, everything is alright, it got rescan\refresh request each switch between the iscsi\fc. No KeyError or bunch of supervdsm processes.

All works properly. Many switches just slower it up because each scan request is handled separately and wait for the previous requests to finish.

I'm attaching the relevant log, you can see that in your long the exception raises each storage.iscsi.rescan call, in mine it doesn't.

Something is missing in your description or the vdsm version is incorrect.
From your log I can't see the vdsm version that ran on the machine in that specific time, so I used the one mentioned in comment #12. 

I'm still convinced the vdsm in that host ran with the code without the attached patch merged. Maybe it was fixed but vdsm was not restarted afterwards.

If you have more info let me know

Comment 32 Yaniv Bronhaim 2013-09-02 09:11:30 UTC

Created attachment 792771 [details]
working log

Comment 35 Yaniv Bronhaim 2013-09-11 16:20:06 UTC

Any new bug on it? can we close this one?

Comment 37 Michael Everette 2013-09-19 18:11:32 UTC

Yaniv, Finally got hold of customer. Here is the update. 


"We are not able to go on a fishing expedition on these servers.  As stated the server was patched and then rebooted, so the VDSMD that is loaded is the new version.

I would like to add that we are seeing the same kind of problem when trying to create a new storage domain on other clusters:

Thread-2142362::DEBUG::2013-09-18 14:05:22,589::supervdsm::179::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm
Thread-2142362::ERROR::2013-09-18 14:05:22,594::task::850::TaskManager.Task::(_setError) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1806, in getDeviceList
    devices = self._getDeviceList(storageType)
  File "/usr/share/vdsm/storage/hsm.py", line 1838, in _getDeviceList
    for dev in multipath.pathListIter(guids):
  File "/usr/share/vdsm/storage/multipath.py", line 226, in pathListIter
    pathStatuses = devicemapper.getPathsStatus()
  File "/usr/share/vdsm/storage/devicemapper.py", line 197, in getPathsStatus
    return getProxy().getPathsStatus()
  File "/usr/share/vdsm/supervdsm.py", line 76, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 67, in <lambda>
    **kwargs)
  File "<string>", line 2, in getPathsStatus
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '251bf50'
---------------------------------------------------------------------------
Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::869::TaskManager.Task::(_run) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Task._run: 9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad (2,) {} failed - stopping task
Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::1194::TaskManager.Task::(stop) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::stopping in state preparing (force False)
Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::974::TaskManager.Task::(_decref) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::ref 1 aborting True
Thread-2142362::INFO::2013-09-18 14:05:22,596::task::1151::TaskManager.Task::(prepare) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::aborting: Task is aborted: u'\n---------------------------------------------------------------------------\nTraceback (most recent call last):\n  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client\n    obj, exposed, gettypeid = id_to_obj[ident]\nKeyError: \'251bf50\'\n---------------------------------------------------------------------------' - code 100
Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::1156::TaskManager.Task::(prepare) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Prepare: aborted:
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '251bf50'
---------------------------------------------------------------------------
Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::974::TaskManager.Task::(_decref) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::ref 0 aborting True
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::909::TaskManager.Task::(_doAbort) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Task._doAbort: force False
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::579::TaskManager.Task::(_updateState) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::moving from state preparing -> state aborting
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::534::TaskManager.Task::(__state_aborting) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::_aborting: recover policy none
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::579::TaskManager.Task::(_updateState) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::moving from state aborting -> state failed
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-2142362::DEBUG::2013-09-18 14:05:22,597::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-2142362::ERROR::2013-09-18 14:05:22,597::dispatcher::69::Storage.Dispatcher.Protect::(run)
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '251bf50'
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
    raise self.error
RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '251bf50'
--------------------------------------------------------------------------"

I will open a new bug as per previous discussion.

Comment 39 Allon Mureinik 2013-10-16 08:45:36 UTC

*** Bug 1010030 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.