Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Thread-1540::DEBUG::2013-02-05 18:21:30,500::supervdsm::179::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm Thread-1540::ERROR::2013-02-05 18:21:30,506::task::833::TaskManager.Task::(_setError) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2426, in createStorageDomain domVersion) File "/usr/share/vdsm/storage/nfsSD.py", line 80, in create version) File "/usr/share/vdsm/storage/nfsSD.py", line 48, in _preCreateValidation fileSD.validateDirAccess(domPath) File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess (os.R_OK | os.X_OK)) File "/usr/share/vdsm/supervdsm.py", line 76, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 67, in <lambda> **kwargs) File "<string>", line 2, in validateAccess File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '150ca90' --------------------------------------------------------------------------- Thread-1540::DEBUG::2013-02-05 18:21:30,507::task::852::TaskManager.Task::(_run) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::Task._run: 5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582 (1, '442d2999-3e9b-4c3b-b491-6436a088c6bb', 'data', 'shual.eng.lab.tlv.redhat.com:/volumes/shual/integration/rhevm-32-integ-test-4', 1, '3') {} failed - stopping task Thread-1540::DEBUG::2013-02-05 18:21:30,507::task::1177::TaskManager.Task::(stop) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::stopping in state preparing (force False) Thread-1540::DEBUG::2013-02-05 18:21:30,507::task::957::TaskManager.Task::(_decref) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::ref 1 aborting True Thread-1540::INFO::2013-02-05 18:21:30,508::task::1134::TaskManager.Task::(prepare) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::aborting: Task is aborted: u'\n---------------------------------------------------------------------------\nTraceback (most recent call last):\n File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client\n obj, exposed, gettypeid = id_to_obj[ident]\nKeyError: \'150ca90\'\n---------------------------------------------------------------------------' - code 100 Thread-1540::DEBUG::2013-02-05 18:21:30,508::task::1139::TaskManager.Task::(prepare) Task=`5ab4c2ad-ff7f-4e82-a52b-e485cbbf9582`::Prepare: aborted: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '150ca90' --------------------------------------------------------------------------- Expected results: No Error Additional info:
https://gerrit.eng.lab.tlv.redhat.com/6194
Changed the bug link. The bug appears because _firstLaunch was always True due to wrong assignment. This causes svdsm to restart in each call and this leads to a race that the proxy call is sent to old instance of svdsm.
The exception was raised when trying to import nfs storage domain. super vdsm restarted in each proxy call. When trying to create the mount the result above is appeared.
steps to reproduce: **TEST FOR BOTH NFS AND ISCSI STORAGE** 1.create one domain in one host cluster 2. block connectivity to the storage from the host 3. look for the error in bug's description.
Tested on SF8.
*** Bug 876561 has been marked as a duplicate of this bug. ***
3.2 has been released
I will get that from customer and take a look.
From their supervdsm.py: def launch(self): self._firstLaunch = False self._start() I have also attached it. Steps they took: 1) We clicked on the virtual machine. 2) On the lower pane selected the disks tab. 3) we clicked add 4) clicked the external radio button at the top (Direct LUN) 5) Switched the type from ISCSI to Fibre We also tried this from the disks tab in the top general pane with the same results. We had to continuously swap the radio button from Internal to External to prompt the rescan until the luns finally appeared (20-50 times).
Created attachment 788614 [details] supervdsm.py
Tried it with vdsm 3.1 (vdsm-4.10.2-0.el6_4.x86_64) , Switched between iscsi to Fibre, and didn't get the KeyError. Is it 100% reproducible on the customer's environment ?? just by switching between iscsi and fiber? Can you provide full vdsm.log please? Thanks.
Yaniv, Here is the vdsm log with the error. Let me know if you need anything else.
Now I'm sure that you add the version before the fix I mentioned above. Look at the log: each supervdsm call initiated new instance of supervdsm ~: grep "Launching Super Vdsm" vdsm.log.23 Thread-4979::DEBUG::2013-08-17 16:01:15,098::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-4981::DEBUG::2013-08-17 16:01:20,868::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-5022::DEBUG::2013-08-17 16:02:14,548::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-5028::DEBUG::2013-08-17 16:02:20,735::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-5022::DEBUG::2013-08-17 16:02:22,748::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-5028::DEBUG::2013-08-17 16:02:26,142::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-5047::DEBUG::2013-08-17 16:02:47,515::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm Thread-5047::DEBUG::2013-08-17 16:02:54,859::supervdsm::117::SuperVdsmProxy::(_start) Launching Super Vdsm .... and so on $ grep "Launching Super Vdsm" vdsm.log.23 | wc -l 333 this happened because of the "underscore" bug, without it each call to isRunning() returned false and we launched supervdsm again and again forever. The KeyError raises when the manager instance is not set, this can be race between supervdsm call and the launching process. but it doesn't matter after the fix. I guess you updated the version later, or fixed it locally. I'm quite sure it won't raise again with the supervdsm.py version you attached here. The multiply instances of supervdsm process are much worst than this exception print.. if this host still runs, check "ps aux | grep supervdsm | wc -l" - should be 3
Yaniv, [root@ccrhdb01 ~]# ps aux | grep supervdsm | wc -l 147
It means that I'm right. you used to run the old code , each call created new instance of supervdsm process. Now the code is fixed, but you still have dirty env. It doesn't hurt in any way, the KeyError should not appear at all, the left processes are there but don't do anything. You can kill them, or wait for the next full restart. In any case, this specific issue is fixed already in vdsm 3.2
Yaniv, Updated customer with what you thought. Here is his response. This cant be the case. This particular cluster only has one host so when we updated its software all guests were shut down and the server was in maintenance mode. Consider the order of operations: 1) DB2 RHEV Cluster in 3.0 compatibility mode (only 1 host CCRHDB01) 2) DB2 RHEV Cluster guests all powered off 3) CCRHDB01 is put into maintenance mode 4) CCRHDB01 is updated via 'yum update', in which vdsm* packages are replaced with the version mentioned earlier in the ticket. 5) CCRHDB01 is rebooted and comes back online. 6) CCRHDB01 has its RHEV host properties changed to support v3.1 cluster capabilities 7) The storage domain updates to V3. 8) We attempt to add DirectLUN objects to the guests that run on CCRHDB01 and experience the bug. Further the process count of supervdsm was taken almost a week after the bug was experienced. The decreased number of processes could be from some sort of cleanup or expiration event since the bug was encountered.
I don't understand it well. Is the exception still repeats? Does the customer know exactly when the exception raises? after any specific action someone performed to reproduce it and it happens again and again? any steps that I can flow? because we have only one attached log with that exception, from that I only figure the issues that I mentioned above, all related to the code we talked about. If it continues and repeats in different events, can you share more logs with appearances of the KeyError exception? It will help to investigate the flow that caused it, But if you don't have any, and didn't repeat, I guess something was changed there, maybe someone fixed the issue manually in the code, but as the code is right now, it shouldn't be raised..
Yaniv, I have requested the customer to reproduce and collect a new log collector. Is there a particular log you would need other than the vdsm log? Also as to how they come to this issue here are the steps they took: 1) We clicked on the virtual machine. 2) On the lower pane selected the disks tab. 3) we clicked add 4) clicked the external radio button at the top (Direct LUN) 5) Switched the type from ISCSI to Fibre They also tried this from the disks tab in the top general pane with the same results. They had to continuously swap the radio button from Internal to External to prompt the rescan until the luns finally appeared (20-50 times).
Created attachment 791538 [details] new vdsm log
Yaniv, Got current fresh vdsm log. $ grep "Launching Super Vdsm" vdsm.log | wc -l 101 Let me know if you need anything else. Customer is getting heated as they have another DirectLUN migration this Saturday.
Hey, I'm trying to reproduce it here asap.. still couldn't. I encountered in few obstacles in the way that might be related to my setup. Trying to add more luns, because it sounds like if there is a bug, it must be related to the luns' scan. The minute I'll have a reproducer and more knowledge about the error I'll send a reply.
Hey, I ran a setup as described in comment #25, added host, with vdsm-4.10.2-23.0.el6ev.x86_64.rpm), fc connection and few luns configured. Created a fiberchannel dc, added 3 sd and few vms. Then went to the addDisk tab, As you described, switched between iscsi and fiberchannel types, and you are right, same as you say, it takes quite long each refresh. If I switch between the type before I see the list, its just getting even slower. I checked in vdsm log, everything is alright, it got rescan\refresh request each switch between the iscsi\fc. No KeyError or bunch of supervdsm processes. All works properly. Many switches just slower it up because each scan request is handled separately and wait for the previous requests to finish. I'm attaching the relevant log, you can see that in your long the exception raises each storage.iscsi.rescan call, in mine it doesn't. Something is missing in your description or the vdsm version is incorrect. From your log I can't see the vdsm version that ran on the machine in that specific time, so I used the one mentioned in comment #12. I'm still convinced the vdsm in that host ran with the code without the attached patch merged. Maybe it was fixed but vdsm was not restarted afterwards. If you have more info let me know
Created attachment 792771 [details] working log
Any new bug on it? can we close this one?
Yaniv, Finally got hold of customer. Here is the update. "We are not able to go on a fishing expedition on these servers. As stated the server was patched and then rebooted, so the VDSMD that is loaded is the new version. I would like to add that we are seeing the same kind of problem when trying to create a new storage domain on other clusters: Thread-2142362::DEBUG::2013-09-18 14:05:22,589::supervdsm::179::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm Thread-2142362::ERROR::2013-09-18 14:05:22,594::task::850::TaskManager.Task::(_setError) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 41, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1806, in getDeviceList devices = self._getDeviceList(storageType) File "/usr/share/vdsm/storage/hsm.py", line 1838, in _getDeviceList for dev in multipath.pathListIter(guids): File "/usr/share/vdsm/storage/multipath.py", line 226, in pathListIter pathStatuses = devicemapper.getPathsStatus() File "/usr/share/vdsm/storage/devicemapper.py", line 197, in getPathsStatus return getProxy().getPathsStatus() File "/usr/share/vdsm/supervdsm.py", line 76, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 67, in <lambda> **kwargs) File "<string>", line 2, in getPathsStatus File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '251bf50' --------------------------------------------------------------------------- Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::869::TaskManager.Task::(_run) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Task._run: 9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad (2,) {} failed - stopping task Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::1194::TaskManager.Task::(stop) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::stopping in state preparing (force False) Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::974::TaskManager.Task::(_decref) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::ref 1 aborting True Thread-2142362::INFO::2013-09-18 14:05:22,596::task::1151::TaskManager.Task::(prepare) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::aborting: Task is aborted: u'\n---------------------------------------------------------------------------\nTraceback (most recent call last):\n File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client\n obj, exposed, gettypeid = id_to_obj[ident]\nKeyError: \'251bf50\'\n---------------------------------------------------------------------------' - code 100 Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::1156::TaskManager.Task::(prepare) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Prepare: aborted: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '251bf50' --------------------------------------------------------------------------- Thread-2142362::DEBUG::2013-09-18 14:05:22,596::task::974::TaskManager.Task::(_decref) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::ref 0 aborting True Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::909::TaskManager.Task::(_doAbort) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::Task._doAbort: force False Thread-2142362::DEBUG::2013-09-18 14:05:22,597::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::579::TaskManager.Task::(_updateState) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::moving from state preparing -> state aborting Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::534::TaskManager.Task::(__state_aborting) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::_aborting: recover policy none Thread-2142362::DEBUG::2013-09-18 14:05:22,597::task::579::TaskManager.Task::(_updateState) Task=`9e472f6b-5ba3-40d7-9bd8-ab3a9158d0ad`::moving from state aborting -> state failed Thread-2142362::DEBUG::2013-09-18 14:05:22,597::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-2142362::DEBUG::2013-09-18 14:05:22,597::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-2142362::ERROR::2013-09-18 14:05:22,597::dispatcher::69::Storage.Dispatcher.Protect::(run) --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '251bf50' --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 1159, in prepare raise self.error RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '251bf50' --------------------------------------------------------------------------" I will open a new bug as per previous discussion.
*** Bug 1010030 has been marked as a duplicate of this bug. ***