Created attachment 960458 [details] engine, vdsm and supervdsm logs Description of problem: When running automation tests we run every time on different jobs with failed to connect to superVdsm error message. Didn't manage to reproduce it manually yet Logs are attached Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Run tier1 tests 2. 3. Actual results: Tests fail with failed to connect to superVdsm Expected results: Tests should pass Additional info:
Dan, any clue? Seems like an error on bond removal in the supervdsm.log?
Which vdsm version is this? The log shows a more infrastructural problem with supervdsm. Thread-13::ERROR::2014-11-23 15:30:09,802::hsm::2433::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2430, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 328, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 222, in connect self.getMountObj().getRecord().fs_file) File "/usr/share/vdsm/storage/fileSD.py", line 71, in validateDirAccess (os.R_OK | os.X_OK)) File "/usr/share/vdsm/supervdsm.py", line 55, in __call__ % self._funcName) RuntimeError: Broken communication with supervdsm. Failed call to validateAccess
The version is vdsm-4.16.7.4-1.el6ev.x86_64
Hi, is there any way you can reproduce with this additional logging? http://gerrit.ovirt.org/#/c/35475/
supervdsm is crashing all too often. Genady, can you attach /var/log/message to verify segfaults existence there? MainThread::DEBUG::2014-11-23 09:01:35,311::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm MainThread::DEBUG::2014-11-23 09:04:52,178::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm MainThread::DEBUG::2014-11-23 10:14:47,817::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm MainThread::DEBUG::2014-11-23 10:18:26,265::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm I suspect that an el6.6 change in one underlying library caused this regression.
Genady, I'd appreciate if you try my http://gerrit.ovirt.org/#/c/35480/1/vdsm/supervdsmServer with Saggi's patch - it would make coredumps available.
The problem is actually in a ctypes call in the ovirtnode module. The this bug needs to be modified and a new bug needs to be opened to reflect that but I don't know exactly what the process is. This is the stack trace from the effected host: (The numbered stack frames are the frames that occur after the most recent python stack call) #0 api_entry (aug=0x6c1b5820) at augeas.c:398 #1 0x00007fd983be9721 in aug_set (aug=0x6c1b5820, path=0x7fd9782d65f4 "/augeas/save/copy_if_rename_fails", value=0x7fd993594054 "") at augeas.c:909 #2 0x00007fd98515bdac in ffi_call_unix64 () at src/x86/unix64.S:75 #3 0x00007fd98515bb34 in ffi_call (cif=0x7fd9819bf9c0, fn=0x7fd983be9700 <aug_set>, rvalue=<value optimized out>, avalue=<value optimized out>) at src/x86/ffi64.c:430 #4 0x00007fd98536f074 in _call_function_pointer (pProc=0x7fd983be9700 <aug_set>, argtuple=0x7fd9819bf8a0, flags=4353, argtypes=<value optimized out>, restype=0x1cbe6f0, checker=0x0) at /usr/src/debug/Python-2.6.6/Modules/_ctypes/callproc.c:816 #5 _CallProc (pProc=0x7fd983be9700 <aug_set>, argtuple=0x7fd9819bf8a0, flags=4353, argtypes=<value optimized out>, restype=0x1cbe6f0, checker=0x0) at /usr/src/debug/Python-2.6.6/Modules/_ctypes/callproc.c:1163 #6 0x00007fd9853683a2 in CFuncPtr_call (self=<value optimized out>, inargs=<value optimized out>, kwds=0x0) at /usr/src/debug/Python-2.6.6/Modules/_ctypes/_ctypes.c:3860 #7 0x00007fd993027c63 in PyObject_Call (func=0x7fd96c04e940, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2492 /usr/lib/python2.6/site-packages/augeas.py (129): set /usr/lib/python2.6/site-packages/ovirtnode/ovirtfunctions.py (20): <module> /usr/lib/python2.6/site-packages/ovirtnode/ovirtfunctions.py (20): <module> /usr/lib/python2.6/site-packages/ovirtnode/ovirtfunctions.py (20): <module> /usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py (644): delete /usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py (644): delete /usr/share/vdsm/network/configurators/ifcfg.py (274): _removeFile /usr/share/vdsm/network/configurators/ifcfg.py (274): _removeFile /usr/share/vdsm/network/configurators/ifcfg.py (228): _removeSourceRouteFile /usr/share/vdsm/network/configurators/ifcfg.py (228): _removeSourceRouteFile /usr/share/vdsm/network/configurators/ifcfg.py (246): removeSourceRoute /usr/share/vdsm/network/configurators/ifcfg.py (246): removeSourceRoute /usr/share/vdsm/network/sourceroute.py (90): remove /usr/share/vdsm/network/configurators/__init__.py (131): _removeSourceRoute /usr/share/vdsm/network/configurators/ifcfg.py (159): removeBridge /usr/share/vdsm/network/configurators/ifcfg.py (159): removeBridge /usr/share/vdsm/network/models.py (171): remove /usr/share/vdsm/network/api.py (405): delNetwork /usr/share/vdsm/network/api.py (405): delNetwork /usr/share/vdsm/network/api.py (206): wrapped /usr/share/vdsm/network/api.py (206): wrapped /usr/share/vdsm/network/api.py (569): setupNetworks /usr/share/vdsm/network/api.py (569): setupNetworks /usr/share/vdsm/supervdsmServer (222): setupNetworks /usr/share/vdsm/supervdsmServer (222): setupNetworks /usr/share/vdsm/supervdsmServer (99): wrapper /usr/share/vdsm/supervdsmServer (99): wrapper /usr/lib64/python2.6/multiprocessing/managers.py (199): serve_client /usr/lib64/python2.6/multiprocessing/managers.py (391): accept_connection /usr/lib64/python2.6/multiprocessing/managers.py (391): accept_connection /usr/lib64/python2.6/multiprocessing/managers.py (165): handle_request /usr/lib64/python2.6/multiprocessing/managers.py (165): handle_request /usr/lib64/python2.6/threading.py (481): run /usr/lib64/python2.6/threading.py (513): __bootstrap_inner /usr/lib64/python2.6/threading.py (490): __bootstrap /usr/lib64/python2.6/threading.py (490): __bootstrap
The bug looks all too similar to the one solved by https://github.com/hercules-team/python-augeas/commit/1a83390bc240abe68b562d2429718e9c06c1e8df in upstream python-augeas.
Hi Genadi, now this bug is moved to ovirt-node component, could you please help to provide ovirt-node/rhev-h/rhevm versions to reproduce this issue in Virt QE side? We also need to know the reproduce steps as well. And in your description "Didn't manage to reproduce it manually yet", so probable Virt QE can not check this bug due to no RHEV automation env in Virt QE side, OR you can help to give qa_ack+ from your side. Thanks Ying
I could reproduce it on automation tests on specific environment. That environment was upgraded lately. From that moment I couldn't reproduce it anymore. I am trying to reproduce it and if I'll have it I'll update the bz, otherwise I'll need to close this bz if you don't see the reason for the failure in the code
Hi Ying, Regarding your comment 9, please see bz 1167097.
Attach/Detach networks works as expected with this rhev-hypervisor7-7.0-20141126.0.iso. Please keep comments related to RHEVH issue in BZ 1167097
also reproduced this on Rhev-Upgrade simulation, during an upgrade operation on one of the hosts and an attempt to add nfs domain afterwards, the creation have failed and the supervdsm error appeared, this issue is probably an xmlrpc bug, couldn't reproduce it when executing same steps with json rpc, also when maintained the host and switch rpc to json, the supervdsmServer error disappeared, nfs domain create operation was successful.
(In reply to Ori Gofen from comment #15) > also reproduced this on Rhev-Upgrade simulation, during an upgrade operation > on one of the hosts and an attempt to add nfs domain afterwards, the > creation have failed and the supervdsm error appeared, this issue is > probably an xmlrpc bug, couldn't reproduce it when executing same steps with > json rpc, also when maintained the host and switch rpc to json, the > supervdsmServer error disappeared, nfs domain create operation was > successful. Ori, was there also a RHEV-H host in use in that testing above?
Fabian, The ERROR happened on a vdsm-rhel6.5 machine, DC contained 2 clusters, on each cluster 2 hosts. cluster 1 wass OK! | -----> rhel6.5, vdsm4.14 | -----> RHEV Hypervisor - 6.5 - 20140930.1.el6 4.14 cluster 2 one of the hosts was down throwing the lines: (connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2430, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 328, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 222, in connect self.getMountObj().getRecord().fs_file) File "/usr/share/vdsm/storage/fileSD.py", line 71, in validateDirAccess (os.R_OK | os.X_OK)) File "/usr/share/vdsm/supervdsm.py", line 55, in __call__ % self._funcName) RuntimeError: Broken communication with supervdsm. Failed call to validateAccess cluster 2 | -----> rhel6.5, vdsm4.16 | -----> rhel.6.5, vdsm4.16 <---- throwing ERRORS, 'non operational' anyhow, I'm updating my opinion (comment #15) about this bug. after further exploring the bug is probably not related to rpc's ,it is much more low level, though, some how, on some occasions(not 100%), putting a host to maintenance and switching it's rpc, solves the issue. I also suggest (in case this bug will reproduce itself) to know the exact flow that causes this ERROR before trying to fix anything as there are too many 'moving parts' here( have looked into some of the tests which have caused this ERROR, and, well, let's just say I'd rather read war and peace in Japanese )
(In reply to Ori Gofen from comment #17) > Fabian, The ERROR happened on a vdsm-rhel6.5 machine, DC contained 2 > clusters, > on each cluster 2 hosts. > > cluster 1 wass OK! > | > -----> rhel6.5, vdsm4.14 > | > -----> RHEV Hypervisor - 6.5 - 20140930.1.el6 4.14 If it's this bug - and we are not sure - then please retry with the build given in comment 12.
Not reproduced for the last couple of weeks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0160.html