1167094 – Failed to connect to superVdsm when trying to perform setupNetworks action

Bug 1167094 - Failed to connect to superVdsm when trying to perform setupNetworks action

Summary: Failed to connect to superVdsm when trying to perform setupNetworks action

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-node
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Fabian Deutsch
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:	node
Depends On:
Blocks:	rhev35rcblocker rhev35gablocker
TreeView+	depends on / blocked

Reported:	2014-11-23 15:07 UTC by GenadiC
Modified:	2016-02-10 20:02 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-11 21:06:16 UTC
oVirt Team:	Node
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine, vdsm and supervdsm logs (2.08 MB, application/zip) 2014-11-23 15:07 UTC, GenadiC	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2015:0160	0	normal	SHIPPED_LIVE	ovirt-node bug fix and enhancement update	2015-02-12 01:34:52 UTC

Description GenadiC 2014-11-23 15:07:13 UTC

Created attachment 960458 [details]
engine, vdsm and supervdsm logs

Description of problem:
When running automation tests we run every time on different jobs with failed to connect to superVdsm error message.
Didn't manage to reproduce it manually yet
Logs are attached

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Run tier1 tests
2.
3.

Actual results:
Tests fail with failed to connect to superVdsm

Expected results:
Tests should pass

Additional info:

Comment 1 Lior Vernia 2014-11-23 15:52:21 UTC

Dan, any clue? Seems like an error on bond removal in the supervdsm.log?

Comment 2 Dan Kenigsberg 2014-11-23 15:54:50 UTC

Which vdsm version is this? The log shows a more infrastructural problem with supervdsm.

Thread-13::ERROR::2014-11-23 15:30:09,802::hsm::2433::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2430, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 328, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 222, in connect
    self.getMountObj().getRecord().fs_file)
  File "/usr/share/vdsm/storage/fileSD.py", line 71, in validateDirAccess
    (os.R_OK | os.X_OK))
  File "/usr/share/vdsm/supervdsm.py", line 55, in __call__
    % self._funcName)
RuntimeError: Broken communication with supervdsm. Failed call to validateAccess

Comment 3 GenadiC 2014-11-24 07:14:24 UTC

The version is vdsm-4.16.7.4-1.el6ev.x86_64

Comment 4 Saggi Mizrahi 2014-11-24 13:35:27 UTC

Hi, is there any way you can reproduce with this additional logging?
http://gerrit.ovirt.org/#/c/35475/

Comment 5 Dan Kenigsberg 2014-11-24 15:09:19 UTC

supervdsm is crashing all too often. Genady, can you attach /var/log/message to verify segfaults existence there?

MainThread::DEBUG::2014-11-23 09:01:35,311::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm
MainThread::DEBUG::2014-11-23 09:04:52,178::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm
MainThread::DEBUG::2014-11-23 10:14:47,817::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm
MainThread::DEBUG::2014-11-23 10:18:26,265::supervdsmServer::423::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm

I suspect that an el6.6 change in one underlying library caused this regression.

Comment 6 Dan Kenigsberg 2014-11-24 15:23:47 UTC

Genady, I'd appreciate if you try my http://gerrit.ovirt.org/#/c/35480/1/vdsm/supervdsmServer with Saggi's patch - it would make coredumps available.

Comment 7 Saggi Mizrahi 2014-11-25 11:50:43 UTC

The problem is actually in a ctypes call in the ovirtnode module.
The this bug needs to be modified and a new bug needs to be opened
to reflect that but I don't know exactly what the process is.

This is the stack trace from the effected host:

(The numbered stack frames are the frames that occur after the most recent
python stack call)


#0  api_entry (aug=0x6c1b5820) at augeas.c:398
#1  0x00007fd983be9721 in aug_set (aug=0x6c1b5820, path=0x7fd9782d65f4 "/augeas/save/copy_if_rename_fails", value=0x7fd993594054 "") at augeas.c:909
#2  0x00007fd98515bdac in ffi_call_unix64 () at src/x86/unix64.S:75
#3  0x00007fd98515bb34 in ffi_call (cif=0x7fd9819bf9c0, fn=0x7fd983be9700 <aug_set>, rvalue=<value optimized out>, avalue=<value optimized out>) at src/x86/ffi64.c:430
#4  0x00007fd98536f074 in _call_function_pointer (pProc=0x7fd983be9700 <aug_set>, argtuple=0x7fd9819bf8a0, flags=4353, argtypes=<value optimized out>, restype=0x1cbe6f0, checker=0x0)
    at /usr/src/debug/Python-2.6.6/Modules/_ctypes/callproc.c:816
#5  _CallProc (pProc=0x7fd983be9700 <aug_set>, argtuple=0x7fd9819bf8a0, flags=4353, argtypes=<value optimized out>, restype=0x1cbe6f0, checker=0x0) at /usr/src/debug/Python-2.6.6/Modules/_ctypes/callproc.c:1163
#6  0x00007fd9853683a2 in CFuncPtr_call (self=<value optimized out>, inargs=<value optimized out>, kwds=0x0) at /usr/src/debug/Python-2.6.6/Modules/_ctypes/_ctypes.c:3860
#7  0x00007fd993027c63 in PyObject_Call (func=0x7fd96c04e940, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2492

/usr/lib/python2.6/site-packages/augeas.py (129): set
/usr/lib/python2.6/site-packages/ovirtnode/ovirtfunctions.py (20): <module>
/usr/lib/python2.6/site-packages/ovirtnode/ovirtfunctions.py (20): <module>
/usr/lib/python2.6/site-packages/ovirtnode/ovirtfunctions.py (20): <module>
/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py (644): delete
/usr/lib/python2.6/site-packages/ovirt/node/utils/fs/__init__.py (644): delete
/usr/share/vdsm/network/configurators/ifcfg.py (274): _removeFile
/usr/share/vdsm/network/configurators/ifcfg.py (274): _removeFile
/usr/share/vdsm/network/configurators/ifcfg.py (228): _removeSourceRouteFile
/usr/share/vdsm/network/configurators/ifcfg.py (228): _removeSourceRouteFile
/usr/share/vdsm/network/configurators/ifcfg.py (246): removeSourceRoute
/usr/share/vdsm/network/configurators/ifcfg.py (246): removeSourceRoute
/usr/share/vdsm/network/sourceroute.py (90): remove
/usr/share/vdsm/network/configurators/__init__.py (131): _removeSourceRoute
/usr/share/vdsm/network/configurators/ifcfg.py (159): removeBridge
/usr/share/vdsm/network/configurators/ifcfg.py (159): removeBridge
/usr/share/vdsm/network/models.py (171): remove
/usr/share/vdsm/network/api.py (405): delNetwork
/usr/share/vdsm/network/api.py (405): delNetwork
/usr/share/vdsm/network/api.py (206): wrapped
/usr/share/vdsm/network/api.py (206): wrapped
/usr/share/vdsm/network/api.py (569): setupNetworks
/usr/share/vdsm/network/api.py (569): setupNetworks
/usr/share/vdsm/supervdsmServer (222): setupNetworks
/usr/share/vdsm/supervdsmServer (222): setupNetworks
/usr/share/vdsm/supervdsmServer (99): wrapper
/usr/share/vdsm/supervdsmServer (99): wrapper
/usr/lib64/python2.6/multiprocessing/managers.py (199): serve_client
/usr/lib64/python2.6/multiprocessing/managers.py (391): accept_connection
/usr/lib64/python2.6/multiprocessing/managers.py (391): accept_connection
/usr/lib64/python2.6/multiprocessing/managers.py (165): handle_request
/usr/lib64/python2.6/multiprocessing/managers.py (165): handle_request
/usr/lib64/python2.6/threading.py (481): run
/usr/lib64/python2.6/threading.py (513): __bootstrap_inner
/usr/lib64/python2.6/threading.py (490): __bootstrap
/usr/lib64/python2.6/threading.py (490): __bootstrap

Comment 8 Dan Kenigsberg 2014-11-25 16:18:31 UTC

The bug looks all too similar to the one solved by https://github.com/hercules-team/python-augeas/commit/1a83390bc240abe68b562d2429718e9c06c1e8df in upstream python-augeas.

Comment 9 Ying Cui 2014-11-26 03:53:56 UTC

Hi Genadi, now this bug is moved to ovirt-node component, could you please help to provide ovirt-node/rhev-h/rhevm versions to reproduce this issue in Virt QE side? We also need to know the reproduce steps as well. 
And in your description "Didn't manage to reproduce it manually yet", so probable Virt QE can not check this bug due to no RHEV automation env in Virt QE side, OR you can help to give qa_ack+ from your side.

Thanks
Ying

Comment 10 GenadiC 2014-11-26 07:09:33 UTC

I could reproduce it on automation tests on specific environment.
That environment was upgraded lately. From that moment I couldn't reproduce it anymore. 
I am trying to reproduce it and if I'll have it I'll update the bz, otherwise I'll need to close this bz if you don't see the reason for the failure in the code

Comment 11 Michael Burman 2014-11-26 09:34:10 UTC

Hi Ying,

Regarding your comment 9, please see bz 1167097.

Comment 13 GenadiC 2014-11-27 12:26:40 UTC

Attach/Detach networks works as expected with this rhev-hypervisor7-7.0-20141126.0.iso.
Please keep comments related to RHEVH issue in BZ 1167097

Comment 15 Ori Gofen 2014-12-01 16:25:55 UTC

also reproduced this on Rhev-Upgrade simulation, during an upgrade operation on one of the hosts and an attempt to add nfs domain afterwards, the creation have failed and the supervdsm error appeared, this issue is probably an xmlrpc bug, couldn't reproduce it when executing same steps with json rpc, also when maintained the host and switch rpc to json, the supervdsmServer error disappeared, nfs domain create operation was successful.

Comment 16 Fabian Deutsch 2014-12-04 07:55:27 UTC

(In reply to Ori Gofen from comment #15)
> also reproduced this on Rhev-Upgrade simulation, during an upgrade operation
> on one of the hosts and an attempt to add nfs domain afterwards, the
> creation have failed and the supervdsm error appeared, this issue is
> probably an xmlrpc bug, couldn't reproduce it when executing same steps with
> json rpc, also when maintained the host and switch rpc to json, the
> supervdsmServer error disappeared, nfs domain create operation was
> successful.

Ori, was there also a RHEV-H host in use in that testing above?

Comment 17 Ori Gofen 2014-12-04 10:12:14 UTC

Fabian, The ERROR happened on a vdsm-rhel6.5 machine, DC contained 2 clusters,
on each cluster 2 hosts.

cluster 1  wass OK!
   |
   -----> rhel6.5, vdsm4.14
   |
   -----> RHEV Hypervisor - 6.5 - 20140930.1.el6 4.14

cluster 2 one of the hosts was down throwing the lines:

(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2430, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 328, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 222, in connect
    self.getMountObj().getRecord().fs_file)
  File "/usr/share/vdsm/storage/fileSD.py", line 71, in validateDirAccess
    (os.R_OK | os.X_OK))
  File "/usr/share/vdsm/supervdsm.py", line 55, in __call__
    % self._funcName)
RuntimeError: Broken communication with supervdsm. Failed call to validateAccess

cluster 2 
   |
   -----> rhel6.5, vdsm4.16
   |
   -----> rhel.6.5, vdsm4.16  <---- throwing ERRORS, 'non operational'
 
anyhow, I'm updating my opinion (comment #15) about this bug. after further exploring the bug is probably not related to rpc's ,it is much more low level, though, some how, on some occasions(not 100%), putting a host to maintenance and switching it's rpc, solves the issue.

I also suggest (in case this bug will reproduce itself) to know the exact flow that causes this ERROR before trying to fix anything as there are too many 'moving parts' here( have looked into some of the tests which have caused this ERROR, and, well, let's just say I'd rather read war and peace in Japanese )

Comment 18 Fabian Deutsch 2014-12-04 13:25:35 UTC

(In reply to Ori Gofen from comment #17)
> Fabian, The ERROR happened on a vdsm-rhel6.5 machine, DC contained 2
> clusters,
> on each cluster 2 hosts.
> 
> cluster 1  wass OK!
>    |
>    -----> rhel6.5, vdsm4.14
>    |
>    -----> RHEV Hypervisor - 6.5 - 20140930.1.el6 4.14

If it's this bug - and we are not sure  - then please retry with the build given in comment 12.

Comment 19 GenadiC 2014-12-10 10:45:22 UTC

Not reproduced for the last couple of weeks

Comment 21 errata-xmlrpc 2015-02-11 21:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0160.html

Note You need to log in before you can comment on or make changes to this bug.