Bug 1168217

Summary:

Error getting hardware information from an upgraded host

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Allon Mureinik <amureini>

Component:

vdsm

Assignee:

Piotr Kliczewski <pkliczew>

Status:

CLOSED CURRENTRELEASE

QA Contact:

sefi litmanovich <slitmano>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.5.0

CC:

acanan, amureini, bazulay, danken, ecohen, gklein, iheim, lpeer, lsurette, mburman, michal.skrivanek, nsednev, nsoffer, oourfali, pkliczew, scohen, tnisan, ybronhei, yeylon

Target Milestone:

---

Keywords:

Regression, Triaged

Target Release:

3.5.0

Flags:

pkliczew: needinfo+

Hardware:

x86_64

OS:

Linux

Whiteboard:

infra

Fixed In Version:

v4.16.8.1

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

1167085

Environment:

Last Closed:

2015-02-16 13:40:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1164308, 1164311

Attachments:

Description	Flags
host and engine logs	none
engine + host logs	none

Description Allon Mureinik 2014-11-26 12:46:08 UTC

+++ This bug was initially created as a clone of Bug #1167085 +++

Description of problem:
When setting host to maintenance, then upgrade vdsm from 3.4 > 3.5 , then reinstalling host again in the setup, after installation can't activate host. Host can't find and connect to iscsi storage domain.

Version-Release number of selected component (if applicable):
3.4.4-2.2.el6ev 
vdsm-4.14.18-3.el6ev > vdsm-4.16.7.4-1.el6ev

How reproducible:
always

Steps to Reproduce:
1. 3.4 engine with 3.4 host
2. put host on maintenance and upgrade host to 3.5 
3. after installation successfully finished try activate host 

Actual results:
can't connect or find iscsi storage domain 'Storage domain does not exist'.

Expected results:
host should be able to connect to iscsi storage domain after activating.

--- Additional comment from Allon Mureinik on 2014-11-24 12:15:53 EST ---

Please attach the engine logs too.

--- Additional comment from Michael Burman on 2014-11-25 02:00:57 EST ---



--- Additional comment from Michael Burman on 2014-11-25 02:02:19 EST ---

Sure.

In the engine logs look at 23.11.14, 12:45-12:55
Relevant hosts- orange-vdsc and orange-vdsd
============================================================================
During the investigation of bug #1167085 I noticed that there are several failures updating the host's hardware.
Offhand, it does not seem related to the original bug, but may be important to investigate:

VDSNetworkException: org.apache.xmlrpc.XmlRpcException: <type 'exceptions.TypeError'>:cannot marshal None unless allow_none is enabled
2014-11-23 10:40:02,657 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-11) [6f1c3cf] FINISH, GetHardwareInfoVDSCommand, log id: 2f0a18ce
2014-11-23 10:40:02,676 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-11) [6f1c3cf] Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: org.apache.xmlrpc.XmlRpcException: <type 'exceptions.TypeError'>:cannot marshal None unless allow_none is enabled
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:102)
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56)
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31)
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:410)
	at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:522)
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:496)
	at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:343)
	at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:231)
	at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: org.apache.xmlrpc.XmlRpcException: <type 'exceptions.TypeError'>:cannot marshal None unless allow_none is enabled
	at org.apache.xmlrpc.client.XmlRpcStreamTransport.readResponse(XmlRpcStreamTransport.java:197)
	at org.apache.xmlrpc.client.XmlRpcStreamTransport.sendRequest(XmlRpcStreamTransport.java:156)
	at org.apache.xmlrpc.client.XmlRpcHttpTransport.sendRequest(XmlRpcHttpTransport.java:143)
	at org.apache.xmlrpc.client.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:56)
	at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:167)
	at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:137)
	at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:126)
	at org.apache.xmlrpc.client.util.ClientFactory$1.invoke(ClientFactory.java:140)
	at com.sun.proxy.$Proxy54.getVdsHardwareInfo(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor161.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcUtils$AsyncProxy$InternalCallable.call(XmlRpcUtils.java:225)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
, Custom Event ID: -1, Message: Could not get hardware information for host orange-vdsd.qa.lab.tlv.redhat.com

Comment 1 Allon Mureinik 2014-11-26 12:47:20 UTC

Logs are available at https://bugzilla.redhat.com/show_bug.cgi?id=1167085 - not uploading in order to save the duplicates.

Comment 3 Barak 2014-11-26 22:41:54 UTC

Michael - how was the upgrade done (exactly).

Comment 4 Nikolai Sednev 2014-11-27 05:43:57 UTC

(In reply to Barak from comment #3)
> Michael - how was the upgrade done (exactly).

Host under RHEVM3.4 was put to maintenance and then it's repo files updated to 3.5, then upgraded to 3.5, then via RHEVM3.4 engine host was put to active.
"
Steps to Reproduce:
1. 3.4 engine with 3.4 host
2. put host on maintenance and upgrade host to 3.5 
3. after installation successfully finished try activate host 
"

I suppose that "yum update all -y" was used, rather than "yum update vdsm -y", both ways are correct and might be used by customer.

Comment 5 Piotr Kliczewski 2014-11-27 07:24:37 UTC

By default xmlrpc has null values turned off. We need to check why we are trying to send None.

Comment 6 Piotr Kliczewski 2014-11-27 07:25:43 UTC

Can you please provide vdsm logs so we can understand where is None value?

Comment 7 Piotr Kliczewski 2014-11-27 07:45:13 UTC

I checked parent bug logs and I noticed that vdsm logs starts at 12:02:46,999 whereas getHardwareInfo None value occurred at 10:20:03,922. Please provide missing log.

Comment 8 Allon Mureinik 2014-11-27 07:46:54 UTC

I don't have a reproducer - I just forked the bug.
Michael - can you help Piotr out here please?

Comment 9 Barak 2014-11-27 16:24:39 UTC

found that in the parent bug logs:

MiscBlockReadException: Internal block device read failure: 'name=/dev/e04c81c8-8d7e-4dab-b909-2d8443ff8863/metadata, offset=0, size=2048'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,371::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,373::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,374::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,375::__init__::375::IOProcess::(_processLogs) (3370) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,446::__init__::375::IOProcess::(_processLogs) (3370) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,456::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,460::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,461::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,462::__init__::375::IOProcess::(_processLogs) (3371) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,462::__init__::375::IOProcess::(_processLogs) (3371) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,463::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,464::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,465::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,465::__init__::375::IOProcess::(_processLogs) (3372) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,465::__init__::375::IOProcess::(_processLogs) (3372) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) (3373) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,468::__init__::375::IOProcess::(_processLogs) (3373) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,469::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,469::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,470::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,470::__init__::375::IOProcess::(_processLogs) (3374) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,470::__init__::375::IOProcess::(_processLogs) (3374) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,471::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,472::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,472::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,472::__init__::375::IOProcess::(_processLogs) (3375) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,473::__init__::375::IOProcess::(_processLogs) (3375) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,474::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,474::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,475::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,475::__init__::375::IOProcess::(_processLogs) (3376) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,475::__init__::375::IOProcess::(_processLogs) (3376) Queuing response
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,476::__init__::375::IOProcess::(_processLogs) Receiving request...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,477::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,477::__init__::375::IOProcess::(_processLogs) Extracting request information...
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,477::__init__::375::IOProcess::(_processLogs) (3377) Got request for method 'glob'
ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,533::__init__::375::IOProcess::(_processLogs) (3377) Queuing response
Thread-46::ERROR::2014-11-23 12:50:03,535::sdc::143::Storage.StorageDomainCache::(_findDomain) domain e04c81c8-8d7e-4dab-b909-2d8443ff8863 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'e04c81c8-8d7e-4dab-b909-2d8443ff8863',)
Thread-46::ERROR::2014-11-23 12:50:03,535::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain e04c81c8-8d7e-4dab-b909-2d8443ff8863 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 232, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'e04c81c8-8d7e-4dab-b909-2d8443ff8863',)
Thread-13::DEBUG::2014-11-23 12:50:03,752::BindingXMLRPC::1132::vds::(wrapper) client [10.35.161.37]::call getHardwareInfo with () {} flowID [dfd0a2b]
Thread-13::DEBUG::2014-11-23 12:50:03,753::BindingXMLRPC::1139::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Done', 'code': 0}, 'info': None}


As you can see in the last line getHardwareInfo returns None.
I looked up the getHardwareInfo in cdsm code and didn't see any issue related to the previously failure (which is storage related).

I would start debugging  using vdsClient on the malfunctioning host and go on from there

Comment 10 Barak 2014-11-27 16:55:28 UTC

While reviewing the logs with Dan we have noticed:

the only call from engine that got None was (from 1vdsm.log) I see 

Thread-13::DEBUG::2014-11-23 12:50:03,753::BindingXMLRPC::1139::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Done', 'code': 0}, 'info': None}

The corresponding call from superVdsm :

MainProcess|Thread-13::DEBUG::2014-11-23 12:50:03,771::supervdsmServer::102::SuperVdsm.ServerCallback::(wrapper) call getHardwareInfo with () {}
MainProcess|Thread-13::DEBUG::2014-11-23 12:50:03,775::supervdsmServer::109::SuperVdsm.ServerCallback::(wrapper) return getHardwareInfo with {'systemProductName': 'Sun Fire X2200 M2', 'systemSerialNumber': '0643QBT010', 'systemFamily': 'To Be Filled By O.E.M.', 'systemVersion': 'Rev 50', 'systemUUID': '007ef314-a079-0010-a38a-0016367b3dfe', 'systemManufacturer': 'Sun Microsystems'}

which is weird.

In addition we see that getHardwareInfo is being called every 10 minutes ... I'm sure this should not be the case.

Comment 11 Piotr Kliczewski 2014-11-28 07:45:59 UTC

Good catch I was too much focused on the specific timing.

Indeed it is weird because in the API we call supervdsmserver which logs info you found in comment #10 next API returns the value to xmlrpcbinding which logs info you posted in comment #9.

It looks like the value disappears when returned from supervdsmserver.

Comment 12 Michael Burman 2014-11-30 06:58:53 UTC

Alon, Piotr

See attached bugs fron bz 1167085

Comment 13 Michael Burman 2014-11-30 07:00:28 UTC

I meant Attached logs from bz 1167085

Comment 14 Michal Skrivanek 2014-12-01 16:37:16 UTC

is this one-time or recurring? Any workaround?

Comment 15 Piotr Kliczewski 2014-12-01 17:03:41 UTC

I accessed test environment to investigate the issue but noticed that affected hosts works OK.

I followed the steps to reproduce the issue:
1. Installed rhevm 3.4.4-2.2.el6ev on my rhel65 using 3.4 qe repo
2. Installed rhel 6.6 on VM (original test environment used it)
3. configured 3.4 qe repo and installed rhev 4.14.18-3.el6ev from it
4. Installed new host from engine UI and moved it to maintenance
5. Added 3.5 repo to my rhel 6.6 VM
6. Updated host and noticed:

  Updating   : vdsm-4.16.7.5-1.el6ev.x86_64                                                                                                                                                                                             11/17 
warning: /etc/vdsm/vdsm.conf created as /etc/vdsm/vdsm.conf.rpmnew
Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 224, in <module>
    sys.exit(main())
  File "/usr/bin/vdsm-tool", line 210, in main
    except vdsm.tool.UsageError as e:
AttributeError: 'module' object has no attribute 'UsageError'

7. Activated the host from engine UI and noticed no issues
8. Run vdsClient

[root@rhel66 ~]# vdsClient -s 0 getVdsHardwareInfo
	systemManufacturer = 'Bochs'
	systemProductName = 'Bochs'
	systemUUID = '533192b1-b059-4acf-8e39-8038b09f4138'

Please provide more information how to reproduce the issue with getHardwareInfo returning None.

Comment 16 Piotr Kliczewski 2014-12-01 17:04:50 UTC

From the logs it looks like recurring but so far I was not able to see it failing.

Comment 17 Piotr Kliczewski 2014-12-01 17:05:24 UTC

Please provide more information how to reproduce the issue with getHardwareInfo returning None.

Comment 18 Michal Skrivanek 2014-12-01 17:16:08 UTC

if it doesn't reproduce it's not a blocker. please let us know asap

Comment 19 Michael Burman 2014-12-02 07:11:42 UTC

Piotr, 

My steps to reproduce are written above.

The origin bz 1167085 was discovered during a big mixed upgrade 3.3>3.4>3.5 environment. In this environment we had mixed DC's, mixed clusters, mixed storage domain's. Can't provide you getHardwareInfo or reproduce that, the environment is already upgraded. I have attached all relevant logs in the origin bug.

Comment 20 Michael Burman 2014-12-02 07:39:09 UTC

Ok, so maybe this issue happening again with another host in our environment,
i'm attaching all relevant logs, seems to me like the same issue. 

Pls for more details ssh to host- alma03.qa.lab.tlv.redhat.com
upgrade engine- 10.35.161.37

Comment 21 Michael Burman 2014-12-02 07:39:54 UTC

Created attachment 963574 [details]
host and engine logs

Comment 22 Piotr Kliczewski 2014-12-02 07:58:00 UTC

I checked the logs and I noticed a lot of connectivity issues. I found as well:

Failed to GetHardwareInfoVDS, error = Failed to read hardware information, code = 57

I found 3 occurrences of above error. From host perspective getHardwareInfo looks good:

Thread-13::DEBUG::2014-11-30 09:30:05,029::BindingXMLRPC::1149::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Done', 'code': 0}, 'info': {'systemProductName': 'PowerEdge C6220 II', 'systemUUID': '4C4C4544-0059-4410-8053-B7C04F573032', 'systemSerialNumber': '7YDSW02', 'systemFamily': 'Server', 'systemManufacturer': 'Dell Inc.'}}

In the logs that you attached there is an issue but it is different one.

Comment 23 Piotr Kliczewski 2014-12-02 09:02:21 UTC

I noticed that there are issues with connecting to supervdsm. Some of the calls are OK and some fail.

Thread-13::DEBUG::2014-12-02 09:30:02,251::supervdsm::77::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm
Thread-13::ERROR::2014-12-02 09:30:02,273::API::1271::vds::(getHardwareInfo) failed to retrieve hardware info
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1268, in getHardwareInfo
    hw = supervdsm.getProxy().getHardwareInfo()
  File "/usr/share/vdsm/supervdsm.py", line 55, in __call__
    % self._funcName)
RuntimeError: Broken communication with supervdsm. Failed call to getHardwareInfo
Thread-13::DEBUG::2014-12-02 09:30:02,293::BindingXMLRPC::1149::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Failed to read hardware information', 'code': 57}}

There is as well:

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2430, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 328, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 222, in connect
    self.getMountObj().getRecord().fs_file)
  File "/usr/share/vdsm/storage/fileSD.py", line 71, in validateDirAccess
    (os.R_OK | os.X_OK))
  File "/usr/share/vdsm/supervdsm.py", line 55, in __call__
    % self._funcName)
RuntimeError: Broken communication with supervdsm. Failed call to validateAccess

Comment 24 Piotr Kliczewski 2014-12-02 09:46:32 UTC

I added some debug statements and noticed:

---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client
    request = recv()
IOError: [Errno 4] Interrupted system call
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsm.py", line 51, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 49, in <lambda>
    **kwargs)
  File "<string>", line 2, in validateAccess
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client
    request = recv()
IOError: [Errno 4] Interrupted system call
---------------------------------------------------------------------------

It seems to be regression problem I found following topic about it on users list [1]:

[Users] latest vdsm cannot read ib device speeds causing storage attach fail 

[1] http://lists.ovirt.org/pipermail/users/2013-January/thread.html#11880

Comment 25 Piotr Kliczewski 2014-12-02 10:06:05 UTC

At this stage I think that we are dealing with 2 separate issues.

1. supervdsm returning None. Seems to be not reproducible.
2. errno.EINTR issue which is continuation of an issue from 3.4

Comment 26 Piotr Kliczewski 2014-12-02 13:51:36 UTC

There was python bug [1] opened as a result of conversation (comment #24)
The issue was fixed in python2.7 and not in python2.6.


[1] http://bugs.python.org/issue17097

Comment 27 Oved Ourfali 2014-12-03 12:14:17 UTC

Revert is now merged.
We should still, however, try to fix the original issue differently.

Comment 28 sefi litmanovich 2015-01-12 17:03:15 UTC

verified with rhevm-3.5.0-0.28.el6ev.noarch.
Installed rh6.6 host with vdsm-4.16.8.1-5.el6ev.x86_64 and python 2.6.6.
Created NFS storage domain attached to host, created vm and so on.
Didn't experience any communication problem between vdsm and supervdsm.
Please update me if there's any specific flow you want me to test with this env.
I am attaching engine, vdsm, supervdsm logs.

Comment 29 sefi litmanovich 2015-01-12 17:03:57 UTC

Created attachment 979246 [details]
engine + host logs

Comment 30 Piotr Kliczewski 2015-01-13 08:04:41 UTC

Thanks, I do not see any issues in the logs.