Bug 1168217
Summary: | Error getting hardware information from an upgraded host | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Allon Mureinik <amureini> | ||||||
Component: | vdsm | Assignee: | Piotr Kliczewski <pkliczew> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | sefi litmanovich <slitmano> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.5.0 | CC: | acanan, amureini, bazulay, danken, ecohen, gklein, iheim, lpeer, lsurette, mburman, michal.skrivanek, nsednev, nsoffer, oourfali, pkliczew, scohen, tnisan, ybronhei, yeylon | ||||||
Target Milestone: | --- | Keywords: | Regression, Triaged | ||||||
Target Release: | 3.5.0 | Flags: | pkliczew:
needinfo+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | infra | ||||||||
Fixed In Version: | v4.16.8.1 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | 1167085 | Environment: | |||||||
Last Closed: | 2015-02-16 13:40:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1164308, 1164311 | ||||||||
Attachments: |
|
Description
Allon Mureinik
2014-11-26 12:46:08 UTC
Logs are available at https://bugzilla.redhat.com/show_bug.cgi?id=1167085 - not uploading in order to save the duplicates. Michael - how was the upgrade done (exactly). (In reply to Barak from comment #3) > Michael - how was the upgrade done (exactly). Host under RHEVM3.4 was put to maintenance and then it's repo files updated to 3.5, then upgraded to 3.5, then via RHEVM3.4 engine host was put to active. " Steps to Reproduce: 1. 3.4 engine with 3.4 host 2. put host on maintenance and upgrade host to 3.5 3. after installation successfully finished try activate host " I suppose that "yum update all -y" was used, rather than "yum update vdsm -y", both ways are correct and might be used by customer. By default xmlrpc has null values turned off. We need to check why we are trying to send None. Can you please provide vdsm logs so we can understand where is None value? I checked parent bug logs and I noticed that vdsm logs starts at 12:02:46,999 whereas getHardwareInfo None value occurred at 10:20:03,922. Please provide missing log. I don't have a reproducer - I just forked the bug. Michael - can you help Piotr out here please? found that in the parent bug logs: MiscBlockReadException: Internal block device read failure: 'name=/dev/e04c81c8-8d7e-4dab-b909-2d8443ff8863/metadata, offset=0, size=2048' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,371::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,373::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,374::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,375::__init__::375::IOProcess::(_processLogs) (3370) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,446::__init__::375::IOProcess::(_processLogs) (3370) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,456::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,460::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,461::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,462::__init__::375::IOProcess::(_processLogs) (3371) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,462::__init__::375::IOProcess::(_processLogs) (3371) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,463::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,464::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,465::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,465::__init__::375::IOProcess::(_processLogs) (3372) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,465::__init__::375::IOProcess::(_processLogs) (3372) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,467::__init__::375::IOProcess::(_processLogs) (3373) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,468::__init__::375::IOProcess::(_processLogs) (3373) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,469::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,469::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,470::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,470::__init__::375::IOProcess::(_processLogs) (3374) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,470::__init__::375::IOProcess::(_processLogs) (3374) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,471::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,472::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,472::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,472::__init__::375::IOProcess::(_processLogs) (3375) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,473::__init__::375::IOProcess::(_processLogs) (3375) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,474::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,474::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,475::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,475::__init__::375::IOProcess::(_processLogs) (3376) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,475::__init__::375::IOProcess::(_processLogs) (3376) Queuing response ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,476::__init__::375::IOProcess::(_processLogs) Receiving request... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,477::__init__::375::IOProcess::(_processLogs) Queuing request in the thread pool... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,477::__init__::375::IOProcess::(_processLogs) Extracting request information... ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,477::__init__::375::IOProcess::(_processLogs) (3377) Got request for method 'glob' ioprocess communication (13118)::DEBUG::2014-11-23 12:50:03,533::__init__::375::IOProcess::(_processLogs) (3377) Queuing response Thread-46::ERROR::2014-11-23 12:50:03,535::sdc::143::Storage.StorageDomainCache::(_findDomain) domain e04c81c8-8d7e-4dab-b909-2d8443ff8863 not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'e04c81c8-8d7e-4dab-b909-2d8443ff8863',) Thread-46::ERROR::2014-11-23 12:50:03,535::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain e04c81c8-8d7e-4dab-b909-2d8443ff8863 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 232, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'e04c81c8-8d7e-4dab-b909-2d8443ff8863',) Thread-13::DEBUG::2014-11-23 12:50:03,752::BindingXMLRPC::1132::vds::(wrapper) client [10.35.161.37]::call getHardwareInfo with () {} flowID [dfd0a2b] Thread-13::DEBUG::2014-11-23 12:50:03,753::BindingXMLRPC::1139::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Done', 'code': 0}, 'info': None} As you can see in the last line getHardwareInfo returns None. I looked up the getHardwareInfo in cdsm code and didn't see any issue related to the previously failure (which is storage related). I would start debugging using vdsClient on the malfunctioning host and go on from there While reviewing the logs with Dan we have noticed: the only call from engine that got None was (from 1vdsm.log) I see Thread-13::DEBUG::2014-11-23 12:50:03,753::BindingXMLRPC::1139::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Done', 'code': 0}, 'info': None} The corresponding call from superVdsm : MainProcess|Thread-13::DEBUG::2014-11-23 12:50:03,771::supervdsmServer::102::SuperVdsm.ServerCallback::(wrapper) call getHardwareInfo with () {} MainProcess|Thread-13::DEBUG::2014-11-23 12:50:03,775::supervdsmServer::109::SuperVdsm.ServerCallback::(wrapper) return getHardwareInfo with {'systemProductName': 'Sun Fire X2200 M2', 'systemSerialNumber': '0643QBT010', 'systemFamily': 'To Be Filled By O.E.M.', 'systemVersion': 'Rev 50', 'systemUUID': '007ef314-a079-0010-a38a-0016367b3dfe', 'systemManufacturer': 'Sun Microsystems'} which is weird. In addition we see that getHardwareInfo is being called every 10 minutes ... I'm sure this should not be the case. Good catch I was too much focused on the specific timing. Indeed it is weird because in the API we call supervdsmserver which logs info you found in comment #10 next API returns the value to xmlrpcbinding which logs info you posted in comment #9. It looks like the value disappears when returned from supervdsmserver. Alon, Piotr See attached bugs fron bz 1167085 I meant Attached logs from bz 1167085 is this one-time or recurring? Any workaround? I accessed test environment to investigate the issue but noticed that affected hosts works OK. I followed the steps to reproduce the issue: 1. Installed rhevm 3.4.4-2.2.el6ev on my rhel65 using 3.4 qe repo 2. Installed rhel 6.6 on VM (original test environment used it) 3. configured 3.4 qe repo and installed rhev 4.14.18-3.el6ev from it 4. Installed new host from engine UI and moved it to maintenance 5. Added 3.5 repo to my rhel 6.6 VM 6. Updated host and noticed: Updating : vdsm-4.16.7.5-1.el6ev.x86_64 11/17 warning: /etc/vdsm/vdsm.conf created as /etc/vdsm/vdsm.conf.rpmnew Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 224, in <module> sys.exit(main()) File "/usr/bin/vdsm-tool", line 210, in main except vdsm.tool.UsageError as e: AttributeError: 'module' object has no attribute 'UsageError' 7. Activated the host from engine UI and noticed no issues 8. Run vdsClient [root@rhel66 ~]# vdsClient -s 0 getVdsHardwareInfo systemManufacturer = 'Bochs' systemProductName = 'Bochs' systemUUID = '533192b1-b059-4acf-8e39-8038b09f4138' Please provide more information how to reproduce the issue with getHardwareInfo returning None. From the logs it looks like recurring but so far I was not able to see it failing. Please provide more information how to reproduce the issue with getHardwareInfo returning None. if it doesn't reproduce it's not a blocker. please let us know asap Piotr, My steps to reproduce are written above. The origin bz 1167085 was discovered during a big mixed upgrade 3.3>3.4>3.5 environment. In this environment we had mixed DC's, mixed clusters, mixed storage domain's. Can't provide you getHardwareInfo or reproduce that, the environment is already upgraded. I have attached all relevant logs in the origin bug. Ok, so maybe this issue happening again with another host in our environment, i'm attaching all relevant logs, seems to me like the same issue. Pls for more details ssh to host- alma03.qa.lab.tlv.redhat.com upgrade engine- 10.35.161.37 Created attachment 963574 [details]
host and engine logs
I checked the logs and I noticed a lot of connectivity issues. I found as well: Failed to GetHardwareInfoVDS, error = Failed to read hardware information, code = 57 I found 3 occurrences of above error. From host perspective getHardwareInfo looks good: Thread-13::DEBUG::2014-11-30 09:30:05,029::BindingXMLRPC::1149::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Done', 'code': 0}, 'info': {'systemProductName': 'PowerEdge C6220 II', 'systemUUID': '4C4C4544-0059-4410-8053-B7C04F573032', 'systemSerialNumber': '7YDSW02', 'systemFamily': 'Server', 'systemManufacturer': 'Dell Inc.'}} In the logs that you attached there is an issue but it is different one. I noticed that there are issues with connecting to supervdsm. Some of the calls are OK and some fail. Thread-13::DEBUG::2014-12-02 09:30:02,251::supervdsm::77::SuperVdsmProxy::(_connect) Trying to connect to Super Vdsm Thread-13::ERROR::2014-12-02 09:30:02,273::API::1271::vds::(getHardwareInfo) failed to retrieve hardware info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1268, in getHardwareInfo hw = supervdsm.getProxy().getHardwareInfo() File "/usr/share/vdsm/supervdsm.py", line 55, in __call__ % self._funcName) RuntimeError: Broken communication with supervdsm. Failed call to getHardwareInfo Thread-13::DEBUG::2014-12-02 09:30:02,293::BindingXMLRPC::1149::vds::(wrapper) return getHardwareInfo with {'status': {'message': 'Failed to read hardware information', 'code': 57}} There is as well: Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2430, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 328, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 222, in connect self.getMountObj().getRecord().fs_file) File "/usr/share/vdsm/storage/fileSD.py", line 71, in validateDirAccess (os.R_OK | os.X_OK)) File "/usr/share/vdsm/supervdsm.py", line 55, in __call__ % self._funcName) RuntimeError: Broken communication with supervdsm. Failed call to validateAccess I added some debug statements and noticed: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client request = recv() IOError: [Errno 4] Interrupted system call --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/share/vdsm/supervdsm.py", line 51, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 49, in <lambda> **kwargs) File "<string>", line 2, in validateAccess File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client request = recv() IOError: [Errno 4] Interrupted system call --------------------------------------------------------------------------- It seems to be regression problem I found following topic about it on users list [1]: [Users] latest vdsm cannot read ib device speeds causing storage attach fail [1] http://lists.ovirt.org/pipermail/users/2013-January/thread.html#11880 At this stage I think that we are dealing with 2 separate issues. 1. supervdsm returning None. Seems to be not reproducible. 2. errno.EINTR issue which is continuation of an issue from 3.4 There was python bug [1] opened as a result of conversation (comment #24) The issue was fixed in python2.7 and not in python2.6. [1] http://bugs.python.org/issue17097 Revert is now merged. We should still, however, try to fix the original issue differently. verified with rhevm-3.5.0-0.28.el6ev.noarch. Installed rh6.6 host with vdsm-4.16.8.1-5.el6ev.x86_64 and python 2.6.6. Created NFS storage domain attached to host, created vm and so on. Didn't experience any communication problem between vdsm and supervdsm. Please update me if there's any specific flow you want me to test with this env. I am attaching engine, vdsm, supervdsm logs. Created attachment 979246 [details]
engine + host logs
Thanks, I do not see any issues in the logs. |