Created attachment 978807 [details] logs from vdsm and engine and db dump Description of problem: On a 3.4.0-1 setup, we've upgraded the hosts to 3.5 vt13.6. The setup seemed to be working fine, one host tool SPM. After a ~2 days period, we observed that the DC was stuck in status 'contending' and that no host could take SPM. Version-Release number of selected component (if applicable): Upgraded from 3.4 av9.6 to 3.5 vt13.6 How reproducible: Occurred on 2 setups I'm not sure about the exact flow Steps to Reproduce: 1. Upgraded hosts to 3.5 vt13.6, didn't upgrade rhevm Actual results: spmStart fails with 'connection timed out': 2015-01-11 09:56:52,105 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-43) [421a2a76] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException : VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Connection timed out, code = 100 Tasks seemed to be stuck in vdsm in state 'finished' DC got stuck in state 'contending' Expected results: Spm start should succeed. Additional info: logs from vdsm and engine and db dump
Elad, did you restart the hosts after the upgrade?
redirecting the need-info to Gil, as he performed the upgrade
I've run the following steps when upgrading the host: 1. Moved the host to maintenance 2. Added new repos for RHEL/RHEV 3. Run "yum update" and upgraded both RHEL and RHEV pacakges (RHEL 6.5->6.6, RHEV 3.4->3.5) 4. The host was rebooted in order to load the new kernel 5. Activate the host
Liron, please have a look
The issue may be a infra problem problem access the storage using ioprocess, which causes spm start to fail. regardless of that, there is an existing problem in the spmStart flow that should be fixed - it should be propagted to the engine that the problem was in accessing the master domain so that reconstruct would be triggered (i'll clone this bug to a new one to handle that). I can see in the logs that there is a problem on accessing the storage from vdsm using ioprocess it can be seen either by the domain monitoring or by the start spm - examples below - we can conclude that the problem isn't with the storage as there are no exceptions on the sanlock/messages log that indicates that - so i suspect that the problem is caused somehow by ioprocess. ----------------------------------------------------- Thread-18::ERROR::2015-01-11 09:41:31,765::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain eb51e96c-e4ae-4e41-a254-96959f76d60c monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 246, in _monitorDomain masterStats = self.domain.validateMaster() File "/usr/share/vdsm/storage/sd.py", line 587, in validateMaster if not self.oop.fileUtils.pathExists(pdir): File "/usr/share/vdsm/storage/outOfProcess.py", line 179, in pathExists return self._iop.pathExists(filename, writable) File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 427, in pathExists if self.access(filename, check): File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 441, in access self.timeout) File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out ----------------------------------------------------- Thread-13::INFO::2015-01-11 09:41:30,916::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'3f68b117-6e66-40fd-b2bb-54c5f537 2f0c': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '13.2', 'valid': False}, u'146f410f-f906-4d4a-bb29-75f2341a3612': {'code': 3 58, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '55.9', 'valid': False}, u'eb51e96c-e4ae-4e41-a254-96959f76d60c': {'code': 0, 'version': 3, ' acquired': True, 'delay': '0.000562465', 'lastCheck': '70.1', 'valid': True}} ----------------------------------------------------- it can be seen that the lease is acquired and there are no exceptions on the sanlock log (consulted with nsoffer about that). as this issue was reproduced on 2 hosts, adding needinfo? on infra team. nir, if you have some insights on that it'll be great as well. Elad, can you please try to reproduce it with
Elad, please try to reproduce using RFH and ioprocess. thanks.
What about all of the following errors?: Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present
afaik those errors are related to libvirt and not somehow to the storage. this seems to be handled on that bug - https://bugzilla.redhat.com/show_bug.cgi?id=1181157 i'm adding needinfo? on mskivak just to take a look. returning the needinfo? on the yeela/sagi from the infra team meanwhile to look at the storage issue.
Not sure the fact that there aren't issues in the sanlock then it means there aren't storage issue. All we see in the logs is a timeout, which probably occurs due to storage issues. Let's wait for QE input on reproduction (either ioprocess or RFH), and continue from there.
Created attachment 979990 [details] logs 14.1.15 (with ioprocess) Reproduced again, using ioprocess. Logs attached. 2015-01-14 14:57:23,735 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) [72b5e35] Command SpmStartVDSCommand(HostName = green-vdsb, HostId = 9ec8bdde-a88b-42d4-bfa8-b805b0078839, storagePoolId = 00000002-0002-0002-0002-0000000003e1, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to SpmStartVDS, error = Resource timeout: ()
Since this seem severe and was recreated by QE marking as RC\GA blockers.
Going into "green-vdsb", and running df, shows that it is stuck. Also, browsing into /rhev/data-center/mnt/ and doing "ls -l", gets stuck as well. Probably a stale NFS issue.
(In reply to Oved Ourfali from comment #12) > Going into "green-vdsb", and running df, shows that it is stuck. > > Also, browsing into /rhev/data-center/mnt/ and doing "ls -l", gets stuck as > well. > Probably a stale NFS issue. Offhand, I tend to agree. Liron, any supporting evidence? Do we know how we got there?
Oved, we are currently working on this host (Liron and Yeela are here). Also, there is a storage domain there which os located on an unreachble storage server.
After deep inspection done by Elad, Liron & Yeela, the issue seem to reside in ioprocess hence moving to infra
Right now it looks like this bug reproduces only when one of the storage domains in the setup is unavailable. What could be causing this is that in 3.5 we have a single IOProcess to handle all oop operations. This means that an unavailable SD will take up all available threads and no operations can be performed by other domains. This can be resolved by using the fix for BZ#1126206. We will try to reproduce and use this fix tomorrow and then we'll have a clear view of the right approach to take.
(In reply to Yeela Kaplan from comment #16) > Right now it looks like this bug reproduces only when one of the storage > domains in the setup is unavailable. > > What could be causing this is that in 3.5 we have a single IOProcess to > handle all oop operations. > This means that an unavailable SD will take up all available threads and no > operations can be performed by other domains. > > This can be resolved by using the fix for BZ#1126206. > > We will try to reproduce and use this fix tomorrow and then we'll have a > clear view of the right approach to take. Yeela, thanks for your help on this issue! If you can confirm this assumption tomorrow, we need to see if it's feasible to backport it to 3.5.0 given the pressing timeframe. If not, we should document that using RFH is a viable workaround.
The fix was merged upstream as a part of the work on Bug 1126206 - [RFE] Use one instance of IOProcess per SD.
SpmStart succeeds after Reconstruct master completed. Tested with a stale NFS master domain. Another domain in the DC takes master and activated successfully. Verified using rhev 3.5 vt13.8 vdsm-4.16.8.1-6.el7ev.x86_64 ioprocess-0.14.0-3.el7.x86_64
bugs were moved by ERRATA to RELEASE PENDING bug not closed probably due to errata error. closing as 3.5.0 is released.