Bug 1180899

Summary: [vdsm] spmStart fails with timeout, the DC cannot be initialized
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: vdsmAssignee: Yeela Kaplan <ykaplan>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.0CC: amureini, bazulay, danken, ebenahar, ecohen, gklein, iheim, laravot, lpeer, lsurette, nsoffer, oourfali, pstehlik, scohen, smizrahi, tnisan, ybronhei, yeylon, ykaplan, ylavi
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: vdsm-4.16.8.1-6.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-15 09:14:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1182563    
Bug Blocks: 1164308, 1164311    
Attachments:
Description Flags
logs from vdsm and engine and db dump
none
logs 14.1.15 (with ioprocess) none

Description Elad 2015-01-11 16:23:42 UTC
Created attachment 978807 [details]
logs from vdsm and engine and db dump

Description of problem:

On a 3.4.0-1 setup, we've upgraded the hosts to 3.5 vt13.6. The setup seemed to be working fine, one host tool SPM. After a ~2 days period, we observed that the DC was stuck in status 'contending' and that no host could take SPM.  

Version-Release number of selected component (if applicable):
Upgraded from 3.4 av9.6 to 3.5 vt13.6

How reproducible:
Occurred on 2 setups
I'm not sure about the exact flow

Steps to Reproduce:
1. Upgraded hosts to 3.5 vt13.6, didn't upgrade rhevm


Actual results:
spmStart fails with 'connection timed out':

2015-01-11 09:56:52,105 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-43) [421a2a76] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException
: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Connection timed out, code = 100

Tasks seemed to be stuck in vdsm in state 'finished'
DC got stuck in state 'contending'

Expected results:
Spm start should succeed.

Additional info:
logs from vdsm and engine and db dump

Comment 1 Tal Nisan 2015-01-12 09:53:08 UTC
Elad, did you restart the hosts after the upgrade?

Comment 2 Elad 2015-01-12 10:05:33 UTC
redirecting the need-info to Gil, as he performed the upgrade

Comment 3 Gil Klein 2015-01-12 11:54:45 UTC
I've run the following steps when upgrading the host:
1. Moved the host to maintenance 
2. Added new repos for RHEL/RHEV
3. Run "yum update" and upgraded both RHEL and RHEV pacakges (RHEL 6.5->6.6, RHEV 3.4->3.5)
4. The host was rebooted in order to load the new kernel
5. Activate the host

Comment 4 Tal Nisan 2015-01-12 12:39:26 UTC
Liron, please have a look

Comment 5 Liron Aravot 2015-01-13 16:01:07 UTC
The issue may be a infra problem problem access the storage using ioprocess, which causes spm start to fail.
regardless of that, there is an existing problem in the spmStart flow that should be fixed - it should be propagted to the engine that the problem was in accessing the master domain so that reconstruct would be triggered (i'll clone this bug to a new one to handle that).

I can see in the logs that there is a problem on accessing the storage from vdsm using ioprocess it can be seen either by the domain monitoring or by the start spm - examples below - we can conclude that the problem isn't with the storage as there are no exceptions on the sanlock/messages log that indicates that - so i suspect that the problem is caused somehow by ioprocess.

-----------------------------------------------------
Thread-18::ERROR::2015-01-11 09:41:31,765::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain eb51e96c-e4ae-4e41-a254-96959f76d60c monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 246, in _monitorDomain
    masterStats = self.domain.validateMaster()
  File "/usr/share/vdsm/storage/sd.py", line 587, in validateMaster
    if not self.oop.fileUtils.pathExists(pdir):
  File "/usr/share/vdsm/storage/outOfProcess.py", line 179, in pathExists
    return self._iop.pathExists(filename, writable)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 427, in pathExists
    if self.access(filename, check):
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 441, in access
    self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

-----------------------------------------------------
Thread-13::INFO::2015-01-11 09:41:30,916::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'3f68b117-6e66-40fd-b2bb-54c5f537
2f0c': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '13.2', 'valid': False}, u'146f410f-f906-4d4a-bb29-75f2341a3612': {'code': 3
58, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '55.9', 'valid': False}, u'eb51e96c-e4ae-4e41-a254-96959f76d60c': {'code': 0, 'version': 3, '
acquired': True, 'delay': '0.000562465', 'lastCheck': '70.1', 'valid': True}}
-----------------------------------------------------


it can be seen that the lease is acquired and there are no exceptions on the sanlock log (consulted with nsoffer about that).

as this issue was reproduced on 2 hosts, adding needinfo? on infra team. nir, if you have some insights on that it'll be great as well.

Elad, can you please try to reproduce it with

Comment 6 Liron Aravot 2015-01-13 16:05:28 UTC
Elad, please try to reproduce using RFH and ioprocess.
thanks.

Comment 7 Yeela Kaplan 2015-01-14 09:48:28 UTC
What about all of the following errors?:
Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present

Comment 8 Liron Aravot 2015-01-14 10:00:46 UTC
afaik those errors are related to libvirt and not somehow to the storage. this seems to be handled on that bug -
https://bugzilla.redhat.com/show_bug.cgi?id=1181157
i'm adding needinfo? on mskivak just to take a look.

returning the needinfo? on the yeela/sagi from the infra team meanwhile to look at the storage issue.

Comment 9 Yeela Kaplan 2015-01-14 10:16:01 UTC
Not sure the fact that there aren't issues in the sanlock then it means there aren't storage issue.

All we see in the logs is a timeout, which probably occurs due to storage issues.

Let's wait for QE input on reproduction (either ioprocess or RFH), and continue from there.

Comment 10 Elad 2015-01-14 13:01:30 UTC
Created attachment 979990 [details]
logs 14.1.15 (with ioprocess)

Reproduced again, using ioprocess. Logs attached.

2015-01-14 14:57:23,735 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) [72b5e35] Command SpmStartVDSCommand(HostName = green-vdsb, HostId = 9ec8bdde-a88b-42d4-bfa8-b805b0078839, storagePoolId = 00000002-0002-0002-0002-0000000003e1, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to SpmStartVDS, error = Resource timeout: ()

Comment 11 Yaniv Lavi 2015-01-14 13:11:09 UTC
Since this seem severe and was recreated by QE marking as RC\GA blockers.

Comment 12 Oved Ourfali 2015-01-14 13:34:44 UTC
Going into "green-vdsb", and running df, shows that it is stuck.

Also, browsing into /rhev/data-center/mnt/ and doing "ls -l", gets stuck as well.
Probably a stale NFS issue.

Comment 13 Allon Mureinik 2015-01-14 14:00:33 UTC
(In reply to Oved Ourfali from comment #12)
> Going into "green-vdsb", and running df, shows that it is stuck.
> 
> Also, browsing into /rhev/data-center/mnt/ and doing "ls -l", gets stuck as
> well.
> Probably a stale NFS issue.
Offhand, I tend to agree.
Liron, any supporting evidence? Do we know how we got there?

Comment 14 Elad 2015-01-14 14:25:35 UTC
Oved, we are currently working on this host (Liron and Yeela are here). Also, there is a storage domain there which os located on an unreachble storage server.

Comment 15 Tal Nisan 2015-01-14 16:10:36 UTC
After deep inspection done by Elad, Liron & Yeela, the issue seem to reside in ioprocess hence moving to infra

Comment 16 Yeela Kaplan 2015-01-14 16:34:59 UTC
Right now it looks like this bug reproduces only when one of the storage domains in the setup is unavailable.

What could be causing this is that in 3.5 we have a single IOProcess to handle all oop operations.
This means that an unavailable SD will take up all available threads and no operations can be performed by other domains.

This can be resolved by using the fix for BZ#1126206.

We will try to reproduce and use this fix tomorrow and then we'll have a clear view of the right approach to take.

Comment 17 Allon Mureinik 2015-01-14 17:32:49 UTC
(In reply to Yeela Kaplan from comment #16)
> Right now it looks like this bug reproduces only when one of the storage
> domains in the setup is unavailable.
> 
> What could be causing this is that in 3.5 we have a single IOProcess to
> handle all oop operations.
> This means that an unavailable SD will take up all available threads and no
> operations can be performed by other domains.
> 
> This can be resolved by using the fix for BZ#1126206.
> 
> We will try to reproduce and use this fix tomorrow and then we'll have a
> clear view of the right approach to take.

Yeela, thanks for your help on this issue!

If you can confirm this assumption tomorrow, we need to see if it's feasible to backport it to 3.5.0 given the pressing timeframe.
If not, we should document that using RFH is a viable workaround.

Comment 18 Barak 2015-01-15 14:52:25 UTC
The fix was merged upstream as a part of the work on Bug 1126206 - [RFE] Use one instance of IOProcess per SD.

Comment 20 Elad 2015-01-21 11:53:45 UTC
SpmStart succeeds after Reconstruct master completed. Tested with a stale NFS master domain. Another domain in the DC takes master and activated successfully.

Verified using rhev 3.5 vt13.8 
vdsm-4.16.8.1-6.el7ev.x86_64
ioprocess-0.14.0-3.el7.x86_64

Comment 22 Eyal Edri 2015-02-15 09:14:29 UTC
bugs were moved by ERRATA to RELEASE PENDING bug not closed probably due to errata error.
closing as 3.5.0 is released.