Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1180899

Summary:

[vdsm] spmStart fails with timeout, the DC cannot be initialized

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Elad <ebenahar>

Component:

vdsm

Assignee:

Yeela Kaplan <ykaplan>

Status:

CLOSED ERRATA

QA Contact:

Elad <ebenahar>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.5.0

CC:

amureini, bazulay, danken, ebenahar, ecohen, gklein, iheim, laravot, lpeer, lsurette, nsoffer, oourfali, pstehlik, scohen, smizrahi, tnisan, ybronhei, yeylon, ykaplan, ylavi

Target Milestone:

---

Target Release:

3.5.0

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

infra

Fixed In Version:

vdsm-4.16.8.1-6.el6ev

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-15 09:14:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1182563

Bug Blocks:

1164308, 1164311

Attachments:

Description	Flags
logs from vdsm and engine and db dump	none
logs 14.1.15 (with ioprocess)	none

Description Elad 2015-01-11 16:23:42 UTC

Created attachment 978807 [details]
logs from vdsm and engine and db dump

Description of problem:

On a 3.4.0-1 setup, we've upgraded the hosts to 3.5 vt13.6. The setup seemed to be working fine, one host tool SPM. After a ~2 days period, we observed that the DC was stuck in status 'contending' and that no host could take SPM.  

Version-Release number of selected component (if applicable):
Upgraded from 3.4 av9.6 to 3.5 vt13.6

How reproducible:
Occurred on 2 setups
I'm not sure about the exact flow

Steps to Reproduce:
1. Upgraded hosts to 3.5 vt13.6, didn't upgrade rhevm


Actual results:
spmStart fails with 'connection timed out':

2015-01-11 09:56:52,105 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-43) [421a2a76] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException
: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Connection timed out, code = 100

Tasks seemed to be stuck in vdsm in state 'finished'
DC got stuck in state 'contending'

Expected results:
Spm start should succeed.

Additional info:
logs from vdsm and engine and db dump

Comment 1 Tal Nisan 2015-01-12 09:53:08 UTC

Elad, did you restart the hosts after the upgrade?

Comment 2 Elad 2015-01-12 10:05:33 UTC

redirecting the need-info to Gil, as he performed the upgrade

Comment 3 Gil Klein 2015-01-12 11:54:45 UTC

I've run the following steps when upgrading the host:
1. Moved the host to maintenance 
2. Added new repos for RHEL/RHEV
3. Run "yum update" and upgraded both RHEL and RHEV pacakges (RHEL 6.5->6.6, RHEV 3.4->3.5)
4. The host was rebooted in order to load the new kernel
5. Activate the host

Comment 4 Tal Nisan 2015-01-12 12:39:26 UTC

Liron, please have a look

Comment 5 Liron Aravot 2015-01-13 16:01:07 UTC

The issue may be a infra problem problem access the storage using ioprocess, which causes spm start to fail.
regardless of that, there is an existing problem in the spmStart flow that should be fixed - it should be propagted to the engine that the problem was in accessing the master domain so that reconstruct would be triggered (i'll clone this bug to a new one to handle that).

I can see in the logs that there is a problem on accessing the storage from vdsm using ioprocess it can be seen either by the domain monitoring or by the start spm - examples below - we can conclude that the problem isn't with the storage as there are no exceptions on the sanlock/messages log that indicates that - so i suspect that the problem is caused somehow by ioprocess.

-----------------------------------------------------
Thread-18::ERROR::2015-01-11 09:41:31,765::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain eb51e96c-e4ae-4e41-a254-96959f76d60c monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 246, in _monitorDomain
    masterStats = self.domain.validateMaster()
  File "/usr/share/vdsm/storage/sd.py", line 587, in validateMaster
    if not self.oop.fileUtils.pathExists(pdir):
  File "/usr/share/vdsm/storage/outOfProcess.py", line 179, in pathExists
    return self._iop.pathExists(filename, writable)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 427, in pathExists
    if self.access(filename, check):
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 441, in access
    self.timeout)
  File "/usr/lib/python2.6/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

-----------------------------------------------------
Thread-13::INFO::2015-01-11 09:41:30,916::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'3f68b117-6e66-40fd-b2bb-54c5f537
2f0c': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '13.2', 'valid': False}, u'146f410f-f906-4d4a-bb29-75f2341a3612': {'code': 3
58, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '55.9', 'valid': False}, u'eb51e96c-e4ae-4e41-a254-96959f76d60c': {'code': 0, 'version': 3, '
acquired': True, 'delay': '0.000562465', 'lastCheck': '70.1', 'valid': True}}
-----------------------------------------------------


it can be seen that the lease is acquired and there are no exceptions on the sanlock log (consulted with nsoffer about that).

as this issue was reproduced on 2 hosts, adding needinfo? on infra team. nir, if you have some insights on that it'll be great as well.

Elad, can you please try to reproduce it with

Comment 6 Liron Aravot 2015-01-13 16:05:28 UTC

Elad, please try to reproduce using RFH and ioprocess.
thanks.

Comment 7 Yeela Kaplan 2015-01-14 09:48:28 UTC

What about all of the following errors?:
Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present

Comment 8 Liron Aravot 2015-01-14 10:00:46 UTC

afaik those errors are related to libvirt and not somehow to the storage. this seems to be handled on that bug -
https://bugzilla.redhat.com/show_bug.cgi?id=1181157
i'm adding needinfo? on mskivak just to take a look.

returning the needinfo? on the yeela/sagi from the infra team meanwhile to look at the storage issue.

Comment 9 Yeela Kaplan 2015-01-14 10:16:01 UTC

Not sure the fact that there aren't issues in the sanlock then it means there aren't storage issue.

All we see in the logs is a timeout, which probably occurs due to storage issues.

Let's wait for QE input on reproduction (either ioprocess or RFH), and continue from there.

Comment 10 Elad 2015-01-14 13:01:30 UTC

Created attachment 979990 [details]
logs 14.1.15 (with ioprocess)

Reproduced again, using ioprocess. Logs attached.

2015-01-14 14:57:23,735 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) [72b5e35] Command SpmStartVDSCommand(HostName = green-vdsb, HostId = 9ec8bdde-a88b-42d4-bfa8-b805b0078839, storagePoolId = 00000002-0002-0002-0002-0000000003e1, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to SpmStartVDS, error = Resource timeout: ()

Comment 11 Yaniv Lavi 2015-01-14 13:11:09 UTC

Since this seem severe and was recreated by QE marking as RC\GA blockers.

Comment 12 Oved Ourfali 2015-01-14 13:34:44 UTC

Going into "green-vdsb", and running df, shows that it is stuck.

Also, browsing into /rhev/data-center/mnt/ and doing "ls -l", gets stuck as well.
Probably a stale NFS issue.

Comment 13 Allon Mureinik 2015-01-14 14:00:33 UTC

(In reply to Oved Ourfali from comment #12)
> Going into "green-vdsb", and running df, shows that it is stuck.
> 
> Also, browsing into /rhev/data-center/mnt/ and doing "ls -l", gets stuck as
> well.
> Probably a stale NFS issue.
Offhand, I tend to agree.
Liron, any supporting evidence? Do we know how we got there?

Comment 14 Elad 2015-01-14 14:25:35 UTC

Oved, we are currently working on this host (Liron and Yeela are here). Also, there is a storage domain there which os located on an unreachble storage server.

Comment 15 Tal Nisan 2015-01-14 16:10:36 UTC

After deep inspection done by Elad, Liron & Yeela, the issue seem to reside in ioprocess hence moving to infra

Comment 16 Yeela Kaplan 2015-01-14 16:34:59 UTC

Right now it looks like this bug reproduces only when one of the storage domains in the setup is unavailable.

What could be causing this is that in 3.5 we have a single IOProcess to handle all oop operations.
This means that an unavailable SD will take up all available threads and no operations can be performed by other domains.

This can be resolved by using the fix for BZ#1126206.

We will try to reproduce and use this fix tomorrow and then we'll have a clear view of the right approach to take.

Comment 17 Allon Mureinik 2015-01-14 17:32:49 UTC

(In reply to Yeela Kaplan from comment #16)
> Right now it looks like this bug reproduces only when one of the storage
> domains in the setup is unavailable.
> 
> What could be causing this is that in 3.5 we have a single IOProcess to
> handle all oop operations.
> This means that an unavailable SD will take up all available threads and no
> operations can be performed by other domains.
> 
> This can be resolved by using the fix for BZ#1126206.
> 
> We will try to reproduce and use this fix tomorrow and then we'll have a
> clear view of the right approach to take.

Yeela, thanks for your help on this issue!

If you can confirm this assumption tomorrow, we need to see if it's feasible to backport it to 3.5.0 given the pressing timeframe.
If not, we should document that using RFH is a viable workaround.

Comment 18 Barak 2015-01-15 14:52:25 UTC

The fix was merged upstream as a part of the work on Bug 1126206 - [RFE] Use one instance of IOProcess per SD.

Comment 20 Elad 2015-01-21 11:53:45 UTC

SpmStart succeeds after Reconstruct master completed. Tested with a stale NFS master domain. Another domain in the DC takes master and activated successfully.

Verified using rhev 3.5 vt13.8 
vdsm-4.16.8.1-6.el7ev.x86_64
ioprocess-0.14.0-3.el7.x86_64

Comment 22 Eyal Edri 2015-02-15 09:14:29 UTC

bugs were moved by ERRATA to RELEASE PENDING bug not closed probably due to errata error.
closing as 3.5.0 is released.