1149448 – [RHEL7] VDSM isn't fenced by sanlock after the connection to the master domain is blocked

Bug 1149448 - [RHEL7] VDSM isn't fenced by sanlock after the connection to the master domain is blocked

Summary: [RHEL7] VDSM isn't fenced by sanlock after the connection to the master domai...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-webadmin-portal
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Idan Shaby
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-05 09:59 UTC by lkuchlan
Modified:	2016-02-10 16:59 UTC (History)
CC List:	17 users (show)
Fixed In Version:	vt13.4
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
image and logs (863.57 KB, application/x-gzip) 2014-10-05 10:00 UTC, lkuchlan	no flags	Details
logs (1.23 MB, application/x-gzip) 2014-12-03 16:45 UTC, lkuchlan	no flags	Details
logs and image (702.03 KB, application/x-gzip) 2014-12-17 13:54 UTC, lkuchlan	no flags	Details
selinux log (49.06 KB, application/x-gzip) 2015-01-28 09:43 UTC, lkuchlan	no flags	Details
View All

Description lkuchlan 2014-10-05 09:59:44 UTC

Description of problem:

After blocking connection between master domain, other domain(iscsi) does not become as new master domain

Blocking Master data domain (NFS) connection on the SPM, the available iSCSI domain did not become the new Master data domain


Version-Release number of selected component (if applicable):
3.5 vt3.1

How reproducible:
100%

Steps to Reproduce:
1. On a setup with a single vdsm host, create 1 NFS Data storage domain (Master)
2. Create 2 iSCSI domains
3. Block the connection between vdsm host and the current master domain (NFS)

Actual results:
The oVirt engine does not migrate the Master domain to use iSCSI which is available and accessible

Expected results:
One of the available iSCSI domains should be selected as the new Master data domain

Comment 1 lkuchlan 2014-10-05 10:00:32 UTC

Created attachment 943988 [details]
image and logs

Comment 2 Liron Aravot 2014-11-05 13:32:44 UTC

lkuchlan, the other ISCSI domain in your setup (called 'iscsi') is in maintenance so it's not available to become the new master anyway.

The main issue here seems to be that vdsm isn't being fenced by sanlock after the connection to the master domain is blocked,  this leads to having the engine/vdsm constantly fail with connection time out errors instead of failing with StoragePoolMasterNotFound exception which will lead to a reconstruct.

lkuchlan, can you please try to reproduce and attach also the sanlock logs?
nir, have you handled similar bug recently?



repoStats/getStoragePoolInfo:

Thread-705::INFO::2014-10-02 13:10:01,033::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'ea54e2ae-e283-460f-97e2-77af7737de7e': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '69.2', 'valid': False}, 'f24f6d96-51eb-495f-b5ec-6bd1be521de5': {'code': 200, 'version': -1, 'acquired': False, 'delay': '0.00613357', 'lastCheck': '60.4', 'valid': False}, u'96267cb4-280f-43c6-b530-afbf8852fed9': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '61.2', 'valid': False}}



Thread-758::DEBUG::2014-10-02 13:13:23,677::task::993::Storage.TaskManager.Task::(_decref) Task=`a19c73d0-f572-40ce-be4c-c5549f35e727`::ref 1 aborting False
Thread-758::INFO::2014-10-02 13:13:23,730::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'name': 'No Des
cription', 'isoprefix': '', 'pool_status': 'connected', 'lver': 47L, 'domains': '96267cb4-280f-43c6-b530-afbf8852fed9:Active,f24f6d96-51eb-495f-b5ec-6bd1be521
de5:Active,847b74d3-f092-40cf-9bb8-19c7ae498e62:Attached,ea54e2ae-e283-460f-97e2-77af7737de7e:Active', 'master_uuid': 'f24f6d96-51eb-495f-b5ec-6bd1be521de5', 
'version': '3', 'spm_id': 1, 'type': 'ISCSI', 'master_ver': 5}, 'dominfo': {'96267cb4-280f-43c6-b530-afbf8852fed9': {'status': 'Active', 'isoprefix': '', 'ale
rts': [], 'version': -1}, 'f24f6d96-51eb-495f-b5ec-6bd1be521de5': {'status': 'Active', 'diskfree': '37044092928', 'isoprefix': '', 'alerts': [], 'disktotal': 
'53284438016', 'version': -1}, '847b74d3-f092-40cf-9bb8-19c7ae498e62': {'status': 'Attached', 'isoprefix': '', 'alerts': []}, 'ea54e2ae-e283-460f-97e2-77af773
7de7e': {'status': 'Active', 'isoprefix': '', 'alerts': [], 'version': -1}}}

Comment 3 Nir Soffer 2014-11-05 14:01:31 UTC

Looks like a duplicate of bug 1141658.

Please provide the version these packages:

- selinux-policy
- selinux-policy-targeted

Comment 4 Nir Soffer 2014-11-05 14:03:39 UTC

Additionally, attach these files:

- /var/log/messages
- /var/log/sanlock.log
- /var/log/audit/audit.log
  Add the file showing the timeframe of this error, it may be one
  of the rotated files (e.g. audit.log.3.gz)

Comment 5 Elad 2014-11-05 14:59:59 UTC

Nir, I don't think it is a DUP of bug 1141658 because as we can see in the attached screenshot and in vdsm.log, the reconstruct statrs (and fails), but the host is not rebooted. As opposed to 1141658, in which the host is rebooted ~2 minutes after the connection to the master domain had lost. in 1141658, when the host becomes up again, the connection to the storage is resumed and the DC becomes active because the iptables rules are cleaned.

Comment 6 lkuchlan 2014-11-05 16:08:26 UTC

Liron, i can not reproduce it on rhel 6.6

Comment 7 Allon Mureinik 2014-11-24 17:52:54 UTC

(In reply to lkuchlan from comment #6)
> Liron, i can not reproduce it on rhel 6.6
So what OS does this reproduce on? RHEL 7? RHEL 6.5?

Comment 8 lkuchlan 2014-11-29 18:02:31 UTC

(In reply to Allon Mureinik from comment #7)
> (In reply to lkuchlan from comment #6)
> > Liron, i can not reproduce it on rhel 6.6
> So what OS does this reproduce on? RHEL 7? RHEL 6.5?

The bug reproduces on RHEL 7

Comment 9 Allon Mureinik 2014-12-01 12:39:53 UTC

(In reply to Nir Soffer from comment #3)
> Looks like a duplicate of bug 1141658.
> 
> Please provide the version these packages:
> 
> - selinux-policy
> - selinux-policy-targeted

(In reply to Nir Soffer from comment #4)
> Additionally, attach these files:
> 
> - /var/log/messages
> - /var/log/sanlock.log
> - /var/log/audit/audit.log
>   Add the file showing the timeframe of this error, it may be one
>   of the rotated files (e.g. audit.log.3.gz)

These requests were somehow missed between all the comments here.
Liron, can you please provide this info?

Comment 10 lkuchlan 2014-12-03 16:45:38 UTC

Created attachment 964217 [details]
logs

packages version:
libselinux-2.2.2-6.el7.x86_64
libselinux-ruby-2.2.2-6.el7.x86_64
selinux-policy-3.12.1-153.el7.noarch
libselinux-utils-2.2.2-6.el7.x86_64
selinux-policy-targeted-3.12.1-153.el7.noarch
libselinux-python-2.2.2-6.el7.x86_64

please find attached the logs

Comment 12 lkuchlan 2014-12-17 13:54:01 UTC

Created attachment 970134 [details]
logs and image

It is still reproduce
please find attached the logs

Comment 13 Yaniv Lavi 2015-01-15 09:54:27 UTC

Can you recreate this on RHEL 7.1?

Comment 17 lkuchlan 2015-01-26 15:20:48 UTC

(In reply to Yaniv Dary from comment #13)
> Can you recreate this on RHEL 7.1?

Hi Yaniv,
It is not reproduced on RHEL 7.1

Comment 18 Yaniv Lavi 2015-01-26 15:29:35 UTC

lkuchlan, can you please add the SELINUX logs on the target host?
Bronce, can we bring this up with the RHEL PM to try to understand why this is happening and try to get the fix for thr policy to 7.0.z?

Comment 19 Bronce McClain 2015-01-26 18:07:27 UTC

(In reply to Yaniv Dary from comment #18)
> lkuchlan, can you please add the SELINUX logs on the target host?
> Bronce, can we bring this up with the RHEL PM to try to understand why this
> is happening and try to get the fix for thr policy to 7.0.z?

Is there a rhel bz for this w/ the build that fixed it in 7.1?

Comment 20 Yaniv Lavi 2015-01-27 08:23:28 UTC

(In reply to Bronce McClain from comment #19)
> (In reply to Yaniv Dary from comment #18)
> > lkuchlan, can you please add the SELINUX logs on the target host?
> > Bronce, can we bring this up with the RHEL PM to try to understand why this
> > is happening and try to get the fix for thr policy to 7.0.z?
> 
> Is there a rhel bz for this w/ the build that fixed it in 7.1?

They made extensive changes to selinux policy, so we don't know the bug number.
lkuchlan, what 7.1 build did you test with?

Comment 21 Tal Nisan 2015-01-27 11:47:19 UTC

vt13.4 contains a newer sanlock (selinux-policy-targeted >= 3.12.1-153.el7_0.13
) than was checked here (see bug 1141658).

We're unable to reproduce this in dev - can you please retry with the latest 3.5.0 on RHEL7, and confirm this is closed?


Thanks!

Comment 22 lkuchlan 2015-01-28 09:43:28 UTC

Created attachment 985054 [details]
selinux log

Tested using vdsm-4.16.8.1-6.el7ev.x86_64, rhevm-3.5.0-0.30.el6ev.noarch
Please find attached the selinux log on the target host

Comment 23 Nir Soffer 2015-02-15 11:41:41 UTC

(In reply to lkuchlan from comment #22)
> Created attachment 985054 [details]
> selinux log
> 
> Tested using vdsm-4.16.8.1-6.el7ev.x86_64, rhevm-3.5.0-0.30.el6ev.noarch
> Please find attached the selinux log on the target host

What are the results of this test? Did you reproduce this error or not?

If it cannot be reproduced, this should be move to VERIFIED. The bug is still in ON_QA - do you need additional testing?

Comment 24 Nir Soffer 2015-02-15 11:43:35 UTC

Adding back needinfo for Bronce, added in comment 18.

Comment 25 lkuchlan 2015-02-15 12:59:23 UTC

Tested using vdsm-4.16.8.1-6.el7ev.x86_64, rhevm-3.5.0-0.30.el6ev.noarch

Comment 26 Allon Mureinik 2015-02-16 19:11:40 UTC

RHEV-M 3.5.0 has been released, closing this bug.

Comment 27 Allon Mureinik 2015-02-16 19:11:41 UTC

RHEV-M 3.5.0 has been released, closing this bug.

Note You need to log in before you can comment on or make changes to this bug.