Bug 1149448 - [RHEL7] VDSM isn't fenced by sanlock after the connection to the master domain is blocked
Summary: [RHEL7] VDSM isn't fenced by sanlock after the connection to the master domai...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-webadmin-portal
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.0
Assignee: Idan Shaby
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-05 09:59 UTC by lkuchlan
Modified: 2016-02-10 16:59 UTC (History)
17 users (show)

Fixed In Version: vt13.4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
image and logs (863.57 KB, application/x-gzip)
2014-10-05 10:00 UTC, lkuchlan
no flags Details
logs (1.23 MB, application/x-gzip)
2014-12-03 16:45 UTC, lkuchlan
no flags Details
logs and image (702.03 KB, application/x-gzip)
2014-12-17 13:54 UTC, lkuchlan
no flags Details
selinux log (49.06 KB, application/x-gzip)
2015-01-28 09:43 UTC, lkuchlan
no flags Details

Description lkuchlan 2014-10-05 09:59:44 UTC
Description of problem:

After blocking connection between master domain, other domain(iscsi) does not become as new master domain

Blocking Master data domain (NFS) connection on the SPM, the available iSCSI domain did not become the new Master data domain


Version-Release number of selected component (if applicable):
3.5 vt3.1

How reproducible:
100%

Steps to Reproduce:
1. On a setup with a single vdsm host, create 1 NFS Data storage domain (Master)
2. Create 2 iSCSI domains
3. Block the connection between vdsm host and the current master domain (NFS)

Actual results:
The oVirt engine does not migrate the Master domain to use iSCSI which is available and accessible

Expected results:
One of the available iSCSI domains should be selected as the new Master data domain

Comment 1 lkuchlan 2014-10-05 10:00:32 UTC
Created attachment 943988 [details]
image and logs

Comment 2 Liron Aravot 2014-11-05 13:32:44 UTC
lkuchlan, the other ISCSI domain in your setup (called 'iscsi') is in maintenance so it's not available to become the new master anyway.

The main issue here seems to be that vdsm isn't being fenced by sanlock after the connection to the master domain is blocked,  this leads to having the engine/vdsm constantly fail with connection time out errors instead of failing with StoragePoolMasterNotFound exception which will lead to a reconstruct.

lkuchlan, can you please try to reproduce and attach also the sanlock logs?
nir, have you handled similar bug recently?



repoStats/getStoragePoolInfo:

Thread-705::INFO::2014-10-02 13:10:01,033::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'ea54e2ae-e283-460f-97e2-77af7737de7e': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '69.2', 'valid': False}, 'f24f6d96-51eb-495f-b5ec-6bd1be521de5': {'code': 200, 'version': -1, 'acquired': False, 'delay': '0.00613357', 'lastCheck': '60.4', 'valid': False}, u'96267cb4-280f-43c6-b530-afbf8852fed9': {'code': 358, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '61.2', 'valid': False}}



Thread-758::DEBUG::2014-10-02 13:13:23,677::task::993::Storage.TaskManager.Task::(_decref) Task=`a19c73d0-f572-40ce-be4c-c5549f35e727`::ref 1 aborting False
Thread-758::INFO::2014-10-02 13:13:23,730::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'name': 'No Des
cription', 'isoprefix': '', 'pool_status': 'connected', 'lver': 47L, 'domains': '96267cb4-280f-43c6-b530-afbf8852fed9:Active,f24f6d96-51eb-495f-b5ec-6bd1be521
de5:Active,847b74d3-f092-40cf-9bb8-19c7ae498e62:Attached,ea54e2ae-e283-460f-97e2-77af7737de7e:Active', 'master_uuid': 'f24f6d96-51eb-495f-b5ec-6bd1be521de5', 
'version': '3', 'spm_id': 1, 'type': 'ISCSI', 'master_ver': 5}, 'dominfo': {'96267cb4-280f-43c6-b530-afbf8852fed9': {'status': 'Active', 'isoprefix': '', 'ale
rts': [], 'version': -1}, 'f24f6d96-51eb-495f-b5ec-6bd1be521de5': {'status': 'Active', 'diskfree': '37044092928', 'isoprefix': '', 'alerts': [], 'disktotal': 
'53284438016', 'version': -1}, '847b74d3-f092-40cf-9bb8-19c7ae498e62': {'status': 'Attached', 'isoprefix': '', 'alerts': []}, 'ea54e2ae-e283-460f-97e2-77af773
7de7e': {'status': 'Active', 'isoprefix': '', 'alerts': [], 'version': -1}}}

Comment 3 Nir Soffer 2014-11-05 14:01:31 UTC
Looks like a duplicate of bug 1141658.

Please provide the version these packages:

- selinux-policy
- selinux-policy-targeted

Comment 4 Nir Soffer 2014-11-05 14:03:39 UTC
Additionally, attach these files:

- /var/log/messages
- /var/log/sanlock.log
- /var/log/audit/audit.log
  Add the file showing the timeframe of this error, it may be one
  of the rotated files (e.g. audit.log.3.gz)

Comment 5 Elad 2014-11-05 14:59:59 UTC
Nir, I don't think it is a DUP of bug 1141658 because as we can see in the attached screenshot and in vdsm.log, the reconstruct statrs (and fails), but the host is not rebooted. As opposed to 1141658, in which the host is rebooted ~2 minutes after the connection to the master domain had lost. in 1141658, when the host becomes up again, the connection to the storage is resumed and the DC becomes active because the iptables rules are cleaned.

Comment 6 lkuchlan 2014-11-05 16:08:26 UTC
Liron, i can not reproduce it on rhel 6.6

Comment 7 Allon Mureinik 2014-11-24 17:52:54 UTC
(In reply to lkuchlan from comment #6)
> Liron, i can not reproduce it on rhel 6.6
So what OS does this reproduce on? RHEL 7? RHEL 6.5?

Comment 8 lkuchlan 2014-11-29 18:02:31 UTC
(In reply to Allon Mureinik from comment #7)
> (In reply to lkuchlan from comment #6)
> > Liron, i can not reproduce it on rhel 6.6
> So what OS does this reproduce on? RHEL 7? RHEL 6.5?

The bug reproduces on RHEL 7

Comment 9 Allon Mureinik 2014-12-01 12:39:53 UTC
(In reply to Nir Soffer from comment #3)
> Looks like a duplicate of bug 1141658.
> 
> Please provide the version these packages:
> 
> - selinux-policy
> - selinux-policy-targeted

(In reply to Nir Soffer from comment #4)
> Additionally, attach these files:
> 
> - /var/log/messages
> - /var/log/sanlock.log
> - /var/log/audit/audit.log
>   Add the file showing the timeframe of this error, it may be one
>   of the rotated files (e.g. audit.log.3.gz)

These requests were somehow missed between all the comments here.
Liron, can you please provide this info?

Comment 10 lkuchlan 2014-12-03 16:45:38 UTC
Created attachment 964217 [details]
logs

packages version:
libselinux-2.2.2-6.el7.x86_64
libselinux-ruby-2.2.2-6.el7.x86_64
selinux-policy-3.12.1-153.el7.noarch
libselinux-utils-2.2.2-6.el7.x86_64
selinux-policy-targeted-3.12.1-153.el7.noarch
libselinux-python-2.2.2-6.el7.x86_64

please find attached the logs

Comment 12 lkuchlan 2014-12-17 13:54:01 UTC
Created attachment 970134 [details]
logs and image

It is still reproduce
please find attached the logs

Comment 13 Yaniv Lavi 2015-01-15 09:54:27 UTC
Can you recreate this on RHEL 7.1?

Comment 17 lkuchlan 2015-01-26 15:20:48 UTC
(In reply to Yaniv Dary from comment #13)
> Can you recreate this on RHEL 7.1?

Hi Yaniv,
It is not reproduced on RHEL 7.1

Comment 18 Yaniv Lavi 2015-01-26 15:29:35 UTC
lkuchlan, can you please add the SELINUX logs on the target host?
Bronce, can we bring this up with the RHEL PM to try to understand why this is happening and try to get the fix for thr policy to 7.0.z?

Comment 19 Bronce McClain 2015-01-26 18:07:27 UTC
(In reply to Yaniv Dary from comment #18)
> lkuchlan, can you please add the SELINUX logs on the target host?
> Bronce, can we bring this up with the RHEL PM to try to understand why this
> is happening and try to get the fix for thr policy to 7.0.z?

Is there a rhel bz for this w/ the build that fixed it in 7.1?

Comment 20 Yaniv Lavi 2015-01-27 08:23:28 UTC
(In reply to Bronce McClain from comment #19)
> (In reply to Yaniv Dary from comment #18)
> > lkuchlan, can you please add the SELINUX logs on the target host?
> > Bronce, can we bring this up with the RHEL PM to try to understand why this
> > is happening and try to get the fix for thr policy to 7.0.z?
> 
> Is there a rhel bz for this w/ the build that fixed it in 7.1?

They made extensive changes to selinux policy, so we don't know the bug number.
lkuchlan, what 7.1 build did you test with?

Comment 21 Tal Nisan 2015-01-27 11:47:19 UTC
vt13.4 contains a newer sanlock (selinux-policy-targeted >= 3.12.1-153.el7_0.13
) than was checked here (see bug 1141658).

We're unable to reproduce this in dev - can you please retry with the latest 3.5.0 on RHEL7, and confirm this is closed?


Thanks!

Comment 22 lkuchlan 2015-01-28 09:43:28 UTC
Created attachment 985054 [details]
selinux log

Tested using vdsm-4.16.8.1-6.el7ev.x86_64, rhevm-3.5.0-0.30.el6ev.noarch
Please find attached the selinux log on the target host

Comment 23 Nir Soffer 2015-02-15 11:41:41 UTC
(In reply to lkuchlan from comment #22)
> Created attachment 985054 [details]
> selinux log
> 
> Tested using vdsm-4.16.8.1-6.el7ev.x86_64, rhevm-3.5.0-0.30.el6ev.noarch
> Please find attached the selinux log on the target host

What are the results of this test? Did you reproduce this error or not?

If it cannot be reproduced, this should be move to VERIFIED. The bug is still in ON_QA - do you need additional testing?

Comment 24 Nir Soffer 2015-02-15 11:43:35 UTC
Adding back needinfo for Bronce, added in comment 18.

Comment 25 lkuchlan 2015-02-15 12:59:23 UTC
Tested using vdsm-4.16.8.1-6.el7ev.x86_64, rhevm-3.5.0-0.30.el6ev.noarch

Comment 26 Allon Mureinik 2015-02-16 19:11:40 UTC
RHEV-M 3.5.0 has been released, closing this bug.

Comment 27 Allon Mureinik 2015-02-16 19:11:41 UTC
RHEV-M 3.5.0 has been released, closing this bug.


Note You need to log in before you can comment on or make changes to this bug.