1138144 – Failed to autorecover storage domain after unblocking connection with host

Bug 1138144 - Failed to autorecover storage domain after unblocking connection with host

Summary: Failed to autorecover storage domain after unblocking connection with host

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-3.6.1
Target Release:	3.6.1
Assignee:	Tal Nisan
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	880738
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-04 07:16 UTC by Raz Tamir
Modified:	2016-03-10 13:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-16 12:23:59 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-3.6.z? rule-engine: blocker? ylavi: planning_ack? tnisan: devel_ack+ acanan: testing_ack+

Attachments	(Terms of Use)
vdsm and engine logs (1.05 MB, application/octet-stream) 2014-09-04 07:16 UTC, Raz Tamir	no flags	Details
new logs (2.23 MB, application/octet-stream) 2014-09-11 09:53 UTC, Raz Tamir	no flags	Details
View All

Description Raz Tamir 2014-09-04 07:16:09 UTC

Created attachment 934294 [details]
vdsm and engine logs

Description of problem:
after blocking connection between host and storage domain, the domain status changed to invalid. "Message: Invalid status on Data Center Global_DC_1. Setting Data Center status to Non Responsive (On h
ost 10.35.160.49, Error: General Exception)."

After unblocking the connection, the expected result (by design) is that the domain status will change to valid.


Version-Release number of selected component (if applicable):

vdsm-4.16.2-1.gite8cba75.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
setup: 1 host, 1 SD
1. block connection between host and SD
2. after ~5 minuts the SD status changes to invalid
3. unblock the connection

Actual results:
the SD's status should change to valid

Expected results:
explained above

Additional info:

Comment 1 Nir Soffer 2014-09-08 13:22:25 UTC

Why is this regression? Does the same test work with previous version?

Comment 2 Nir Soffer 2014-09-08 13:29:34 UTC

Please add /var/log/sanlock.log and /var/log/messages showing the relevant timeframe

Comment 3 Liron Aravot 2014-09-08 13:47:33 UTC

It can be seen that connectStoragePool is issued on 18:01:34,905, an exslucive lock on Storage.HsmDomainMonitorLock is being acquired, the thread gets stuck on executing lvm command and all the sent connectStoragePool requests fails on timeout (more than 3 minutes). That lvm command is stuck for over than 1 hour (till the end of the log), keeping the lock acquired. attaching it here -

-----------------------------------------------------------------------
Thread-15::DEBUG::2014-09-03 18:01:34,985::lvm::365::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-15::DEBUG::2014-09-03 18:01:34,986::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a0300083a53e6c7833e411|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 2b8925db-e572-47cc-931e-029a51d32311 (cwd None)
-----------------------------------------------------------------------

please also take a look on the discussion on this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1121420), in that case the lvm command seems to be stuck for over an hour.

Comment 4 Nir Soffer 2014-09-08 13:52:49 UTC

What we can see in vdsm log:

1. spm lost access to storage
2. sanlock kill vdsm
3. vdsm restarts at 2014-09-03 17:56:03
4. connectStoragePool issued at 2014-09-03 18:01
5. The last thing that vdsm does is running lvm command:

Thread-15::DEBUG::2014-09-03 18:01:34,986::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a0300083a53e6c7833e411|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 2b8925db-e572-47cc-931e-029a51d32311 (cwd None)

Thread-15 never returns until the end of the log at 2014-09-03 19:10:35,670.

vdsm was not restarted until the end of the log.

Looks like lvm got stuck for more then an hour.

Since you claim that this is reproducible, please work with Zdenek from lvm on understanding why lvm got stuck for such long time.

Comment 5 Nir Soffer 2014-09-08 13:57:02 UTC

Zdenek, can you look at comment 3 and 4? I think this the first time we see lvm command stuck for such a long time.

The reporter claim that this is 100% reproducible, so maybe this is our change to understand why lvm was stuck for such long time.

Please instruct us how to collect additional information and create a reproducer for this.

Comment 6 Nir Soffer 2014-09-08 13:59:18 UTC

Raz, please add information about the system:
- kernel
- lvm2
- device-mapper-multipath

Comment 7 Raz Tamir 2014-09-08 15:19:30 UTC

lvm2 - lvm2-2.02.100-8.el6.x86_64
kernel - kernel-2.6.32-431.1.2.el6.x86_64
device-mapper-multipath - device-mapper-multipath-0.4.9-72.el6_5.2.x86_64

Comment 8 Allon Mureinik 2014-09-08 15:52:59 UTC

Raz, another question - I remember we saw some odd behavior with your VNX environment.
Does this bug reproduce with a different storage machine?

Comment 9 Nir Soffer 2014-09-08 16:55:46 UTC

Raz, can you try to reproduce this with rhel 7

Comment 10 Allon Mureinik 2014-09-10 09:03:34 UTC

Insight about the automation blocker - this blocks a specific test, not the entire automation.

Comment 11 Raz Tamir 2014-09-10 09:20:37 UTC

Hi Nir'
It doesn't reproduced with RHEL 7 hosts

Comment 12 Nir Soffer 2014-09-11 06:32:20 UTC

I debug this with Raz on RHEL 6.5 and RHEV 3.5 beta.

Reproduced 3 times with current system configuration. The issue is lvm getting stuck for unlimited time,  with kernel tracebacks in /var/log/messages. Should open a bug for lvm/multipath/kernel to track this issue.

After modifying iscsid.conf's replacement_timeout to 5 seconds, and adding multipath configuration for the storage device (DGC), setting "no_path_retry fail", there was no issue, and recovery was immediate.

So this is another bug depending on bug 880738.

Comment 13 Nir Soffer 2014-09-11 06:52:58 UTC

Please test again on RHEL 6.5 using this setup:

1. Edit /etc/iscsi/iscsid.conf and set replacement_timeout to 5:

node.session.timeo.replacement_timeout = 120

2. Restart iscsid

service iscsid restart

3. Edit /etc/multipath.conf and add this device in the "devices" section:

devices {
    device {
        all_devs  yes
        no_path_retry  fail
    }
}

This is an undocumented configuration available only in RHEL 6.5.

4. Reload mulitpathd

service multipathd reload

Comment 14 Raz Tamir 2014-09-11 09:53:51 UTC

Created attachment 936474 [details]
new logs

Followed the steps and it looks like the flow of autorecover loop wasn't occurs.
new Logs attached

Comment 15 Nir Soffer 2014-09-11 10:34:51 UTC

(In reply to ratamir from comment #14)
> Created attachment 936474 [details]
> new logs
> 
> Followed the steps and it looks like the flow of autorecover loop wasn't
> occurs.

How many time did you test this? We want to be sure that this is a reliable solution.

Comment 16 Allon Mureinik 2014-09-11 12:08:21 UTC

The underlying multipath issue won't make it for oVirt 3.5.0.
This bug has been around since forever - pushing out to 3.5.1.

Comment 17 Zdenek Kabelac 2014-09-18 08:47:44 UTC

For better guess about what is going on - it usually helps to get  '-vvvv' full debug trace of lvm command.

From the comment 12 saying 'it's stuck with unlimited time' - this may be the case where the tool is prompting for  'y|n' answer.

Unfortunately there are number of cases (i.e. lvremove) in lvm2 where during the prompt the VG lock is held effectively disabling usage of any other command meanwhile (we have few BZ for this).

To avoid any prompt wait - it's usually enough to 'fork/execute' lvm command with closed 'stdin/1' file descriptor - this gives instant 'n'.

Of course there are possibly some other reason as well - but usually the easiest is take  'sysrq-t' stack trace of every single process - in case of prompt it's visible the lvm process wait on input.

So please attach full  '-vvvv' trace for stucked command
and  sysrq  t state of all process in that moment.

Comment 18 Raz Tamir 2014-09-18 08:56:48 UTC

Hi Allon,
This scenario was tested in in 3.4 (av9.6) and the storage domain state changed to valid.
http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.4/job/3.4-storage_resume_guests_after_eio-iscsi/55/consoleFull

** search 2014-06-23 07:20:18,807

In 2014-06-23 07:20:40,415 vm state was change to up again

Comment 19 Red Hat Bugzilla Rules Engine 2015-09-22 07:43:26 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 20 Red Hat Bugzilla Rules Engine 2015-10-19 11:01:07 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 21 Yaniv Lavi 2015-10-29 12:07:07 UTC

In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 22 Tal Nisan 2015-11-12 09:36:33 UTC

Given the fix to bug 880738, this bug should also be fixed, moving to MODIFIED

Comment 23 Red Hat Bugzilla Rules Engine 2015-11-27 04:40:30 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 24 Elad 2015-12-08 07:22:22 UTC

Adding TestOnly flag since this bug doesn't include code change (no gerrit external tracker), it relies on bug 880738.

Comment 25 Elad 2015-12-08 08:19:13 UTC

Storage domain auto recovery succeeds and completes quickly.
Tested according to the scenario in the description using FC domain.

Resuming paths connectivity:

[root@green-vdsb ~]# date && echo "running" > /sys/block/sde/device/state   
Tue  8 Dec 08:11:58 IST 2015
[root@green-vdsb ~]# date && echo "running" > /sys/block/sdn/device/state
Tue  8 Dec 08:12:02 IST 2015


Master domain (6605112f-7f26-48b9-b64e-f661bfe18799) returns to Active after 3 minutes:

jsonrpc.Executor/5::DEBUG::2015-12-08 08:15:04,355::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.getInfo' in bridge with {'info': {'name': 'No Description', 'isoprefix': '', 'pool_status': 'connected', 'lver': 16L, 'domains': u'504c7595-7ca5-459c-be6f-44a3d7f1b5d7:Attached,6605112f-7f26-48b9-b64e-f661bfe18799:Active


Used:
vdsm-4.17.12-0.el7ev.noarch
lvm2-2.02.130-5.el7.x86_64
device-mapper-multipath-0.4.9-85.el7.x86_64
sanlock-3.2.4-1.el7.x86_64
rhevm-3.6.1.1-0.1.el6.noarch

Comment 26 Sandro Bonazzola 2015-12-16 12:23:59 UTC

According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.

Note You need to log in before you can comment on or make changes to this bug.