Bug 1121420 - Activating host fails when a regular block domain is inactive
Summary: Activating host fails when a regular block domain is inactive
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Frontend.WebAdmin
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-3.6.1
: 3.6.1
Assignee: Tal Nisan
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On: 880738
Blocks: 1093924
TreeView+ depends on / blocked
 
Reported: 2014-07-20 16:27 UTC by Ori Gofen
Modified: 2016-03-10 13:34 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-13 14:39:54 UTC
oVirt Team: Storage
rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
tnisan: devel_ack+
acanan: testing_ack+


Attachments (Terms of Use)
vdsm+engine logs (776.56 KB, application/x-gzip)
2014-07-20 16:27 UTC, Ori Gofen
no flags Details
vdsm+engine+kernel logs+ an image (419.35 KB, application/octet-stream)
2014-09-01 10:04 UTC, Ori Gofen
no flags Details
vdsm.log,engine.log, vdsb:var/log/messages,vdsc:var/log/messages,vdsb-dmesg,vdsc-dmesg (561.48 KB, application/octet-stream)
2014-09-01 12:06 UTC, Ori Gofen
no flags Details

Description Ori Gofen 2014-07-20 16:27:17 UTC
Created attachment 919405 [details]
vdsm+engine logs

Description of problem:
This bug is a special case of BZ #1093924 (read description)
When a block domain is inactive,activating a new host fails and becomes unassigned.

Version-Release number of selected component (if applicable):
beta.1

How reproducible:
100%

Steps to Reproduce:
Setup:activate 1 host and 2 data storage domains: DomainA (master) DomainBlock (regular block domain),1 host remains inactive

1.block connectivity to DomainBlock from both hosts,wait for the domain to become Inactive
2. activate a second host

Actual results:
host become unassigned(waited for 20 minuts)

Expected results:
host should be activated

Additional info:

Comment 1 Liron Aravot 2014-07-21 06:52:17 UTC
Ori,
Please specify which host you are refrerring to and when,
from when i see in the log -


--host is being activated at 19:13

2014-07-20 19:13:59,096 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp--127.0.0.1-8702-9) [7b156f0d] Lock Acquired to object EngineLock [exclusiveL
ocks= key: 2399c3aa-f72c-4d27-9891-7658426d294c value: VDS
, sharedLocks= ]

-- status changed to unassigned for the activation

2014-07-20 19:13:59,170 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-8-thread-49) [7b156f0d] START, SetVdsStatusVDSCo
mmand(HostName = vdsb, HostId = 2399c3aa-f72c-4d27-9891-7658426d294c, status=Unassigned, nonOperationalReason=NONE, stopSpmFailureLogged=false), log id: 1cd05
725


-- InitVdsOnUp (activation flows starts):
2014-07-20 19:14:01,469 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (DefaultQuartzScheduler_Worker-48) [6b4833eb] Running command: InitVdsOnUpCommand internal: true. Entities affected :  ID: 00000002-0002-0002-0002-00000000025d Type: StoragePool

-- the hosts fails to connect to the storage pool and moves to non operational at 19:17

2014-07-20 19:17:01,738 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (DefaultQuartzScheduler_Worker-48) [21fda512] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 2399c3aa-f72c-4d27-9891-7658426d294c Type: VDS
2014-07-20 19:17:01,742 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler_Worker-48) [21fda512] START, SetVdsStatusVDSCommand(HostName = vdsb, HostId = 2399c3aa-f72c-4d27-9891-7658426d294c, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE, stopSpmFailureLogged=false), log id: 6935b7a6



so in that scenario, it took the host 3 minutes to move to non operational.
please try to reproduce to verify that it's not UI refresh problem, if it is - let's have a bug on that.

Regardless of that, as in the activation flow there are multiple vdsm calls that each can take up to 3 mintues, generally activation can take some time.

thanks,
Liron.

Comment 2 Liron Aravot 2014-07-21 10:54:42 UTC
Regarding the timeout in the connectStoragePool operation, the operation get stuck on on LVM command for more than 10 minutes

-------------------------------------------------------------------
Thread-13::DEBUG::2014-07-20 19:14:06,871::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a030007aeed85291dbe311|/dev/mapper/360060160f4a030007beed85291dbe311|/dev/mapper/360060160f4a030007ceed85291dbe311|/dev/mapper/360060160f4a030007deed85291dbe311|/dev/mapper/360060160f4a030007eeed85291dbe311|/dev/mapper/360060160f4a03000fa65675991dbe311|/dev/mapper/360060160f4a03000fb65675991dbe311|/dev/mapper/360060160f4a03000fc65675991dbe311|/dev/mapper/360060160f4a03000fd65675991dbe311|/dev/mapper/360060160f4a03000fe65675991dbe311|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None)
Thread-233::DEBUG::2014-07-20 19:14:06,882::fileSD::259::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n345 bytes (345 B) copied, 0.000412624 s, 836 kB/s\n'; <rc> = 0


*** after 11 minutes ****

Thread-13::DEBUG::2014-07-20 19:25:08,458::lvm::288::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  /dev/mapper/360060160f4a03000fc65675991dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000fc65675991dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000fc65675991dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000fc65675991dbe311 was disabled\n  /dev/mapper/360060160f4a03000fb65675991dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000fb65675991dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000fb65675991dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000fb65675991dbe311 was disabled\n  /dev/mapper/360060160f4a03000fa65675991dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n  /dev/mapper/360060160f4a03000fa65675991dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n  /dev/mapper/360060160f4a03000fa65675991dbe311: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000fa65675991dbe311 was disabled\n  /dev/mapper/360060160f4a0
-------------------------------------------------------------------

This causes to connect storage pool requests to pile up and will prevent the host to move to "UP" status.

Nir, iirc we had a similar bug for that issue.
Ori, in addition to my last comment please confirm the scenario that you ran if possible to save us RCA time.

thanks.

Comment 3 Nir Soffer 2014-07-21 14:55:44 UTC
(In reply to Liron Aravot from comment #2)
> Regarding the timeout in the connectStoragePool operation, the operation get
> stuck on on LVM command for more than 10 minutes

This is may be lvm bug, it should not block for 10 minutes. However, it is possible that lvm is blocked in the kernel in uninterruptible state, in which case lvm can do nothing about it.

Zdenek, can you take a look at this lvm failure?

Comment 4 Nir Soffer 2014-07-21 15:08:18 UTC
Returning need info for Ori.

Comment 5 Nir Soffer 2014-07-21 15:10:26 UTC
Ori, how did you block connectivity to the storage server? Are you using iptables for this?

I would like to see results when you block the connectivity outside of the machine under test, for example, by configuring a switch to drop packets from the machine under test to the storage server.

Comment 6 Ori Gofen 2014-07-22 08:41:41 UTC
yes I have used iptables,I'm currently trying to reproduce bug in a case host has very low connectivity, as laravot(swords) suggested.

Comment 7 Ori Gofen 2014-08-03 14:04:53 UTC
steps to reproduce (corrected):

Setup:activate 1 host and 2 data storage domains: DomainA (master) DomainBlock (regular block domain),1 host remains inactive

1.block connectivity to DomainBlock from both hosts
2.maintain DomainBlock and wait for it to become Inactive
2.activate a second host

Comment 8 Allon Mureinik 2014-08-07 11:55:31 UTC
Returning needinfo on Zdenek that was removed by mistake.
Zdenek - can you please take a look at comment 3 ? Thanks.

Comment 9 Zdenek Kabelac 2014-08-20 14:21:12 UTC
I'm not sure I can give here much valid info, but -

it looks like in comment 2 -  set of devices are given into vgs command and it takes extremely long time to realize those devices are 'missing/failing' 

I've no idea why storage is blocking error on read for such a long time (probably network timeout ?), but likely lvm2 can't do anything with this - if you know the device is failing - don't pass such device to lvm command as a PV.

If the error timeout is too long - you will probably need to tune networking options for iscsi devices to detect read error earlier.

Comment 10 Zdenek Kabelac 2014-08-26 09:19:05 UTC
I assume the easiest thing to check is to attach kernel debug output for stacktrace of all processes.

Just check 'kernel.sysrq' in sysctl.
and 'echo t >/proc/sysrq-trigger'

It should show us where is the process sleeping/waiting.

Comment 11 Nir Soffer 2014-08-26 11:15:22 UTC
Ori, please reproduce this again, and attach a kernel debug output showing a stracktrace of all processes, as Zdendek suggests in comment 10.

We want to understand why lvm is blocking for several minutes when access to storage server is blocked.

Comment 12 Ori Gofen 2014-09-01 10:04:43 UTC
Created attachment 933291 [details]
vdsm+engine+kernel logs+ an image

Comment 13 Nir Soffer 2014-09-01 10:17:06 UTC
Ori, can  you explain what are the ker-xxx.log files and how did you create them?

This files contains no time information, so it is hard to related the info there to the other logs.

Comment 14 Ori Gofen 2014-09-01 11:12:37 UTC
This are the kernel debug output that you have asked me to reproduce,
according to comment #10.

to get them: dmesg > kernel.log

Comment 15 Nir Soffer 2014-09-01 11:20:40 UTC
(In reply to Ori from comment #14)
> This are the kernel debug output that you have asked me to reproduce,
> according to comment #10.

This is not what me or Zdenek asked.

> 
> to get them: dmesg > kernel.log

This is less useful compared with /var/log/messages, since there is no time information in these messages, and you cannot related them to the other logs.

Please work with Zdendc on getting the stacktraces he asked.

Comment 16 Ori Gofen 2014-09-01 12:06:46 UTC
Created attachment 933321 [details]
vdsm.log,engine.log, vdsb:var/log/messages,vdsc:var/log/messages,vdsb-dmesg,vdsc-dmesg

As instructed by Zdenek,this are the files that needs to be attached.

Comment 17 Zdenek Kabelac 2014-09-01 16:50:26 UTC
I assume there is already something strange going on with this line:

Sep  1 12:12:39 camel-vdsc kernel: device-mapper: table: 253:2: multipath: error getting device
Sep  1 12:12:39 camel-vdsc kernel: device-mapper: ioctl: error adding target to table


It's unclear what is automount doing on such system ?

Also system is using sanlock - so maybe there are some timeouts ?

IMHO the system itself doesn't look to be busy - it's just waiting.

Comment 18 Nir Soffer 2014-09-02 10:49:06 UTC
(In reply to Zdenek Kabelac from comment #17)
Zdenec, I think that this is a duplicate of bug 880738 - lvm is stuck 
for 11 minutes because multipath devices is using "no_path_retry queue"

Comment 19 Allon Mureinik 2014-09-11 12:13:38 UTC
(In reply to Nir Soffer from comment #18)
> (In reply to Zdenek Kabelac from comment #17)
> Zdenec, I think that this is a duplicate of bug 880738 - lvm is stuck 
> for 11 minutes because multipath devices is using "no_path_retry queue"
If so, it won't make oVirt 3.5.0 - pushing out.

Comment 20 Nir Soffer 2014-09-12 00:43:14 UTC
Adding dependency on 880738 - I'm not 100% sure about this, but we want to evaluate this bug again when we have a fix for that one.

Comment 21 Zdenek Kabelac 2014-11-25 17:57:47 UTC
(In reply to Nir Soffer from comment #18)
> (In reply to Zdenek Kabelac from comment #17)
> Zdenec, I think that this is a duplicate of bug 880738 - lvm is stuck 
> for 11 minutes because multipath devices is using "no_path_retry queue"

I'm not an multipath expert here - so cannot really tell...

Comment 22 Red Hat Bugzilla Rules Engine 2015-10-19 10:50:23 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 23 Yaniv Lavi 2015-10-29 12:05:42 UTC
In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 24 Red Hat Bugzilla Rules Engine 2015-11-27 04:38:32 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 25 Aharon Canan 2015-12-29 16:39:42 UTC
Setup:
2 hosts (one on maintenance) and 2 data storage domains

 1. Block connectivity (iptables) to non-master domain from both hosts
 2. Wait for non-master domain to become Inactive
 3. Activate the host 

Host is up.

vdsm-4.17.14-0.el7ev.noarch
rhevm-3.6.2-0.1.el6.noarch

Comment 26 Sandro Bonazzola 2016-01-13 14:39:54 UTC
oVirt 3.6.1 has been released, closing current release


Note You need to log in before you can comment on or make changes to this bug.