Created attachment 919405 [details] vdsm+engine logs Description of problem: This bug is a special case of BZ #1093924 (read description) When a block domain is inactive,activating a new host fails and becomes unassigned. Version-Release number of selected component (if applicable): beta.1 How reproducible: 100% Steps to Reproduce: Setup:activate 1 host and 2 data storage domains: DomainA (master) DomainBlock (regular block domain),1 host remains inactive 1.block connectivity to DomainBlock from both hosts,wait for the domain to become Inactive 2. activate a second host Actual results: host become unassigned(waited for 20 minuts) Expected results: host should be activated Additional info:
Ori, Please specify which host you are refrerring to and when, from when i see in the log - --host is being activated at 19:13 2014-07-20 19:13:59,096 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (ajp--127.0.0.1-8702-9) [7b156f0d] Lock Acquired to object EngineLock [exclusiveL ocks= key: 2399c3aa-f72c-4d27-9891-7658426d294c value: VDS , sharedLocks= ] -- status changed to unassigned for the activation 2014-07-20 19:13:59,170 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-8-thread-49) [7b156f0d] START, SetVdsStatusVDSCo mmand(HostName = vdsb, HostId = 2399c3aa-f72c-4d27-9891-7658426d294c, status=Unassigned, nonOperationalReason=NONE, stopSpmFailureLogged=false), log id: 1cd05 725 -- InitVdsOnUp (activation flows starts): 2014-07-20 19:14:01,469 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (DefaultQuartzScheduler_Worker-48) [6b4833eb] Running command: InitVdsOnUpCommand internal: true. Entities affected : ID: 00000002-0002-0002-0002-00000000025d Type: StoragePool -- the hosts fails to connect to the storage pool and moves to non operational at 19:17 2014-07-20 19:17:01,738 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (DefaultQuartzScheduler_Worker-48) [21fda512] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 2399c3aa-f72c-4d27-9891-7658426d294c Type: VDS 2014-07-20 19:17:01,742 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler_Worker-48) [21fda512] START, SetVdsStatusVDSCommand(HostName = vdsb, HostId = 2399c3aa-f72c-4d27-9891-7658426d294c, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE, stopSpmFailureLogged=false), log id: 6935b7a6 so in that scenario, it took the host 3 minutes to move to non operational. please try to reproduce to verify that it's not UI refresh problem, if it is - let's have a bug on that. Regardless of that, as in the activation flow there are multiple vdsm calls that each can take up to 3 mintues, generally activation can take some time. thanks, Liron.
Regarding the timeout in the connectStoragePool operation, the operation get stuck on on LVM command for more than 10 minutes ------------------------------------------------------------------- Thread-13::DEBUG::2014-07-20 19:14:06,871::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a030007aeed85291dbe311|/dev/mapper/360060160f4a030007beed85291dbe311|/dev/mapper/360060160f4a030007ceed85291dbe311|/dev/mapper/360060160f4a030007deed85291dbe311|/dev/mapper/360060160f4a030007eeed85291dbe311|/dev/mapper/360060160f4a03000fa65675991dbe311|/dev/mapper/360060160f4a03000fb65675991dbe311|/dev/mapper/360060160f4a03000fc65675991dbe311|/dev/mapper/360060160f4a03000fd65675991dbe311|/dev/mapper/360060160f4a03000fe65675991dbe311|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) Thread-233::DEBUG::2014-07-20 19:14:06,882::fileSD::259::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n345 bytes (345 B) copied, 0.000412624 s, 836 kB/s\n'; <rc> = 0 *** after 11 minutes **** Thread-13::DEBUG::2014-07-20 19:25:08,458::lvm::288::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ' /dev/mapper/360060160f4a03000fc65675991dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000fc65675991dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000fc65675991dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000fc65675991dbe311 was disabled\n /dev/mapper/360060160f4a03000fb65675991dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000fb65675991dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000fb65675991dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000fb65675991dbe311 was disabled\n /dev/mapper/360060160f4a03000fa65675991dbe311: read failed after 0 of 4096 at 32212189184: Input/output error\n /dev/mapper/360060160f4a03000fa65675991dbe311: read failed after 0 of 4096 at 32212246528: Input/output error\n /dev/mapper/360060160f4a03000fa65675991dbe311: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000fa65675991dbe311 was disabled\n /dev/mapper/360060160f4a0 ------------------------------------------------------------------- This causes to connect storage pool requests to pile up and will prevent the host to move to "UP" status. Nir, iirc we had a similar bug for that issue. Ori, in addition to my last comment please confirm the scenario that you ran if possible to save us RCA time. thanks.
(In reply to Liron Aravot from comment #2) > Regarding the timeout in the connectStoragePool operation, the operation get > stuck on on LVM command for more than 10 minutes This is may be lvm bug, it should not block for 10 minutes. However, it is possible that lvm is blocked in the kernel in uninterruptible state, in which case lvm can do nothing about it. Zdenek, can you take a look at this lvm failure?
Returning need info for Ori.
Ori, how did you block connectivity to the storage server? Are you using iptables for this? I would like to see results when you block the connectivity outside of the machine under test, for example, by configuring a switch to drop packets from the machine under test to the storage server.
yes I have used iptables,I'm currently trying to reproduce bug in a case host has very low connectivity, as laravot(swords) suggested.
steps to reproduce (corrected): Setup:activate 1 host and 2 data storage domains: DomainA (master) DomainBlock (regular block domain),1 host remains inactive 1.block connectivity to DomainBlock from both hosts 2.maintain DomainBlock and wait for it to become Inactive 2.activate a second host
Returning needinfo on Zdenek that was removed by mistake. Zdenek - can you please take a look at comment 3 ? Thanks.
I'm not sure I can give here much valid info, but - it looks like in comment 2 - set of devices are given into vgs command and it takes extremely long time to realize those devices are 'missing/failing' I've no idea why storage is blocking error on read for such a long time (probably network timeout ?), but likely lvm2 can't do anything with this - if you know the device is failing - don't pass such device to lvm command as a PV. If the error timeout is too long - you will probably need to tune networking options for iscsi devices to detect read error earlier.
I assume the easiest thing to check is to attach kernel debug output for stacktrace of all processes. Just check 'kernel.sysrq' in sysctl. and 'echo t >/proc/sysrq-trigger' It should show us where is the process sleeping/waiting.
Ori, please reproduce this again, and attach a kernel debug output showing a stracktrace of all processes, as Zdendek suggests in comment 10. We want to understand why lvm is blocking for several minutes when access to storage server is blocked.
Created attachment 933291 [details] vdsm+engine+kernel logs+ an image
Ori, can you explain what are the ker-xxx.log files and how did you create them? This files contains no time information, so it is hard to related the info there to the other logs.
This are the kernel debug output that you have asked me to reproduce, according to comment #10. to get them: dmesg > kernel.log
(In reply to Ori from comment #14) > This are the kernel debug output that you have asked me to reproduce, > according to comment #10. This is not what me or Zdenek asked. > > to get them: dmesg > kernel.log This is less useful compared with /var/log/messages, since there is no time information in these messages, and you cannot related them to the other logs. Please work with Zdendc on getting the stacktraces he asked.
Created attachment 933321 [details] vdsm.log,engine.log, vdsb:var/log/messages,vdsc:var/log/messages,vdsb-dmesg,vdsc-dmesg As instructed by Zdenek,this are the files that needs to be attached.
I assume there is already something strange going on with this line: Sep 1 12:12:39 camel-vdsc kernel: device-mapper: table: 253:2: multipath: error getting device Sep 1 12:12:39 camel-vdsc kernel: device-mapper: ioctl: error adding target to table It's unclear what is automount doing on such system ? Also system is using sanlock - so maybe there are some timeouts ? IMHO the system itself doesn't look to be busy - it's just waiting.
(In reply to Zdenek Kabelac from comment #17) Zdenec, I think that this is a duplicate of bug 880738 - lvm is stuck for 11 minutes because multipath devices is using "no_path_retry queue"
(In reply to Nir Soffer from comment #18) > (In reply to Zdenek Kabelac from comment #17) > Zdenec, I think that this is a duplicate of bug 880738 - lvm is stuck > for 11 minutes because multipath devices is using "no_path_retry queue" If so, it won't make oVirt 3.5.0 - pushing out.
Adding dependency on 880738 - I'm not 100% sure about this, but we want to evaluate this bug again when we have a fix for that one.
(In reply to Nir Soffer from comment #18) > (In reply to Zdenek Kabelac from comment #17) > Zdenec, I think that this is a duplicate of bug 880738 - lvm is stuck > for 11 minutes because multipath devices is using "no_path_retry queue" I'm not an multipath expert here - so cannot really tell...
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Setup: 2 hosts (one on maintenance) and 2 data storage domains 1. Block connectivity (iptables) to non-master domain from both hosts 2. Wait for non-master domain to become Inactive 3. Activate the host Host is up. vdsm-4.17.14-0.el7ev.noarch rhevm-3.6.2-0.1.el6.noarch
oVirt 3.6.1 has been released, closing current release