Created attachment 934294 [details] vdsm and engine logs Description of problem: after blocking connection between host and storage domain, the domain status changed to invalid. "Message: Invalid status on Data Center Global_DC_1. Setting Data Center status to Non Responsive (On h ost 10.35.160.49, Error: General Exception)." After unblocking the connection, the expected result (by design) is that the domain status will change to valid. Version-Release number of selected component (if applicable): vdsm-4.16.2-1.gite8cba75.el7.x86_64 How reproducible: 100% Steps to Reproduce: setup: 1 host, 1 SD 1. block connection between host and SD 2. after ~5 minuts the SD status changes to invalid 3. unblock the connection Actual results: the SD's status should change to valid Expected results: explained above Additional info:
Why is this regression? Does the same test work with previous version?
Please add /var/log/sanlock.log and /var/log/messages showing the relevant timeframe
It can be seen that connectStoragePool is issued on 18:01:34,905, an exslucive lock on Storage.HsmDomainMonitorLock is being acquired, the thread gets stuck on executing lvm command and all the sent connectStoragePool requests fails on timeout (more than 3 minutes). That lvm command is stuck for over than 1 hour (till the end of the log), keeping the lock acquired. attaching it here - ----------------------------------------------------------------------- Thread-15::DEBUG::2014-09-03 18:01:34,985::lvm::365::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-15::DEBUG::2014-09-03 18:01:34,986::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a0300083a53e6c7833e411|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 2b8925db-e572-47cc-931e-029a51d32311 (cwd None) ----------------------------------------------------------------------- please also take a look on the discussion on this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1121420), in that case the lvm command seems to be stuck for over an hour.
What we can see in vdsm log: 1. spm lost access to storage 2. sanlock kill vdsm 3. vdsm restarts at 2014-09-03 17:56:03 4. connectStoragePool issued at 2014-09-03 18:01 5. The last thing that vdsm does is running lvm command: Thread-15::DEBUG::2014-09-03 18:01:34,986::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/360060160f4a0300083a53e6c7833e411|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 2b8925db-e572-47cc-931e-029a51d32311 (cwd None) Thread-15 never returns until the end of the log at 2014-09-03 19:10:35,670. vdsm was not restarted until the end of the log. Looks like lvm got stuck for more then an hour. Since you claim that this is reproducible, please work with Zdenek from lvm on understanding why lvm got stuck for such long time.
Zdenek, can you look at comment 3 and 4? I think this the first time we see lvm command stuck for such a long time. The reporter claim that this is 100% reproducible, so maybe this is our change to understand why lvm was stuck for such long time. Please instruct us how to collect additional information and create a reproducer for this.
Raz, please add information about the system: - kernel - lvm2 - device-mapper-multipath
lvm2 - lvm2-2.02.100-8.el6.x86_64 kernel - kernel-2.6.32-431.1.2.el6.x86_64 device-mapper-multipath - device-mapper-multipath-0.4.9-72.el6_5.2.x86_64
Raz, another question - I remember we saw some odd behavior with your VNX environment. Does this bug reproduce with a different storage machine?
Raz, can you try to reproduce this with rhel 7
Insight about the automation blocker - this blocks a specific test, not the entire automation.
Hi Nir' It doesn't reproduced with RHEL 7 hosts
I debug this with Raz on RHEL 6.5 and RHEV 3.5 beta. Reproduced 3 times with current system configuration. The issue is lvm getting stuck for unlimited time, with kernel tracebacks in /var/log/messages. Should open a bug for lvm/multipath/kernel to track this issue. After modifying iscsid.conf's replacement_timeout to 5 seconds, and adding multipath configuration for the storage device (DGC), setting "no_path_retry fail", there was no issue, and recovery was immediate. So this is another bug depending on bug 880738.
Please test again on RHEL 6.5 using this setup: 1. Edit /etc/iscsi/iscsid.conf and set replacement_timeout to 5: node.session.timeo.replacement_timeout = 120 2. Restart iscsid service iscsid restart 3. Edit /etc/multipath.conf and add this device in the "devices" section: devices { device { all_devs yes no_path_retry fail } } This is an undocumented configuration available only in RHEL 6.5. 4. Reload mulitpathd service multipathd reload
Created attachment 936474 [details] new logs Followed the steps and it looks like the flow of autorecover loop wasn't occurs. new Logs attached
(In reply to ratamir from comment #14) > Created attachment 936474 [details] > new logs > > Followed the steps and it looks like the flow of autorecover loop wasn't > occurs. How many time did you test this? We want to be sure that this is a reliable solution.
The underlying multipath issue won't make it for oVirt 3.5.0. This bug has been around since forever - pushing out to 3.5.1.
For better guess about what is going on - it usually helps to get '-vvvv' full debug trace of lvm command. From the comment 12 saying 'it's stuck with unlimited time' - this may be the case where the tool is prompting for 'y|n' answer. Unfortunately there are number of cases (i.e. lvremove) in lvm2 where during the prompt the VG lock is held effectively disabling usage of any other command meanwhile (we have few BZ for this). To avoid any prompt wait - it's usually enough to 'fork/execute' lvm command with closed 'stdin/1' file descriptor - this gives instant 'n'. Of course there are possibly some other reason as well - but usually the easiest is take 'sysrq-t' stack trace of every single process - in case of prompt it's visible the lvm process wait on input. So please attach full '-vvvv' trace for stucked command and sysrq t state of all process in that moment.
Hi Allon, This scenario was tested in in 3.4 (av9.6) and the storage domain state changed to valid. http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.4/job/3.4-storage_resume_guests_after_eio-iscsi/55/consoleFull ** search 2014-06-23 07:20:18,807 In 2014-06-23 07:20:40,415 vm state was change to up again
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
In oVirt testing is done on single stream by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Given the fix to bug 880738, this bug should also be fixed, moving to MODIFIED
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Adding TestOnly flag since this bug doesn't include code change (no gerrit external tracker), it relies on bug 880738.
Storage domain auto recovery succeeds and completes quickly. Tested according to the scenario in the description using FC domain. Resuming paths connectivity: [root@green-vdsb ~]# date && echo "running" > /sys/block/sde/device/state Tue 8 Dec 08:11:58 IST 2015 [root@green-vdsb ~]# date && echo "running" > /sys/block/sdn/device/state Tue 8 Dec 08:12:02 IST 2015 Master domain (6605112f-7f26-48b9-b64e-f661bfe18799) returns to Active after 3 minutes: jsonrpc.Executor/5::DEBUG::2015-12-08 08:15:04,355::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.getInfo' in bridge with {'info': {'name': 'No Description', 'isoprefix': '', 'pool_status': 'connected', 'lver': 16L, 'domains': u'504c7595-7ca5-459c-be6f-44a3d7f1b5d7:Attached,6605112f-7f26-48b9-b64e-f661bfe18799:Active Used: vdsm-4.17.12-0.el7ev.noarch lvm2-2.02.130-5.el7.x86_64 device-mapper-multipath-0.4.9-85.el7.x86_64 sanlock-3.2.4-1.el7.x86_64 rhevm-3.6.1.1-0.1.el6.noarch
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.