Bug 1286755 - Processes blocked in io_schedule+0x9d/0x130 after iscsid error, hypervisor non-operational for hours until reboot [NEEDINFO]
Processes blocked in io_schedule+0x9d/0x130 after iscsid error, hypervisor n...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: iscsi-initiator-utils (Show other bugs)
7.2
Unspecified Unspecified
high Severity high
: pre-dev-freeze
: ---
Assigned To: Chris Leech
Storage QE
:
Depends On:
Blocks: 1420851
  Show dependency treegraph
 
Reported: 2015-11-30 11:32 EST by Sarvesh Pandit
Modified: 2017-03-08 13:35 EST (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
nsoffer: needinfo? (cleech)
sapandit: needinfo? (cleech)
sapandit: needinfo? (cleech)


Attachments (Terms of Use)
/var/log/messages from problem host (275.84 KB, application/x-xz)
2016-01-26 19:07 EST, Nir Soffer
no flags Details

  None (edit)
Description Sarvesh Pandit 2015-11-30 11:32:20 EST
Description of problem:

we see the host <xxx> transition from non-operational to non-responsive state.
The non-responsive state of the host occurs when the engine gets a "network exception" and can't communicate with a host.

The host became non-operational due to the 'lastCheck' values returned by VDSM exceeding a pre-determined threshold.  

It appears that the Storage Domain Monitor threads had stopped responding on the host. The problem seems to start when ISCSI_ERR_CONN_FAILED error occurred

Version-Release number of selected component (if applicable):
vdsm-4.16.13.1-1.el7ev.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 3 Allon Mureinik 2015-12-01 11:07:06 EST
Tal, let's have someone take a look at this pleae?
Comment 13 Sarvesh Pandit 2016-01-19 12:21:21 EST
Do we have any update on this bug?

Customer is asking for update.
Comment 21 Nir Soffer 2016-01-19 19:18:16 EST
In sanlock log we see that trouble started here for the
domain 0d3c977c-c5b0-4368-a9c0-28187e3ea9ae

2015-11-16 02:06:29+0000 1875048 [21600]: 0d3c977c aio timeout 0 0x7f40ec0008c0:0x7f40ec0008d0:0x7f41004de000 ioto 10 to_count 1
2015-11-16 02:06:29+0000 1875048 [21600]: s9 delta_renew read rv -202 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
2015-11-16 02:06:29+0000 1875048 [21600]: s9 renewal error -202 delta_length 10 last_success 1875018
2015-11-16 02:06:40+0000 1875059 [21600]: 0d3c977c aio timeout 0 0x7f40ec000910:0x7f40ec000920:0x7f41001db000 ioto 10 to_count 2
2015-11-16 02:06:40+0000 1875059 [21600]: s9 delta_renew read rv -202 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
2015-11-16 02:06:40+0000 1875059 [21600]: s9 renewal error -202 delta_length 10 last_success 1875018
2015-11-16 02:06:51+0000 1875070 [21600]: 0d3c977c aio timeout 0 0x7f40ec000960:0x7f40ec000970:0x7f41000d9000 ioto 10 to_count 3
2015-11-16 02:06:51+0000 1875070 [21600]: s9 delta_renew read rv -202 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
2015-11-16 02:06:51+0000 1875070 [21600]: s9 renewal error -202 delta_length 10 last_success 1875018
2015-11-16 02:06:59+0000 1875078 [5082]: s9 check_our_lease warning 60 last_success 1875018
2015-11-16 02:07:00+0000 1875079 [5082]: s9 check_our_lease warning 61 last_success 1875018
2015-11-16 02:07:01+0000 1875080 [5082]: s9 check_our_lease warning 62 last_success 1875018
2015-11-16 02:07:02+0000 1875081 [5082]: s9 check_our_lease warning 63 last_success 1875018
2015-11-16 02:07:02+0000 1875081 [21600]: 0d3c977c aio timeout 0 0x7f40ec0009b0:0x7f40ec0009c0:0x7f40f16fa000 ioto 10 to_count 4
2015-11-16 02:07:02+0000 1875081 [21600]: s9 delta_renew read rv -202 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
2015-11-16 02:07:02+0000 1875081 [21600]: s9 renewal error -202 delta_length 10 last_success 1875018
2015-11-16 02:07:03+0000 1875082 [5082]: s9 check_our_lease warning 64 last_success 1875018
2015-11-16 02:07:04+0000 1875083 [5082]: s9 check_our_lease warning 65 last_success 1875018
2015-11-16 02:07:05+0000 1875084 [5082]: s9 check_our_lease warning 66 last_success 1875018
2015-11-16 02:07:06+0000 1875085 [5082]: s9 check_our_lease warning 67 last_success 1875018
2015-11-16 02:07:07+0000 1875086 [5082]: s9 check_our_lease warning 68 last_success 1875018
2015-11-16 02:07:08+0000 1875087 [5082]: s9 check_our_lease warning 69 last_success 1875018
2015-11-16 02:07:09+0000 1875088 [5082]: s9 check_our_lease warning 70 last_success 1875018
2015-11-16 02:07:10+0000 1875089 [5082]: s9 check_our_lease warning 71 last_success 1875018
2015-11-16 02:07:11+0000 1875090 [5082]: s9 check_our_lease warning 72 last_success 1875018
2015-11-16 02:07:12+0000 1875091 [5082]: s9 check_our_lease warning 73 last_success 1875018
2015-11-16 02:07:13+0000 1875092 [5082]: s9 check_our_lease warning 74 last_success 1875018
2015-11-16 02:07:13+0000 1875092 [21600]: s9 delta_renew read rv -2 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
2015-11-16 02:07:13+0000 1875092 [21600]: s9 renewal error -2 delta_length 10 last_success 1875018
2015-11-16 02:07:14+0000 1875093 [5082]: s9 check_our_lease warning 75 last_success 1875018
2015-11-16 02:07:15+0000 1875094 [5082]: s9 check_our_lease warning 76 last_success 1875018
2015-11-16 02:07:16+0000 1875095 [5082]: s9 check_our_lease warning 77 last_success 1875018
2015-11-16 02:07:17+0000 1875096 [5082]: s9 check_our_lease warning 78 last_success 1875018
2015-11-16 02:07:18+0000 1875097 [5082]: s9 check_our_lease warning 79 last_success 1875018
2015-11-16 02:07:19+0000 1875098 [5082]: s9 check_our_lease failed 80
2015-11-16 02:07:19+0000 1875098 [5082]: s9 all pids clear
2015-11-16 02:07:24+0000 1875103 [21600]: s9 delta_renew read rv -2 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
2015-11-16 02:07:24+0000 1875103 [21600]: s9 renewal error -2 delta_length 10 last_success 1875018
2015-11-16 02:07:34+0000 1875113 [21600]: 0d3c977c close_task_aio 0 0x7f40ec0008c0 busy
...
2015-11-16 08:11:44+0000 1896963 [21600]: 0d3c977c close_task_aio 3 0x7f40ec0009b0 busy

Looks like sanlock could not access the this domain for 6! hours.

David, can you check the sanlock log attached to this bug?

What do you think can cause 6 hours of "close_task_aio 3 0x7f40ec0009b0 busy" logs?
Comment 24 Nir Soffer 2016-01-19 19:27:20 EST
Adding back needinfo for David for comment 21.
Comment 25 Nir Soffer 2016-01-19 19:51:32 EST
In vdsm log, we see:

1. Starting read delay check for domain 0d3c977c-c5b0-4368-a9c0-28187e3ea9ae

Thread-726588::DEBUG::2015-11-16 02:06:29,887::blockSD::596::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)

(This is the same domain that sanlock could not access for 6 hours)

2. We don't see this thread any more in this log vdsm.log.1, until vdsm
   is terminated:

MainThread::DEBUG::2015-11-16 08:11:48,077::vdsm::58::vds::(sigtermHandler) Received signal 15

This means that the dd command never returned. Vdsm cannot do anything
about that, except reporting increasing lastCheck value - expected behavior.

I'll check the rest of the log later.
Comment 26 David Teigland 2016-01-20 10:51:16 EST
When async io operations time out, sanlock will try indefinately to clean up the operations using io_getevents().  I expected that all operations would eventually return an error at least.  But, it seems that there are cases where aio operations simply never return, and io_destroy() should be called to clear them.
I will make a change to this sanlock code so that it will not wait forever for the aio operations to return an error.  In the mean time, there should be no real harm from the current behavior.

(There's also a chance that the storage layers below sanlock are somehow configured to never return an error and wait or retry io forever.  If that's the case, then adjusting that would be another way to handle this issue.)
Comment 29 Nir Soffer 2016-01-26 19:04:08 EST
Looking at /var/log/messages from sosreport-rhevh06.****-20151116122626

1. Connection error in iscsid

Nov 16 02:06:32 rhevh06 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 6169708449, last ping 6169713450, now 6169718464
Nov 16 02:06:32 rhevh06 kernel: connection1:0: detected conn error (1011)
Nov 16 02:06:32 rhevh06 iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
Nov 16 02:06:35 rhevh06 iscsid: connection1:0 is operational after recovery (1 attempts)

2. Sanlock fail to renew the lease on domain 0d3c977c-c5b0-4368-a9c0-28187e3ea9ae

Nov 16 02:06:40 rhevh06 sanlock[5082]: 2015-11-16 02:06:40+0000 1875059 [21600]: s9 delta_renew read rv -202 offset 0 /dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/ids
Nov 16 02:06:40 rhevh06 sanlock[5082]: 2015-11-16 02:06:40+0000 1875059 [21600]: s9 renewal error -202 delta_length 10 last_success 1875018

3. Sanlock lost the lease after 80 seconds of failures

Nov 16 02:07:19 rhevh06 sanlock[5082]: 2015-11-16 02:07:19+0000 1875098 [5082]: s9 check_our_lease failed 80

4. Kernel complains about qemu-kvm and dd, blocked for 120 seconds:

Nov 16 02:09:42 rhevh06 kernel: INFO: task qemu-kvm:29441 blocked for more than 120 seconds.
Nov 16 02:09:42 rhevh06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 16 02:09:42 rhevh06 kernel: qemu-kvm        D ffff88207fc93680     0 29441      1 0x00000080
Nov 16 02:09:42 rhevh06 kernel: ffff882007d67a68 0000000000000082 ffff881d0c8dc440 ffff882007d67fd8
Nov 16 02:09:42 rhevh06 kernel: ffff882007d67fd8 ffff882007d67fd8 ffff881d0c8dc440 ffff88207fc93f48
Nov 16 02:09:42 rhevh06 kernel: ffff88200917e300 ffff881d0c8dc440 0000000000000000 0000000000000000
Nov 16 02:09:42 rhevh06 kernel: Call Trace:
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff8160954d>] io_schedule+0x9d/0x130
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81203e93>] do_blockdev_direct_IO+0xc03/0x2620
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81205905>] __blockdev_direct_IO+0x55/0x60
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff812000a7>] blkdev_direct_IO+0x57/0x60
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81158203>] generic_file_aio_read+0x6d3/0x750
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff8109ae60>] ? hrtimer_get_res+0x50/0x50
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff8120062c>] blkdev_aio_read+0x4c/0x70
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811c5d5d>] do_sync_read+0x8d/0xd0
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811c643c>] vfs_read+0x9c/0x170
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811c7102>] SyS_pread64+0x92/0xc0
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
Nov 16 02:09:42 rhevh06 kernel: INFO: task dd:29589 blocked for more than 120 seconds.
Nov 16 02:09:42 rhevh06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 16 02:09:42 rhevh06 kernel: dd              D ffff88103fcf3680     0 29589  21519 0x00000080
Nov 16 02:09:42 rhevh06 kernel: ffff880b90e9ba70 0000000000000086 ffff882006470000 ffff880b90e9bfd8
Nov 16 02:09:42 rhevh06 kernel: ffff880b90e9bfd8 ffff880b90e9bfd8 ffff882006470000 ffff88103fcf3f48
Nov 16 02:09:42 rhevh06 kernel: ffff88003543aa80 ffff882006470000 0000000000000000 0000000000000000
Nov 16 02:09:42 rhevh06 kernel: Call Trace:
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff8160954d>] io_schedule+0x9d/0x130
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81203e93>] do_blockdev_direct_IO+0xc03/0x2620
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff812baf20>] ? disk_map_sector_rcu+0x80/0x80
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81205905>] __blockdev_direct_IO+0x55/0x60
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff812000a7>] blkdev_direct_IO+0x57/0x60
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81158203>] generic_file_aio_read+0x6d3/0x750
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81164e2e>] ? lru_cache_add+0xe/0x10
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff8118ca51>] ? page_add_new_anon_rmap+0x91/0x130
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81182913>] ? handle_mm_fault+0x773/0xd60
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81600804>] ? __slab_free+0x10e/0x277
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff8120062c>] blkdev_aio_read+0x4c/0x70
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811c5d5d>] do_sync_read+0x8d/0xd0
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811c643c>] vfs_read+0x9c/0x170
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff811c6f68>] SyS_read+0x58/0xb0
Nov 16 02:09:42 rhevh06 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b

5. Kernel complains about lvm blocked for 120 seconds

Nov 16 02:11:42 rhevh06 kernel: INFO: task lvm:29690 blocked for more than 120 seconds.
Nov 16 02:11:42 rhevh06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 16 02:11:42 rhevh06 kernel: lvm             D ffff88103fc33680     0 29690  29689 0x00000080
Nov 16 02:11:42 rhevh06 kernel: ffff880f7a8cfa70 0000000000000086 ffff882006470b60 ffff880f7a8cffd8
Nov 16 02:11:42 rhevh06 kernel: ffff880f7a8cffd8 ffff880f7a8cffd8 ffff882006470b60 ffff88103fc33f48
Nov 16 02:11:42 rhevh06 kernel: ffff880f69142580 ffff882006470b60 0000000000000000 0000000000000000
Nov 16 02:11:42 rhevh06 kernel: Call Trace:
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff8160954d>] io_schedule+0x9d/0x130
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff81203e93>] do_blockdev_direct_IO+0xc03/0x2620
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff81205905>] __blockdev_direct_IO+0x55/0x60
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff812000a7>] blkdev_direct_IO+0x57/0x60
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff81158203>] generic_file_aio_read+0x6d3/0x750
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811e61ce>] ? mntput_no_expire+0x3e/0x120
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811e62d4>] ? mntput+0x24/0x40
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff8120062c>] blkdev_aio_read+0x4c/0x70
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811c5d5d>] do_sync_read+0x8d/0xd0
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811c643c>] vfs_read+0x9c/0x170
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff811c6f68>] SyS_read+0x58/0xb0
Nov 16 02:11:42 rhevh06 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b

6. Libvirt fail to perform blockInfo call, seems that some lock is acquired
   and never released

Nov 16 02:12:55 rhevh06 journal: Cannot start job (query, none) for domain Zabbix02; current job is (async nested, migration out) owned by (4997, 4997)
Nov 16 02:12:55 rhevh06 journal: Timed out during operation: cannot acquire state change lock
Nov 16 02:12:55 rhevh06 journal: vdsm vm.Vm ERROR vmId=`4cfff6f3-2f81-487f-b970-93f111239702`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2c93570>
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
  File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
  File "/usr/share/vdsm/virt/vm.py", line 292, in _highWrite
  File "/usr/share/vdsm/virt/vm.py", line 2537, in extendDrivesIfNeeded
  File "/usr/share/vdsm/virt/vm.py", line 2489, in _getExtendCandidates
  File "/usr/share/vdsm/virt/vm.py", line 697, in f
TimeoutError: Timed out during operation: cannot acquire state change lock

7. Last report about blocked process

Nov 16 02:15:42 rhevh06 kernel: INFO: task dd:29589 blocked for more than 120 seconds.
Nov 16 02:15:42 rhevh06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 16 02:15:42 rhevh06 kernel: dd              D ffff88103fcf3680     0 29589  21519 0x00000080
Nov 16 02:15:42 rhevh06 kernel: ffff880b90e9ba70 0000000000000086 ffff882006470000 ffff880b90e9bfd8
Nov 16 02:15:42 rhevh06 kernel: ffff880b90e9bfd8 ffff880b90e9bfd8 ffff882006470000 ffff88103fcf3f48
Nov 16 02:15:42 rhevh06 kernel: ffff88003543aa80 ffff882006470000 0000000000000000 0000000000000000
Nov 16 02:15:42 rhevh06 kernel: Call Trace:
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff8160954d>] io_schedule+0x9d/0x130
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81203e93>] do_blockdev_direct_IO+0xc03/0x2620
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff812baf20>] ? disk_map_sector_rcu+0x80/0x80
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81205905>] __blockdev_direct_IO+0x55/0x60
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff812000a7>] blkdev_direct_IO+0x57/0x60
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff811ffa50>] ? I_BDEV+0x10/0x10
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81158203>] generic_file_aio_read+0x6d3/0x750
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81164e2e>] ? lru_cache_add+0xe/0x10
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff8118ca51>] ? page_add_new_anon_rmap+0x91/0x130
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81182913>] ? handle_mm_fault+0x773/0xd60
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81600804>] ? __slab_free+0x10e/0x277
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff8120062c>] blkdev_aio_read+0x4c/0x70
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff811c5d5d>] do_sync_read+0x8d/0xd0
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff811c643c>] vfs_read+0x9c/0x170
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff811c6f68>] SyS_read+0x58/0xb0
Nov 16 02:15:42 rhevh06 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b

9. Last libvirt error

Nov 16 02:35:18 rhevh06 journal: Timed out during operation: cannot acquire state change lock
Nov 16 02:35:18 rhevh06 journal: vdsm vm.Vm ERROR vmId=`8b47d8c0-294b-4f9b-a78d-8bda7eebb045`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2c93570>
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
  File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
  File "/usr/share/vdsm/virt/vm.py", line 292, in _highWrite
  File "/usr/share/vdsm/virt/vm.py", line 2537, in extendDrivesIfNeeded
  File "/usr/share/vdsm/virt/vm.py", line 2489, in _getExtendCandidates
  File "/usr/share/vdsm/virt/vm.py", line 697, in f
TimeoutError: Timed out during operation: cannot acquire state change lock

10. Sanlock errors continue until the host is rebooted

Nov 16 08:11:44 rhevh06 sanlock[5082]: 2015-11-16 08:11:44+0000 1896963 [21600]: 0d3c977c close_task_aio 3 0x7f40ec0009b0 busy

11. Host is rebooted

Nov 16 08:16:50 rhevh06 kernel: Linux version 3.10.0-229.1.2.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Fri Mar 6 17:12:08 EST 2015


Looking again in vdsm log:

1. Starting read delay check for domain 0d3c977c-c5b0-4368-a9c0-28187e3ea9ae

Thread-726588::DEBUG::2015-11-16 02:06:29,887::blockSD::596::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/dev/0d3c977c-c5b0-4368-a9c0-28187e3ea9ae/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)

We don't see this thread any more in this until vdsm is terminated.

2. Domain monitor thread takes the storage refresh lock
   (part of storage domain monitor flow)

Thread-726589::DEBUG::2015-11-16 02:15:10,793::misc::741::Storage.SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-726589::DEBUG::2015-11-16 02:15:10,793::misc::743::Storage.SamplingMethod::(__call__) Got in to sampling method

3. Thread-726589 start multipath
   (part of multipath rescan flow) 

Thread-726589::DEBUG::2015-11-16 02:15:41,856::multipath::131::Storage.Misc.excCmd::(rescan) /usr/bin/sudo -n /sbin/multipath (cwd None)

We don't see this thread any more in this until vdsm is terminated.

4. Thread-726593 try to refresh domain bba27063-b640-4cb5-9c1e-fbc894924700
   (part of storage domain monitor flow)

Thread-726593::DEBUG::2015-11-16 02:15:12,045::misc::741::Storage.SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)

We don't see this thread any more in this until vdsm is terminated,
because the storage refresh lock is held by Thread-726589. 

Vdsm storage monitor threads are blocked because they they are sharing the 
storage refresh lock and multipath rescan lock.


Why dd, lvm and multipath blocked for hours?
Looking at /etc/multipath.conf, we see standard configuration.

defaults {
    polling_interval        5
    getuid_callout          "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
    no_path_retry           fail
    user_friendly_names     no
    flush_on_last_del       yes
    fast_io_fail_tmo        5
    dev_loss_tmo            30
    max_fds                 4096
}


Looking in sos_commands/multiapth/multipath_-l

14f504e46494c45527163327a6b762d7a446b4f2d47615267 dm-7 OPNFILER,VIRTUAL-DISK    
size=2.5T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  `- 9:0:0:1 sdd 8:48 active undef running
14f504e46494c45524c56646a79672d677077302d76754343 dm-5 OPNFILER,VIRTUAL-DISK    
size=2.5T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  `- 8:0:0:0 sdb 8:16 active undef running
14f504e46494c455247725a756e662d507563542d785a547a dm-6 OPNFILER,VIRTUAL-DISK    
size=2.5T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  `- 9:0:0:0 sdc 8:32 active undef running

Checking the multipath configuration on rhel 7.2 system and in upstream multipath,
there is no configuration for OPNFILER,VIRTUAL-DISK, so it should use the defaults
set in multipath.conf.


Summary:

1. scsi connection failed, causing various processes to fail for 6 hours
   This is the root cause.

2. Vdsm assumes that scsi operations will not block for hours

3. Vdsm storage threads block for hours, because dd and multiapth
   never return, and storage threads share storage refresh and multipath
   rescan locks

4. Because some storage monitor threads are blocked, lastCheck values are
   high, causing the host to be non-operational

5. Engine does not fence non-operational hosts, only non-responsive host.
   The only way to recover from this state is by manually reboot the host.


Based on this, I think this bug should move to the kernel.

Chris, can you take a look?
Comment 30 Nir Soffer 2016-01-26 19:07 EST
Created attachment 1118641 [details]
/var/log/messages from problem host
Comment 33 Nir Soffer 2016-02-20 15:01:02 EST
Chris, can you take a look? see comment 29

Note You need to log in before you can comment on or make changes to this bug.