Bug 1502601
Summary: | [Hyper-V][RHEL7.4] hang when thaw on microsoft hyper-v | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | nikhil kshirsagar <nkshirsa> | ||||
Component: | kernel | Assignee: | Cathy Avery <cavery> | ||||
kernel sub component: | Hyper-V | QA Contact: | xuli <xuli> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | urgent | CC: | afox, ailan, akarvi, amaury.deganseman, anazmy, boyang, cavery, decui, dhoward, hhei, jjarvis, jopoulso, kys, ldu, leiwang, leo.white, loberman, longli, mgandhi, mitchell.doedens, mmilgram, mtessun, nkshirsa, nmurray, orendorf, paulf123, pbuchan, pdhamdhe, qzheng, rdutta, salmy, schnederle, stanislav.moravec, subhat, tom.soderlund, toneata, toracat, vkuznets, xialiu, xiaofwan, xuli, yacao | ||||
Version: | 7.4 | Keywords: | ZStream | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-3.10.0-832.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: Improper handling of some error codes in Hyper-V storvsc driver was in some cases causing system lockup after hot backup.
Consequence: It was possible to observe block subsystem getting stuck after performing hot backup of Hyper-V Virtual Machine.
Fix: Upstream patches fixing improper error codes handling were applied to RHEL kernel.
Result: Hot backup of RHEL Virtual Machines running on Hyper-V should not cause system lockup now.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1536978 (view as bug list) | Environment: | |||||
Last Closed: | 2018-04-10 22:32:38 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1420851, 1536978 | ||||||
Attachments: |
|
Comment 3
nikhil kshirsagar
2017-10-16 10:16:16 UTC
@long, This looks like your recent VSS backup bug. Is it a problem for both WS2012 and WS2016. Cathy Are the private comments visible to long ? Shall we make them public? @cathy, Does it repro with your patch on storvsc_remove_lun? I think that might fix it. Long (In reply to longli from comment #6) > @cathy, > > Does it repro with your patch on storvsc_remove_lun? I think that might fix > it. > > Long We haven't reproduced this yet so I'll let you know. Note that the storsvc or remove_scsi_device wasn't see in the hung task at all in the vmcore. All we had was this hung task triggering the panice, [31236.253319] scsi host2: scsi_prep_async_scan called twice [31236.265530] scsi host2: scsi_prep_async_scan called twice [31236.273435] Dev sde: unable to read RDB block 0 [31236.273455] sde: unable to read partition table [31236.276435] sd 2:0:1:0: [sde] Sector size 0 reported, assuming 512. [31236.276885] sd 2:0:1:0: [sde] Attached SCSI disk [31440.245172] INFO: task sshd:71717 blocked for more than 120 seconds. [31440.245239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [31440.245321] sshd D ffff880572f1bc00 0 71717 71694 0x00000080 [31440.245330] ffff8803472ff7c0 0000000000000086 ffff8805476d0fd0 ffff8803472fffd8 [31440.245336] ffff8803472fffd8 ffff8803472fffd8 ffff8805476d0fd0 ffff8803472ff910 [31440.245340] 7fffffffffffffff ffff8803472ff908 ffff8805476d0fd0 ffff880572f1bc00 [31440.245345] Call Trace: [31440.245358] [<ffffffff816a94e9>] schedule+0x29/0x70 [31440.245366] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [31440.245376] [<ffffffff810ce54e>] ? dequeue_task_fair+0x41e/0x660 [31440.245381] [<ffffffff810c4593>] ? try_to_wake_up+0x183/0x340 [31440.245386] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [31440.245390] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [31440.245398] [<ffffffff810a987d>] flush_work+0xfd/0x190 [31440.245403] [<ffffffff810a5df0>] ? move_linked_works+0x90/0x90 [31440.245411] [<ffffffff813fbf99>] tty_flush_to_ldisc+0x29/0x30 [31440.245416] [<ffffffff813f724f>] n_tty_poll+0x8f/0x1d0 [31440.245420] [<ffffffff813f30d2>] tty_poll+0x72/0x90 [31440.245427] [<ffffffff81216585>] do_select+0x3a5/0x7c0 [31440.245435] [<ffffffff815cfacf>] ? ip_finish_output+0x29f/0x780 [31440.245440] [<ffffffff815d02b3>] ? ip_output+0x73/0xe0 [31440.245446] [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150 [31440.245451] [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150 [31440.245456] [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150 [31440.245461] [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150 [31440.245466] [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20 [31440.245472] [<ffffffff8156e998>] ? release_sock+0x118/0x170 [31440.245477] [<ffffffff815db126>] ? tcp_sendmsg+0xd6/0xc30 [31440.245486] [<ffffffff816068b3>] ? inet_sendmsg+0x63/0xb0 [31440.245492] [<ffffffff812b50c3>] ? selinux_socket_sendmsg+0x23/0x30 [31440.245497] [<ffffffff81216b7b>] core_sys_select+0x1db/0x300 [31440.245505] [<ffffffff8120027d>] ? do_sync_write+0x8d/0xd0 [31440.245512] [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0 [31440.245518] [<ffffffff81216d5a>] SyS_select+0xba/0x110 [31440.245524] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b [31440.245529] sending NMI to all CPUs: This does NOT reproduce for the customer with the RHEL 7.3 kernel (In reply to nikhil from comment #11) > This does NOT reproduce for the customer with the RHEL 7.3 kernel My patch addresses a problem originally reported on RHEL 7.2. Once we get this bug reproduced I can see if there is a change to RHEL 7.4 that accounts for the issue. So is downgrading the Kernel the only workaround to fixing this issue at the moment? Have reproduced this issue on the kernel 3.10.0-693.2.2.el7.x86_64 Gen2 VM with 2012R2 hyper-v. Reproduce steps: 1) execute the hot add/remove scsi by following script. This script continuously adds two disks, then removes two disks. 2) Do backup from host side by powershell during the script running. wbadmin start backup -backupTarget:Q:\ -hyperv:vmName 3) Observe that Online backup succeeded, but check the VM, it is hung and no IP address available, heartbeat shows as OK. After wait for a while, hung_task_timeout_secs about kworker pops up. =========================================================== $VMName = Read-Host "Specify the VMName to use" $ComputerName = "localhost" Date $maxnum = 2 $numdisk = 0 do { $DISK = "$DPATH\b1273723_$numdisk.vhdx" $EXIST = Test-Path $DISK if ( -not $EXIST ) { New-VHD -Path $DISK -Size 1024MB -ComputerName $ComputerName -Dynamic } $numdisk++ } while ($numdisk -lt $maxnum) while ( 1 -ne 0 ) { $DISK = "$DPATH\b1273723_0.vhdx" Add-VMHardDiskDrive -VMName $VMName -ComputerName $ComputerName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 2 -Path $DISK $DISK = "$DPATH\b1273723_1.vhdx" Add-VMHardDiskDrive -VMName $VMName -ComputerName $ComputerName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 3 -Path $DISK Remove-VMHardDiskDrive -vmName $VMName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 2 -ComputerName $ComputerName Remove-VMHardDiskDrive -vmName $VMName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 3 -ComputerName $ComputerName } ============================================================= After test the kernel build, I suspect kernel 3.10.0-622.el7 which induces this hang issue if do backup during scsi disk hot add/remove . (this hang issue reproduces on kernel 3.10.0-622.el7, but cannot reproduce on kernel 3.10.0-621.el7 by following comment 26 steps) Also kernel 3.10.0-622.el7 with patch: [scsi] storvsc: properly set residual data length on errors [1364282], which is related with Bug 1364282 - [Hyper-V][RHEL-7.2]Data corruption observed on RHEL 7.2 VM running on Hyper-V Win2k12R2 and https://bugzilla.redhat.com/show_bug.cgi?id=1500887. ([Hyper-V][RHEL-7.2] All MP paths failing to a device with heavy IO can cause the VM to hang. - found during verify 1364282) Thank you. (In reply to xuli from comment #32) > After test the kernel build, I suspect kernel 3.10.0-622.el7 which induces > this hang issue if do backup during scsi disk hot add/remove . (this hang > issue reproduces on kernel 3.10.0-622.el7, but cannot reproduce on kernel > 3.10.0-621.el7 by following comment 26 steps) > > Also kernel 3.10.0-622.el7 with patch: > [scsi] storvsc: properly set residual data length on errors [1364282], > which is related with > Bug 1364282 - [Hyper-V][RHEL-7.2]Data corruption observed on RHEL 7.2 VM > running on Hyper-V Win2k12R2 and > https://bugzilla.redhat.com/show_bug.cgi?id=1500887. ([Hyper-V][RHEL-7.2] > All MP paths failing to a device with heavy IO can cause the VM to hang. - > found during verify 1364282) > > Thank you. That was a regression that was introduced and fixed in 7.4. https://bugzilla.redhat.com/show_bug.cgi?id=1437552 The fix was available by kernel-3.10.0-650.el7 and it is in 3.10.0-693.2.2.el7.x86_64 Cathy This appears to be a problem that Microsoft is aware of and is currently working on. It would appear that this bug shares common traits with https://bugzilla.redhat.com/show_bug.cgi?id=1500887. We are working with MS to resolve the issue. Thanks, Cathy @long QE is seeing that this patch commit 40630f462824ee24bc00d692865c86c3828094e0 Author: Long Li <longli> Date: Wed Dec 14 18:46:03 2016 -0800 scsi: storvsc: properly set residual data length on errors is either causing or exacerbating this backup hang. commit f1c635b439a5c01776fe3a25b1e2dc546ea82e6f Author: Stephen Hemminger <stephen> Date: Tue Mar 7 09:15:53 2017 -0800 Hyper-V host emulation of SCSI for virtual DVD device reports SCSI version 0 (UNKNOWN) but is still capable of supporting REPORTLUN. Was added and verified as fix for the regression it caused https://bugzilla.redhat.com/show_bug.cgi?id=1437552 However it looks like commit 40630f462824ee24bc00d692865c86c3828094e0 may have other side effects. Is there a reason why the commit is causing this hang directly or has it simply changed the timing so the scsi_lun_remove problem is more obvious. Thanks, Cathy I think it's the timing that changes the sequence of scsi_lun_remove being issued. This patch is needed to make sure we report the correct returned data from host so the SCSI doesn't pick up uninitialized data. I'm concerned on the call trace on sshd, that is a I/O timeout. Does it repro with your patch? (In reply to longli from comment #40) > I think it's the timing that changes the sequence of scsi_lun_remove being > issued. This patch is needed to make sure we report the correct returned > data from host so the SCSI doesn't pick up uninitialized data. > > I'm concerned on the call trace on sshd, that is a I/O timeout. Does it > repro with your patch? With our combined patches meaning my dedicated work queue plus excessive lun removal and your excessive scan ( minus set_host_byte(scmnd, DID_BAD_TARGET) ) I've had pretty good luck testing. I sent you the excessive removal patch. @Long With my additional patch to limit storvsc_remove_lun calls we have made it past the old hang but are now hitting a new one. The build includes the 2 new patches that just made it upstream. [ 2280.017852] [<ffffffffaf25e5d4>] ? __pm_runtime_use_autosuspend+0x54/0x70 [ 2280.017855] [<ffffffffc03f52b5>] sd_probe_async+0x115/0x1e0 [sd_mod] [ 2280.017857] [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120 [ 2280.017859] [<ffffffffaeeab58a>] process_one_work+0x17a/0x440 [ 2280.017862] [<ffffffffaeeac256>] worker_thread+0x126/0x3c0 [ 2280.017864] [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 2280.017866] [<ffffffffaeeb352f>] kthread+0xcf/0xe0 [ 2280.017868] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 [ 2280.017870] [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90 [ 2280.017872] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 [ 2280.017874] INFO: task kworker/u128:1:14524 blocked for more than 120 seconds. [ 2280.018839] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2280.019771] kworker/u128:1 D ffff8debbfbeaf70 0 14524 2 0x00000080 [ 2280.019777] Workqueue: events_unbound async_run_entry_fn [ 2280.019778] Call Trace: [ 2280.019783] [<ffffffffaf4cda19>] schedule_preempt_disabled+0x29/0x70 [ 2280.019786] [<ffffffffaf4cbab7>] __mutex_lock_slowpath+0xc7/0x1d0 [ 2280.019788] [<ffffffffaf4caecf>] mutex_lock+0x1f/0x2f [ 2280.019791] [<ffffffffaf286976>] do_scan_async+0x36/0x160 [ 2280.019793] [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120 [ 2280.019796] [<ffffffffaeeab58a>] process_one_work+0x17a/0x440 [ 2280.019798] [<ffffffffaeeac256>] worker_thread+0x126/0x3c0 [ 2280.019801] [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 2280.019803] [<ffffffffaeeb352f>] kthread+0xcf/0xe0 [ 2280.019805] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 [ 2280.019808] [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90 [ 2280.019810] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 [ 2280.019813] INFO: task kworker/u128:5:14806 blocked for more than 120 seconds. [ 2280.020743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2280.021678] kworker/u128:5 D ffff8deb4342af70 0 14806 2 0x00000080 [ 2280.021687] Workqueue: storvsc_error_wq_0 storvsc_remove_lun [hv_storvsc] [ 2280.021688] Call Trace: [ 2280.021691] [<ffffffffaf4ccaf9>] schedule+0x29/0x70 [ 2280.021694] [<ffffffffaeebaa25>] async_synchronize_cookie_domain+0x85/0x150 [ 2280.021696] [<ffffffffaeeb44b0>] ? wake_up_atomic_t+0x30/0x30 [ 2280.021698] [<ffffffffaeebab08>] async_synchronize_full_domain+0x18/0x20 [ 2280.021701] [<ffffffffc03f2362>] sd_remove+0x42/0xb0 [sd_mod] [ 2280.021704] [<ffffffffaf252bbf>] __device_release_driver+0x7f/0xf0 [ 2280.021706] [<ffffffffaf252c53>] device_release_driver+0x23/0x30 [ 2280.021708] [<ffffffffaf252318>] bus_remove_device+0x108/0x180 [ 2280.021712] [<ffffffffaf24e295>] device_del+0x135/0x210 [ 2280.021714] [<ffffffffaf288813>] __scsi_remove_device+0xc3/0xd0 [ 2280.021716] [<ffffffffaf288846>] scsi_remove_device+0x26/0x40 [ 2280.021719] [<ffffffffc03d4420>] storvsc_remove_lun+0x40/0x60 [hv_storvsc] [ 2280.021722] [<ffffffffaeeab58a>] process_one_work+0x17a/0x440 [ 2280.021724] [<ffffffffaeeac256>] worker_thread+0x126/0x3c0 [ 2280.021726] [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 2280.021728] [<ffffffffaeeb352f>] kthread+0xcf/0xe0 [ 2280.021731] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 [ 2280.021733] [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90 [ 2280.021735] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 (In reply to Cathy Avery from comment #51) > (In reply to xuli from comment #44) > > I've added vmlinuz-4.14.0-rc8lun_work_item+ to the VM at 10.73.199.37. > > This is the upstream kernel with the same patches as > vmlinuz-3.10.0-772.el7_scsi_test3.x86_64 that you used to uncover yet > another hang. > > Could you see if this build hangs in the same way as well? > > Thanks! Hi Cathy, Notice that the test build vmlinuz-4.14.0-rc8lun_work_item+ gets the "hung_task_timeout_secs" when execute the same steps as before (need to keep running add/remove disk and only sleep 5 seconds during add/remove, then do backup, after backup successfully,still keep running add/remove disk, after wait for a while, it gets hung_task_timeout_secs at 10.73.199.37). 1143.040625] sdg: unable to read partition table [ 1143.040637] sdg: partition table beyond EOD, enabling native capacity [ 1143.041254] sd 0:0:1:2: [sdg] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 1143.041257] sd 0:0:1:2: [sdg] 4096-byte physical blocks [ 1143.041406] sd 0:0:1:2: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 1143.041411] sdg: detected capacity change from 512 to 1073741824 [ 1143.051274] sd 0:0:1:2: [sdg] Attached SCSI disk [ 1226.720046] INFO: task kworker/u128:0:5 blocked for more than 120 seconds. [ 1226.720689] Not tainted 4.14.0-rc8lun_work_item+ #3 [ 1226.721278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1226.721884] kworker/u128:0 D 0 5 2 0x80000000 [ 1226.721891] Workqueue: events_unbound async_run_entry_fn [ 1226.721893] Call Trace: [ 1226.721899] __schedule+0x2cc/0x7a0 [ 1226.721902] ? __add_to_page_cache_locked+0x101/0x160 [ 1226.721903] schedule+0x31/0x80 [ 1226.721906] io_schedule+0x11/0x40 [ 1226.721907] do_read_cache_page+0x42d/0x520 [ 1226.721910] ? blkdev_writepages+0x10/0x10 [ 1226.721913] ? __blk_put_request+0xbf/0x190 [ 1226.721914] ? page_cache_tree_insert+0xb0/0xb0 [ 1226.721916] read_cache_page+0x10/0x20 [ 1226.721919] read_dev_sector+0x28/0xa0 [ 1226.721921] read_lba+0x162/0x230 [ 1226.721923] efi_partition+0xfc/0x880 [ 1226.721925] ? string+0x63/0x70 [ 1226.721926] ? vsnprintf+0x255/0x490 [ 1226.721928] ? snprintf+0x45/0x70 [ 1226.721930] ? compare_gpts+0x280/0x280 [ 1226.721931] check_partition+0x138/0x210 [ 1226.721933] ? check_partition+0x138/0x210 [ 1226.721934] rescan_partitions+0xab/0x310 [ 1226.721937] ? down_write+0xd/0x40 [ 1226.721938] __blkdev_get+0x2d4/0x430 [ 1226.721940] blkdev_get+0x11f/0x300 [ 1226.721943] ? unlock_new_inode+0x43/0x60 [ 1226.721944] ? bdget+0x112/0x150 [ 1226.721946] device_add_disk+0x3c5/0x420 [ 1226.721950] ? update_autosuspend+0x4c/0x50 [ 1226.721951] ? __pm_runtime_use_autosuspend+0x57/0x70 [ 1226.721954] sd_probe_async+0x103/0x1d0 [ 1226.721956] async_run_entry_fn+0x33/0x160 [ 1226.721958] process_one_work+0x164/0x380 [ 1226.721960] worker_thread+0x49/0x410 [ 1226.721961] kthread+0x104/0x140 [ 1226.721963] ? process_one_work+0x380/0x380 [ 1226.721964] ? kthread_create_on_node+0x60/0x60 [ 1226.721965] ret_from_fork+0x25/0x30 Thank you so much. @Long Have you been seeing these types of hangs on WS 2016? So far we have only heard of them on WS 2012? Thanks, Cathy Yes I have seen similar hung on WS2016. I have fixed it with the DID_TARGET_FAILURE in the patch. But that was removed in the final patch. Can you give it a quick test to see if that fixes the problem? I'm attempting to repro and looking into it. (In reply to longli from comment #57) > Yes I have seen similar hung on WS2016. I have fixed it with the > DID_TARGET_FAILURE in the patch. But that was removed in the final patch. > Can you give it a quick test to see if that fixes the problem? > > I'm attempting to repro and looking into it. I have tested that patch before and it causes multipath to return IO errors on failover. Without the patch multipath fails over with no IO errors. Thats a problem. (In reply to longli from comment #57) > Yes I have seen similar hung on WS2016. I am testing the excessive lun removal patch which works well as we have pushed past that hang except that we have now hit another hang which xuli discovered in her testing. She hot adds and removes the luns while she is backing them up with wbadmin start backup -backupTarget:D:\ -hyperv:VMname. It looks like the hang is the result of an attempted lun removal slap bang in the middle of its probe. I am looking in to it. I don't believe it is the result of my patch but I can't push it upstream until I know for sure. Thanks, Cathy (In reply to Milan P. Gandhi from comment #60 OK this is the same crash that xuli just reported. This is the latest one uncovered in comment 54 and 59. It is easily reproduced. Long could you look at this as well? In the customers case they were running just the upstream patches. xuli reproduced it here with the upstream and the limiting lun removal patch. Thanks, Cathy (In reply to Milan P. Gandhi from comment #60) Hi, Do you have the messages file as well? Thanks, Cathy I'm looking at this. I think there may be another bug in upper layer SCSI. If you look at the SCSI probe code at scsi_probe_lun(), it calls scsi_execute_req but doesn't check the resid (the request length - the actual data transferred). At the time when it calls on a non-existent LUN, this call should fail and everything should recover. My patch with DID_TARGET_FAILURE deals with this situation (but unfortunately it breaks your FC tests), it makes this call fail by setting an error in host byte. I'm still struggling with replicating your FC tests, can you share the test setup details? Shoot me an email with the QA on your team, we can discuss how to repro it. @Cathy, Can you post the complete kernel trace on this hung? I want to see what mutex (or bit_wait_io?) sd_probe_async is waiting at the beginning the of trace. I don't have your 3rd patch. Please send it to me I'll try to patch it along upstream patches, and repro in our lab. (In reply to Cathy Avery from comment #47) > @Long > > With my additional patch to limit storvsc_remove_lun calls we have made it > past the old hang but are now hitting a new one. The build includes the 2 > new patches > that just made it upstream. > > [ 2280.017852] [<ffffffffaf25e5d4>] ? __pm_runtime_use_autosuspend+0x54/0x70 > [ 2280.017855] [<ffffffffc03f52b5>] sd_probe_async+0x115/0x1e0 [sd_mod] > [ 2280.017857] [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120 > [ 2280.017859] [<ffffffffaeeab58a>] process_one_work+0x17a/0x440 > [ 2280.017862] [<ffffffffaeeac256>] worker_thread+0x126/0x3c0 > [ 2280.017864] [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0 > [ 2280.017866] [<ffffffffaeeb352f>] kthread+0xcf/0xe0 > [ 2280.017868] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 > [ 2280.017870] [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90 > [ 2280.017872] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 > [ 2280.017874] INFO: task kworker/u128:1:14524 blocked for more than 120 > seconds. > [ 2280.018839] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 2280.019771] kworker/u128:1 D ffff8debbfbeaf70 0 14524 2 > 0x00000080 > [ 2280.019777] Workqueue: events_unbound async_run_entry_fn > [ 2280.019778] Call Trace: > [ 2280.019783] [<ffffffffaf4cda19>] schedule_preempt_disabled+0x29/0x70 > [ 2280.019786] [<ffffffffaf4cbab7>] __mutex_lock_slowpath+0xc7/0x1d0 > [ 2280.019788] [<ffffffffaf4caecf>] mutex_lock+0x1f/0x2f > [ 2280.019791] [<ffffffffaf286976>] do_scan_async+0x36/0x160 > [ 2280.019793] [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120 > [ 2280.019796] [<ffffffffaeeab58a>] process_one_work+0x17a/0x440 > [ 2280.019798] [<ffffffffaeeac256>] worker_thread+0x126/0x3c0 > [ 2280.019801] [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0 > [ 2280.019803] [<ffffffffaeeb352f>] kthread+0xcf/0xe0 > [ 2280.019805] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 > [ 2280.019808] [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90 > [ 2280.019810] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 > [ 2280.019813] INFO: task kworker/u128:5:14806 blocked for more than 120 > seconds. > [ 2280.020743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 2280.021678] kworker/u128:5 D ffff8deb4342af70 0 14806 2 > 0x00000080 > [ 2280.021687] Workqueue: storvsc_error_wq_0 storvsc_remove_lun [hv_storvsc] > [ 2280.021688] Call Trace: > [ 2280.021691] [<ffffffffaf4ccaf9>] schedule+0x29/0x70 > [ 2280.021694] [<ffffffffaeebaa25>] > async_synchronize_cookie_domain+0x85/0x150 > [ 2280.021696] [<ffffffffaeeb44b0>] ? wake_up_atomic_t+0x30/0x30 > [ 2280.021698] [<ffffffffaeebab08>] async_synchronize_full_domain+0x18/0x20 > [ 2280.021701] [<ffffffffc03f2362>] sd_remove+0x42/0xb0 [sd_mod] > [ 2280.021704] [<ffffffffaf252bbf>] __device_release_driver+0x7f/0xf0 > [ 2280.021706] [<ffffffffaf252c53>] device_release_driver+0x23/0x30 > [ 2280.021708] [<ffffffffaf252318>] bus_remove_device+0x108/0x180 > [ 2280.021712] [<ffffffffaf24e295>] device_del+0x135/0x210 > [ 2280.021714] [<ffffffffaf288813>] __scsi_remove_device+0xc3/0xd0 > [ 2280.021716] [<ffffffffaf288846>] scsi_remove_device+0x26/0x40 > [ 2280.021719] [<ffffffffc03d4420>] storvsc_remove_lun+0x40/0x60 > [hv_storvsc] > [ 2280.021722] [<ffffffffaeeab58a>] process_one_work+0x17a/0x440 > [ 2280.021724] [<ffffffffaeeac256>] worker_thread+0x126/0x3c0 > [ 2280.021726] [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0 > [ 2280.021728] [<ffffffffaeeb352f>] kthread+0xcf/0xe0 > [ 2280.021731] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 > [ 2280.021733] [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90 > [ 2280.021735] [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40 Created attachment 1353233 [details] dmesg logs from vmcore Hello, I had to make comment#60 private since it had customer sensitive data. I have extracted the full dmesg logs from this vmcore captured with test kernel-3.10.0-693.12.1.el7_BZ_1502601.x86_64. Please let me know if you would need any further information. Many thanks, Milan. (In reply to longli from comment #63) > I'm looking at this. > > I think there may be another bug in upper layer SCSI. If you look at the > SCSI probe code at scsi_probe_lun(), it calls scsi_execute_req but doesn't > check the resid (the request length - the actual data transferred). At the > time when it calls on a non-existent LUN, this call should fail and > everything should recover. My patch with DID_TARGET_FAILURE deals with this > situation (but unfortunately it breaks your FC tests), it makes this call > fail by setting an error in host byte. > > I'm still struggling with replicating your FC tests, can you share the test > setup details? Shoot me an email with the QA on your team, we can discuss > how to repro it. Yes I think this last hang with devices being probed and removed is the root of all the problems. The IO is probably stuck on a non functioning queue and that is what is causing the backup. Do you think that the resid patch is contributing to this? commit 40630f462824ee24bc00d692865c86c3828094e0 Author: Long Li <longli> Date: Wed Dec 14 18:46:03 2016 -0800 scsi: storvsc: properly set residual data length on errors It may be exposing yet another problem. We were seeing hangs with multipath before that patch went in. So I'm guessing we may have 2 issues in the scsi stack. Look at comment 26. xuli describes how she reproduces the problem. You may have to run the backup more than once. You should also put a 5 to 10 second sleep in the script between the hot disk adds and removes. I'll send you my patch so you can get to the probe hang without hitting the other hangs first. I think we need to return DID_TARGET_FAILURE for SRB_STATUS_INVALID_LUN (in storvsc_handle_error). If this error is not populated, the I/O will never finish because data_transfer_length is always 0 in these situations. The sd_probe_async call will get stuck forever for this I/O finish, while holding the async_domain scsi_sd_probe_domain. This will block the subsequent scsi_remove_device (which eventually calls into sd_remove, waiting for scsi_sd_probe_domain to clear), making the I/O impossible to fail on device removal. This creates a deadlock. Looking at scsi_io_completion, returning an error code (I think DID_TARGET_FAILURE is the best code, good_bytes will be zero in this case) is necessary to eventually finish this I/O. This is the only chance this I/O can be finished. Can you give it a try to see if it fixes this bug? I know it's causing problems on your multi-path FC tests, but that may be another issue I'm happy to work on it. (In reply to longli from comment #69) Yes this fix seems to work but you are asking me to replace one problem for another. Have you tested multipath failover with fc? The other issue I have is the IO does not always get stuck forever. In many cases it completes just fine. Why is that? There is a race going on that we have not addressed and my concern is the fix is a band-aid approach and is not correcting the real problem. (In reply to longli from comment #69) #define DID_TARGET_FAILURE 0x10 /* Permanent target failure, do not retry on * other paths */ This error looks to be incompatible with multipath. Hi folks. Not sure if this is an appropriate place to ask this sort of question. I certainly appreciate that this is an acknowledged bug that is being worked on but.... My thought is that there must be hundreds of thousands of installations that run on Hyper-V servers and leverage VSS. What are people doing to work around this in production? In our environment we are talking multiple daily outages with anything other than 4 months+ old kernels leaving several CVE issues open. Am I missing something or this not a huge problem? Thx To answer my own question, looks like being that the problem is in the SCSI stack (or whatever you programmer people call it :)) so switching the VM to use IDE instead appears to be an acceptable workaround. (In reply to paulf123 from comment #81) > To answer my own question, looks like being that the problem is in the SCSI > stack (or whatever you programmer people call it :)) so switching the VM to > use IDE instead appears to be an acceptable workaround. Further to this... the SCSI controller itself has to stay otherwise the VM will get paused during snapshot. An error about cache is still thrown but it it references an non existent drive so I believe this will be non-disruptive. (In reply to paulf123 from comment #82) > (In reply to paulf123 from comment #81) > > To answer my own question, looks like being that the problem is in the SCSI > > stack (or whatever you programmer people call it :)) so switching the VM to > > use IDE instead appears to be an acceptable workaround. > > Further to this... the SCSI controller itself has to stay otherwise the VM > will get paused during snapshot. > > An error about cache is still thrown but it it references an non existent > drive so I believe this will be non-disruptive. and nope.... same problem hung kworker @^#$%@$#! still don't understand how anyone is using current kernels and not having a huge deal this show stopper problem (In reply to paulf123 from comment #73) > What are people doing to work around this in production? If the problem is triggered by snapshot/checkpoint based backup, another remedy could be to use agent based backup instead until the problem is solved? This is a problem that exists in upstream kernels as well. This bug is a top priority and we are working closely with Microsoft towards a solution. When that happens which will hopefully be soon I will make a test kernel available. Thanks, Cathy (In reply to Tom Söderlund from comment #84) > (In reply to paulf123 from comment #73) > > What are people doing to work around this in production? > > If the problem is triggered by snapshot/checkpoint based backup, another > remedy could be to use agent based backup instead until the problem is > solved? Thanks yes that is what we will do. It's been so long I forgot all about LVM abilities :) *** Bug 1510709 has been marked as a duplicate of this bug. *** @nikhil @milan @afox Here is another test build for the customers to try. It has done well in our testing of VSS and multipath failover so far. Please inform them that this is a test build only and is not intended for production use. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825 Thanks, Cathy *** Bug 1500887 has been marked as a duplicate of this bug. *** (In reply to Cathy Avery from comment #90) > @nikhil @milan @afox > > Here is another test build for the customers to try. It has done well in our > testing of VSS and multipath failover so far. > > Please inform them that this is a test build only and is not intended for > production use. > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825 > > Thanks, > > Cathy Hello Cathy, I have provided the test kernel to couple of customers. As soon as I receive any inputs from any of them, I shall report back the same here. Many thanks, Milan. (In reply to Cathy Avery from comment #90) > @nikhil @milan @afox > > Here is another test build for the customers to try. It has done well in our > testing of VSS and multipath failover so far. > > Please inform them that this is a test build only and is not intended for > production use. > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825 > > Thanks, > > Cathy Hi. This is internally accessible only so I cannot obtain for testing? Thanks. (In reply to paulf123 from comment #93) > (In reply to Cathy Avery from comment #90) > > @nikhil @milan @afox > > > > Here is another test build for the customers to try. It has done well in our > > testing of VSS and multipath failover so far. > > > > Please inform them that this is a test build only and is not intended for > > production use. > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825 > > > > Thanks, > > > > Cathy > > > Hi. This is internally accessible only so I cannot obtain for testing? > Thanks. I can post it to where you should be able to access it. Is x86_64 OK? (In reply to Cathy Avery from comment #94) > (In reply to paulf123 from comment #93) > > (In reply to Cathy Avery from comment #90) > > > @nikhil @milan @afox > > > > > > Here is another test build for the customers to try. It has done well in our > > > testing of VSS and multipath failover so far. > > > > > > Please inform them that this is a test build only and is not intended for > > > production use. > > > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825 > > > > > > Thanks, > > > > > > Cathy > > > > > > Hi. This is internally accessible only so I cannot obtain for testing? > > Thanks. > > I can post it to where you should be able to access it. Is x86_64 OK? Disregard the x86_64 question. @nikhil @milan @afox Could you make the brew rpm available to paulf123? I'm having trouble copying to my people account. Thanks! (In reply to Cathy Avery from comment #96) > @nikhil @milan @afox > > Could you make the brew rpm available to paulf123? > > I'm having trouble copying to my people account. > > Thanks! OK never mind I got it working. @paulf123 Here is the rpm to install http://people.redhat.com/cavery/kernel-3.10.0-693.12.1.el7_BZ_1502601_3.x86_64.rpm (In reply to Cathy Avery from comment #97) > (In reply to Cathy Avery from comment #96) > > @nikhil @milan @afox > > > > Could you make the brew rpm available to paulf123? > > > > I'm having trouble copying to my people account. > > > > Thanks! > > OK never mind I got it working. > > @paulf123 > > Here is the rpm to install > > http://people.redhat.com/cavery/kernel-3.10.0-693.12.1.el7_BZ_1502601_3. > x86_64.rpm Thanks. Still getting cache data failed on a non existing controller + device(sdb) but not every time VSS snapshot is invoked. Also has not hung kworker (yet) but this machine isn't doing much/no load unlike production boxes. Here are a few iterations of VSS snapshots. First error occurs Dec 8 23:37:16 Dec 8 23:20:49 crashycrashy chronyd[653]: Selected source 198.50.139.209 Dec 8 23:20:49 crashycrashy chronyd[653]: System clock wrong by 159.944110 seconds, adjustment started Dec 8 23:20:49 crashycrashy chronyd[653]: System clock was stepped by 159.944110 seconds Dec 8 23:20:49 crashycrashy systemd: Time has been changed Dec 8 23:21:11 crashycrashy systemd: Created slice User Slice of root. Dec 8 23:21:11 crashycrashy systemd: Starting User Slice of root. Dec 8 23:21:11 crashycrashy systemd: Started Session 1 of user root. Dec 8 23:21:11 crashycrashy systemd-logind: New session 1 of user root. Dec 8 23:21:11 crashycrashy systemd: Starting Session 1 of user root. Dec 8 23:23:00 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:09 crashycrashy chronyd[653]: Source 159.203.8.72 replaced with 206.108.0.132 Dec 8 23:23:19 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:21 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:21 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:25 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:25 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Dec 8 23:20:45 crashycrashy systemd: Time has been changed Dec 8 23:20:45 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded Dec 8 23:20:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:20:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:20:45 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:21:26 crashycrashy chronyd[653]: Backward time jump detected! Dec 8 23:21:26 crashycrashy chronyd[653]: Can't synchronise: no selectable sources Dec 8 23:21:32 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:21:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:21:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:21:45 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:22:42 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:22:59 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:01 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:01 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:05 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:23:05 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Dec 8 23:23:05 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded Dec 8 23:23:05 crashycrashy systemd: Time has been changed Dec 8 23:23:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:23:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:23:05 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:23:34 crashycrashy chronyd[653]: Selected source 198.50.139.209 Dec 8 23:23:34 crashycrashy chronyd[653]: System clock wrong by 159.958641 seconds, adjustment started Dec 8 23:23:53 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:24:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:24:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:24:05 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:28:35 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:29:52 crashycrashy chronyd[653]: Selected source 198.50.139.209 Dec 8 23:29:52 crashycrashy chronyd[653]: System clock wrong by 159.351515 seconds, adjustment started Dec 8 23:32:05 crashycrashy systemd: Created slice User Slice of vulnscan. Dec 8 23:32:05 crashycrashy systemd: Starting User Slice of vulnscan. Dec 8 23:32:05 crashycrashy systemd: Started Session 2 of user vulnscan. Dec 8 23:32:05 crashycrashy systemd-logind: New session 2 of user vulnscan. Dec 8 23:32:05 crashycrashy systemd: Starting Session 2 of user vulnscan. Dec 8 23:32:46 crashycrashy systemd: Starting Cleanup of Temporary Directories...Dec 8 23:32:46 crashycrashy systemd: Started Cleanup of Temporary Directories. Dec 8 23:34:08 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:34:28 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:34:30 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:34:30 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:34:34 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:34:34 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Dec 8 23:34:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:34:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:34:12 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:34:12 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded Dec 8 23:34:12 crashycrashy systemd: Time has been changed Dec 8 23:34:53 crashycrashy chronyd[653]: Backward time jump detected! Dec 8 23:34:53 crashycrashy chronyd[653]: Can't synchronise: no selectable sources Dec 8 23:35:03 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:35:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:35:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:35:12 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:36:54 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:37:02 crashycrashy chronyd[653]: Selected source 206.108.0.132 Dec 8 23:37:02 crashycrashy chronyd[653]: System clock wrong by 156.812611 seconds, adjustment started Dec 8 23:37:05 crashycrashy chronyd[653]: Selected source 149.56.27.12 Dec 8 23:37:14 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:37:16 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:37:17 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:37:21 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:37:21 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Dec 8 23:37:16 crashycrashy systemd: Time has been changed Dec 8 23:37:16 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded Dec 8 23:37:16 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:37:16 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:37:16 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:37:16 crashycrashy kernel: scsi 3:0:1:0: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 4 Dec 8 23:37:16 crashycrashy kernel: sd 3:0:1:0: Attached scsi generic sg2 type 0 Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available. Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available. Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Write Protect is off Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Asking for cache data failed Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Assuming drive cache: write through Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available. Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available. Dec 8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Attached SCSI disk Dec 8 23:38:08 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:38:09 crashycrashy chronyd[653]: Can't synchronise: no selectable sources Dec 8 23:38:17 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:38:17 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:38:17 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:39:46 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:40:06 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:40:09 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:40:09 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:40:13 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:40:14 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Dec 8 23:40:00 crashycrashy systemd: Time has been changed Dec 8 23:40:00 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded Dec 8 23:40:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:40:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:40:00 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:40:02 crashycrashy chronyd[653]: Backward time jump detected! Dec 8 23:40:46 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:41:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:41:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:41:00 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:42:11 crashycrashy chronyd[653]: Selected source 206.108.0.132 Dec 8 23:42:11 crashycrashy chronyd[653]: System clock wrong by 159.800448 seconds, adjustment started Dec 8 23:42:14 crashycrashy chronyd[653]: Selected source 149.56.27.12 Dec 8 23:43:35 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:43:55 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:43:57 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:43:58 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:44:02 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:44:02 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Dec 8 23:43:53 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:43:53 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:43:53 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Dec 8 23:43:53 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded Dec 8 23:43:54 crashycrashy systemd: Time has been changed Dec 8 23:44:23 crashycrashy chronyd[653]: Can't synchronise: no selectable sources Dec 8 23:44:40 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Dec 8 23:44:54 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Dec 8 23:44:54 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Dec 8 23:44:54 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Hi. Any update on this? Thanks! (In reply to paulf123 from comment #105) > Hi. Any update on this? Thanks! The fix has been pushed to 4.15 upstream. It will be part of RHEL 7.5 and most likely added to a RHEL 7.4 errata. Thanks for your help, Cathy Cathy is there an ETA on that roll out of 7.5? (In reply to Leo White from comment #107) > Cathy is there an ETA on that roll out of 7.5? Thanks Cathy or even when 4.15 will be avail kernel-ml? (In reply to paulf123 from comment #109) > (In reply to Leo White from comment #107) > > Cathy is there an ETA on that roll out of 7.5? > > Thanks Cathy or even when 4.15 will be avail kernel-ml? I'm not sure of the dates. I'll try to find out. For those looking for release dates I unfortunately don't have that info. It looks like the best option for status updates is to contact your Red Hat support person. I've tested ML 4.15 RC7 under heavy load. So far there are still seeing warnings about non existent controllers/drives per logs bellow and now there are some additional I/O errors being thrown (again sdc, sdd which are not actually present.) The good news is there doesn't appear to be any hanging of the OS which is great! Jan 9 20:00:09 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:00:11 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:00:11 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:00:25 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:00:25 ouch1 journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Jan 9 20:00:24 ouch1 systemd: Time has been changed Jan 9 20:00:24 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:00:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:00:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:00:24 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:00:24 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:00:24 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:00:24 ouch1 journal: Hyper-V VSS: VSS: op=THAW: succeeded Jan 9 20:00:24 ouch1 kernel: scsi 3:0:1:1: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 4 Jan 9 20:00:24 ouch1 kernel: sd 3:0:1:1: Attached scsi generic sg2 type 0 Jan 9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB) Jan 9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] 4096-byte physical blocks Jan 9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] Write Protect is off Jan 9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 9 20:00:24 ouch1 kernel: sdc: sdc1 sdc2 Jan 9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] Attached SCSI disk Jan 9 20:00:24 ouch1 systemd: Starting LVM2 PV scan on device 8:34... Jan 9 20:00:24 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found. Jan 9 20:00:24 ouch1 lvm: WARNING: Autoactivation reading from disk instead of lvmetad. Jan 9 20:00:25 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY on /dev/sdc2 was already found on /dev/sda2. Jan 9 20:00:25 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY prefers device /dev/sda2 because device is used by LV. Jan 9 20:00:25 ouch1 lvm: 1 logical volume(s) in volume group "MYDATA" now active Jan 9 20:00:25 ouch1 lvm: 2 logical volume(s) in volume group "CL" now active Jan 9 20:00:25 ouch1 systemd: Started LVM2 PV scan on device 8:34. Jan 9 20:00:25 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronizing SCSI cache Jan 9 20:00:25 ouch1 systemd: Stopping LVM2 PV scan on device 8:34... Jan 9 20:00:25 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:00:25 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found. Jan 9 20:00:25 ouch1 systemd: Stopped LVM2 PV scan on device 8:34. Jan 9 20:00:41 ouch1 chronyd[786]: Selected source 99.224.6.96 Jan 9 20:00:41 ouch1 chronyd[786]: System clock wrong by 156.713638 seconds, adjustment started Jan 9 20:00:43 ouch1 chronyd[786]: Selected source 149.56.121.19 Jan 9 20:00:56 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:01:01 ouch1 systemd: Started Session 32 of user root. Jan 9 20:01:01 ouch1 systemd: Starting Session 32 of user root. Jan 9 20:01:23 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:01:46 ouch1 chronyd[786]: Selected source 99.224.6.96 Jan 9 20:01:46 ouch1 chronyd[786]: Source 99.224.6.96 replaced with 144.217.181.221 Jan 9 20:01:47 ouch1 chronyd[786]: Selected source 149.56.121.19 Jan 9 20:01:49 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:04:44 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:05:04 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:05:15 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:05:16 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:05:16 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:05:16 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:05:21 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:05:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:05:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:05:24 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:05:56 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:06:16 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:07:10 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:07:29 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:07:58 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:08:24 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:08:26 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:08:26 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:08:41 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:08:41 ouch1 journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Jan 9 20:08:04 ouch1 systemd: Time has been changed Jan 9 20:08:04 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:08:04 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:08:04 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:08:04 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:08:04 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:08:04 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:08:04 ouch1 journal: Hyper-V VSS: VSS: op=THAW: succeeded Jan 9 20:08:05 ouch1 kernel: scsi 3:0:1:1: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 4 Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: Attached scsi generic sg2 type 0 Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB) Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] 4096-byte physical blocks Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Write Protect is off Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 9 20:08:05 ouch1 kernel: sdc: sdc1 sdc2 Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Attached SCSI disk Jan 9 20:08:05 ouch1 systemd: Starting LVM2 PV scan on device 8:34... Jan 9 20:08:05 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found. Jan 9 20:08:05 ouch1 lvm: WARNING: Autoactivation reading from disk instead of lvmetad. Jan 9 20:08:05 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY on /dev/sdc2 was already found on /dev/sda2. Jan 9 20:08:05 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY prefers device /dev/sda2 because device is used by LV. Jan 9 20:08:05 ouch1 kernel: scsi 3:0:1:0: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 4 Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: Attached scsi generic sg3 type 0 Jan 9 20:08:05 ouch1 lvm: 1 logical volume(s) in volume group "MYDATA" now active Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available. Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available. Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] 0 512-byte logical blocks: (0 B/0 B) Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] 0-byte physical blocks Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Write Protect is off Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Asking for cache data failed Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Assuming drive cache: write through Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available. Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available. Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Attached SCSI disk Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronizing SCSI cache Jan 9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:08:05 ouch1 lvm: 2 logical volume(s) in volume group "CL" now active Jan 9 20:08:05 ouch1 systemd: Stopped LVM2 PV scan on device 8:34. Jan 9 20:08:42 ouch1 chronyd[786]: Backward time jump detected! Jan 9 20:08:42 ouch1 chronyd[786]: Can't synchronise: no selectable sources Jan 9 20:09:17 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:09:18 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:09:18 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:09:18 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:09:18 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:09:18 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:09:18 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:10:02 ouch1 systemd: Started Session 33 of user root. Jan 9 20:10:02 ouch1 systemd: Starting Session 33 of user root. Jan 9 20:10:36 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:10:51 ouch1 chronyd[786]: Selected source 149.56.121.19 Jan 9 20:10:51 ouch1 chronyd[786]: System clock wrong by 154.429252 seconds, adjustment started Jan 9 20:11:01 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:11:03 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:11:03 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:11:18 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:11:13 ouch1 journal: Hyper-V VSS: VSS: op=FREEZE: succeeded Jan 9 20:11:13 ouch1 systemd: Time has been changed Jan 9 20:11:13 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:11:13 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:11:13 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:11:13 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:11:13 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:11:13 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:11:13 ouch1 journal: Hyper-V VSS: VSS: op=THAW: succeeded Jan 9 20:11:14 ouch1 kernel: scsi 3:0:1:1: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 4 Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: Attached scsi generic sg2 type 0 Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB) Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] 4096-byte physical blocks Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Write Protect is off Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 9 20:11:14 ouch1 kernel: sdc: sdc1 sdc2 Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Attached SCSI disk Jan 9 20:11:14 ouch1 systemd: Starting LVM2 PV scan on device 8:34... Jan 9 20:11:14 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found. Jan 9 20:11:14 ouch1 lvm: WARNING: Autoactivation reading from disk instead of lvmetad. Jan 9 20:11:14 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY on /dev/sdc2 was already found on /dev/sda2. Jan 9 20:11:14 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY prefers device /dev/sda2 because device is used by LV. Jan 9 20:11:14 ouch1 lvm: 1 logical volume(s) in volume group "MYDATA" now active Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronizing SCSI cache Jan 9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 9 20:11:14 ouch1 lvm: 2 logical volume(s) in volume group "CL" now active Jan 9 20:11:14 ouch1 systemd: Stopped LVM2 PV scan on device 8:34. Jan 9 20:11:55 ouch1 chronyd[786]: Selected source 158.69.125.231 Jan 9 20:11:59 ouch1 chronyd[786]: Can't synchronise: no selectable sources Jan 9 20:15:28 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:15:30 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:15:30 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:15:30 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:15:37 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current] Jan 9 20:15:37 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition Jan 9 20:15:37 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa Jan 9 20:17:02 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP Jan 9 20:17:18 ouch1 chronyd[786]: Selected source 149.56.121.19 Jan 9 20:17:18 ouch1 chronyd[786]: System clock wrong by 4.618358 seconds, adjustment started The I/O failure messages on /sdc and /sdd are expected. With the patch, storvsc may return I/O errors (when we figure out they are actually not present) when their partitions are being probed. Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-832.el7 Verified with kernel 3.10.0-843.el7 Tested with steps in comment #26 on WS2016/WS2012R2/WS2012, no crash/error observed. Backup is able to complete while continuously adding/removing disks to the vm. Tested FC multipath failover on WS2016/WS2012R2, no error observed, failover works correctly under io load. We have the same problem here. I've just updated from 3.10.0-693.11.1.el7.x86_64to 3.10.0-693.21.1.el7.x86_64. Does it include the patch ? Thanks Hi Amaury, (In reply to Amaury De Ganseman from comment #142) > We have the same problem here. I've just updated from > 3.10.0-693.11.1.el7.x86_64to 3.10.0-693.21.1.el7.x86_64. > > Does it include the patch ? Yes. This was fixed for RHEL-7.4.z in kernel-3.10.0-693.21.1.el7 - Errata RHSA-2018:0395: https://access.redhat.com/errata/RHSA-2018:0395 Marc Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1062 |