Bug 1502601

Summary:

[Hyper-V][RHEL7.4] hang when thaw on microsoft hyper-v

Product:

Red Hat Enterprise Linux 7

Reporter:

nikhil kshirsagar <nkshirsa>

Component:

kernel

Assignee:

Cathy Avery <cavery>

kernel sub component:

Hyper-V

QA Contact:

xuli <xuli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

urgent

CC:

afox, ailan, akarvi, amaury.deganseman, anazmy, boyang, cavery, decui, dhoward, hhei, jjarvis, jopoulso, kys, ldu, leiwang, leo.white, loberman, longli, mgandhi, mitchell.doedens, mmilgram, mtessun, nkshirsa, nmurray, orendorf, paulf123, pbuchan, pdhamdhe, qzheng, rdutta, salmy, schnederle, stanislav.moravec, subhat, tom.soderlund, toneata, toracat, vkuznets, xialiu, xiaofwan, xuli, yacao

Version:

7.4

Keywords:

ZStream

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-3.10.0-832.el7

Doc Type:

Bug Fix

Doc Text:

Cause: Improper handling of some error codes in Hyper-V storvsc driver was in some cases causing system lockup after hot backup. Consequence: It was possible to observe block subsystem getting stuck after performing hot backup of Hyper-V Virtual Machine. Fix: Upstream patches fixing improper error codes handling were applied to RHEL kernel. Result: Hot backup of RHEL Virtual Machines running on Hyper-V should not cause system lockup now.

Story Points:

---

Clone Of:

Clones:

1536978 (view as bug list)

Environment:

Last Closed:

2018-04-10 22:32:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1420851, 1536978

Attachments:

Description	Flags
dmesg logs from vmcore	none

Comment 3 nikhil kshirsagar 2017-10-16 10:16:16 UTC

sde appears as cancelled device

crash> scsishow -d

===========================================================================================================================================================================
HOST      DRIVER
NAME      NAME                               Scsi_Host                shost_data               &.hostdata[0]           
---------------------------------------------------------------------------------------------------------------
host0     storvsc_host_t                     ffff8800f24eb000                        0         ffff8800f24eb740

DEV NAME          scsi_device             H:C:T:L          VENDOR/MODEL              DEVICE STATE             WRITABLE       IOREQ-CNT  IODONE-CNT               IOERR-CNT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sda               ffff8805e8600000        0:0:0:0          Msft     Virtual Disk     SDEV_RUNNING                    1             957         957  (  0)	         0
sr0               ffff8800f24eb800        0:0:0:1          Msft     Virtual DVD-ROM  SDEV_RUNNING                    1           15433       15433  (  0)	        29

===========================================================================================================================================================================
HOST      DRIVER
NAME      NAME                               Scsi_Host                shost_data               &.hostdata[0]           
---------------------------------------------------------------------------------------------------------------
host1     storvsc_host_t                     ffff8805ebdc7800                        0         ffff8805ebdc7f40

DEV NAME          scsi_device             H:C:T:L          VENDOR/MODEL              DEVICE STATE             WRITABLE       IOREQ-CNT  IODONE-CNT               IOERR-CNT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sdb               ffff8800f24ec000        1:0:0:0          Msft     Virtual Disk     SDEV_RUNNING                    1          152073      152073  (  0)	         0

===========================================================================================================================================================================
HOST      DRIVER
NAME      NAME                               Scsi_Host                shost_data               &.hostdata[0]           
---------------------------------------------------------------------------------------------------------------
host2     storvsc_host_t                     ffff8805e8602800                        0         ffff8805e8602f40

DEV NAME          scsi_device             H:C:T:L          VENDOR/MODEL              DEVICE STATE             WRITABLE       IOREQ-CNT  IODONE-CNT               IOERR-CNT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sdc               ffff8805e8603800        2:0:0:0          Msft     Virtual Disk     SDEV_RUNNING                    1           21423       21423  (  0)	         0
sdd               ffff8805e8604000        2:0:0:1          Msft     Virtual Disk     SDEV_RUNNING                    1             306         306  (  0)	         0
sde               ffff8800f24e4800        2:0:1:0          Msft     Virtual Disk     SDEV_CANCEL                     1             202         202  (  0)	         0                                            <------------- SDEV_CANCEL   
sdf               ffff8805ee7f5000        2:0:1:1          Msft     Virtual Disk     SDEV_RUNNING                    1         2044485     2044484  (  1)	         0

 ** Execution took   1.32s (real)   1.29s (CPU), Child processes:   0.04s
crash>

Comment 4 Cathy Avery 2017-10-16 13:47:53 UTC

@long,

This looks like your recent VSS backup bug. Is it a problem for both WS2012 and WS2016.

Cathy

Comment 5 nikhil kshirsagar 2017-10-16 14:16:04 UTC

Are the private comments visible to long ? Shall we make them public?

Comment 6 longli 2017-10-16 18:11:50 UTC

@cathy,

Does it repro with your patch on storvsc_remove_lun? I think that might fix it.

Long

Comment 7 Cathy Avery 2017-10-16 18:50:55 UTC

(In reply to longli from comment #6)
> @cathy,
> 
> Does it repro with your patch on storvsc_remove_lun? I think that might fix
> it.
> 
> Long

We haven't reproduced this yet so I'll let you know.

Comment 8 nikhil kshirsagar 2017-10-17 03:35:19 UTC

Note that the storsvc or remove_scsi_device wasn't see in the hung task at all in the vmcore. All we had was this hung task triggering the panice,

[31236.253319] scsi host2: scsi_prep_async_scan called twice
[31236.265530] scsi host2: scsi_prep_async_scan called twice
[31236.273435] Dev sde: unable to read RDB block 0
[31236.273455]  sde: unable to read partition table
[31236.276435] sd 2:0:1:0: [sde] Sector size 0 reported, assuming 512.
[31236.276885] sd 2:0:1:0: [sde] Attached SCSI disk
[31440.245172] INFO: task sshd:71717 blocked for more than 120 seconds.
[31440.245239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[31440.245321] sshd            D ffff880572f1bc00     0 71717  71694 0x00000080
[31440.245330]  ffff8803472ff7c0 0000000000000086 ffff8805476d0fd0 ffff8803472fffd8
[31440.245336]  ffff8803472fffd8 ffff8803472fffd8 ffff8805476d0fd0 ffff8803472ff910
[31440.245340]  7fffffffffffffff ffff8803472ff908 ffff8805476d0fd0 ffff880572f1bc00
[31440.245345] Call Trace:
[31440.245358]  [<ffffffff816a94e9>] schedule+0x29/0x70
[31440.245366]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
[31440.245376]  [<ffffffff810ce54e>] ? dequeue_task_fair+0x41e/0x660
[31440.245381]  [<ffffffff810c4593>] ? try_to_wake_up+0x183/0x340
[31440.245386]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
[31440.245390]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
[31440.245398]  [<ffffffff810a987d>] flush_work+0xfd/0x190
[31440.245403]  [<ffffffff810a5df0>] ? move_linked_works+0x90/0x90
[31440.245411]  [<ffffffff813fbf99>] tty_flush_to_ldisc+0x29/0x30
[31440.245416]  [<ffffffff813f724f>] n_tty_poll+0x8f/0x1d0
[31440.245420]  [<ffffffff813f30d2>] tty_poll+0x72/0x90
[31440.245427]  [<ffffffff81216585>] do_select+0x3a5/0x7c0
[31440.245435]  [<ffffffff815cfacf>] ? ip_finish_output+0x29f/0x780
[31440.245440]  [<ffffffff815d02b3>] ? ip_output+0x73/0xe0
[31440.245446]  [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150
[31440.245451]  [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150
[31440.245456]  [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150
[31440.245461]  [<ffffffff81215e50>] ? poll_select_copy_remaining+0x150/0x150
[31440.245466]  [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20
[31440.245472]  [<ffffffff8156e998>] ? release_sock+0x118/0x170
[31440.245477]  [<ffffffff815db126>] ? tcp_sendmsg+0xd6/0xc30
[31440.245486]  [<ffffffff816068b3>] ? inet_sendmsg+0x63/0xb0
[31440.245492]  [<ffffffff812b50c3>] ? selinux_socket_sendmsg+0x23/0x30
[31440.245497]  [<ffffffff81216b7b>] core_sys_select+0x1db/0x300
[31440.245505]  [<ffffffff8120027d>] ? do_sync_write+0x8d/0xd0
[31440.245512]  [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0
[31440.245518]  [<ffffffff81216d5a>] SyS_select+0xba/0x110
[31440.245524]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
[31440.245529] sending NMI to all CPUs:

Comment 11 nikhil kshirsagar 2017-10-17 09:52:12 UTC

This does NOT reproduce for the customer with the RHEL 7.3 kernel

Comment 12 Cathy Avery 2017-10-17 13:21:40 UTC

(In reply to nikhil from comment #11)
> This does NOT reproduce for the customer with the RHEL 7.3 kernel

My patch addresses a problem originally reported on RHEL 7.2. Once we get this bug reproduced I can see if there is a change to RHEL 7.4 that accounts for the issue.

Comment 13 Leo White 2017-10-18 17:56:51 UTC

So is downgrading the Kernel the only workaround to fixing this issue at the moment?

Comment 26 xuli 2017-10-20 07:12:43 UTC

Have reproduced this issue on the kernel 3.10.0-693.2.2.el7.x86_64 Gen2 VM with 2012R2 hyper-v.
Reproduce steps: 

1) execute the hot add/remove scsi by following script. This script continuously adds two disks, then removes two disks.

2) Do backup from host side by powershell during the script running.

wbadmin start backup -backupTarget:Q:\ -hyperv:vmName

3) Observe that Online backup succeeded, but check the VM, it is hung and no IP address available, heartbeat shows as OK. After wait for a while, hung_task_timeout_secs about kworker pops up.

===========================================================

$VMName = Read-Host "Specify the VMName to use"
$ComputerName = "localhost"

Date

$maxnum = 2
$numdisk = 0
do {
$DISK = "$DPATH\b1273723_$numdisk.vhdx"
$EXIST = Test-Path $DISK
if ( -not $EXIST )
{
New-VHD -Path $DISK -Size 1024MB -ComputerName $ComputerName -Dynamic
}
$numdisk++
} while ($numdisk -lt $maxnum)



while ( 1 -ne 0 )
{
$DISK = "$DPATH\b1273723_0.vhdx"
Add-VMHardDiskDrive -VMName $VMName -ComputerName $ComputerName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 2 -Path $DISK
$DISK = "$DPATH\b1273723_1.vhdx"
Add-VMHardDiskDrive -VMName $VMName -ComputerName $ComputerName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 3 -Path $DISK

Remove-VMHardDiskDrive -vmName $VMName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 2 -ComputerName $ComputerName
Remove-VMHardDiskDrive -vmName $VMName -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 3 -ComputerName $ComputerName
}
=============================================================

Comment 32 xuli 2017-10-23 08:44:59 UTC

After test the kernel build, I suspect kernel 3.10.0-622.el7 which induces this hang issue if do backup during scsi disk hot add/remove . (this hang issue reproduces on kernel 3.10.0-622.el7, but cannot reproduce on kernel 3.10.0-621.el7 by following comment 26 steps)

Also kernel 3.10.0-622.el7 with patch:
 [scsi] storvsc: properly set residual data length on errors [1364282], which is related with 
Bug 1364282 - [Hyper-V][RHEL-7.2]Data corruption observed on RHEL 7.2 VM running on Hyper-V Win2k12R2 and https://bugzilla.redhat.com/show_bug.cgi?id=1500887.  ([Hyper-V][RHEL-7.2] All MP paths failing to a device with heavy IO can cause the VM to hang. - found during verify 1364282) 

Thank you.

Comment 33 Cathy Avery 2017-10-23 11:33:31 UTC

(In reply to xuli from comment #32)
> After test the kernel build, I suspect kernel 3.10.0-622.el7 which induces
> this hang issue if do backup during scsi disk hot add/remove . (this hang
> issue reproduces on kernel 3.10.0-622.el7, but cannot reproduce on kernel
> 3.10.0-621.el7 by following comment 26 steps)
> 
> Also kernel 3.10.0-622.el7 with patch:
>  [scsi] storvsc: properly set residual data length on errors [1364282],
> which is related with 
> Bug 1364282 - [Hyper-V][RHEL-7.2]Data corruption observed on RHEL 7.2 VM
> running on Hyper-V Win2k12R2 and
> https://bugzilla.redhat.com/show_bug.cgi?id=1500887.  ([Hyper-V][RHEL-7.2]
> All MP paths failing to a device with heavy IO can cause the VM to hang. -
> found during verify 1364282) 
> 
> Thank you.

That was a regression that was introduced and fixed in 7.4. https://bugzilla.redhat.com/show_bug.cgi?id=1437552

The fix was available by kernel-3.10.0-650.el7 and it is in 3.10.0-693.2.2.el7.x86_64

Cathy

Comment 38 Cathy Avery 2017-10-27 20:23:20 UTC

This appears to be a problem that Microsoft is aware of and is currently working on. It would appear that this bug shares common traits with https://bugzilla.redhat.com/show_bug.cgi?id=1500887. We are working with MS to resolve the issue.

Thanks,

Cathy

Comment 39 Cathy Avery 2017-10-30 10:51:54 UTC

@long

QE is seeing that this patch

commit 40630f462824ee24bc00d692865c86c3828094e0
Author: Long Li <longli>
Date:   Wed Dec 14 18:46:03 2016 -0800

    scsi: storvsc: properly set residual data length on errors

is either causing or exacerbating this backup hang.

commit f1c635b439a5c01776fe3a25b1e2dc546ea82e6f
Author: Stephen Hemminger <stephen>
Date:   Tue Mar 7 09:15:53 2017 -0800

Hyper-V host emulation of SCSI for virtual DVD device reports SCSI
version 0 (UNKNOWN) but is still capable of supporting REPORTLUN.

Was added and verified as fix for the regression it caused
https://bugzilla.redhat.com/show_bug.cgi?id=1437552

However it looks like commit 40630f462824ee24bc00d692865c86c3828094e0 may have other side effects. Is there a reason why the commit is causing this hang directly or has it simply changed the timing so the scsi_lun_remove problem is more obvious.

Thanks,

Cathy

Comment 40 longli 2017-10-30 19:29:16 UTC

I think it's the timing that changes the sequence of scsi_lun_remove being issued. This patch is needed to make sure we report the correct returned data from host so the SCSI doesn't pick up uninitialized data.

I'm concerned on the call trace on sshd, that is a I/O timeout. Does it repro with your patch?

Comment 41 Cathy Avery 2017-10-30 19:57:57 UTC

(In reply to longli from comment #40)
> I think it's the timing that changes the sequence of scsi_lun_remove being
> issued. This patch is needed to make sure we report the correct returned
> data from host so the SCSI doesn't pick up uninitialized data.
> 
> I'm concerned on the call trace on sshd, that is a I/O timeout. Does it
> repro with your patch?

With our combined patches meaning my dedicated work queue plus excessive lun removal and your excessive scan ( minus set_host_byte(scmnd, DID_BAD_TARGET) ) I've had pretty good luck testing. I sent you the excessive removal patch.

Comment 47 Cathy Avery 2017-11-09 13:41:32 UTC

@Long

With my additional patch to limit storvsc_remove_lun calls we have made it past the old hang but are now hitting a new one. The build includes the 2 new patches
that just made it upstream.

[ 2280.017852]  [<ffffffffaf25e5d4>] ? __pm_runtime_use_autosuspend+0x54/0x70
[ 2280.017855]  [<ffffffffc03f52b5>] sd_probe_async+0x115/0x1e0 [sd_mod]
[ 2280.017857]  [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120
[ 2280.017859]  [<ffffffffaeeab58a>] process_one_work+0x17a/0x440
[ 2280.017862]  [<ffffffffaeeac256>] worker_thread+0x126/0x3c0
[ 2280.017864]  [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 2280.017866]  [<ffffffffaeeb352f>] kthread+0xcf/0xe0
[ 2280.017868]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
[ 2280.017870]  [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90
[ 2280.017872]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
[ 2280.017874] INFO: task kworker/u128:1:14524 blocked for more than 120 seconds.
[ 2280.018839] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2280.019771] kworker/u128:1  D ffff8debbfbeaf70     0 14524      2 0x00000080
[ 2280.019777] Workqueue: events_unbound async_run_entry_fn
[ 2280.019778] Call Trace:
[ 2280.019783]  [<ffffffffaf4cda19>] schedule_preempt_disabled+0x29/0x70
[ 2280.019786]  [<ffffffffaf4cbab7>] __mutex_lock_slowpath+0xc7/0x1d0
[ 2280.019788]  [<ffffffffaf4caecf>] mutex_lock+0x1f/0x2f
[ 2280.019791]  [<ffffffffaf286976>] do_scan_async+0x36/0x160
[ 2280.019793]  [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120
[ 2280.019796]  [<ffffffffaeeab58a>] process_one_work+0x17a/0x440
[ 2280.019798]  [<ffffffffaeeac256>] worker_thread+0x126/0x3c0
[ 2280.019801]  [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 2280.019803]  [<ffffffffaeeb352f>] kthread+0xcf/0xe0
[ 2280.019805]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
[ 2280.019808]  [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90
[ 2280.019810]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
[ 2280.019813] INFO: task kworker/u128:5:14806 blocked for more than 120 seconds.
[ 2280.020743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2280.021678] kworker/u128:5  D ffff8deb4342af70     0 14806      2 0x00000080
[ 2280.021687] Workqueue: storvsc_error_wq_0 storvsc_remove_lun [hv_storvsc]
[ 2280.021688] Call Trace:
[ 2280.021691]  [<ffffffffaf4ccaf9>] schedule+0x29/0x70
[ 2280.021694]  [<ffffffffaeebaa25>] async_synchronize_cookie_domain+0x85/0x150
[ 2280.021696]  [<ffffffffaeeb44b0>] ? wake_up_atomic_t+0x30/0x30
[ 2280.021698]  [<ffffffffaeebab08>] async_synchronize_full_domain+0x18/0x20
[ 2280.021701]  [<ffffffffc03f2362>] sd_remove+0x42/0xb0 [sd_mod]
[ 2280.021704]  [<ffffffffaf252bbf>] __device_release_driver+0x7f/0xf0
[ 2280.021706]  [<ffffffffaf252c53>] device_release_driver+0x23/0x30
[ 2280.021708]  [<ffffffffaf252318>] bus_remove_device+0x108/0x180
[ 2280.021712]  [<ffffffffaf24e295>] device_del+0x135/0x210
[ 2280.021714]  [<ffffffffaf288813>] __scsi_remove_device+0xc3/0xd0
[ 2280.021716]  [<ffffffffaf288846>] scsi_remove_device+0x26/0x40
[ 2280.021719]  [<ffffffffc03d4420>] storvsc_remove_lun+0x40/0x60 [hv_storvsc]
[ 2280.021722]  [<ffffffffaeeab58a>] process_one_work+0x17a/0x440
[ 2280.021724]  [<ffffffffaeeac256>] worker_thread+0x126/0x3c0
[ 2280.021726]  [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 2280.021728]  [<ffffffffaeeb352f>] kthread+0xcf/0xe0
[ 2280.021731]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
[ 2280.021733]  [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90
[ 2280.021735]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40

Comment 54 xuli 2017-11-13 05:51:05 UTC

(In reply to Cathy Avery from comment #51)
> (In reply to xuli from comment #44)
> 
> I've added vmlinuz-4.14.0-rc8lun_work_item+ to the VM at 10.73.199.37.
> 
> This is the upstream kernel with the same patches as
> vmlinuz-3.10.0-772.el7_scsi_test3.x86_64 that you used to uncover yet
> another hang.
> 
> Could you see if this build hangs in the same way as well?
> 
> Thanks!

Hi Cathy,

Notice that the test build vmlinuz-4.14.0-rc8lun_work_item+ gets the "hung_task_timeout_secs" when execute the same steps as before (need to keep running add/remove disk and only sleep 5 seconds during add/remove, then do backup, after backup successfully,still keep running add/remove disk, after wait for a while, it gets hung_task_timeout_secs at 10.73.199.37).

 1143.040625]  sdg: unable to read partition table
[ 1143.040637] sdg: partition table beyond EOD, enabling native capacity
[ 1143.041254] sd 0:0:1:2: [sdg] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[ 1143.041257] sd 0:0:1:2: [sdg] 4096-byte physical blocks
[ 1143.041406] sd 0:0:1:2: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1143.041411] sdg: detected capacity change from 512 to 1073741824
[ 1143.051274] sd 0:0:1:2: [sdg] Attached SCSI disk
[ 1226.720046] INFO: task kworker/u128:0:5 blocked for more than 120 seconds.
[ 1226.720689]       Not tainted 4.14.0-rc8lun_work_item+ #3
[ 1226.721278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1226.721884] kworker/u128:0  D    0     5      2 0x80000000
[ 1226.721891] Workqueue: events_unbound async_run_entry_fn
[ 1226.721893] Call Trace:
[ 1226.721899]  __schedule+0x2cc/0x7a0
[ 1226.721902]  ? __add_to_page_cache_locked+0x101/0x160
[ 1226.721903]  schedule+0x31/0x80
[ 1226.721906]  io_schedule+0x11/0x40
[ 1226.721907]  do_read_cache_page+0x42d/0x520
[ 1226.721910]  ? blkdev_writepages+0x10/0x10
[ 1226.721913]  ? __blk_put_request+0xbf/0x190
[ 1226.721914]  ? page_cache_tree_insert+0xb0/0xb0
[ 1226.721916]  read_cache_page+0x10/0x20
[ 1226.721919]  read_dev_sector+0x28/0xa0
[ 1226.721921]  read_lba+0x162/0x230
[ 1226.721923]  efi_partition+0xfc/0x880
[ 1226.721925]  ? string+0x63/0x70
[ 1226.721926]  ? vsnprintf+0x255/0x490
[ 1226.721928]  ? snprintf+0x45/0x70
[ 1226.721930]  ? compare_gpts+0x280/0x280
[ 1226.721931]  check_partition+0x138/0x210
[ 1226.721933]  ? check_partition+0x138/0x210
[ 1226.721934]  rescan_partitions+0xab/0x310
[ 1226.721937]  ? down_write+0xd/0x40
[ 1226.721938]  __blkdev_get+0x2d4/0x430
[ 1226.721940]  blkdev_get+0x11f/0x300
[ 1226.721943]  ? unlock_new_inode+0x43/0x60
[ 1226.721944]  ? bdget+0x112/0x150
[ 1226.721946]  device_add_disk+0x3c5/0x420
[ 1226.721950]  ? update_autosuspend+0x4c/0x50
[ 1226.721951]  ? __pm_runtime_use_autosuspend+0x57/0x70
[ 1226.721954]  sd_probe_async+0x103/0x1d0
[ 1226.721956]  async_run_entry_fn+0x33/0x160
[ 1226.721958]  process_one_work+0x164/0x380
[ 1226.721960]  worker_thread+0x49/0x410
[ 1226.721961]  kthread+0x104/0x140
[ 1226.721963]  ? process_one_work+0x380/0x380
[ 1226.721964]  ? kthread_create_on_node+0x60/0x60
[ 1226.721965]  ret_from_fork+0x25/0x30

Thank you so much.

Comment 56 Cathy Avery 2017-11-13 12:17:04 UTC

@Long

Have you been seeing these types of hangs on WS 2016? So far we have only heard of them on WS 2012?

Thanks,

Cathy

Comment 57 longli 2017-11-13 21:18:57 UTC

Yes I have seen similar hung on WS2016. I have fixed it with the DID_TARGET_FAILURE in the patch. But that was removed in the final patch. Can you give it a quick test to see if that fixes the problem?

I'm attempting to repro and looking into it.

Comment 58 Cathy Avery 2017-11-14 11:50:34 UTC

(In reply to longli from comment #57)
> Yes I have seen similar hung on WS2016. I have fixed it with the
> DID_TARGET_FAILURE in the patch. But that was removed in the final patch.
> Can you give it a quick test to see if that fixes the problem?
> 
> I'm attempting to repro and looking into it.

I have tested that patch before and it causes multipath to return IO errors on failover. Without the patch multipath fails over with no IO errors.

Thats a problem.

Comment 59 Cathy Avery 2017-11-14 12:08:37 UTC

(In reply to longli from comment #57)
> Yes I have seen similar hung on WS2016. 

I am testing the excessive lun removal patch which works well as we have pushed past that hang except that we have now hit another hang which xuli discovered in her testing. She hot adds and removes the luns while she is backing them up with wbadmin start backup -backupTarget:D:\ -hyperv:VMname.

It looks like the hang is the result of an attempted lun removal slap bang in the middle of its probe. I am looking in to it. I don't believe it is the result of my patch but I can't push it upstream until I know for sure.

Thanks,

Cathy

Comment 61 Cathy Avery 2017-11-15 15:08:19 UTC

(In reply to Milan P. Gandhi from comment #60

OK this is the same crash that xuli just reported. This is the latest one uncovered in comment 54 and 59. It is easily reproduced.

Long could you look at this as well? In the customers case they were running just the upstream patches. xuli reproduced it here with the upstream and the limiting lun removal patch.

Thanks,

Cathy

Comment 62 Cathy Avery 2017-11-15 18:06:51 UTC

(In reply to Milan P. Gandhi from comment #60)

Hi,

Do you have the messages file as well?

Thanks,

Cathy

Comment 63 longli 2017-11-15 19:59:44 UTC

I'm looking at this.

I think there may be another bug in upper layer SCSI. If you look at the SCSI probe code at scsi_probe_lun(), it calls scsi_execute_req but doesn't check the resid (the request length - the actual data transferred). At the time when it calls on a non-existent LUN, this call should fail and everything should recover. My patch with DID_TARGET_FAILURE deals with this situation (but unfortunately it breaks your FC tests), it makes this call fail by setting an error in host byte.

I'm still struggling with replicating your FC tests, can you share the test setup details? Shoot me an email with the QA on your team, we can discuss how to repro it.

Comment 64 longli 2017-11-16 04:22:14 UTC

@Cathy,

Can you post the complete kernel trace on this hung? I want to see what mutex (or bit_wait_io?) sd_probe_async is waiting at the beginning the of trace.

I don't have your 3rd patch. Please send it to me I'll try to patch it along upstream patches, and repro in our lab.

(In reply to Cathy Avery from comment #47)
> @Long
> 
> With my additional patch to limit storvsc_remove_lun calls we have made it
> past the old hang but are now hitting a new one. The build includes the 2
> new patches
> that just made it upstream.
> 
> [ 2280.017852]  [<ffffffffaf25e5d4>] ? __pm_runtime_use_autosuspend+0x54/0x70
> [ 2280.017855]  [<ffffffffc03f52b5>] sd_probe_async+0x115/0x1e0 [sd_mod]
> [ 2280.017857]  [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120
> [ 2280.017859]  [<ffffffffaeeab58a>] process_one_work+0x17a/0x440
> [ 2280.017862]  [<ffffffffaeeac256>] worker_thread+0x126/0x3c0
> [ 2280.017864]  [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0
> [ 2280.017866]  [<ffffffffaeeb352f>] kthread+0xcf/0xe0
> [ 2280.017868]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
> [ 2280.017870]  [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90
> [ 2280.017872]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
> [ 2280.017874] INFO: task kworker/u128:1:14524 blocked for more than 120
> seconds.
> [ 2280.018839] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2280.019771] kworker/u128:1  D ffff8debbfbeaf70     0 14524      2
> 0x00000080
> [ 2280.019777] Workqueue: events_unbound async_run_entry_fn
> [ 2280.019778] Call Trace:
> [ 2280.019783]  [<ffffffffaf4cda19>] schedule_preempt_disabled+0x29/0x70
> [ 2280.019786]  [<ffffffffaf4cbab7>] __mutex_lock_slowpath+0xc7/0x1d0
> [ 2280.019788]  [<ffffffffaf4caecf>] mutex_lock+0x1f/0x2f
> [ 2280.019791]  [<ffffffffaf286976>] do_scan_async+0x36/0x160
> [ 2280.019793]  [<ffffffffaeeba869>] async_run_entry_fn+0x39/0x120
> [ 2280.019796]  [<ffffffffaeeab58a>] process_one_work+0x17a/0x440
> [ 2280.019798]  [<ffffffffaeeac256>] worker_thread+0x126/0x3c0
> [ 2280.019801]  [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0
> [ 2280.019803]  [<ffffffffaeeb352f>] kthread+0xcf/0xe0
> [ 2280.019805]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
> [ 2280.019808]  [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90
> [ 2280.019810]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
> [ 2280.019813] INFO: task kworker/u128:5:14806 blocked for more than 120
> seconds.
> [ 2280.020743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2280.021678] kworker/u128:5  D ffff8deb4342af70     0 14806      2
> 0x00000080
> [ 2280.021687] Workqueue: storvsc_error_wq_0 storvsc_remove_lun [hv_storvsc]
> [ 2280.021688] Call Trace:
> [ 2280.021691]  [<ffffffffaf4ccaf9>] schedule+0x29/0x70
> [ 2280.021694]  [<ffffffffaeebaa25>]
> async_synchronize_cookie_domain+0x85/0x150
> [ 2280.021696]  [<ffffffffaeeb44b0>] ? wake_up_atomic_t+0x30/0x30
> [ 2280.021698]  [<ffffffffaeebab08>] async_synchronize_full_domain+0x18/0x20
> [ 2280.021701]  [<ffffffffc03f2362>] sd_remove+0x42/0xb0 [sd_mod]
> [ 2280.021704]  [<ffffffffaf252bbf>] __device_release_driver+0x7f/0xf0
> [ 2280.021706]  [<ffffffffaf252c53>] device_release_driver+0x23/0x30
> [ 2280.021708]  [<ffffffffaf252318>] bus_remove_device+0x108/0x180
> [ 2280.021712]  [<ffffffffaf24e295>] device_del+0x135/0x210
> [ 2280.021714]  [<ffffffffaf288813>] __scsi_remove_device+0xc3/0xd0
> [ 2280.021716]  [<ffffffffaf288846>] scsi_remove_device+0x26/0x40
> [ 2280.021719]  [<ffffffffc03d4420>] storvsc_remove_lun+0x40/0x60
> [hv_storvsc]
> [ 2280.021722]  [<ffffffffaeeab58a>] process_one_work+0x17a/0x440
> [ 2280.021724]  [<ffffffffaeeac256>] worker_thread+0x126/0x3c0
> [ 2280.021726]  [<ffffffffaeeac130>] ? manage_workers.isra.24+0x2a0/0x2a0
> [ 2280.021728]  [<ffffffffaeeb352f>] kthread+0xcf/0xe0
> [ 2280.021731]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40
> [ 2280.021733]  [<ffffffffaf4d8698>] ret_from_fork+0x58/0x90
> [ 2280.021735]  [<ffffffffaeeb3460>] ? insert_kthread_work+0x40/0x40

Comment 66 Milan P. Gandhi 2017-11-16 04:34:00 UTC

Created attachment 1353233 [details]
dmesg logs from vmcore

Hello,

I had to make comment#60 private since it had customer sensitive data. I have extracted the full dmesg logs from this vmcore captured with test kernel-3.10.0-693.12.1.el7_BZ_1502601.x86_64. Please let me know if you would need any further information.

Many thanks,

Milan.

Comment 68 Cathy Avery 2017-11-16 12:45:21 UTC

(In reply to longli from comment #63)
> I'm looking at this.
> 
> I think there may be another bug in upper layer SCSI. If you look at the
> SCSI probe code at scsi_probe_lun(), it calls scsi_execute_req but doesn't
> check the resid (the request length - the actual data transferred). At the
> time when it calls on a non-existent LUN, this call should fail and
> everything should recover. My patch with DID_TARGET_FAILURE deals with this
> situation (but unfortunately it breaks your FC tests), it makes this call
> fail by setting an error in host byte.
> 
> I'm still struggling with replicating your FC tests, can you share the test
> setup details? Shoot me an email with the QA on your team, we can discuss
> how to repro it.

Yes I think this last hang with devices being probed and removed is the root of all the problems. The IO is probably stuck on a non functioning queue and that is what is causing the backup. Do you think that the resid patch is contributing to this? 

commit 40630f462824ee24bc00d692865c86c3828094e0
Author: Long Li <longli>
Date:   Wed Dec 14 18:46:03 2016 -0800

    scsi: storvsc: properly set residual data length on errors

It may be exposing yet another problem. We were seeing hangs with multipath before that patch went in. So I'm guessing we may have 2 issues in the scsi stack.

Look at comment 26. xuli describes how she reproduces the problem. You may have to run the backup more than once. You should also put a 5 to 10 second sleep in the script between the hot disk adds and removes.  I'll send you my patch so you can get to the probe hang without hitting the other hangs first.

Comment 69 longli 2017-11-16 23:17:24 UTC

I think we need to return DID_TARGET_FAILURE for SRB_STATUS_INVALID_LUN (in storvsc_handle_error). If this error is not populated, the I/O will never finish because data_transfer_length is always 0 in these situations. The sd_probe_async call will get stuck forever for this I/O finish, while holding the async_domain scsi_sd_probe_domain. This will block the subsequent scsi_remove_device (which eventually calls into sd_remove, waiting for scsi_sd_probe_domain to clear), making the I/O impossible to fail on device removal. This creates a deadlock.

Looking at scsi_io_completion, returning an error code (I think DID_TARGET_FAILURE is the best code, good_bytes will be zero in this case) is necessary to eventually finish this I/O. This is the only chance this I/O can be finished.

Can you give it a try to see if it fixes this bug?

I know it's causing problems on your multi-path FC tests, but that may be another issue I'm happy to work on it.

Comment 71 Cathy Avery 2017-11-17 12:31:51 UTC

(In reply to longli from comment #69)


Yes this fix seems to work but you are asking me to replace one problem for another. Have you tested multipath failover with fc?

The other issue I have is the IO does not always get stuck forever. In many cases it completes just fine. Why is that? There is a race going on that we have not addressed and my concern is the fix is a band-aid approach and is not correcting the real problem.

Comment 72 Cathy Avery 2017-11-17 13:17:22 UTC

(In reply to longli from comment #69)

#define DID_TARGET_FAILURE 0x10 /* Permanent target failure, do not retry on
                                 * other paths */

This error looks to be incompatible with multipath.

Comment 73 paulf123 2017-11-17 21:35:25 UTC

Hi folks.

Not sure if this is an appropriate place to ask this sort of question. I certainly appreciate that this is an acknowledged bug that is being worked on but....

My thought is that there must be hundreds of thousands of installations that run on Hyper-V servers and leverage VSS. What are people doing to work around this in production? In our environment we are talking multiple daily outages with anything other than 4 months+ old kernels leaving several CVE issues open. Am I missing something or this not a huge problem?

Thx

Comment 81 paulf123 2017-11-23 21:47:22 UTC

To answer my own question, looks like being that the problem is in the SCSI stack (or whatever you programmer people call it :)) so switching the VM to use IDE instead appears to be an acceptable workaround.

Comment 82 paulf123 2017-11-23 22:04:51 UTC

(In reply to paulf123 from comment #81)
> To answer my own question, looks like being that the problem is in the SCSI
> stack (or whatever you programmer people call it :)) so switching the VM to
> use IDE instead appears to be an acceptable workaround.

Further to this... the SCSI controller itself has to stay otherwise the VM will get paused during snapshot.

An error about cache is still thrown but it it references an non existent drive so I believe this will be non-disruptive.

Comment 83 paulf123 2017-11-24 03:56:32 UTC

(In reply to paulf123 from comment #82)
> (In reply to paulf123 from comment #81)
> > To answer my own question, looks like being that the problem is in the SCSI
> > stack (or whatever you programmer people call it :)) so switching the VM to
> > use IDE instead appears to be an acceptable workaround.
> 
> Further to this... the SCSI controller itself has to stay otherwise the VM
> will get paused during snapshot.
> 
> An error about cache is still thrown but it it references an non existent
> drive so I believe this will be non-disruptive.

and nope.... same problem hung kworker @^#$%@$#!

still don't understand how anyone is using current kernels and not having a huge deal this show stopper problem

Comment 84 Tom Söderlund 2017-11-27 09:11:21 UTC

(In reply to paulf123 from comment #73)
> What are people doing to work around this in production?

If the problem is triggered by snapshot/checkpoint based backup, another remedy could be to use agent based backup instead until the problem is solved?

Comment 85 Cathy Avery 2017-11-27 14:32:26 UTC

This is a problem that exists in upstream kernels as well. This bug is a top priority and we are working closely with Microsoft towards a solution. When that happens which will hopefully be soon I will make a test kernel available.

Thanks,

Cathy

Comment 86 paulf123 2017-11-28 19:45:44 UTC

(In reply to Tom Söderlund from comment #84)
> (In reply to paulf123 from comment #73)
> > What are people doing to work around this in production?
> 
> If the problem is triggered by snapshot/checkpoint based backup, another
> remedy could be to use agent based backup instead until the problem is
> solved?

Thanks yes that is what we will do. It's been so long I forgot all about LVM abilities :)

Comment 87 Cathy Avery 2017-11-29 16:17:17 UTC

*** Bug 1510709 has been marked as a duplicate of this bug. ***

Comment 90 Cathy Avery 2017-12-06 15:38:42 UTC

@nikhil @milan @afox

Here is another test build for the customers to try. It has done well in our testing of VSS and multipath failover so far.

Please inform them that this is a test build only and is not intended for production use.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825

Thanks,

Cathy

Comment 91 Cathy Avery 2017-12-06 19:13:53 UTC

*** Bug 1500887 has been marked as a duplicate of this bug. ***

Comment 92 Milan P. Gandhi 2017-12-07 05:39:53 UTC

(In reply to Cathy Avery from comment #90)
> @nikhil @milan @afox
> 
> Here is another test build for the customers to try. It has done well in our
> testing of VSS and multipath failover so far.
> 
> Please inform them that this is a test build only and is not intended for
> production use.
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825
> 
> Thanks,
> 
> Cathy

Hello Cathy,

I have provided the test kernel to couple of customers. As soon as I receive any inputs from any of them, I shall report back the same here.

Many thanks,

Milan.

Comment 93 paulf123 2017-12-07 22:13:38 UTC

(In reply to Cathy Avery from comment #90)
> @nikhil @milan @afox
> 
> Here is another test build for the customers to try. It has done well in our
> testing of VSS and multipath failover so far.
> 
> Please inform them that this is a test build only and is not intended for
> production use.
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825
> 
> Thanks,
> 
> Cathy


Hi. This is internally accessible only so I cannot obtain for testing? Thanks.

Comment 94 Cathy Avery 2017-12-08 11:31:51 UTC

(In reply to paulf123 from comment #93)
> (In reply to Cathy Avery from comment #90)
> > @nikhil @milan @afox
> > 
> > Here is another test build for the customers to try. It has done well in our
> > testing of VSS and multipath failover so far.
> > 
> > Please inform them that this is a test build only and is not intended for
> > production use.
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825
> > 
> > Thanks,
> > 
> > Cathy
> 
> 
> Hi. This is internally accessible only so I cannot obtain for testing?
> Thanks.

I can post it to where you should be able to access it. Is x86_64 OK?

Comment 95 Cathy Avery 2017-12-08 12:02:52 UTC

(In reply to Cathy Avery from comment #94)
> (In reply to paulf123 from comment #93)
> > (In reply to Cathy Avery from comment #90)
> > > @nikhil @milan @afox
> > > 
> > > Here is another test build for the customers to try. It has done well in our
> > > testing of VSS and multipath failover so far.
> > > 
> > > Please inform them that this is a test build only and is not intended for
> > > production use.
> > > 
> > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14687825
> > > 
> > > Thanks,
> > > 
> > > Cathy
> > 
> > 
> > Hi. This is internally accessible only so I cannot obtain for testing?
> > Thanks.
> 
> I can post it to where you should be able to access it. Is x86_64 OK?

Disregard the x86_64 question.

Comment 96 Cathy Avery 2017-12-08 14:15:10 UTC

@nikhil @milan @afox

Could you make the brew rpm available to paulf123?

I'm having trouble copying to my people account.

Thanks!

Comment 97 Cathy Avery 2017-12-08 14:27:07 UTC

(In reply to Cathy Avery from comment #96)
> @nikhil @milan @afox
> 
> Could you make the brew rpm available to paulf123?
> 
> I'm having trouble copying to my people account.
> 
> Thanks!

OK never mind I got it working.

@paulf123 

Here is the rpm to install

http://people.redhat.com/cavery/kernel-3.10.0-693.12.1.el7_BZ_1502601_3.x86_64.rpm

Comment 98 paulf123 2017-12-09 05:50:14 UTC

(In reply to Cathy Avery from comment #97)
> (In reply to Cathy Avery from comment #96)
> > @nikhil @milan @afox
> > 
> > Could you make the brew rpm available to paulf123?
> > 
> > I'm having trouble copying to my people account.
> > 
> > Thanks!
> 
> OK never mind I got it working.
> 
> @paulf123 
> 
> Here is the rpm to install
> 
> http://people.redhat.com/cavery/kernel-3.10.0-693.12.1.el7_BZ_1502601_3.
> x86_64.rpm


Thanks. Still getting cache data failed on a non existing controller + device(sdb) but not every time VSS snapshot is invoked. Also has not hung kworker (yet) but this machine isn't doing much/no load unlike production boxes.

Here are a few iterations of VSS snapshots. First error occurs Dec 8 23:37:16


Dec  8 23:20:49 crashycrashy chronyd[653]: Selected source 198.50.139.209
Dec  8 23:20:49 crashycrashy chronyd[653]: System clock wrong by 159.944110 seconds, adjustment started
Dec  8 23:20:49 crashycrashy chronyd[653]: System clock was stepped by 159.944110 seconds
Dec  8 23:20:49 crashycrashy systemd: Time has been changed
Dec  8 23:21:11 crashycrashy systemd: Created slice User Slice of root.
Dec  8 23:21:11 crashycrashy systemd: Starting User Slice of root.
Dec  8 23:21:11 crashycrashy systemd: Started Session 1 of user root.
Dec  8 23:21:11 crashycrashy systemd-logind: New session 1 of user root.
Dec  8 23:21:11 crashycrashy systemd: Starting Session 1 of user root.
Dec  8 23:23:00 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:09 crashycrashy chronyd[653]: Source 159.203.8.72 replaced with 206.108.0.132
Dec  8 23:23:19 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:21 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:21 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:25 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:25 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Dec  8 23:20:45 crashycrashy systemd: Time has been changed
Dec  8 23:20:45 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded
Dec  8 23:20:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:20:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:20:45 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:21:26 crashycrashy chronyd[653]: Backward time jump detected!
Dec  8 23:21:26 crashycrashy chronyd[653]: Can't synchronise: no selectable sources
Dec  8 23:21:32 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:21:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:21:45 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:21:45 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:22:42 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:22:59 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:01 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:01 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:05 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:23:05 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Dec  8 23:23:05 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded
Dec  8 23:23:05 crashycrashy systemd: Time has been changed
Dec  8 23:23:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:23:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:23:05 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:23:34 crashycrashy chronyd[653]: Selected source 198.50.139.209
Dec  8 23:23:34 crashycrashy chronyd[653]: System clock wrong by 159.958641 seconds, adjustment started
Dec  8 23:23:53 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:24:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:24:05 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:24:05 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:28:35 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:29:52 crashycrashy chronyd[653]: Selected source 198.50.139.209
Dec  8 23:29:52 crashycrashy chronyd[653]: System clock wrong by 159.351515 seconds, adjustment started
Dec  8 23:32:05 crashycrashy systemd: Created slice User Slice of vulnscan.
Dec  8 23:32:05 crashycrashy systemd: Starting User Slice of vulnscan.
Dec  8 23:32:05 crashycrashy systemd: Started Session 2 of user vulnscan.
Dec  8 23:32:05 crashycrashy systemd-logind: New session 2 of user vulnscan.
Dec  8 23:32:05 crashycrashy systemd: Starting Session 2 of user vulnscan.
Dec  8 23:32:46 crashycrashy systemd: Starting Cleanup of Temporary Directories...Dec  8 23:32:46 crashycrashy systemd: Started Cleanup of Temporary Directories.
Dec  8 23:34:08 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:34:28 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:34:30 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:34:30 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:34:34 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:34:34 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Dec  8 23:34:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:34:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:34:12 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:34:12 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded
Dec  8 23:34:12 crashycrashy systemd: Time has been changed
Dec  8 23:34:53 crashycrashy chronyd[653]: Backward time jump detected!
Dec  8 23:34:53 crashycrashy chronyd[653]: Can't synchronise: no selectable sources
Dec  8 23:35:03 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:35:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:35:12 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:35:12 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:36:54 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:37:02 crashycrashy chronyd[653]: Selected source 206.108.0.132
Dec  8 23:37:02 crashycrashy chronyd[653]: System clock wrong by 156.812611 seconds, adjustment started
Dec  8 23:37:05 crashycrashy chronyd[653]: Selected source 149.56.27.12
Dec  8 23:37:14 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:37:16 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:37:17 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:37:21 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:37:21 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Dec  8 23:37:16 crashycrashy systemd: Time has been changed
Dec  8 23:37:16 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded
Dec  8 23:37:16 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:37:16 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:37:16 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:37:16 crashycrashy kernel: scsi 3:0:1:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
Dec  8 23:37:16 crashycrashy kernel: sd 3:0:1:0: Attached scsi generic sg2 type 0
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available.
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available.
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Write Protect is off
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Asking for cache data failed
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Assuming drive cache: write through
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available.
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Sense not available.
Dec  8 23:37:17 crashycrashy kernel: sd 3:0:1:0: [sdb] Attached SCSI disk
Dec  8 23:38:08 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:38:09 crashycrashy chronyd[653]: Can't synchronise: no selectable sources
Dec  8 23:38:17 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:38:17 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:38:17 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:39:46 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:40:06 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:40:09 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:40:09 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:40:13 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:40:14 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Dec  8 23:40:00 crashycrashy systemd: Time has been changed
Dec  8 23:40:00 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded
Dec  8 23:40:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:40:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:40:00 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:40:02 crashycrashy chronyd[653]: Backward time jump detected!
Dec  8 23:40:46 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:41:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:41:00 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:41:00 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:42:11 crashycrashy chronyd[653]: Selected source 206.108.0.132
Dec  8 23:42:11 crashycrashy chronyd[653]: System clock wrong by 159.800448 seconds, adjustment started
Dec  8 23:42:14 crashycrashy chronyd[653]: Selected source 149.56.27.12
Dec  8 23:43:35 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:43:55 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:43:57 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:43:58 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:44:02 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:44:02 crashycrashy journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Dec  8 23:43:53 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:43:53 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:43:53 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Dec  8 23:43:53 crashycrashy journal: Hyper-V VSS: VSS: op=THAW: succeeded
Dec  8 23:43:54 crashycrashy systemd: Time has been changed
Dec  8 23:44:23 crashycrashy chronyd[653]: Can't synchronise: no selectable sources
Dec  8 23:44:40 crashycrashy journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Dec  8 23:44:54 crashycrashy kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Dec  8 23:44:54 crashycrashy kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Dec  8 23:44:54 crashycrashy kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa

Comment 105 paulf123 2018-01-04 05:41:08 UTC

Hi. Any update on this? Thanks!

Comment 106 Cathy Avery 2018-01-04 12:16:55 UTC

(In reply to paulf123 from comment #105)
> Hi. Any update on this? Thanks!

The fix has been pushed to 4.15 upstream. It will be part of RHEL 7.5 and most likely added to a RHEL 7.4 errata.

Thanks for your help,

Cathy

Comment 107 Leo White 2018-01-04 15:36:35 UTC

Cathy is there an ETA on that roll out of 7.5?

Comment 109 paulf123 2018-01-05 21:25:14 UTC

(In reply to Leo White from comment #107)
> Cathy is there an ETA on that roll out of 7.5?

Thanks Cathy or even when 4.15 will be avail kernel-ml?

Comment 110 Cathy Avery 2018-01-06 13:12:06 UTC

(In reply to paulf123 from comment #109)
> (In reply to Leo White from comment #107)
> > Cathy is there an ETA on that roll out of 7.5?
> 
> Thanks Cathy or even when 4.15 will be avail kernel-ml?

I'm not sure of the dates. I'll try to find out.

Comment 113 Cathy Avery 2018-01-09 14:42:47 UTC

For those looking for release dates I unfortunately don't have that info. It looks like the best option for status updates is to contact your Red Hat support person.

Comment 118 paulf123 2018-01-11 17:33:27 UTC

I've tested ML 4.15 RC7 under heavy load.

So far there are still seeing warnings about non existent controllers/drives per logs bellow and now there are some additional I/O errors being thrown (again sdc, sdd which are not actually present.) The good news is there doesn't appear to be any hanging of the OS which is great!

Jan  9 20:00:09 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:00:11 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:00:11 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:00:25 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:00:25 ouch1 journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Jan  9 20:00:24 ouch1 systemd: Time has been changed
Jan  9 20:00:24 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:00:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:00:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:00:24 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:00:24 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:00:24 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:00:24 ouch1 journal: Hyper-V VSS: VSS: op=THAW: succeeded
Jan  9 20:00:24 ouch1 kernel: scsi 3:0:1:1: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
Jan  9 20:00:24 ouch1 kernel: sd 3:0:1:1: Attached scsi generic sg2 type 0
Jan  9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB)
Jan  9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] 4096-byte physical blocks
Jan  9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] Write Protect is off
Jan  9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan  9 20:00:24 ouch1 kernel: sdc: sdc1 sdc2
Jan  9 20:00:24 ouch1 kernel: sd 3:0:1:1: [sdc] Attached SCSI disk
Jan  9 20:00:24 ouch1 systemd: Starting LVM2 PV scan on device 8:34...
Jan  9 20:00:24 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found.
Jan  9 20:00:24 ouch1 lvm: WARNING: Autoactivation reading from disk instead of lvmetad.
Jan  9 20:00:25 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY on /dev/sdc2 was already found on /dev/sda2.
Jan  9 20:00:25 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY prefers device /dev/sda2 because device is used by LV.
Jan  9 20:00:25 ouch1 lvm: 1 logical volume(s) in volume group "MYDATA" now active
Jan  9 20:00:25 ouch1 lvm: 2 logical volume(s) in volume group "CL" now active
Jan  9 20:00:25 ouch1 systemd: Started LVM2 PV scan on device 8:34.
Jan  9 20:00:25 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronizing SCSI cache
Jan  9 20:00:25 ouch1 systemd: Stopping LVM2 PV scan on device 8:34...
Jan  9 20:00:25 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:00:25 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found.
Jan  9 20:00:25 ouch1 systemd: Stopped LVM2 PV scan on device 8:34.
Jan  9 20:00:41 ouch1 chronyd[786]: Selected source 99.224.6.96
Jan  9 20:00:41 ouch1 chronyd[786]: System clock wrong by 156.713638 seconds, adjustment started
Jan  9 20:00:43 ouch1 chronyd[786]: Selected source 149.56.121.19
Jan  9 20:00:56 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:01:01 ouch1 systemd: Started Session 32 of user root.
Jan  9 20:01:01 ouch1 systemd: Starting Session 32 of user root.
Jan  9 20:01:23 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:01:46 ouch1 chronyd[786]: Selected source 99.224.6.96
Jan  9 20:01:46 ouch1 chronyd[786]: Source 99.224.6.96 replaced with 144.217.181.221
Jan  9 20:01:47 ouch1 chronyd[786]: Selected source 149.56.121.19
Jan  9 20:01:49 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:04:44 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:05:04 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:05:15 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:05:16 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:05:16 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:05:16 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:05:21 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:05:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:05:24 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:05:24 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:05:56 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:06:16 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:07:10 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:07:29 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:07:58 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:08:24 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:08:26 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:08:26 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:08:41 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:08:41 ouch1 journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Jan  9 20:08:04 ouch1 systemd: Time has been changed
Jan  9 20:08:04 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:08:04 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:08:04 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:08:04 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:08:04 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:08:04 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:08:04 ouch1 journal: Hyper-V VSS: VSS: op=THAW: succeeded
Jan  9 20:08:05 ouch1 kernel: scsi 3:0:1:1: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: Attached scsi generic sg2 type 0
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB)
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] 4096-byte physical blocks
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Write Protect is off
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan  9 20:08:05 ouch1 kernel: sdc: sdc1 sdc2
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Attached SCSI disk
Jan  9 20:08:05 ouch1 systemd: Starting LVM2 PV scan on device 8:34...
Jan  9 20:08:05 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found.
Jan  9 20:08:05 ouch1 lvm: WARNING: Autoactivation reading from disk instead of lvmetad.
Jan  9 20:08:05 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY on /dev/sdc2 was already found on /dev/sda2.
Jan  9 20:08:05 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY prefers device /dev/sda2 because device is used by LV.
Jan  9 20:08:05 ouch1 kernel: scsi 3:0:1:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: Attached scsi generic sg3 type 0
Jan  9 20:08:05 ouch1 lvm: 1 logical volume(s) in volume group "MYDATA" now active
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available.
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available.
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] 0 512-byte logical blocks: (0 B/0 B)
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] 0-byte physical blocks
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Write Protect is off
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Asking for cache data failed
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Assuming drive cache: write through
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available.
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Sense not available.
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:0: [sdd] Attached SCSI disk
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronizing SCSI cache
Jan  9 20:08:05 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:08:05 ouch1 lvm: 2 logical volume(s) in volume group "CL" now active
Jan  9 20:08:05 ouch1 systemd: Stopped LVM2 PV scan on device 8:34.
Jan  9 20:08:42 ouch1 chronyd[786]: Backward time jump detected!
Jan  9 20:08:42 ouch1 chronyd[786]: Can't synchronise: no selectable sources
Jan  9 20:09:17 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:09:18 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:09:18 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:09:18 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:09:18 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:09:18 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:09:18 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:10:02 ouch1 systemd: Started Session 33 of user root.
Jan  9 20:10:02 ouch1 systemd: Starting Session 33 of user root.
Jan  9 20:10:36 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:10:51 ouch1 chronyd[786]: Selected source 149.56.121.19
Jan  9 20:10:51 ouch1 chronyd[786]: System clock wrong by 154.429252 seconds, adjustment started
Jan  9 20:11:01 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:11:03 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:11:03 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:11:18 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:11:13 ouch1 journal: Hyper-V VSS: VSS: op=FREEZE: succeeded
Jan  9 20:11:13 ouch1 systemd: Time has been changed
Jan  9 20:11:13 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:11:13 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:11:13 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:11:13 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:11:13 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:11:13 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:11:13 ouch1 journal: Hyper-V VSS: VSS: op=THAW: succeeded
Jan  9 20:11:14 ouch1 kernel: scsi 3:0:1:1: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: Attached scsi generic sg2 type 0
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] 83886080 512-byte logical blocks: (42.9 GB/40.0 GiB)
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] 4096-byte physical blocks
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Write Protect is off
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan  9 20:11:14 ouch1 kernel: sdc: sdc1 sdc2
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Attached SCSI disk
Jan  9 20:11:14 ouch1 systemd: Starting LVM2 PV scan on device 8:34...
Jan  9 20:11:14 ouch1 lvm: WARNING: Not using lvmetad because duplicate PVs were found.
Jan  9 20:11:14 ouch1 lvm: WARNING: Autoactivation reading from disk instead of lvmetad.
Jan  9 20:11:14 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY on /dev/sdc2 was already found on /dev/sda2.
Jan  9 20:11:14 ouch1 lvm: WARNING: PV jF7p3F-iVi1-AUOA-Zp0D-Meej-Rexs-9gR2yY prefers device /dev/sda2 because device is used by LV.
Jan  9 20:11:14 ouch1 lvm: 1 logical volume(s) in volume group "MYDATA" now active
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronizing SCSI cache
Jan  9 20:11:14 ouch1 kernel: sd 3:0:1:1: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan  9 20:11:14 ouch1 lvm: 2 logical volume(s) in volume group "CL" now active
Jan  9 20:11:14 ouch1 systemd: Stopped LVM2 PV scan on device 8:34.
Jan  9 20:11:55 ouch1 chronyd[786]: Selected source 158.69.125.231
Jan  9 20:11:59 ouch1 chronyd[786]: Can't synchronise: no selectable sources
Jan  9 20:15:28 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:15:30 ouch1 kernel: sd 3:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:15:30 ouch1 kernel: sd 3:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:15:30 ouch1 kernel: sd 3:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:15:37 ouch1 kernel: sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
Jan  9 20:15:37 ouch1 kernel: sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
Jan  9 20:15:37 ouch1 kernel: sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automa
Jan  9 20:17:02 ouch1 journal: Hyper-V VSS: VSS: op=CHECK HOT BACKUP
Jan  9 20:17:18 ouch1 chronyd[786]: Selected source 149.56.121.19
Jan  9 20:17:18 ouch1 chronyd[786]: System clock wrong by 4.618358 seconds, adjustment started

Comment 119 longli 2018-01-11 20:22:40 UTC

The I/O failure messages on /sdc and /sdd are expected. With the patch, storvsc may return I/O errors (when we figure out they are actually not present) when their partitions are being probed.

Comment 120 Bruno Meneguele 2018-01-19 17:50:27 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 123 Bruno Meneguele 2018-01-22 12:40:52 UTC

Patch(es) available on kernel-3.10.0-832.el7

Comment 136 Qiaowei Zheng 2018-02-09 05:15:51 UTC

Verified with kernel 3.10.0-843.el7

Tested with steps in comment #26 on WS2016/WS2012R2/WS2012, no crash/error observed. Backup is able to complete while continuously adding/removing disks to the vm.

Tested FC multipath failover on WS2016/WS2012R2, no error observed, failover works correctly under io load.

Comment 142 Amaury De Ganseman 2018-04-03 10:38:23 UTC

We have the same problem here. I've just updated from 3.10.0-693.11.1.el7.x86_64to 3.10.0-693.21.1.el7.x86_64.

Does it include the patch ? 


Thanks

Comment 143 Marc Milgram 2018-04-03 11:59:16 UTC

Hi Amaury,

(In reply to Amaury De Ganseman from comment #142)
> We have the same problem here. I've just updated from
> 3.10.0-693.11.1.el7.x86_64to 3.10.0-693.21.1.el7.x86_64.
> 
> Does it include the patch ? 

Yes.  This was fixed for RHEL-7.4.z in kernel-3.10.0-693.21.1.el7 - Errata RHSA-2018:0395:
https://access.redhat.com/errata/RHSA-2018:0395

Marc

Comment 144 errata-xmlrpc 2018-04-10 22:32:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1062