Bug 1518329 - Identify Controller failed (-4) resulting in nvme controller attempted reset. Reboot to recover.
Summary: Identify Controller failed (-4) resulting in nvme controller attempted reset....
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: All
OS: All
urgent
urgent
Target Milestone: rc
: ---
Assignee: David Milburn
QA Contact: Zhang Yi
URL:
Whiteboard:
Depends On:
Blocks: 1477664 1532680
TreeView+ depends on / blocked
 
Reported: 2017-11-28 16:08 UTC by Nathaniel Weddle
Modified: 2021-06-10 13:43 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-19 09:44:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nathaniel Weddle 2017-11-28 16:08:58 UTC
Description of problem:
   Identify Controller failed (-4) resulting in nvme controller attempted reset.  Reboot to recover.

Nov 22 20:44:49 host1 kernel: nvme nvme1: async event result 00010300
Nov 22 20:44:49 host1 kernel: nvme 0000:85:00.0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Nov 22 20:45:52 host1 kernel: nvme nvme1: I/O 0 QID 0 timeout, disable controller
Nov 22 20:45:54 host1 kernel: nvme nvme1: Identify Controller failed (-4)
Nov 22 20:45:54 host1 kernel: nvme nvme1: Removing after probe failure status: -5
Nov 22 20:45:54 host1 kernel: nvme1n1: detected capacity change from 3820752101376 to 0


Version-Release number of selected component (if applicable):
   Booted kernel:  3.10.0-693.5.2.el7.x86_64
   Non-Volatile memory controller [0108]: HGST, Inc. Ultrastar SN100

How reproducible:
   Customer reproduced using same controller on a couple of different, similar systems.  CU replaced controller and has reproduced once in the past week


Steps to Reproduce:
1.  Unknown
2.
3.

Actual results:
Following AER adn PCIe Bus errors (corrected) the controller is reported as down, and driver attempt to reset.  The device never appears to recover until the system is rebooted.

Expected results:

If failure is related to correctable errors, the controller should not reset, and should recover following reset.

Additional info:
   I am uploading tar of lspci, and dmidecode.  Unfortunately, CU has been rebooting the system prior to collected sosreports, I've requested CU to collect the data prior to rebooting the system (or at least providing dmesg)

Comment 2 David Milburn 2017-11-28 18:31:27 UTC
Hi Nathaniel,

The messages file (Comment#1) shows page allocation failures

Nov 24 21:52:36 mrpcvmw12 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Nov 24 21:52:36 mrpcvmw12 kernel:  cache: kmalloc-8192, object size: 8192, buffer size: 8192, default order: 3, min order: 1
Nov 24 21:52:36 mrpcvmw12 kernel:  node 0: slabs: 136, objs: 364, free: 0
Nov 24 21:52:36 mrpcvmw12 kernel:  node 1: slabs: 135, objs: 540, free: 24
Nov 24 21:52:36 mrpcvmw12 kernel: kworker/47:1H: page allocation failure: order:1, mode:0x204020
Nov 24 21:52:36 mrpcvmw12 kernel: CPU: 47 PID: 69804 Comm: kworker/47:1H Tainted: G        W  OE  ------------   3.10.0-693.5.2\
.el7.x86_64 #1
Nov 24 21:52:36 mrpcvmw12 kernel: Hardware name: Cisco Systems Inc UCSC-C240-M4SX/UCSC-C240-M4SX, BIOS C240M4.3.0.3a.0.03211721\
11 03/21/2017
Nov 24 21:52:36 mrpcvmw12 kernel: Workqueue: kblockd blk_mq_run_work_fn
Nov 24 21:52:36 mrpcvmw12 kernel: 0000000000204020 0000000090ca5da1 ffff883f11377868 ffffffff816a3e51
Nov 24 21:52:36 mrpcvmw12 kernel: ffff883f113778f8 ffffffff81188820 0000000000000046 ffff883f113778b8
Nov 24 21:52:36 mrpcvmw12 kernel: fffffffffffffffe 0020402000000000 ffff88407ffdb018 0000000090ca5da1
Nov 24 21:52:36 mrpcvmw12 kernel: Call Trace:
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff816a3e51>] dump_stack+0x19/0x1b
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff81188820>] warn_alloc_failed+0x110/0x180
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8169fe2a>] __alloc_pages_slowpath+0x6b6/0x724
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8116b93d>] ? irq_work_queue+0xd/0x80
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811d1078>] alloc_pages_current+0x98/0x110
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811dbdfc>] new_slab+0x2fc/0x310
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811dd68c>] ___slab_alloc+0x3ac/0x4f0
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffffc0054f90>] ? nvme_queue_rq+0x1c0/0x870 [nvme]

Are these systems running under memory pressure? Thanks.

Comment 3 Nathaniel Weddle 2017-11-28 22:19:35 UTC
Hi David,

I did not believe so as both direct and normal (both nodes) indicate available memory at the time of the page allocation failure.  I reached out to the kernel team to take a look, and they have verified the key bits on memory pressure are the "free" and "min" values from the zone information.  It looks like there is plenty of memory free

Comment 4 David Milburn 2017-11-29 19:21:10 UTC
Hi,

Nov 22 20:44:49 host1 kernel: nvme nvme1: async event result 00010300
Nov 22 20:44:49 host1 kernel: nvme 0000:85:00.0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10

^From the above I noticed that the controller status register shows that the
Controller Fatal Status (CFS) bit has been set. According to the NVMe spec

"This field is set to '1' when a fatal controller error occurred that could not
be communicated in the appropriate Completion Queue. This field is cleared to
'0' when a fatal controller error has not occurred. Refer to section 9.5".

Then section 9.5 says "This indicates to host software that a serious error
condition has occurred. When this condition occurs, host software should reset
and then re-initialize the controller."

So, it seems the NVMe driver's watchdog timer is following this protocol.

Cc'ing Keith to see if he has seen this failure on any systems.

Nov 22 20:45:52 host1 kernel: nvme nvme1: I/O 0 QID 0 timeout, disable controller
Nov 22 20:45:54 host1 kernel: nvme nvme1: Identify Controller failed (-4)
Nov 22 20:45:54 host1 kernel: nvme nvme1: Removing after probe failure status: -5
Nov 22 20:45:54 host1 kernel: nvme1n1: detected capacity change from 3820752101376 to 0

Comment 5 Keith Busch 2017-11-29 19:35:37 UTC
Yeah, preceding CSTS.CFS, the controller reported async event 10300, which translates to "persistent internal error", then the fatal status was observed. It looks like something has gotten this drive in a bad state.

Following the device reset, the driver did not observe a completion to the first admin command when initializing the controller. When this recovery/initialization sequence fails like this, the driver unbinds from the device.

The description says the drive recovers after a power cycle recovers. Maybe the vendor knows more about such conditions, or has a way to find out why the drive is behaving this way.

Comment 7 Nathaniel Weddle 2017-12-21 17:55:02 UTC
Hi, 

I've discussed this case with William and explained that we suspect this is a hardware/firmware issue, and that the Cisco certified vendor should be engaged.  I intended to make this bug public when I opened, and I'm doing so now.

tyvm,
nfw

Comment 35 Nathaniel Weddle 2018-03-12 15:04:28 UTC
Hi,

Have we heard anything back from the HGST engineers?  Should CU continue collecting hdm diagnostic tool output following reboot?

Comment 38 jaearl 2018-03-14 14:54:22 UTC
We have a mutual customer who is consistently hitting this bug.  They still encountered it after loading the Red Hat 7.5 test kernel provided.  Has there been any traction with HGST?

Comment 41 David Milburn 2018-03-16 21:44:53 UTC
Hi,

Here are few notes I received back from HGST.

1. It appears that the logs were collected after a lot of idle time since the
power was cycled. Please make sure and attach a log as soon as possible after
the drive fails.

2. They see signs of multiple power failures to the drive which causes it to
shutdown causing the drive to go into recovery mode which may take 10 seconds
to fully come back clean.

Do you have any idea why the drive would see a loss of power?

3. Even though the PCI errors were correctable, it doesn't look like the drive
is fully recovering from them. Thanks.

Comment 43 Red Hat Bugzilla Rules Engine 2018-03-19 09:44:06 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 46 Nathaniel Weddle 2018-03-19 20:14:00 UTC
Hi David:

Marcia Staley-Olsen
PROGRAM MANAGER
mstaleyo
Tel: +1 408 525 2335


Note You need to log in before you can comment on or make changes to this bug.