Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1518329

Summary:	Identify Controller failed (-4) resulting in nvme controller attempted reset. Reboot to recover.
Product:	Red Hat Enterprise Linux 7	Reporter:	Nathaniel Weddle <nweddle>
Component:	kernel	Assignee:	David Milburn <dmilburn>
kernel sub component:	Platform Enablement	QA Contact:	Zhang Yi <yizhan>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	alex.wang, cww, dmilburn, glamb, jaearl, jmagrini, jpittman, jshortt, keith.busch, mstowe, nweddle, toneata, wlehman, yizhan
Version:	7.4
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-19 09:44:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1477664, 1532680

Description Nathaniel Weddle 2017-11-28 16:08:58 UTC

Description of problem:
Identify Controller failed (-4) resulting in nvme controller attempted reset. Reboot to recover.

Nov 22 20:44:49 host1 kernel: nvme nvme1: async event result 00010300
Nov 22 20:44:49 host1 kernel: nvme 0000:85:00.0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Nov 22 20:45:52 host1 kernel: nvme nvme1: I/O 0 QID 0 timeout, disable controller
Nov 22 20:45:54 host1 kernel: nvme nvme1: Identify Controller failed (-4)
Nov 22 20:45:54 host1 kernel: nvme nvme1: Removing after probe failure status: -5
Nov 22 20:45:54 host1 kernel: nvme1n1: detected capacity change from 3820752101376 to 0

Version-Release number of selected component (if applicable):
Booted kernel: 3.10.0-693.5.2.el7.x86_64
Non-Volatile memory controller [0108]: HGST, Inc. Ultrastar SN100

How reproducible:
Customer reproduced using same controller on a couple of different, similar systems. CU replaced controller and has reproduced once in the past week

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Following AER adn PCIe Bus errors (corrected) the controller is reported as down, and driver attempt to reset. The device never appears to recover until the system is rebooted.

Expected results:

If failure is related to correctable errors, the controller should not reset, and should recover following reset.

Additional info:
I am uploading tar of lspci, and dmidecode. Unfortunately, CU has been rebooting the system prior to collected sosreports, I've requested CU to collect the data prior to rebooting the system (or at least providing dmesg)

Comment 2 David Milburn 2017-11-28 18:31:27 UTC

Hi Nathaniel,

The messages file (Comment#1) shows page allocation failures

Nov 24 21:52:36 mrpcvmw12 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Nov 24 21:52:36 mrpcvmw12 kernel:  cache: kmalloc-8192, object size: 8192, buffer size: 8192, default order: 3, min order: 1
Nov 24 21:52:36 mrpcvmw12 kernel:  node 0: slabs: 136, objs: 364, free: 0
Nov 24 21:52:36 mrpcvmw12 kernel:  node 1: slabs: 135, objs: 540, free: 24
Nov 24 21:52:36 mrpcvmw12 kernel: kworker/47:1H: page allocation failure: order:1, mode:0x204020
Nov 24 21:52:36 mrpcvmw12 kernel: CPU: 47 PID: 69804 Comm: kworker/47:1H Tainted: G        W  OE  ------------   3.10.0-693.5.2\
.el7.x86_64 #1
Nov 24 21:52:36 mrpcvmw12 kernel: Hardware name: Cisco Systems Inc UCSC-C240-M4SX/UCSC-C240-M4SX, BIOS C240M4.3.0.3a.0.03211721\
11 03/21/2017
Nov 24 21:52:36 mrpcvmw12 kernel: Workqueue: kblockd blk_mq_run_work_fn
Nov 24 21:52:36 mrpcvmw12 kernel: 0000000000204020 0000000090ca5da1 ffff883f11377868 ffffffff816a3e51
Nov 24 21:52:36 mrpcvmw12 kernel: ffff883f113778f8 ffffffff81188820 0000000000000046 ffff883f113778b8
Nov 24 21:52:36 mrpcvmw12 kernel: fffffffffffffffe 0020402000000000 ffff88407ffdb018 0000000090ca5da1
Nov 24 21:52:36 mrpcvmw12 kernel: Call Trace:
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff816a3e51>] dump_stack+0x19/0x1b
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff81188820>] warn_alloc_failed+0x110/0x180
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8169fe2a>] __alloc_pages_slowpath+0x6b6/0x724
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8116b93d>] ? irq_work_queue+0xd/0x80
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811d1078>] alloc_pages_current+0x98/0x110
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811dbdfc>] new_slab+0x2fc/0x310
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811dd68c>] ___slab_alloc+0x3ac/0x4f0
Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffffc0054f90>] ? nvme_queue_rq+0x1c0/0x870 [nvme]

Are these systems running under memory pressure? Thanks.

Comment 3 Nathaniel Weddle 2017-11-28 22:19:35 UTC

Hi David,

I did not believe so as both direct and normal (both nodes) indicate available memory at the time of the page allocation failure.  I reached out to the kernel team to take a look, and they have verified the key bits on memory pressure are the "free" and "min" values from the zone information.  It looks like there is plenty of memory free

Comment 4 David Milburn 2017-11-29 19:21:10 UTC

Hi,

Nov 22 20:44:49 host1 kernel: nvme nvme1: async event result 00010300
Nov 22 20:44:49 host1 kernel: nvme 0000:85:00.0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10

^From the above I noticed that the controller status register shows that the
Controller Fatal Status (CFS) bit has been set. According to the NVMe spec

"This field is set to '1' when a fatal controller error occurred that could not
be communicated in the appropriate Completion Queue. This field is cleared to
'0' when a fatal controller error has not occurred. Refer to section 9.5".

Then section 9.5 says "This indicates to host software that a serious error
condition has occurred. When this condition occurs, host software should reset
and then re-initialize the controller."

So, it seems the NVMe driver's watchdog timer is following this protocol.

Cc'ing Keith to see if he has seen this failure on any systems.

Nov 22 20:45:52 host1 kernel: nvme nvme1: I/O 0 QID 0 timeout, disable controller
Nov 22 20:45:54 host1 kernel: nvme nvme1: Identify Controller failed (-4)
Nov 22 20:45:54 host1 kernel: nvme nvme1: Removing after probe failure status: -5
Nov 22 20:45:54 host1 kernel: nvme1n1: detected capacity change from 3820752101376 to 0

Comment 5 Keith Busch 2017-11-29 19:35:37 UTC

Yeah, preceding CSTS.CFS, the controller reported async event 10300, which translates to "persistent internal error", then the fatal status was observed. It looks like something has gotten this drive in a bad state.

Following the device reset, the driver did not observe a completion to the first admin command when initializing the controller. When this recovery/initialization sequence fails like this, the driver unbinds from the device.

The description says the drive recovers after a power cycle recovers. Maybe the vendor knows more about such conditions, or has a way to find out why the drive is behaving this way.

Comment 7 Nathaniel Weddle 2017-12-21 17:55:02 UTC

Hi, 

I've discussed this case with William and explained that we suspect this is a hardware/firmware issue, and that the Cisco certified vendor should be engaged.  I intended to make this bug public when I opened, and I'm doing so now.

tyvm,
nfw

Comment 35 Nathaniel Weddle 2018-03-12 15:04:28 UTC

Hi,

Have we heard anything back from the HGST engineers?  Should CU continue collecting hdm diagnostic tool output following reboot?

Comment 38 jaearl 2018-03-14 14:54:22 UTC

We have a mutual customer who is consistently hitting this bug.  They still encountered it after loading the Red Hat 7.5 test kernel provided.  Has there been any traction with HGST?

Comment 41 David Milburn 2018-03-16 21:44:53 UTC

Hi,

Here are few notes I received back from HGST.

1. It appears that the logs were collected after a lot of idle time since the
power was cycled. Please make sure and attach a log as soon as possible after
the drive fails.

2. They see signs of multiple power failures to the drive which causes it to
shutdown causing the drive to go into recovery mode which may take 10 seconds
to fully come back clean.

Do you have any idea why the drive would see a loss of power?

3. Even though the PCI errors were correctable, it doesn't look like the drive
is fully recovering from them. Thanks.

Comment 43 Red Hat Bugzilla Rules Engine 2018-03-19 09:44:06 UTC

Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 46 Nathaniel Weddle 2018-03-19 20:14:00 UTC

Hi David:

Marcia Staley-Olsen
PROGRAM MANAGER
mstaleyo
Tel: +1 408 525 2335