Bug 1518329
| Summary: | Identify Controller failed (-4) resulting in nvme controller attempted reset. Reboot to recover. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nathaniel Weddle <nweddle> |
| Component: | kernel | Assignee: | David Milburn <dmilburn> |
| kernel sub component: | Platform Enablement | QA Contact: | Zhang Yi <yizhan> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | alex.wang, cww, dmilburn, glamb, jaearl, jmagrini, jpittman, jshortt, keith.busch, mstowe, nweddle, toneata, wlehman, yizhan |
| Version: | 7.4 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-19 09:44:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1477664, 1532680 | ||
|
Description
Nathaniel Weddle
2017-11-28 16:08:58 UTC
Hi Nathaniel, The messages file (Comment#1) shows page allocation failures Nov 24 21:52:36 mrpcvmw12 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20) Nov 24 21:52:36 mrpcvmw12 kernel: cache: kmalloc-8192, object size: 8192, buffer size: 8192, default order: 3, min order: 1 Nov 24 21:52:36 mrpcvmw12 kernel: node 0: slabs: 136, objs: 364, free: 0 Nov 24 21:52:36 mrpcvmw12 kernel: node 1: slabs: 135, objs: 540, free: 24 Nov 24 21:52:36 mrpcvmw12 kernel: kworker/47:1H: page allocation failure: order:1, mode:0x204020 Nov 24 21:52:36 mrpcvmw12 kernel: CPU: 47 PID: 69804 Comm: kworker/47:1H Tainted: G W OE ------------ 3.10.0-693.5.2\ .el7.x86_64 #1 Nov 24 21:52:36 mrpcvmw12 kernel: Hardware name: Cisco Systems Inc UCSC-C240-M4SX/UCSC-C240-M4SX, BIOS C240M4.3.0.3a.0.03211721\ 11 03/21/2017 Nov 24 21:52:36 mrpcvmw12 kernel: Workqueue: kblockd blk_mq_run_work_fn Nov 24 21:52:36 mrpcvmw12 kernel: 0000000000204020 0000000090ca5da1 ffff883f11377868 ffffffff816a3e51 Nov 24 21:52:36 mrpcvmw12 kernel: ffff883f113778f8 ffffffff81188820 0000000000000046 ffff883f113778b8 Nov 24 21:52:36 mrpcvmw12 kernel: fffffffffffffffe 0020402000000000 ffff88407ffdb018 0000000090ca5da1 Nov 24 21:52:36 mrpcvmw12 kernel: Call Trace: Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff816a3e51>] dump_stack+0x19/0x1b Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff81188820>] warn_alloc_failed+0x110/0x180 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8169fe2a>] __alloc_pages_slowpath+0x6b6/0x724 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8116b93d>] ? irq_work_queue+0xd/0x80 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811d1078>] alloc_pages_current+0x98/0x110 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811dbdfc>] new_slab+0x2fc/0x310 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffff811dd68c>] ___slab_alloc+0x3ac/0x4f0 Nov 24 21:52:36 mrpcvmw12 kernel: [<ffffffffc0054f90>] ? nvme_queue_rq+0x1c0/0x870 [nvme] Are these systems running under memory pressure? Thanks. Hi David, I did not believe so as both direct and normal (both nodes) indicate available memory at the time of the page allocation failure. I reached out to the kernel team to take a look, and they have verified the key bits on memory pressure are the "free" and "min" values from the zone information. It looks like there is plenty of memory free Hi, Nov 22 20:44:49 host1 kernel: nvme nvme1: async event result 00010300 Nov 22 20:44:49 host1 kernel: nvme 0000:85:00.0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 ^From the above I noticed that the controller status register shows that the Controller Fatal Status (CFS) bit has been set. According to the NVMe spec "This field is set to '1' when a fatal controller error occurred that could not be communicated in the appropriate Completion Queue. This field is cleared to '0' when a fatal controller error has not occurred. Refer to section 9.5". Then section 9.5 says "This indicates to host software that a serious error condition has occurred. When this condition occurs, host software should reset and then re-initialize the controller." So, it seems the NVMe driver's watchdog timer is following this protocol. Cc'ing Keith to see if he has seen this failure on any systems. Nov 22 20:45:52 host1 kernel: nvme nvme1: I/O 0 QID 0 timeout, disable controller Nov 22 20:45:54 host1 kernel: nvme nvme1: Identify Controller failed (-4) Nov 22 20:45:54 host1 kernel: nvme nvme1: Removing after probe failure status: -5 Nov 22 20:45:54 host1 kernel: nvme1n1: detected capacity change from 3820752101376 to 0 Yeah, preceding CSTS.CFS, the controller reported async event 10300, which translates to "persistent internal error", then the fatal status was observed. It looks like something has gotten this drive in a bad state. Following the device reset, the driver did not observe a completion to the first admin command when initializing the controller. When this recovery/initialization sequence fails like this, the driver unbinds from the device. The description says the drive recovers after a power cycle recovers. Maybe the vendor knows more about such conditions, or has a way to find out why the drive is behaving this way. Hi, I've discussed this case with William and explained that we suspect this is a hardware/firmware issue, and that the Cisco certified vendor should be engaged. I intended to make this bug public when I opened, and I'm doing so now. tyvm, nfw Hi, Have we heard anything back from the HGST engineers? Should CU continue collecting hdm diagnostic tool output following reboot? We have a mutual customer who is consistently hitting this bug. They still encountered it after loading the Red Hat 7.5 test kernel provided. Has there been any traction with HGST? Hi, Here are few notes I received back from HGST. 1. It appears that the logs were collected after a lot of idle time since the power was cycled. Please make sure and attach a log as soon as possible after the drive fails. 2. They see signs of multiple power failures to the drive which causes it to shutdown causing the drive to go into recovery mode which may take 10 seconds to fully come back clean. Do you have any idea why the drive would see a loss of power? 3. Even though the PCI errors were correctable, it doesn't look like the drive is fully recovering from them. Thanks. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. Hi David: Marcia Staley-Olsen PROGRAM MANAGER mstaleyo Tel: +1 408 525 2335 |