Bug 1685613 - fstrim (manually or by timer) crashes the NVME controller and freezes the system
Summary: fstrim (manually or by timer) crashes the NVME controller and freezes the system
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 28
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-05 16:24 UTC by Francesco Simula
Modified: 2019-05-29 08:20 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-28 23:44:57 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Francesco Simula 2019-03-05 16:24:28 UTC
Description of problem:

On three identical systems (Dell Optiplex 3620) with Intel i7-6770 CPUs, all of them equipped with M.2 NVME SSD's (Western Digital Black SSD model WDS256G1X0C-00ENX0 with B35900WD firmware), so that faulty hardware is almost assuredly out of the equation, starting 'sudo fstrim -va' manually or waiting for the 'fstrim.timer' to trigger (on monday nights) crashes the system with:

[cut]
Feb 25 00:00:49 ronin systemd[1]: Starting Discard unused blocks...
Feb 25 00:00:49 ronin kernel: DMAR: DRHD: handling fault status reg 3
Feb 25 00:00:49 ronin kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr 0 [fault reason 06] PTE Read access is not set
Feb 25 00:01:20 ronin kernel: nvme nvme0: I/O 697 QID 5 timeout, aborting
Feb 25 00:01:26 ronin kernel: nvme nvme0: I/O 446 QID 2 timeout, aborting
Feb 25 00:01:38 ronin kernel: nvme nvme0: I/O 634 QID 4 timeout, aborting
Feb 25 00:01:38 ronin kernel: nvme nvme0: I/O 635 QID 4 timeout, aborting
Feb 25 00:01:38 ronin kernel: nvme nvme0: I/O 636 QID 4 timeout, aborting
Feb 25 00:01:50 ronin kernel: nvme nvme0: I/O 697 QID 5 timeout, reset controller
Feb 25 00:02:20 ronin kernel: nvme nvme0: I/O 4 QID 0 timeout, reset controller
Feb 25 00:03:13 ronin kernel: nvme nvme0: Device not ready; aborting reset
Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7
Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7
Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7
Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7
Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7
Feb 25 00:03:36 ronin kernel: nvme nvme0: Device not ready; aborting reset
Feb 25 00:03:36 ronin kernel: nvme nvme0: Removing after probe failure status: -19
Feb 25 00:03:36 ronin kernel: print_req_error: I/O error, dev nvme0n1, sector 157472648
Feb 25 00:03:36 ronin kernel: print_req_error: I/O error, dev nvme0n1, sector 1050704
[/cut]

and more badness...

Version-Release number of selected component (if applicable):

fstrim from util-linux 2.32.1

How reproducible:

90 over 100 trials (every now and then it works, then retrying immediately after triggers the crash).

Steps to Reproduce:
1. any of the PCs with the configuration mentioned above boots;
2. SSH to console;
3. 'sudo fstrim -va'

Actual results:

console stays silent for a while, then dies with 'fstrim: /: FITRIM ioctl failed: Input/output error' and every attempt to issue any command results in 'input/output error', hard reset necessary.

Expected results:

[cut]
[fsimula@ronin ~]$ sudo fstrim -va
/opt: 0 B (0 bytes) trimmed
/boot: 148,4 MiB (155566080 bytes) trimmed
/: 644,1 MiB (675352576 bytes) trimmed
[/cut]

Additional info:

'dmesg' always outputs the following messages:
- DMAR: DRHD: handling fault status reg 3
- DMAR: [DMA Read] Request device [02:00.0] fault addr 0 [fault reason 06] PTE Read access is not set

more or less simultaneously with the crash, [02:00.0] being the PCI address of the NVME drive:

[cut]
[fsimula@ronin ~]$ sudo lspci -vs02:00.0
02:00.0 Non-Volatile memory controller: Sandisk Corp WD Black NVMe SSD (prog-if 02 [NVM Express])
	Subsystem: Marvell Technology Group Ltd. Device 1093
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
	Memory at ef000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [b0] MSI-X: Enable+ Count=19 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [148] Device Serial Number 03-4d-ff-7a-99-88-77-66
	Capabilities: [158] Power Budgeting <?>
	Capabilities: [168] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [178] Secondary PCI Express <?>
	Capabilities: [2b8] Latency Tolerance Reporting
	Capabilities: [2c0] L1 PM Substates
	Kernel driver in use: nvme
	Kernel modules: nvme
[/cut]

and after reading:
- "https://unix.stackexchange.com/questions/456628/nvme-fstrim-causing-crash-on-linux-disabling-with-systemctl-doesnt-help"
and
- "https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough"

I made up my mind that the problem was somehow linked to IOMMU and VT-d.

I decided to change the 'intel_iommu=on" GRUB setting with 'intel_iommu=pt'; guess what, this seems to solve the crash.

Can someone clarify?

Comment 1 Karel Zak 2019-03-06 09:13:14 UTC
Sounds like kernel or HW problem. Re-assigning.

Comment 2 Ben Cotton 2019-05-02 19:18:04 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 3 Ben Cotton 2019-05-02 19:48:32 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 4 Ben Cotton 2019-05-28 23:44:57 UTC
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 5 Francesco Simula 2019-05-29 08:18:00 UTC
Since the problem is still present in Fedora 30, the bug was cloned here:
https://bugzilla.redhat.com/show_bug.cgi?id=1707443

Comment 6 Francesco Simula 2019-05-29 08:20:58 UTC
Sorry, wrong post; please, disregard or if possible delete the post above this one.


Note You need to log in before you can comment on or make changes to this bug.