Description of problem: On three identical systems (Dell Optiplex 3620) with Intel i7-6770 CPUs, all of them equipped with M.2 NVME SSD's (Western Digital Black SSD model WDS256G1X0C-00ENX0 with B35900WD firmware), so that faulty hardware is almost assuredly out of the equation, starting 'sudo fstrim -va' manually or waiting for the 'fstrim.timer' to trigger (on monday nights) crashes the system with: [cut] Feb 25 00:00:49 ronin systemd[1]: Starting Discard unused blocks... Feb 25 00:00:49 ronin kernel: DMAR: DRHD: handling fault status reg 3 Feb 25 00:00:49 ronin kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr 0 [fault reason 06] PTE Read access is not set Feb 25 00:01:20 ronin kernel: nvme nvme0: I/O 697 QID 5 timeout, aborting Feb 25 00:01:26 ronin kernel: nvme nvme0: I/O 446 QID 2 timeout, aborting Feb 25 00:01:38 ronin kernel: nvme nvme0: I/O 634 QID 4 timeout, aborting Feb 25 00:01:38 ronin kernel: nvme nvme0: I/O 635 QID 4 timeout, aborting Feb 25 00:01:38 ronin kernel: nvme nvme0: I/O 636 QID 4 timeout, aborting Feb 25 00:01:50 ronin kernel: nvme nvme0: I/O 697 QID 5 timeout, reset controller Feb 25 00:02:20 ronin kernel: nvme nvme0: I/O 4 QID 0 timeout, reset controller Feb 25 00:03:13 ronin kernel: nvme nvme0: Device not ready; aborting reset Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7 Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7 Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7 Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7 Feb 25 00:03:13 ronin kernel: nvme nvme0: Abort status: 0x7 Feb 25 00:03:36 ronin kernel: nvme nvme0: Device not ready; aborting reset Feb 25 00:03:36 ronin kernel: nvme nvme0: Removing after probe failure status: -19 Feb 25 00:03:36 ronin kernel: print_req_error: I/O error, dev nvme0n1, sector 157472648 Feb 25 00:03:36 ronin kernel: print_req_error: I/O error, dev nvme0n1, sector 1050704 [/cut] and more badness... Version-Release number of selected component (if applicable): fstrim from util-linux 2.32.1 How reproducible: 90 over 100 trials (every now and then it works, then retrying immediately after triggers the crash). Steps to Reproduce: 1. any of the PCs with the configuration mentioned above boots; 2. SSH to console; 3. 'sudo fstrim -va' Actual results: console stays silent for a while, then dies with 'fstrim: /: FITRIM ioctl failed: Input/output error' and every attempt to issue any command results in 'input/output error', hard reset necessary. Expected results: [cut] [fsimula@ronin ~]$ sudo fstrim -va /opt: 0 B (0 bytes) trimmed /boot: 148,4 MiB (155566080 bytes) trimmed /: 644,1 MiB (675352576 bytes) trimmed [/cut] Additional info: 'dmesg' always outputs the following messages: - DMAR: DRHD: handling fault status reg 3 - DMAR: [DMA Read] Request device [02:00.0] fault addr 0 [fault reason 06] PTE Read access is not set more or less simultaneously with the crash, [02:00.0] being the PCI address of the NVME drive: [cut] [fsimula@ronin ~]$ sudo lspci -vs02:00.0 02:00.0 Non-Volatile memory controller: Sandisk Corp WD Black NVMe SSD (prog-if 02 [NVM Express]) Subsystem: Marvell Technology Group Ltd. Device 1093 Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0 Memory at ef000000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [b0] MSI-X: Enable+ Count=19 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [148] Device Serial Number 03-4d-ff-7a-99-88-77-66 Capabilities: [158] Power Budgeting <?> Capabilities: [168] Alternative Routing-ID Interpretation (ARI) Capabilities: [178] Secondary PCI Express <?> Capabilities: [2b8] Latency Tolerance Reporting Capabilities: [2c0] L1 PM Substates Kernel driver in use: nvme Kernel modules: nvme [/cut] and after reading: - "https://unix.stackexchange.com/questions/456628/nvme-fstrim-causing-crash-on-linux-disabling-with-systemctl-doesnt-help" and - "https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough" I made up my mind that the problem was somehow linked to IOMMU and VT-d. I decided to change the 'intel_iommu=on" GRUB setting with 'intel_iommu=pt'; guess what, this seems to solve the crash. Can someone clarify?
Sounds like kernel or HW problem. Re-assigning.
This message is a reminder that Fedora 28 is nearing its end of life. On 2019-May-28 Fedora will stop maintaining and issuing updates for Fedora 28. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '28'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 28 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
Since the problem is still present in Fedora 30, the bug was cloned here: https://bugzilla.redhat.com/show_bug.cgi?id=1707443
Sorry, wrong post; please, disregard or if possible delete the post above this one.