Bug 2013322 - Adaptec 71605H HBA Card (pm8001 driver) causes system hang when drive attached with kernel >= 5.13.16-200.fc34.x86_64
Summary: Adaptec 71605H HBA Card (pm8001 driver) causes system hang when drive attache...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 34
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-12 15:15 UTC by Daniel J. R. May
Modified: 2022-06-07 22:48 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-07 22:48:38 UTC
Type: ---


Attachments (Terms of Use)

Description Daniel J. R. May 2021-10-12 15:15:14 UTC
User-Agent:       Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0
Build Identifier: 

Hello,

I have two systems which I have been upgrading from Fedora 33. Both have Adaptec 71605H HBA Cards and I get the same behaviour in each system.

They both work fine with the initial Fedora 34 kernel 5.11.12-300.fc34.x86_64, but they both have problems when I update to use a more recent 5.13 (F34) or 5.14 (F34/F35) kernels. 

If I remove all the HDDs connected to the HBA card then the systems boot and work fine. When I connect a HDD to the HBA card the system hangs after about 5 seconds and requires a hard reboot. If I have a drive connected to the HBA at boot time then the system hangs at boot time. Connecting the same disk directly to the motherboard works fine.

If I use `journalctl --follow` when I connect a disk to the HBA I see the following output:

```
Oct 11 18:37:56 sulphur kernel: sas: phy-7:3 added to port-7:0, phy_mask:0x8 (50000d11074d0603)
Oct 11 18:37:56 sulphur kernel: sas: DOING DISCOVERY on port 0, pid:144
Oct 11 18:37:56 sulphur kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0
Oct 11 18:37:56 sulphur kernel: sas: ata7: end_device-7:0: dev error handler
Oct 11 18:38:56 sulphur kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Oct 11 18:38:56 sulphur kernel: rcu:         0-...0: (5 ticks this GP) idle=2b6/1/0x4000000000000000 softirq=48347/48348 fqs=14999 
Oct 11 18:38:56 sulphur kernel:         (detected by 3, t=60002 jiffies, g=92413, q=329)
Oct 11 18:38:56 sulphur kernel: Sending NMI from CPU 3 to CPUs 0:
Oct 11 18:38:56 sulphur kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
Oct 11 18:38:56 sulphur kernel: Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr ns sunrpc vfat fat mlx4_ib ib_uverbs ib_core ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt at24 intel_pmc_bxt iTCO_vendor_support irqbypass rapl intel_cstate intel_uncore i2c_i801 mlx4_core intel_pch_thermal i2c_smbus lpc_ich pm80xx acpi_ipmi joydev ipmi_si libsas ie31200_edac ipmi_devintf ipmi_msghandler fuse zram ip_tables xfs ast drm_vram_helper drm_kms_helper cec drm_ttm_helper ttm drm crct10dif_pclmul mpt3sas crc32_pclmul crc32c_intel igb ghash_clmulni_intel dca i2c_algo_bit raid_class scsi_transport_sas video
Oct 11 18:38:56 sulphur kernel: CPU: 0 PID: 1175 Comm: kworker/u8:0 Tainted: G        W         5.14.10-300.fc35.x86_64 #1
Oct 11 18:38:56 sulphur kernel: Hardware name: Supermicro X10SL7-F/X10SL7-F, BIOS 2.00 04/24/2014
Oct 11 18:38:56 sulphur kernel: Workqueue: events_unbound async_run_entry_fn
Oct 11 18:38:56 sulphur kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
Oct 11 18:38:56 sulphur kernel: Code: 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 11 01 00 00 85 c0 74 0e 8b 02 84 c0 74 08 f3 90 8b 02 <84> c0 75 f8 b8 01 00 00 00 66 89 02 c3 8b 37 b9 00 02 00 00 81 fe
Oct 11 18:38:56 sulphur kernel: RSP: 0018:ffffa79f415f3a68 EFLAGS: 00000002
Oct 11 18:38:56 sulphur kernel: RAX: 0000000000000101 RBX: ffff8b3ac2b50000 RCX: 0000000000000000
Oct 11 18:38:56 sulphur kernel: RDX: ffff8b3ac2b50038 RSI: 0000000000000000 RDI: ffff8b3ac2b50038
Oct 11 18:38:56 sulphur kernel: RBP: ffff8b3ae32c3f00 R08: 0000000000000001 R09: ffff8b3ae32c3f00
Oct 11 18:38:56 sulphur kernel: R10: 0000000074706db0 R11: 0000000000000001 R12: 0000000000000046
Oct 11 18:38:56 sulphur kernel: R13: 0000000000000000 R14: ffff8b3ac2b50038 R15: ffff8b3ad6400000
Oct 11 18:38:56 sulphur kernel: FS:  0000000000000000(0000) GS:ffff8b3ddfc00000(0000) knlGS:0000000000000000
Oct 11 18:38:56 sulphur kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 18:38:56 sulphur kernel: CR2: 00007f12d865fbb0 CR3: 0000000410e10004 CR4: 00000000001706f0
Oct 11 18:38:56 sulphur kernel: Call Trace:
Oct 11 18:38:56 sulphur kernel:  _raw_spin_lock_irqsave+0x32/0x40
Oct 11 18:38:56 sulphur kernel:  pm8001_task_exec.constprop.0+0x66/0x3f0 [pm80xx]
Oct 11 18:38:56 sulphur kernel:  ? kmem_cache_alloc+0x165/0x290
Oct 11 18:38:56 sulphur kernel:  sas_ata_qc_issue+0x17d/0x220 [libsas]
Oct 11 18:38:56 sulphur kernel:  ata_qc_issue+0xfe/0x1f0
Oct 11 18:38:56 sulphur kernel:  ata_exec_internal_sg+0x2b8/0x560
Oct 11 18:38:56 sulphur kernel:  ata_hpa_resize+0x15b/0x440
Oct 11 18:38:56 sulphur kernel:  ? ata_dev_blacklisted+0x68/0xc0
Oct 11 18:38:56 sulphur kernel:  ata_dev_configure+0x188/0xed0
Oct 11 18:38:56 sulphur kernel:  ? ata_dev_read_id+0x3ca/0x470
Oct 11 18:38:56 sulphur kernel:  ata_eh_recover+0x973/0x1340
Oct 11 18:38:56 sulphur kernel:  ? __irq_work_queue_local+0x48/0x50
Oct 11 18:38:56 sulphur kernel:  ? enqueue_entity+0x16a/0x780
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_sched_eh+0x60/0x60 [libsas]
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_prereset+0x50/0x50 [libsas]
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_sched_eh+0x60/0x60 [libsas]
Oct 11 18:38:56 sulphur kernel:  ? sas_ata_prereset+0x50/0x50 [libsas]
Oct 11 18:38:56 sulphur kernel:  ata_do_eh+0x71/0xf0
Oct 11 18:38:56 sulphur kernel:  ata_scsi_port_error_handler+0x3cf/0x8a0
Oct 11 18:38:56 sulphur kernel:  async_sas_ata_eh+0x44/0x7b [libsas]
Oct 11 18:38:56 sulphur kernel:  async_run_entry_fn+0x30/0x130
Oct 11 18:38:56 sulphur kernel:  process_one_work+0x1ec/0x390
Oct 11 18:38:56 sulphur kernel:  worker_thread+0x53/0x3e0
Oct 11 18:38:56 sulphur kernel:  ? process_one_work+0x390/0x390
Oct 11 18:38:56 sulphur kernel:  kthread+0x127/0x150
Oct 11 18:38:56 sulphur kernel:  ? set_kthread_struct+0x40/0x40
Oct 11 18:38:56 sulphur kernel:  ret_from_fork+0x22/0x30
Oct 11 18:38:56 sulphur kernel: NMI backtrace for cpu 0
Oct 11 18:38:56 sulphur kernel: CPU: 0 PID: 1175 Comm: kworker/u8:0 Tainted: G        W         5.14.10-300.fc35.x86_64 #1
Oct 11 18:38:56 sulphur kernel: Hardware name: Supermicro X10SL7-F/X10SL7-F, BIOS 2.00 04/24/2014
Oct 11 18:38:56 sulphur kernel: Workqueue: events_unbound async_run_entry_fn
Oct 11 18:38:56 sulphur kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
Oct 11 18:38:56 sulphur kernel: Code: 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 11 01 00 00 85 c0 74 0e 8b 02 84 c0 74 08 f3 90 8b 02 <84> c0 75 f8 b8 01 00 00 00 66 89 02 c3 8b 37 b9 00 02 00 00 81 fe
Oct 11 18:38:56 sulphur kernel: RSP: 0018:ffffa79f415f3a68 EFLAGS: 00000002
Oct 11 18:38:56 sulphur kernel: RAX: 0000000000000101 RBX: ffff8b3ac2b50000 RCX: 0000000000000000
Oct 11 18:38:56 sulphur kernel: RDX: ffff8b3ac2b50038 RSI: 0000000000000000 RDI: ffff8b3ac2b50038
Oct 11 18:38:56 sulphur kernel: RBP: ffff8b3ae32c3f00 R08: 0000000000000001 R09: ffff8b3ae32c3f00
Oct 11 18:38:56 sulphur kernel: R10: 0000000074706db0 R11: 0000000000000001 R12: 0000000000000046
Oct 11 18:38:56 sulphur kernel: R13: 0000000000000000 R14: ffff8b3ac2b50038 R15: ffff8b3ad6400000
Oct 11 18:38:56 sulphur kernel: FS:  0000000000000000(0000) GS:ffff8b3ddfc00000(0000) knlGS:0000000000000000
Oct 11 18:38:56 sulphur kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 18:38:56 sulphur kernel: CR2: 00007f12d865fbb0 CR3: 0000000410e10004 CR4: 00000000001706f0
Oct 11 18:38:56 sulphur kernel: Call Trace:
Oct 11 18:38:56 sulphur kernel:  _raw_spin_lock_irqsave+0x32/
Oct 11 18:38:56 sulphur kernel: Lost 26 message(s)!
```


Reproducible: Always

Steps to Reproduce:
1. Have a system with an Adaptec 71605H HBA Card.
2. Have a system with kernel >= 5.13.16-200.fc34.x86_64.

Either:
3a. Boot with HDD attached to the HBA card. System hangs at boot when inspecting drives on HBA card. 

Or:
3b. Remove all HDDs attached to the HBA card.
4. Boot and log in. 
5. Attach a drive to the HBA card. System hangs.

Actual Results:  
System hangs and requires hard reboot.

Expected Results:  
Normal operation.

My guess is that some of the changes introduced between [5.11 version of the pm8001 driver](https://github.com/torvalds/linux/tree/v5.11/drivers/scsi/pm8001) and [5.13 version](https://v5.13/drivers/scsi/pm8001) have caused this issue. However, I may well be wrong and I am not sure how to go about testing this.

Are there any other tests or diagnosis I should do?

Comment 1 Ben Cotton 2022-05-12 15:55:06 UTC
This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 2 Ben Cotton 2022-06-07 22:48:38 UTC
Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07.

Fedora Linux 34 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.