Created attachment 1714271 [details] Related messages extraced from the systemd journal 1. Please describe the problem: With a 5.8 kernel, every 4-5 boots iscsid crashes and isci devices fail to come up, which hangs the boot in my case because of dependencies. This is really an upstream kernel bug that I have reported a couple of times on lkml, here: https://lkml.org/lkml/2020/7/28/1085 and here: https://lkml.org/lkml/2020/8/31/459 It got no traction from the commit (1b66d253610c7 ("bpf: Add get{peer, sock}name attach types for sock_addr") author or the iscsi maintainers, maybe someone in Fedora can help nudge the process forward. The iscsi_sw_tcp_conn_get_param function has this: spin_lock_bh(&conn->session->frwd_lock); if (!tcp_sw_conn || !tcp_sw_conn->sock) { spin_unlock_bh(&conn->session->frwd_lock); return -ENOTCONN; } if (param == ISCSI_PARAM_LOCAL_PORT) rc = kernel_getsockname(tcp_sw_conn->sock, (struct sockaddr *)&addr); else rc = kernel_getpeername(tcp_sw_conn->sock, (struct sockaddr *)&addr); spin_unlock_bh(&conn->session->frwd_lock); .. so it's calling kernel_getpeername while holding a spinlock. Commit 1b66d253610c7 introduces BPF_CGROUP_RUN_SA_PROG_LOCK() within getpeername, which does lock_sock(sk); .... release_sock(sk);, and will try to acquire a mutex. My guess is that when the mutex is not contended this works out fine, but when it is and there's a need to sleep, the "scheduling while atomic" occurs. While that commit introduces the issue, calling into something like kernel_getpeername while holding a spinlock seems a bit optimistic, so maybe this needs fixing in the iscsi caller. 2. What is the Version-Release number of the kernel: 5.8.6-201.fc32.x86_64 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : This is a kernel upstream bug that appeared with the 5.8 kernel. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Yes, it reproduces once out of every 4-5 boots, on several systems. It has occurred with 5.8.6-201.fc32.x86_64 and also several times with custom kernels based on very recent mainline 5.9-rc kernels. As far as I can tell the problem still exists in the mainline kernel. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Not tested, but I assume the problem is present there as well. 6. Are you running any modules that not shipped with directly Fedora's kernel?: I sometimes do, but this was reproduced without it. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Related journal entries attached.
I think I'm hitting this as well - and it can get much worse in an OpenStack environment with many iSCSI LUN that come and go.
I have posted a patch for review by the Linux kernel SCSI subsystem maintainers.
https://marc.info/?l=linux-scsi&m=160126767316636&w=2
The patch looks fine to me. No issues starting iscsid over 20 reboots, where it would usually fail once out of every 4/5 starts.
To have a reference here, Mark's patch has been merged into the kernel mainline as commit bcf3a2953d36bbfb9bd44ccb3db0897d935cc485, and has also been queued for 5.8 stable, so should be part of the next 5.8 point release.
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.