1877345 – iscscid "scheduling while atomic" with 5.8 kernel

Bug 1877345 - iscscid "scheduling while atomic" with 5.8 kernel

Summary: iscscid "scheduling while atomic" with 5.8 kernel

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	32
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-09 12:32 UTC by Marc Dionne
Modified:	2021-05-25 17:28 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-05-25 17:28:26 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Related messages extraced from the systemd journal (10.11 KB, text/plain) 2020-09-09 12:32 UTC, Marc Dionne	no flags	Details
View All

Description Marc Dionne 2020-09-09 12:32:10 UTC

Created attachment 1714271 [details]
Related messages extraced from the systemd journal

1. Please describe the problem:

With a 5.8 kernel, every 4-5 boots iscsid crashes and isci devices fail to come up, which hangs the boot in my case because of dependencies.

This is really an upstream kernel bug that I have reported a couple of times on lkml, here: 
    https://lkml.org/lkml/2020/7/28/1085
and here:
    https://lkml.org/lkml/2020/8/31/459

It got no traction from the commit (1b66d253610c7 ("bpf: Add get{peer, sock}name attach types for sock_addr") author or the iscsi maintainers, maybe someone in Fedora can help nudge the process forward.

The iscsi_sw_tcp_conn_get_param function has this:

                spin_lock_bh(&conn->session->frwd_lock);
                if (!tcp_sw_conn || !tcp_sw_conn->sock) {
                        spin_unlock_bh(&conn->session->frwd_lock);
                        return -ENOTCONN;
                }
                if (param == ISCSI_PARAM_LOCAL_PORT)
                        rc = kernel_getsockname(tcp_sw_conn->sock,
                                                (struct sockaddr *)&addr);
                else
                        rc = kernel_getpeername(tcp_sw_conn->sock,
                                                (struct sockaddr *)&addr);
                spin_unlock_bh(&conn->session->frwd_lock);

.. so it's calling kernel_getpeername while holding a spinlock.   Commit 1b66d253610c7 introduces BPF_CGROUP_RUN_SA_PROG_LOCK() within getpeername, which does lock_sock(sk); .... release_sock(sk);, and will try to acquire a mutex.  My guess is that when the mutex is not contended this works out fine, but when it is and there's a need to sleep, the "scheduling while atomic" occurs.

While that commit introduces the issue, calling into something like kernel_getpeername while holding a spinlock seems a bit optimistic, so maybe this needs fixing in the iscsi caller.

2. What is the Version-Release number of the kernel:

5.8.6-201.fc32.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

This is a kernel upstream bug that appeared with the 5.8 kernel.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Yes, it reproduces once out of every 4-5 boots, on several systems.  It has occurred with 5.8.6-201.fc32.x86_64 and also several times with custom kernels based on very recent mainline 5.9-rc kernels.  As far as I can tell the problem still exists in the mainline kernel.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not tested, but I assume the problem is present there as well.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

I sometimes do, but this was reproduced without it.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Related journal entries attached.

Comment 1 Mark Mielke 2020-09-27 18:21:05 UTC

I think I'm hitting this as well - and it can get much worse in an OpenStack environment with many iSCSI LUN that come and go.

Comment 2 Mark Mielke 2020-09-28 04:44:56 UTC

I have posted a patch for review by the Linux kernel SCSI subsystem maintainers.

Comment 3 Mark Mielke 2020-09-28 04:46:27 UTC

https://marc.info/?l=linux-scsi&m=160126767316636&w=2

Comment 4 Marc Dionne 2020-09-28 19:36:13 UTC

The patch looks fine to me.  No issues starting iscsid over 20 reboots, where it would usually fail once out of every 4/5 starts.

Comment 5 Marc Dionne 2020-10-05 11:47:07 UTC

To have a reference here, Mark's patch has been merged into the kernel mainline as commit bcf3a2953d36bbfb9bd44ccb3db0897d935cc485, and has also been queued for 5.8 stable, so should be part of the next 5.8 point release.

Comment 6 Fedora Program Management 2021-04-29 16:56:42 UTC

This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 7 Ben Cotton 2021-05-25 17:28:26 UTC

Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

acaringi
airlied
bskeggs
hdegoede
ichavero
itamar
jarodwilson
jeremy
jglisse
john.j5live
jonathan
josef
kernel-maint
lgoncalv
linville
mark
masami256
mchehab
mjg59
steved