1979070 – Inaccessible NFS server overloads clients (native_queued_spin_lock_slowpath connotation?)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1979070 - Inaccessible NFS server overloads clients (native_queued_spin_lock_slowpath connotation?)

Summary: Inaccessible NFS server overloads clients (native_queued_spin_lock_slowpath c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	beta
Target Release:	---
Assignee:	Scott Mayhew
QA Contact:	JianHong Yin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1975926 1980613
TreeView+	depends on / blocked

Reported:	2021-07-04 17:57 UTC by Nir Soffer
Modified:	2021-11-10 06:05 UTC (History)
CC List:	13 users (show)
Fixed In Version:	kernel-4.18.0-323.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1980613 (view as bug list)
Environment:
Last Closed:	2021-11-09 19:23:41 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs and flamegraphs from good and bad systems (1.97 MB, application/x-xz) 2021-07-04 17:57 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/kernel rhel-8 merge_requests 944	0	None	None	None	2021-07-08 19:31:44 UTC
Red Hat Product Errata	RHSA-2021:4356	0	None	None	None	2021-11-09 19:24:26 UTC

Description Nir Soffer 2021-07-04 17:57:32 UTC

Created attachment 1797883 [details]
logs and flamegraphs from good and bad systems

Description of problem:

When NFS server become inaccessible, for example by blocking network
access or shutting the server, all NFS clients are overloaded.

The result on the client are:
- extreme load (up to 123 seen)
- delays up to 40 seconds in vdsm (entire process is blocked)
- extreme cpu usage (0 idle)
- multipath reports failed devices for all paths (blocked for many
  seconds?)
- ssh connections get stuck for 20-30 seconds

On RHV system this cause all hosts in a cluster to lose access to all
storage domains (nfs mounts, iscsi luns) and become non-operational in
few minutes - the entire cluster goes down.  See bug 1975926 for
details.

In top we see multiple "rpciod" kernel workers consuming up to 99% cpu
time, and from time to time processes trying to access the inaccessible
mountpoint (e.g. dd) show high cpu time while they are blocked on open()
or read().


Here are examples top output from RHV system in this state:

top - 17:28:54 up 20:25,  3 users,  load average: 84.18, 28.47, 10.41
Tasks: 319 total,  32 running, 286 sleeping,   0 stopped,   1 zombie
%Cpu(s):  8.4 us, 82.0 sy,  0.0 ni,  0.0 id,  8.7 wa,  0.6 hi,  0.3 si,  0.0 st
MiB Mem :  31910.3 total,  13789.8 free,  14451.4 used,   3669.1 buff/cache
MiB Swap:  16116.0 total,  16116.0 free,      0.0 used.  16956.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  13404 vdsm       0 -20  411732   6716   4012 S  99.8   0.0   0:19.55 ioprocess
 577755 vdsm       0 -20  224304    960    896 D  99.0   0.0   0:19.85 dd
 577754 vdsm       0 -20  224304    972    900 D  98.8   0.0   0:19.83 dd
 577753 vdsm       0 -20  224304    960    896 D  98.5   0.0   0:19.79 dd
 577752 vdsm       0 -20  224304    920    852 D  98.2   0.0   0:19.76 dd
   3648 qemu      20   0   20.6g   7.0g  24308 D  63.4  22.6  75:30.31 qemu-kvm
 507741 root      20   0       0      0      0 R  40.5   0.0   0:20.60 kworker/u18:1+rpciod
 153691 root      20   0       0      0      0 R  37.3   0.0   0:56.77 kworker/u17:3+rpciod
 432365 root      20   0       0      0      0 R  37.3   0.0   0:28.12 kworker/u17:4+rpciod
   2420 vdsm       0 -20 5754764 146724  30652 S   2.9   0.4  53:15.09 vdsmd
     37 root      20   0       0      0      0 S   2.4   0.0   0:00.85 ksoftirqd/4


top - 17:29:10 up 20:25,  3 users,  load average: 103.47, 35.39, 12.96
Tasks: 328 total,  45 running, 281 sleeping,   0 stopped,   2 zombie
%Cpu(s):  2.9 us, 79.9 sy,  0.2 ni,  0.5 id, 15.9 wa,  0.6 hi,  0.0 si,  0.0 st
MiB Mem :  31910.3 total,  13611.4 free,  14460.5 used,   3838.4 buff/cache
MiB Swap:  16116.0 total,  16116.0 free,      0.0 used.  16947.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 432365 root      20   0       0      0      0 I  98.6   0.0   0:43.38 kworker/u17:4-rpciod
 153691 root      20   0       0      0      0 I  98.3   0.0   1:11.99 kworker/u17:3-rpciod
 577841 root      20   0       0      0      0 I  98.1   0.0   0:15.18 kworker/u17:5-rpciod
  92014 root      20   0       0      0      0 I  69.5   0.0   0:26.99 kworker/u17:1-rpciod
     36 root      rt   0       0      0      0 S  58.1   0.0   0:09.05 migration/4
 556197 root      20   0       0      0      0 R  52.2   0.0   0:28.29 kworker/u18:4+rpciod
 577854 root      20   0       0      0      0 R  52.2   0.0   0:08.11 kworker/u18:5+rpciod
 516383 root      20   0       0      0      0 R  52.1   0.0   0:08.99 kworker/u18:2+rpciod
 507741 root      20   0       0      0      0 R  51.0   0.0   0:28.50 kworker/u18:1+rpciod
  13404 vdsm       0 -20  411732   6716   4012 S  28.4   0.0   0:23.94 ioprocess
   3648 qemu      20   0   20.5g   7.0g  24308 S  21.9  22.6  75:33.70 qemu-kvm
 577745 root      20   0  386328  24600  20044 R   1.4   0.1   0:02.96 perf


I recorded perf events during one incident. In the attached flamegraph
lynx17/rhv-nfs-down/perf-kernel.svg we can see that multiple processes
and (rpcdio?) kernel workers are blocked in native_queued_spin_lock_slowpath.


Minimal reproducer without RHV:

1. Setup 2 hosts running RHEL 8.4

2. Setup NFS server on one host with one export

    echo '/export/test *(rw,async,anonuid=36,anongid=36,no_subtree_check)' >> /etc/exports
    exportfs -r
    firewall-cmd --add-service=nfs --permanent
    firewall-cmd --reload
    systemctl enable --now nfs-server

3. Mount the export on the second host (using same mount option as RHV):

    mkdir /tmp/mnt
    chown vdsm:kvm /tmp/mnt
    mount -t nfs -o soft,nosharecache,timeo=100,retrans=3,nfsvers=4,minorversion=2 lynx21:/export/test /tmp/mnt

(reproduced also without -o soft)

4. Run dd in a loop, reading 4k bytes from the mountpoint every 10
  seconds (simulates how RHV monitor storage). For quicker reproduction,
  run several dd processes. See attached monitor.sh script

    while true; do
        date
        dd if=/tmp/mnt/test bs=4096 count=1 iflag=direct of=/dev/null
        echo
        sleep 10
    done

5. Stop the NFS server on first host

    systemctl stop nfs-server

6. Watch the client host.

Trouble starts after a minute or so. Load increase slowly, and every
minute or so we see one or more rpciod worker with high cpu usage, and
or dd.

In "perf top" we can see that most of the time is spent in
native_queued_spin_lock_slowpath:

   PerfTop:    9955 irqs/sec  kernel:98.5%  exact: 98.4% lost: 0/0 drop: 0/26214 [4000Hz cycles],  (all, 4 CPUs)
-------------------------------------------------------------------------------

    42.80%  [kernel]          [k] native_queued_spin_lock_slowpath
    17.39%  [sunrpc]          [k] xprt_reserve_xprt
    17.09%  [kernel]          [k] _raw_spin_lock
     4.58%  [sunrpc]          [k] xprt_prepare_transmit
     2.40%  [sunrpc]          [k] call_connect_status
     2.02%  [sunrpc]          [k] call_connect
     1.80%  [sunrpc]          [k] __rpc_execute
     1.78%  [sunrpc]          [k] rpc_sleep_on_timeout
     1.73%  [sunrpc]          [k] __rpc_sleep_on_priority_timeout
     1.66%  [kernel]          [k] nmi

See complete output in lynx12/simulate-rhv-flow/perf-top.out

In the attached flamegraph lynx12/simulate-rhv-flow/perf-kernel.svg we
can see that both dd and (rpciod?) kernel worker are blocked on
native_queued_spin_lock_slowpath:

    native_queued_spin_lock_slowpath (25.34%)
    _raw_spin_lock (28.52%)
    xprt_prepare_transmit (35.75%)
    call_connect (35.75%)
    ...
    do_sys_open (37.47%)
    ...
    dd (37.55%)


    native_queued_spin_lock_slowpath (12.68%)
    _raw_spin_lock (15.14%)
    xprt_prepare_transmit (26.76%)
    call_connect (29%)
    __rpc_execute
    rpc_async_schedule
    ...
    kworker/u8:5-ev (36.90%)


I tried the same flow on older setup we had in the lab:
# uname -r
4.18.0-240.5.1.el8_3.x86_64

The issue is not reproducible. The system works as expected with
inaccessible NFS server. See good/good.txt for details, and
good/storage-ge1-vdsm{1,2}-top.out to the expected condition in this
case.


Version-Release number of selected component (if applicable):
bad: 4.18.0-305.8.1.el8_4.x86_64
good: 4.18.0-240.5.1.el8_3.x86_64


How reproducible:
100%

Actual results:
Inaccessible mount affects other mounts and multipath devices. Entire
cluster goes down.

Expected results:
Processes accessing the inaccessible mount blocked without affecting
other processes using other mounts or multipath devices.


Attached files:

attachment
├── good                            # kernel 4.18.0-240.5.1
│   ├── good.txt
│   ├── storage-ge1-vdsm1-top.out
│   └── storage-ge1-vdsm2-top.out
├── lynx12                          # kernel 4.18.0-305.8.1
│   ├── rhv-nfs-down                # reproducation with RHV
│   │   ├── out.perf-folded
│   │   ├── perf-kernel.svg
│   │   └── top.out
│   └── simulate-rhv-flow           # simulated RHV flow
│       ├── monitor.sh
│       ├── out.perf-folded
│       ├── perf-kernel.svg
│       └── perf-top.out
└── lynx17                          # kernel 4.18.0-305.8.1
    └── rhv-nfs-down                # reproducation with RHV
        ├── out.perf-folded
        ├── perf-kernel.svg
        └── top.out

- lynx17 (bare metal) was the SPM host in RHV, and was also running the hosted
  engine VM, so it show higher load.

- lynx12 (bare metal) is an idle host that does nothing but storage monitoring.

- good/storage-ge1-vdsm{1,2} are RHV hosts (vms) in an idle system. One of them
  was the SPM during the test.

Comment 29 errata-xmlrpc 2021-11-09 19:23:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4356

Note You need to log in before you can comment on or make changes to this bug.