Bug 1979070

Summary: Inaccessible NFS server overloads clients (native_queued_spin_lock_slowpath connotation?)
Product: Red Hat Enterprise Linux 8 Reporter: Nir Soffer <nsoffer>
Component: kernelAssignee: Scott Mayhew <smayhew>
kernel sub component: NFS QA Contact: JianHong Yin <jiyin>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: acaringi, bcodding, jiyin, mgold, mtessun, nfs-maint, pelauter, rhandlin, smayhew, vjuranek, xzhou, ymankad, yoyang
Version: 8.4Keywords: Triaged, ZStream
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-4.18.0-323.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1980613 (view as bug list) Environment:
Last Closed: 2021-11-09 19:23:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1975926, 1980613    
Attachments:
Description Flags
logs and flamegraphs from good and bad systems none

Description Nir Soffer 2021-07-04 17:57:32 UTC
Created attachment 1797883 [details]
logs and flamegraphs from good and bad systems

Description of problem:

When NFS server become inaccessible, for example by blocking network
access or shutting the server, all NFS clients are overloaded.

The result on the client are:
- extreme load (up to 123 seen)
- delays up to 40 seconds in vdsm (entire process is blocked)
- extreme cpu usage (0 idle)
- multipath reports failed devices for all paths (blocked for many
  seconds?)
- ssh connections get stuck for 20-30 seconds

On RHV system this cause all hosts in a cluster to lose access to all
storage domains (nfs mounts, iscsi luns) and become non-operational in
few minutes - the entire cluster goes down.  See bug 1975926 for
details.

In top we see multiple "rpciod" kernel workers consuming up to 99% cpu
time, and from time to time processes trying to access the inaccessible
mountpoint (e.g. dd) show high cpu time while they are blocked on open()
or read().


Here are examples top output from RHV system in this state:

top - 17:28:54 up 20:25,  3 users,  load average: 84.18, 28.47, 10.41
Tasks: 319 total,  32 running, 286 sleeping,   0 stopped,   1 zombie
%Cpu(s):  8.4 us, 82.0 sy,  0.0 ni,  0.0 id,  8.7 wa,  0.6 hi,  0.3 si,  0.0 st
MiB Mem :  31910.3 total,  13789.8 free,  14451.4 used,   3669.1 buff/cache
MiB Swap:  16116.0 total,  16116.0 free,      0.0 used.  16956.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  13404 vdsm       0 -20  411732   6716   4012 S  99.8   0.0   0:19.55 ioprocess
 577755 vdsm       0 -20  224304    960    896 D  99.0   0.0   0:19.85 dd
 577754 vdsm       0 -20  224304    972    900 D  98.8   0.0   0:19.83 dd
 577753 vdsm       0 -20  224304    960    896 D  98.5   0.0   0:19.79 dd
 577752 vdsm       0 -20  224304    920    852 D  98.2   0.0   0:19.76 dd
   3648 qemu      20   0   20.6g   7.0g  24308 D  63.4  22.6  75:30.31 qemu-kvm
 507741 root      20   0       0      0      0 R  40.5   0.0   0:20.60 kworker/u18:1+rpciod
 153691 root      20   0       0      0      0 R  37.3   0.0   0:56.77 kworker/u17:3+rpciod
 432365 root      20   0       0      0      0 R  37.3   0.0   0:28.12 kworker/u17:4+rpciod
   2420 vdsm       0 -20 5754764 146724  30652 S   2.9   0.4  53:15.09 vdsmd
     37 root      20   0       0      0      0 S   2.4   0.0   0:00.85 ksoftirqd/4


top - 17:29:10 up 20:25,  3 users,  load average: 103.47, 35.39, 12.96
Tasks: 328 total,  45 running, 281 sleeping,   0 stopped,   2 zombie
%Cpu(s):  2.9 us, 79.9 sy,  0.2 ni,  0.5 id, 15.9 wa,  0.6 hi,  0.0 si,  0.0 st
MiB Mem :  31910.3 total,  13611.4 free,  14460.5 used,   3838.4 buff/cache
MiB Swap:  16116.0 total,  16116.0 free,      0.0 used.  16947.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 432365 root      20   0       0      0      0 I  98.6   0.0   0:43.38 kworker/u17:4-rpciod
 153691 root      20   0       0      0      0 I  98.3   0.0   1:11.99 kworker/u17:3-rpciod
 577841 root      20   0       0      0      0 I  98.1   0.0   0:15.18 kworker/u17:5-rpciod
  92014 root      20   0       0      0      0 I  69.5   0.0   0:26.99 kworker/u17:1-rpciod
     36 root      rt   0       0      0      0 S  58.1   0.0   0:09.05 migration/4
 556197 root      20   0       0      0      0 R  52.2   0.0   0:28.29 kworker/u18:4+rpciod
 577854 root      20   0       0      0      0 R  52.2   0.0   0:08.11 kworker/u18:5+rpciod
 516383 root      20   0       0      0      0 R  52.1   0.0   0:08.99 kworker/u18:2+rpciod
 507741 root      20   0       0      0      0 R  51.0   0.0   0:28.50 kworker/u18:1+rpciod
  13404 vdsm       0 -20  411732   6716   4012 S  28.4   0.0   0:23.94 ioprocess
   3648 qemu      20   0   20.5g   7.0g  24308 S  21.9  22.6  75:33.70 qemu-kvm
 577745 root      20   0  386328  24600  20044 R   1.4   0.1   0:02.96 perf


I recorded perf events during one incident. In the attached flamegraph
lynx17/rhv-nfs-down/perf-kernel.svg we can see that multiple processes
and (rpcdio?) kernel workers are blocked in native_queued_spin_lock_slowpath.


Minimal reproducer without RHV:

1. Setup 2 hosts running RHEL 8.4

2. Setup NFS server on one host with one export

    echo '/export/test *(rw,async,anonuid=36,anongid=36,no_subtree_check)' >> /etc/exports
    exportfs -r
    firewall-cmd --add-service=nfs --permanent
    firewall-cmd --reload
    systemctl enable --now nfs-server

3. Mount the export on the second host (using same mount option as RHV):

    mkdir /tmp/mnt
    chown vdsm:kvm /tmp/mnt
    mount -t nfs -o soft,nosharecache,timeo=100,retrans=3,nfsvers=4,minorversion=2 lynx21:/export/test /tmp/mnt

(reproduced also without -o soft)

4. Run dd in a loop, reading 4k bytes from the mountpoint every 10
  seconds (simulates how RHV monitor storage). For quicker reproduction,
  run several dd processes. See attached monitor.sh script

    while true; do
        date
        dd if=/tmp/mnt/test bs=4096 count=1 iflag=direct of=/dev/null
        echo
        sleep 10
    done

5. Stop the NFS server on first host

    systemctl stop nfs-server

6. Watch the client host.

Trouble starts after a minute or so. Load increase slowly, and every
minute or so we see one or more rpciod worker with high cpu usage, and
or dd.

In "perf top" we can see that most of the time is spent in
native_queued_spin_lock_slowpath:

   PerfTop:    9955 irqs/sec  kernel:98.5%  exact: 98.4% lost: 0/0 drop: 0/26214 [4000Hz cycles],  (all, 4 CPUs)
-------------------------------------------------------------------------------

    42.80%  [kernel]          [k] native_queued_spin_lock_slowpath
    17.39%  [sunrpc]          [k] xprt_reserve_xprt
    17.09%  [kernel]          [k] _raw_spin_lock
     4.58%  [sunrpc]          [k] xprt_prepare_transmit
     2.40%  [sunrpc]          [k] call_connect_status
     2.02%  [sunrpc]          [k] call_connect
     1.80%  [sunrpc]          [k] __rpc_execute
     1.78%  [sunrpc]          [k] rpc_sleep_on_timeout
     1.73%  [sunrpc]          [k] __rpc_sleep_on_priority_timeout
     1.66%  [kernel]          [k] nmi

See complete output in lynx12/simulate-rhv-flow/perf-top.out

In the attached flamegraph lynx12/simulate-rhv-flow/perf-kernel.svg we
can see that both dd and (rpciod?) kernel worker are blocked on
native_queued_spin_lock_slowpath:

    native_queued_spin_lock_slowpath (25.34%)
    _raw_spin_lock (28.52%)
    xprt_prepare_transmit (35.75%)
    call_connect (35.75%)
    ...
    do_sys_open (37.47%)
    ...
    dd (37.55%)


    native_queued_spin_lock_slowpath (12.68%)
    _raw_spin_lock (15.14%)
    xprt_prepare_transmit (26.76%)
    call_connect (29%)
    __rpc_execute
    rpc_async_schedule
    ...
    kworker/u8:5-ev (36.90%)


I tried the same flow on older setup we had in the lab:
# uname -r
4.18.0-240.5.1.el8_3.x86_64

The issue is not reproducible. The system works as expected with
inaccessible NFS server. See good/good.txt for details, and
good/storage-ge1-vdsm{1,2}-top.out to the expected condition in this
case.


Version-Release number of selected component (if applicable):
bad: 4.18.0-305.8.1.el8_4.x86_64
good: 4.18.0-240.5.1.el8_3.x86_64


How reproducible:
100%

Actual results:
Inaccessible mount affects other mounts and multipath devices. Entire
cluster goes down.

Expected results:
Processes accessing the inaccessible mount blocked without affecting
other processes using other mounts or multipath devices.


Attached files:

attachment
├── good                            # kernel 4.18.0-240.5.1
│   ├── good.txt
│   ├── storage-ge1-vdsm1-top.out
│   └── storage-ge1-vdsm2-top.out
├── lynx12                          # kernel 4.18.0-305.8.1
│   ├── rhv-nfs-down                # reproducation with RHV
│   │   ├── out.perf-folded
│   │   ├── perf-kernel.svg
│   │   └── top.out
│   └── simulate-rhv-flow           # simulated RHV flow
│       ├── monitor.sh
│       ├── out.perf-folded
│       ├── perf-kernel.svg
│       └── perf-top.out
└── lynx17                          # kernel 4.18.0-305.8.1
    └── rhv-nfs-down                # reproducation with RHV
        ├── out.perf-folded
        ├── perf-kernel.svg
        └── top.out

- lynx17 (bare metal) was the SPM host in RHV, and was also running the hosted
  engine VM, so it show higher load.

- lynx12 (bare metal) is an idle host that does nothing but storage monitoring.

- good/storage-ge1-vdsm{1,2} are RHV hosts (vms) in an idle system. One of them
  was the SPM during the test.

Comment 29 errata-xmlrpc 2021-11-09 19:23:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4356