Created attachment 1797883[details]
logs and flamegraphs from good and bad systems
Description of problem:
When NFS server become inaccessible, for example by blocking network
access or shutting the server, all NFS clients are overloaded.
The result on the client are:
- extreme load (up to 123 seen)
- delays up to 40 seconds in vdsm (entire process is blocked)
- extreme cpu usage (0 idle)
- multipath reports failed devices for all paths (blocked for many
seconds?)
- ssh connections get stuck for 20-30 seconds
On RHV system this cause all hosts in a cluster to lose access to all
storage domains (nfs mounts, iscsi luns) and become non-operational in
few minutes - the entire cluster goes down. See bug 1975926 for
details.
In top we see multiple "rpciod" kernel workers consuming up to 99% cpu
time, and from time to time processes trying to access the inaccessible
mountpoint (e.g. dd) show high cpu time while they are blocked on open()
or read().
Here are examples top output from RHV system in this state:
top - 17:28:54 up 20:25, 3 users, load average: 84.18, 28.47, 10.41
Tasks: 319 total, 32 running, 286 sleeping, 0 stopped, 1 zombie
%Cpu(s): 8.4 us, 82.0 sy, 0.0 ni, 0.0 id, 8.7 wa, 0.6 hi, 0.3 si, 0.0 st
MiB Mem : 31910.3 total, 13789.8 free, 14451.4 used, 3669.1 buff/cache
MiB Swap: 16116.0 total, 16116.0 free, 0.0 used. 16956.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13404 vdsm 0 -20 411732 6716 4012 S 99.8 0.0 0:19.55 ioprocess
577755 vdsm 0 -20 224304 960 896 D 99.0 0.0 0:19.85 dd
577754 vdsm 0 -20 224304 972 900 D 98.8 0.0 0:19.83 dd
577753 vdsm 0 -20 224304 960 896 D 98.5 0.0 0:19.79 dd
577752 vdsm 0 -20 224304 920 852 D 98.2 0.0 0:19.76 dd
3648 qemu 20 0 20.6g 7.0g 24308 D 63.4 22.6 75:30.31 qemu-kvm
507741 root 20 0 0 0 0 R 40.5 0.0 0:20.60 kworker/u18:1+rpciod
153691 root 20 0 0 0 0 R 37.3 0.0 0:56.77 kworker/u17:3+rpciod
432365 root 20 0 0 0 0 R 37.3 0.0 0:28.12 kworker/u17:4+rpciod
2420 vdsm 0 -20 5754764 146724 30652 S 2.9 0.4 53:15.09 vdsmd
37 root 20 0 0 0 0 S 2.4 0.0 0:00.85 ksoftirqd/4
top - 17:29:10 up 20:25, 3 users, load average: 103.47, 35.39, 12.96
Tasks: 328 total, 45 running, 281 sleeping, 0 stopped, 2 zombie
%Cpu(s): 2.9 us, 79.9 sy, 0.2 ni, 0.5 id, 15.9 wa, 0.6 hi, 0.0 si, 0.0 st
MiB Mem : 31910.3 total, 13611.4 free, 14460.5 used, 3838.4 buff/cache
MiB Swap: 16116.0 total, 16116.0 free, 0.0 used. 16947.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
432365 root 20 0 0 0 0 I 98.6 0.0 0:43.38 kworker/u17:4-rpciod
153691 root 20 0 0 0 0 I 98.3 0.0 1:11.99 kworker/u17:3-rpciod
577841 root 20 0 0 0 0 I 98.1 0.0 0:15.18 kworker/u17:5-rpciod
92014 root 20 0 0 0 0 I 69.5 0.0 0:26.99 kworker/u17:1-rpciod
36 root rt 0 0 0 0 S 58.1 0.0 0:09.05 migration/4
556197 root 20 0 0 0 0 R 52.2 0.0 0:28.29 kworker/u18:4+rpciod
577854 root 20 0 0 0 0 R 52.2 0.0 0:08.11 kworker/u18:5+rpciod
516383 root 20 0 0 0 0 R 52.1 0.0 0:08.99 kworker/u18:2+rpciod
507741 root 20 0 0 0 0 R 51.0 0.0 0:28.50 kworker/u18:1+rpciod
13404 vdsm 0 -20 411732 6716 4012 S 28.4 0.0 0:23.94 ioprocess
3648 qemu 20 0 20.5g 7.0g 24308 S 21.9 22.6 75:33.70 qemu-kvm
577745 root 20 0 386328 24600 20044 R 1.4 0.1 0:02.96 perf
I recorded perf events during one incident. In the attached flamegraph
lynx17/rhv-nfs-down/perf-kernel.svg we can see that multiple processes
and (rpcdio?) kernel workers are blocked in native_queued_spin_lock_slowpath.
Minimal reproducer without RHV:
1. Setup 2 hosts running RHEL 8.4
2. Setup NFS server on one host with one export
echo '/export/test *(rw,async,anonuid=36,anongid=36,no_subtree_check)' >> /etc/exports
exportfs -r
firewall-cmd --add-service=nfs --permanent
firewall-cmd --reload
systemctl enable --now nfs-server
3. Mount the export on the second host (using same mount option as RHV):
mkdir /tmp/mnt
chown vdsm:kvm /tmp/mnt
mount -t nfs -o soft,nosharecache,timeo=100,retrans=3,nfsvers=4,minorversion=2 lynx21:/export/test /tmp/mnt
(reproduced also without -o soft)
4. Run dd in a loop, reading 4k bytes from the mountpoint every 10
seconds (simulates how RHV monitor storage). For quicker reproduction,
run several dd processes. See attached monitor.sh script
while true; do
date
dd if=/tmp/mnt/test bs=4096 count=1 iflag=direct of=/dev/null
echo
sleep 10
done
5. Stop the NFS server on first host
systemctl stop nfs-server
6. Watch the client host.
Trouble starts after a minute or so. Load increase slowly, and every
minute or so we see one or more rpciod worker with high cpu usage, and
or dd.
In "perf top" we can see that most of the time is spent in
native_queued_spin_lock_slowpath:
PerfTop: 9955 irqs/sec kernel:98.5% exact: 98.4% lost: 0/0 drop: 0/26214 [4000Hz cycles], (all, 4 CPUs)
-------------------------------------------------------------------------------
42.80% [kernel] [k] native_queued_spin_lock_slowpath
17.39% [sunrpc] [k] xprt_reserve_xprt
17.09% [kernel] [k] _raw_spin_lock
4.58% [sunrpc] [k] xprt_prepare_transmit
2.40% [sunrpc] [k] call_connect_status
2.02% [sunrpc] [k] call_connect
1.80% [sunrpc] [k] __rpc_execute
1.78% [sunrpc] [k] rpc_sleep_on_timeout
1.73% [sunrpc] [k] __rpc_sleep_on_priority_timeout
1.66% [kernel] [k] nmi
See complete output in lynx12/simulate-rhv-flow/perf-top.out
In the attached flamegraph lynx12/simulate-rhv-flow/perf-kernel.svg we
can see that both dd and (rpciod?) kernel worker are blocked on
native_queued_spin_lock_slowpath:
native_queued_spin_lock_slowpath (25.34%)
_raw_spin_lock (28.52%)
xprt_prepare_transmit (35.75%)
call_connect (35.75%)
...
do_sys_open (37.47%)
...
dd (37.55%)
native_queued_spin_lock_slowpath (12.68%)
_raw_spin_lock (15.14%)
xprt_prepare_transmit (26.76%)
call_connect (29%)
__rpc_execute
rpc_async_schedule
...
kworker/u8:5-ev (36.90%)
I tried the same flow on older setup we had in the lab:
# uname -r
4.18.0-240.5.1.el8_3.x86_64
The issue is not reproducible. The system works as expected with
inaccessible NFS server. See good/good.txt for details, and
good/storage-ge1-vdsm{1,2}-top.out to the expected condition in this
case.
Version-Release number of selected component (if applicable):
bad: 4.18.0-305.8.1.el8_4.x86_64
good: 4.18.0-240.5.1.el8_3.x86_64
How reproducible:
100%
Actual results:
Inaccessible mount affects other mounts and multipath devices. Entire
cluster goes down.
Expected results:
Processes accessing the inaccessible mount blocked without affecting
other processes using other mounts or multipath devices.
Attached files:
attachment
├── good # kernel 4.18.0-240.5.1
│ ├── good.txt
│ ├── storage-ge1-vdsm1-top.out
│ └── storage-ge1-vdsm2-top.out
├── lynx12 # kernel 4.18.0-305.8.1
│ ├── rhv-nfs-down # reproducation with RHV
│ │ ├── out.perf-folded
│ │ ├── perf-kernel.svg
│ │ └── top.out
│ └── simulate-rhv-flow # simulated RHV flow
│ ├── monitor.sh
│ ├── out.perf-folded
│ ├── perf-kernel.svg
│ └── perf-top.out
└── lynx17 # kernel 4.18.0-305.8.1
└── rhv-nfs-down # reproducation with RHV
├── out.perf-folded
├── perf-kernel.svg
└── top.out
- lynx17 (bare metal) was the SPM host in RHV, and was also running the hosted
engine VM, so it show higher load.
- lynx12 (bare metal) is an idle host that does nothing but storage monitoring.
- good/storage-ge1-vdsm{1,2} are RHV hosts (vms) in an idle system. One of them
was the SPM during the test.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2021:4356