Hide Forgot
Created attachment 1797883 [details] logs and flamegraphs from good and bad systems Description of problem: When NFS server become inaccessible, for example by blocking network access or shutting the server, all NFS clients are overloaded. The result on the client are: - extreme load (up to 123 seen) - delays up to 40 seconds in vdsm (entire process is blocked) - extreme cpu usage (0 idle) - multipath reports failed devices for all paths (blocked for many seconds?) - ssh connections get stuck for 20-30 seconds On RHV system this cause all hosts in a cluster to lose access to all storage domains (nfs mounts, iscsi luns) and become non-operational in few minutes - the entire cluster goes down. See bug 1975926 for details. In top we see multiple "rpciod" kernel workers consuming up to 99% cpu time, and from time to time processes trying to access the inaccessible mountpoint (e.g. dd) show high cpu time while they are blocked on open() or read(). Here are examples top output from RHV system in this state: top - 17:28:54 up 20:25, 3 users, load average: 84.18, 28.47, 10.41 Tasks: 319 total, 32 running, 286 sleeping, 0 stopped, 1 zombie %Cpu(s): 8.4 us, 82.0 sy, 0.0 ni, 0.0 id, 8.7 wa, 0.6 hi, 0.3 si, 0.0 st MiB Mem : 31910.3 total, 13789.8 free, 14451.4 used, 3669.1 buff/cache MiB Swap: 16116.0 total, 16116.0 free, 0.0 used. 16956.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13404 vdsm 0 -20 411732 6716 4012 S 99.8 0.0 0:19.55 ioprocess 577755 vdsm 0 -20 224304 960 896 D 99.0 0.0 0:19.85 dd 577754 vdsm 0 -20 224304 972 900 D 98.8 0.0 0:19.83 dd 577753 vdsm 0 -20 224304 960 896 D 98.5 0.0 0:19.79 dd 577752 vdsm 0 -20 224304 920 852 D 98.2 0.0 0:19.76 dd 3648 qemu 20 0 20.6g 7.0g 24308 D 63.4 22.6 75:30.31 qemu-kvm 507741 root 20 0 0 0 0 R 40.5 0.0 0:20.60 kworker/u18:1+rpciod 153691 root 20 0 0 0 0 R 37.3 0.0 0:56.77 kworker/u17:3+rpciod 432365 root 20 0 0 0 0 R 37.3 0.0 0:28.12 kworker/u17:4+rpciod 2420 vdsm 0 -20 5754764 146724 30652 S 2.9 0.4 53:15.09 vdsmd 37 root 20 0 0 0 0 S 2.4 0.0 0:00.85 ksoftirqd/4 top - 17:29:10 up 20:25, 3 users, load average: 103.47, 35.39, 12.96 Tasks: 328 total, 45 running, 281 sleeping, 0 stopped, 2 zombie %Cpu(s): 2.9 us, 79.9 sy, 0.2 ni, 0.5 id, 15.9 wa, 0.6 hi, 0.0 si, 0.0 st MiB Mem : 31910.3 total, 13611.4 free, 14460.5 used, 3838.4 buff/cache MiB Swap: 16116.0 total, 16116.0 free, 0.0 used. 16947.0 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 432365 root 20 0 0 0 0 I 98.6 0.0 0:43.38 kworker/u17:4-rpciod 153691 root 20 0 0 0 0 I 98.3 0.0 1:11.99 kworker/u17:3-rpciod 577841 root 20 0 0 0 0 I 98.1 0.0 0:15.18 kworker/u17:5-rpciod 92014 root 20 0 0 0 0 I 69.5 0.0 0:26.99 kworker/u17:1-rpciod 36 root rt 0 0 0 0 S 58.1 0.0 0:09.05 migration/4 556197 root 20 0 0 0 0 R 52.2 0.0 0:28.29 kworker/u18:4+rpciod 577854 root 20 0 0 0 0 R 52.2 0.0 0:08.11 kworker/u18:5+rpciod 516383 root 20 0 0 0 0 R 52.1 0.0 0:08.99 kworker/u18:2+rpciod 507741 root 20 0 0 0 0 R 51.0 0.0 0:28.50 kworker/u18:1+rpciod 13404 vdsm 0 -20 411732 6716 4012 S 28.4 0.0 0:23.94 ioprocess 3648 qemu 20 0 20.5g 7.0g 24308 S 21.9 22.6 75:33.70 qemu-kvm 577745 root 20 0 386328 24600 20044 R 1.4 0.1 0:02.96 perf I recorded perf events during one incident. In the attached flamegraph lynx17/rhv-nfs-down/perf-kernel.svg we can see that multiple processes and (rpcdio?) kernel workers are blocked in native_queued_spin_lock_slowpath. Minimal reproducer without RHV: 1. Setup 2 hosts running RHEL 8.4 2. Setup NFS server on one host with one export echo '/export/test *(rw,async,anonuid=36,anongid=36,no_subtree_check)' >> /etc/exports exportfs -r firewall-cmd --add-service=nfs --permanent firewall-cmd --reload systemctl enable --now nfs-server 3. Mount the export on the second host (using same mount option as RHV): mkdir /tmp/mnt chown vdsm:kvm /tmp/mnt mount -t nfs -o soft,nosharecache,timeo=100,retrans=3,nfsvers=4,minorversion=2 lynx21:/export/test /tmp/mnt (reproduced also without -o soft) 4. Run dd in a loop, reading 4k bytes from the mountpoint every 10 seconds (simulates how RHV monitor storage). For quicker reproduction, run several dd processes. See attached monitor.sh script while true; do date dd if=/tmp/mnt/test bs=4096 count=1 iflag=direct of=/dev/null echo sleep 10 done 5. Stop the NFS server on first host systemctl stop nfs-server 6. Watch the client host. Trouble starts after a minute or so. Load increase slowly, and every minute or so we see one or more rpciod worker with high cpu usage, and or dd. In "perf top" we can see that most of the time is spent in native_queued_spin_lock_slowpath: PerfTop: 9955 irqs/sec kernel:98.5% exact: 98.4% lost: 0/0 drop: 0/26214 [4000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------- 42.80% [kernel] [k] native_queued_spin_lock_slowpath 17.39% [sunrpc] [k] xprt_reserve_xprt 17.09% [kernel] [k] _raw_spin_lock 4.58% [sunrpc] [k] xprt_prepare_transmit 2.40% [sunrpc] [k] call_connect_status 2.02% [sunrpc] [k] call_connect 1.80% [sunrpc] [k] __rpc_execute 1.78% [sunrpc] [k] rpc_sleep_on_timeout 1.73% [sunrpc] [k] __rpc_sleep_on_priority_timeout 1.66% [kernel] [k] nmi See complete output in lynx12/simulate-rhv-flow/perf-top.out In the attached flamegraph lynx12/simulate-rhv-flow/perf-kernel.svg we can see that both dd and (rpciod?) kernel worker are blocked on native_queued_spin_lock_slowpath: native_queued_spin_lock_slowpath (25.34%) _raw_spin_lock (28.52%) xprt_prepare_transmit (35.75%) call_connect (35.75%) ... do_sys_open (37.47%) ... dd (37.55%) native_queued_spin_lock_slowpath (12.68%) _raw_spin_lock (15.14%) xprt_prepare_transmit (26.76%) call_connect (29%) __rpc_execute rpc_async_schedule ... kworker/u8:5-ev (36.90%) I tried the same flow on older setup we had in the lab: # uname -r 4.18.0-240.5.1.el8_3.x86_64 The issue is not reproducible. The system works as expected with inaccessible NFS server. See good/good.txt for details, and good/storage-ge1-vdsm{1,2}-top.out to the expected condition in this case. Version-Release number of selected component (if applicable): bad: 4.18.0-305.8.1.el8_4.x86_64 good: 4.18.0-240.5.1.el8_3.x86_64 How reproducible: 100% Actual results: Inaccessible mount affects other mounts and multipath devices. Entire cluster goes down. Expected results: Processes accessing the inaccessible mount blocked without affecting other processes using other mounts or multipath devices. Attached files: attachment ├── good # kernel 4.18.0-240.5.1 │ ├── good.txt │ ├── storage-ge1-vdsm1-top.out │ └── storage-ge1-vdsm2-top.out ├── lynx12 # kernel 4.18.0-305.8.1 │ ├── rhv-nfs-down # reproducation with RHV │ │ ├── out.perf-folded │ │ ├── perf-kernel.svg │ │ └── top.out │ └── simulate-rhv-flow # simulated RHV flow │ ├── monitor.sh │ ├── out.perf-folded │ ├── perf-kernel.svg │ └── perf-top.out └── lynx17 # kernel 4.18.0-305.8.1 └── rhv-nfs-down # reproducation with RHV ├── out.perf-folded ├── perf-kernel.svg └── top.out - lynx17 (bare metal) was the SPM host in RHV, and was also running the hosted engine VM, so it show higher load. - lynx12 (bare metal) is an idle host that does nothing but storage monitoring. - good/storage-ge1-vdsm{1,2} are RHV hosts (vms) in an idle system. One of them was the SPM during the test.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4356