Bug 1975926
| Summary: | hosts become 'NonOperational' after disconnect the storage server's network. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | michal <mgold> | ||||
| Component: | BLL.Storage | Assignee: | Nir Soffer <nsoffer> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | michal <mgold> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.4.7.3 | CC: | aefrat, bugs, bzlotnik, dagur, eshenitz, lsvaty, mburman, michal.skrivanek, mlehrer, nsoffer, pelauter, sfishbai | ||||
| Target Milestone: | ovirt-4.4.8 | Keywords: | Regression, ZStream | ||||
| Target Release: | 4.4.8.3 | Flags: | dagur:
needinfo+
pm-rhel: ovirt-4.4+ michal.skrivanek: blocker- lsvaty: exception+ |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-08-23 07:52:46 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1979070 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
Michal, can you please describe the environment you are using? is there only one host in it? If a host cannot see a storage domain while all other hosts can see it, the host becomes 'NonOperational' because the Engine assumes that there is a problem with the host. Please provide the engine logs also. I have an environment with 3 hosts. I have 4 available storage domains. I can't provide engine logs- because the environment collapse after this scenario Creating a storage server on a vm which is part of the setup is not a real world use case and we don't care about it. If you used this flow in as an automated test the tests is bad and should be removed from the test suite. Please show how you reproduce this with storage server running on another host. You can use a vm, but not a vm managed by RHV. Usually these tests are done by blocking access the the storage server using iptables. I also tested in the past shutting down the storage server vm. I reproduced this without hosted engine, with NFS serve on separate host.
Environment tested:
- 2 hosts running RHEL 8.5 nightly (updated last week)
- 2 iscsi storage domains served by server "storage"
- 2 nfs storage domains server by server "storage"
- 1 nfs storage domain server by server "storage2"
- 1 ovirt vm with disk on iscsi storage domain on server "storage"
- 1 ovirt vm with disk on nfs storage domain on server "storage2"
- all hosts are vms on my laptop
1. Start top in batch mode on both hosts
2. Shutdown the server "storage2"
3. After a 1-2 mintes, both hosts become non operational
4. Connect to hosts using ssh, both show very high load (50-60)
5. Wait 10-15 minutes
6. Start up server "storage2"
7. After few mintues system goes back to normal state
We expect that:
- vm with disk on server "storage2" will be paused
- nfs storage domain on server "storage2" becomes inactive
- both hosts should be up
- vm with disk on server "storage" run normally
In top output on both hosts we see that average load increases to very high values
when storage server "storage2" was down:
$ grep 'load average' host4-top.out
top - 19:14:45 up 1:00, 1 user, load average: 0.38, 14.22, 19.55
top - 19:14:55 up 1:00, 1 user, load average: 0.32, 13.75, 19.34
top - 19:15:05 up 1:01, 1 user, load average: 1.05, 13.47, 19.19
top - 19:15:15 up 1:01, 2 users, load average: 2.12, 13.29, 19.06
top - 19:15:25 up 1:01, 2 users, load average: 3.18, 13.14, 18.96
top - 19:15:35 up 1:01, 2 users, load average: 4.07, 13.01, 18.85
top - 19:16:02 up 1:02, 2 users, load average: 30.56, 18.24, 20.41
top - 19:16:25 up 1:02, 2 users, load average: 50.71, 23.72, 22.15
top - 19:16:35 up 1:02, 2 users, load average: 43.75, 23.12, 21.97
top - 19:16:59 up 1:03, 2 users, load average: 56.04, 27.45, 23.42
top - 19:17:25 up 1:03, 2 users, load average: 67.60, 32.21, 25.09
top - 19:17:35 up 1:03, 2 users, load average: 58.44, 31.41, 24.91
top - 19:17:45 up 1:03, 2 users, load average: 53.88, 31.31, 24.94
top - 19:17:55 up 1:03, 2 users, load average: 57.99, 32.93, 25.54
top - 19:18:25 up 1:04, 2 users, load average: 71.07, 38.24, 27.53
top - 19:18:35 up 1:04, 2 users, load average: 62.09, 37.39, 27.37
top - 19:18:45 up 1:04, 2 users, load average: 56.00, 36.90, 27.31
top - 19:18:55 up 1:04, 2 users, load average: 48.69, 35.97, 27.11
top - 19:19:05 up 1:05, 2 users, load average: 42.74, 35.11, 26.93
top - 19:19:15 up 1:05, 2 users, load average: 47.71, 36.43, 27.44
top - 19:19:25 up 1:05, 2 users, load average: 52.06, 37.73, 27.96
top - 19:19:35 up 1:05, 2 users, load average: 53.24, 38.46, 28.31
top - 19:19:45 up 1:05, 2 users, load average: 53.82, 39.07, 28.61
top - 19:19:55 up 1:05, 2 users, load average: 47.06, 38.11, 28.41
top - 19:20:05 up 1:06, 2 users, load average: 41.50, 37.22, 28.23
top - 19:20:30 up 1:06, 2 users, load average: 57.07, 41.13, 29.76
top - 19:20:52 up 1:06, 2 users, load average: 62.28, 43.18, 30.66
top - 19:21:02 up 1:07, 2 users, load average: 61.09, 43.59, 30.93
top - 19:21:12 up 1:07, 2 users, load average: 53.57, 42.55, 30.73
top - 19:21:22 up 1:07, 2 users, load average: 49.02, 41.94, 30.66
top - 19:21:32 up 1:07, 2 users, load average: 43.89, 41.08, 30.50
top - 19:21:42 up 1:07, 2 users, load average: 38.84, 40.09, 30.29
top - 19:21:52 up 1:07, 2 users, load average: 34.71, 39.16, 30.09
top - 19:22:02 up 1:08, 2 users, load average: 30.98, 38.22, 29.88
top - 19:22:12 up 1:08, 2 users, load average: 27.61, 37.26, 29.66
top - 19:22:22 up 1:08, 2 users, load average: 25.14, 36.41, 29.46
top - 19:22:32 up 1:08, 2 users, load average: 22.75, 35.52, 29.25
top - 19:22:43 up 1:08, 2 users, load average: 20.87, 34.70, 29.05
top - 19:22:53 up 1:08, 2 users, load average: 19.51, 33.95, 28.86
top - 19:23:03 up 1:09, 2 users, load average: 17.75, 33.10, 28.64
top - 19:23:13 up 1:09, 2 users, load average: 23.35, 33.79, 28.91
top - 19:23:23 up 1:09, 2 users, load average: 24.85, 33.79, 28.97
top - 19:23:33 up 1:09, 2 users, load average: 25.04, 33.53, 28.93
top - 19:23:43 up 1:09, 1 user, load average: 24.84, 33.22, 28.88
top - 19:23:53 up 1:09, 1 user, load average: 21.89, 32.30, 28.63
top - 19:24:03 up 1:10, 1 user, load average: 25.07, 32.63, 28.77
top - 19:24:13 up 1:10, 1 user, load average: 30.53, 33.55, 29.11
top - 19:24:23 up 1:10, 1 user, load average: 27.62, 32.82, 28.92
top - 19:24:33 up 1:10, 1 user, load average: 29.68, 33.08, 29.05
top - 19:24:43 up 1:10, 1 user, load average: 38.34, 34.82, 29.66
top - 19:24:53 up 1:10, 1 user, load average: 43.98, 36.16, 30.15
top - 19:25:03 up 1:11, 1 user, load average: 40.15, 35.59, 30.03
top - 19:25:13 up 1:11, 1 user, load average: 37.35, 35.20, 29.99
top - 19:25:23 up 1:11, 1 user, load average: 33.28, 34.40, 29.79
top - 19:25:33 up 1:11, 1 user, load average: 28.79, 33.40, 29.51
top - 19:25:43 up 1:11, 1 user, load average: 27.44, 32.96, 29.41
top - 19:25:53 up 1:11, 1 user, load average: 26.98, 32.68, 29.36
top - 19:26:03 up 1:12, 1 user, load average: 22.91, 31.62, 29.05
top - 19:26:13 up 1:12, 1 user, load average: 19.39, 30.58, 28.74
top - 19:26:23 up 1:12, 1 user, load average: 16.48, 29.59, 28.43
top - 19:26:33 up 1:12, 1 user, load average: 13.95, 28.61, 28.13
top - 19:26:43 up 1:12, 1 user, load average: 11.87, 27.69, 27.83
top - 19:26:53 up 1:12, 1 user, load average: 10.05, 26.77, 27.53
top - 19:27:03 up 1:13, 1 user, load average: 8.50, 25.89, 27.24
top - 19:27:13 up 1:13, 1 user, load average: 7.20, 25.04, 26.94
top - 19:27:23 up 1:13, 1 user, load average: 6.09, 24.21, 26.66
top - 19:27:33 up 1:13, 1 user, load average: 5.15, 23.42, 26.37
top - 19:27:43 up 1:13, 1 user, load average: 4.36, 22.64, 26.09
top - 19:27:53 up 1:13, 1 user, load average: 3.69, 21.90, 25.81
top - 19:28:03 up 1:14, 2 users, load average: 3.20, 21.19, 25.53
top - 19:28:13 up 1:14, 2 users, load average: 2.94, 20.54, 25.28
top - 19:28:23 up 1:14, 2 users, load average: 2.48, 19.87, 25.01
top - 19:28:33 up 1:14, 2 users, load average: 2.10, 19.21, 24.74
top - 19:28:43 up 1:14, 2 users, load average: 1.78, 18.58, 24.47
top - 19:28:53 up 1:14, 2 users, load average: 1.50, 17.97, 24.21
top - 19:29:03 up 1:15, 2 users, load average: 1.35, 17.39, 23.95
Looking at the output when load was high:
top - 19:16:02 up 1:02, 2 users, load average: 30.56, 18.24, 20.41
Tasks: 254 total, 9 running, 245 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.7 us, 75.7 sy, 0.0 ni, 22.5 id, 0.4 wa, 0.2 hi, 0.0 si, 0.4 st
MiB Mem : 3736.1 total, 803.0 free, 2232.4 used, 700.7 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1182.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2773 root 20 0 0 0 0 R 74.5 0.0 3:51.06 kworker/u8:8-rpciod
977 root 0 -20 0 0 0 R 74.4 0.0 0:19.78 rpciod+rpciod
10277 root 20 0 0 0 0 R 74.4 0.0 3:15.33 kworker/u8:2+rpciod
4805 root 20 0 0 0 0 R 73.1 0.0 2:19.86 kworker/u8:0+rpciod
12284 root 20 0 0 0 0 R 6.9 0.0 0:41.71 kworker/u8:4-events_unbound
2067 vdsm 0 -20 2996704 125020 30068 S 3.0 3.3 0:55.80 vdsmd
5998 vdsm 0 -20 772196 7420 3976 S 1.6 0.2 0:22.89 ioprocess
8737 qemu 20 0 2798860 955116 24436 S 0.2 25.0 0:28.93 qemu-kvm
12247 vdsm 20 0 635404 32828 10444 S 0.2 0.9 0:02.06 momd
1154 openvsw+ 10 -10 67416 5764 3960 S 0.1 0.2 0:05.47 ovsdb-server
top - 19:16:25 up 1:02, 2 users, load average: 50.71, 23.72, 22.15
Tasks: 251 total, 19 running, 232 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 87.6 sy, 0.0 ni, 9.2 id, 2.1 wa, 0.3 hi, 0.0 si, 0.2 st
MiB Mem : 3736.1 total, 783.8 free, 2251.5 used, 700.8 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1163.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10277 root 20 0 0 0 0 D 84.8 0.0 3:35.19 kworker/u8:2+iscsi_eh
12284 root 20 0 0 0 0 I 84.8 0.0 1:01.57 kworker/u8:4-rpciod
15473 root 20 0 0 0 0 R 84.6 0.0 0:19.82 kworker/u8:1+rpciod
4805 root 20 0 0 0 0 R 84.3 0.0 2:39.60 kworker/u8:0+rpciod
977 root 0 -20 0 0 0 I 6.7 0.0 0:21.34 rpciod
15503 vdsm 0 -20 133796 29484 10308 R 1.3 0.8 0:00.31 50_openstacknet
1604 root 15 -5 1897600 71888 24316 S 0.4 1.9 0:03.60 supervdsmd
2067 vdsm 0 -20 2996704 125120 30068 S 0.3 3.3 0:55.87 vdsmd
12247 vdsm 20 0 635404 32832 10444 R 0.2 0.9 0:02.10 momd
8737 qemu 20 0 2798860 955116 24436 S 0.1 25.0 0:28.96 qemu-kvm
top - 19:16:59 up 1:03, 2 users, load average: 56.04, 27.45, 23.42
Tasks: 257 total, 33 running, 224 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 74.7 sy, 0.0 ni, 22.4 id, 0.1 wa, 0.4 hi, 0.1 si, 1.2 st
MiB Mem : 3736.1 total, 798.3 free, 2236.6 used, 701.3 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1178.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15473 root 20 0 0 0 0 R 79.5 0.0 0:39.46 kworker/u8:1+rpciod
7174 qemu 20 0 3714220 747732 24256 R 72.2 19.5 2:38.50 qemu-kvm
2773 root 20 0 0 0 0 R 71.9 0.0 4:08.82 kworker/u8:8+rpciod
10277 root 20 0 0 0 0 R 70.2 0.0 3:52.52 kworker/u8:2+rpciod
1 root 20 0 254512 15196 9620 S 0.8 0.4 0:10.35 systemd
2067 vdsm 0 -20 3004900 125648 30068 S 0.6 3.3 0:56.40 vdsmd
1006 dbus 20 0 65076 5980 4780 S 0.5 0.2 0:05.87 dbus-daemon
803 root 20 0 117532 28748 27072 R 0.2 0.8 0:02.39 systemd-journal
1069 root 20 0 105828 10372 8148 S 0.2 0.3 0:02.13 systemd-logind
12247 vdsm 20 0 635404 32836 10444 R 0.2 0.9 0:02.21 momd
top - 19:17:25 up 1:03, 2 users, load average: 67.60, 32.21, 25.09
Tasks: 259 total, 17 running, 242 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.2 us, 84.9 sy, 0.0 ni, 12.0 id, 0.1 wa, 0.4 hi, 0.0 si, 0.4 st
MiB Mem : 3736.1 total, 797.4 free, 2236.8 used, 701.9 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1177.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10277 root 20 0 0 0 0 R 87.2 0.0 4:14.56 kworker/u8:2+rpciod
15473 root 20 0 0 0 0 I 78.6 0.0 0:59.34 kworker/u8:1-iscsi_q_7
12284 root 20 0 0 0 0 R 78.5 0.0 1:21.41 kworker/u8:4+rpciod
5998 vdsm 0 -20 764000 7436 3976 S 78.4 0.2 0:42.71 ioprocess
2067 vdsm 0 -20 3045880 125716 30068 S 8.2 3.3 0:58.47 vdsmd
2773 root 20 0 0 0 0 R 7.2 0.0 4:10.64 kworker/u8:8-rpciod
7174 qemu 20 0 3706024 747708 24256 S 7.1 19.5 2:40.29 qemu-kvm
12247 vdsm 20 0 635404 32848 10444 S 0.3 0.9 0:02.28 momd
1 root 20 0 254512 15196 9620 S 0.2 0.4 0:10.40 systemd
1006 dbus 20 0 65076 5980 4780 S 0.2 0.2 0:05.91 dbus-daemon
15408 nsoffer 20 0 65572 5108 4192 R 0.2 0.1 0:00.19 top
top - 19:17:45 up 1:03, 2 users, load average: 53.88, 31.31, 24.94
Tasks: 260 total, 14 running, 246 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.1 us, 27.2 sy, 0.0 ni, 70.0 id, 0.2 wa, 0.2 hi, 0.2 si, 0.1 st
MiB Mem : 3736.1 total, 785.2 free, 2241.6 used, 709.4 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1165.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16131 vdsm 0 -20 485436 4220 3724 S 41.1 0.1 0:04.11 ioprocess
5998 vdsm 0 -20 764000 7436 3976 S 34.0 0.2 0:46.15 ioprocess
10277 root 20 0 0 0 0 R 25.8 0.0 4:17.14 kworker/u8:2+rpciod
2067 vdsm 0 -20 3045880 126204 30068 S 1.6 3.3 0:59.11 vdsmd
1 root 20 0 254512 15196 9620 S 1.2 0.4 0:10.94 systemd
1006 dbus 20 0 65076 5980 4780 S 0.7 0.2 0:06.28 dbus-daemon
8737 qemu 20 0 2798860 956288 24436 S 0.6 25.0 0:29.18 qemu-kvm
15408 nsoffer 20 0 65572 5252 4192 S 0.4 0.1 0:00.28 top
803 root 20 0 125724 31760 30060 S 0.3 0.8 0:02.52 systemd-journal
12247 vdsm 20 0 637548 33068 10656 S 0.3 0.9 0:02.37 momd
top - 19:17:55 up 1:03, 2 users, load average: 57.99, 32.93, 25.54
Tasks: 260 total, 20 running, 240 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 73.9 sy, 0.0 ni, 0.0 id, 24.1 wa, 0.5 hi, 0.1 si, 1.4 st
MiB Mem : 3736.1 total, 784.8 free, 2242.0 used, 709.4 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1165.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10277 root 20 0 0 0 0 R 99.4 0.0 4:27.10 kworker/u8:2+rpciod
5998 vdsm 0 -20 764000 7436 3976 S 99.3 0.2 0:56.10 ioprocess
16131 vdsm 0 -20 485436 4220 3724 S 99.2 0.1 0:14.05 ioprocess
8737 qemu 20 0 2798860 956288 24436 S 0.6 25.0 0:29.24 qemu-kvm
15308 nsoffer 20 0 65520 5040 4152 R 0.2 0.1 0:00.11 top
909 root 20 0 479584 21248 10588 S 0.1 0.6 0:00.68 multipathd
1030 sanlock 20 0 914876 63188 32040 S 0.1 1.7 0:01.10 sanlock
1604 root 15 -5 2045064 72708 24316 S 0.1 1.9 0:04.15 supervdsmd
15408 nsoffer 20 0 65572 5252 4192 S 0.1 0.1 0:00.29 top
top - 19:18:25 up 1:04, 2 users, load average: 71.07, 38.24, 27.53
Tasks: 260 total, 12 running, 248 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 82.5 sy, 0.0 ni, 10.2 id, 5.7 wa, 0.4 hi, 0.1 si, 0.4 st
MiB Mem : 3736.1 total, 781.8 free, 2244.8 used, 709.6 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1162.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10277 root 20 0 0 0 0 R 65.1 0.0 4:33.64 kworker/u8:2+rpciod
5998 vdsm 0 -20 764000 7436 3976 S 63.2 0.2 1:02.45 ioprocess
2067 vdsm 0 -20 3045880 126212 30068 S 3.1 3.3 0:59.42 vdsmd
1 root 20 0 254512 15196 9620 S 1.4 0.4 0:11.08 systemd
1006 dbus 20 0 65076 5988 4780 S 1.1 0.2 0:06.39 dbus-daemon
8737 qemu 20 0 2798860 956288 24436 S 0.8 25.0 0:29.32 qemu-kvm
12247 vdsm 20 0 637548 33068 10656 S 0.8 0.9 0:02.45 momd
2773 root 20 0 0 0 0 R 0.6 0.0 4:10.70 kworker/u8:8+rpciod
1069 root 20 0 105828 10372 8148 S 0.4 0.3 0:02.31 systemd-logind
803 root 20 0 125724 32264 30548 S 0.3 0.8 0:02.55 systemd-journal
top - 19:18:35 up 1:04, 2 users, load average: 62.09, 37.39, 27.37
Tasks: 265 total, 7 running, 258 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.9 us, 23.8 sy, 0.0 ni, 45.5 id, 21.9 wa, 0.4 hi, 0.2 si, 0.3 st
MiB Mem : 3736.1 total, 733.1 free, 2290.7 used, 712.4 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1115.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10277 root 20 0 0 0 0 R 92.2 0.0 5:01.24 kworker/u8:2+rpciod
5998 vdsm 0 -20 764000 7436 3976 S 66.3 0.2 1:22.28 ioprocess
2773 root 20 0 0 0 0 I 66.2 0.0 4:30.52 kworker/u8:8-events_unbound
12284 root 20 0 0 0 0 I 66.1 0.0 1:41.22 kworker/u8:4-iscsi_q_7
2067 vdsm 0 -20 3046136 126440 30068 S 1.0 3.3 0:59.71 vdsmd
1604 root 15 -5 2045064 72816 24316 S 0.8 1.9 0:04.39 supervdsmd
1 root 20 0 254512 15196 9620 S 0.6 0.4 0:11.27 systemd
1006 dbus 20 0 65076 5988 4780 S 0.4 0.2 0:06.52 dbus-daemon
8737 qemu 20 0 2798860 956288 24436 S 0.3 25.0 0:29.40 qemu-kvm
17011 vdsm 0 -20 84260 17132 8400 S 0.3 0.4 0:00.08 ovirt_provider_
12247 vdsm 20 0 637548 33068 10656 S 0.2 0.9 0:02.52 momd
top - 19:18:45 up 1:04, 2 users, load average: 56.00, 36.90, 27.31
Tasks: 269 total, 9 running, 260 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 25.2 sy, 0.0 ni, 24.4 id, 49.0 wa, 0.5 hi, 0.1 si, 0.5 st
MiB Mem : 3736.1 total, 723.3 free, 2299.8 used, 713.0 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1105.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10277 root 20 0 0 0 0 R 99.6 0.0 5:11.22 kworker/u8:2+rpciod
2067 vdsm 0 -20 3046136 126444 30068 S 0.9 3.3 0:59.80 vdsmd
8737 qemu 20 0 2798860 956288 24436 S 0.8 25.0 0:29.48 qemu-kvm
12247 vdsm 20 0 637548 33072 10656 S 0.7 0.9 0:02.59 momd
15408 nsoffer 20 0 65572 5252 4192 S 0.4 0.1 0:00.41 top
1154 openvsw+ 10 -10 67416 5764 3960 S 0.3 0.2 0:05.66 ovsdb-server
1374 root 20 0 2091568 60032 39172 S 0.2 1.6 0:03.28 libvirtd
11 root 20 0 0 0 0 I 0.1 0.0 0:00.45 rcu_sched
1604 root 15 -5 2045064 72816 24316 S 0.1 1.9 0:04.40 supervdsmd
top - 19:19:15 up 1:05, 2 users, load average: 47.71, 36.43, 27.44
Tasks: 260 total, 16 running, 244 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 74.6 sy, 0.0 ni, 7.3 id, 16.9 wa, 0.4 hi, 0.1 si, 0.4 st
MiB Mem : 3736.1 total, 780.9 free, 2245.1 used, 710.2 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1162.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4805 root 20 0 0 0 0 R 99.5 0.0 2:49.69 kworker/u8:0+rpciod
12284 root 20 0 0 0 0 R 99.5 0.0 1:51.31 kworker/u8:4+rpciod
5998 vdsm 0 -20 764000 7440 3976 S 98.7 0.2 1:32.26 ioprocess
12247 vdsm 20 0 637548 33076 10656 S 0.6 0.9 0:02.69 momd
8737 qemu 20 0 2798860 956288 24436 S 0.4 25.0 0:29.61 qemu-kvm
15408 nsoffer 20 0 65704 5252 4192 S 0.4 0.1 0:00.55 top
top - 19:19:25 up 1:05, 2 users, load average: 52.06, 37.73, 27.96
Tasks: 264 total, 12 running, 252 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 74.5 sy, 0.0 ni, 0.0 id, 24.4 wa, 0.4 hi, 0.0 si, 0.3 st
MiB Mem : 3736.1 total, 747.2 free, 2278.7 used, 710.2 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1128.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12284 root 20 0 0 0 0 R 99.4 0.0 2:01.27 kworker/u8:4+rpciod
4805 root 20 0 0 0 0 I 98.1 0.0 2:59.52 kworker/u8:0-rpciod
5998 vdsm 0 -20 764000 7440 3976 S 98.0 0.2 1:42.08 ioprocess
2067 vdsm 0 -20 3037940 126488 30068 S 1.6 3.3 1:00.46 vdsmd
17027 vdsm 0 -20 11612 988 924 R 1.5 0.0 0:00.15 dd
8737 qemu 20 0 2798860 956288 24436 S 0.4 25.0 0:29.65 qemu-kvm
1154 openvsw+ 10 -10 67416 5764 3960 S 0.3 0.2 0:05.77 ovsdb-server
top - 19:19:35 up 1:05, 2 users, load average: 53.24, 38.46, 28.31
Tasks: 271 total, 10 running, 257 sleeping, 0 stopped, 4 zombie
%Cpu(s): 5.4 us, 58.6 sy, 0.0 ni, 23.1 id, 11.2 wa, 0.5 hi, 0.1 si, 1.2 st
MiB Mem : 3736.1 total, 740.1 free, 2285.1 used, 711.0 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1121.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17027 vdsm 0 -20 11612 988 924 D 78.9 0.0 0:08.06 dd
16744 vdsm 0 -20 559168 6444 3872 S 77.2 0.2 0:07.74 ioprocess
12284 root 20 0 0 0 0 R 75.0 0.0 2:08.79 kworker/u8:4+rpciod
2067 vdsm 0 -20 3054332 126488 30068 S 21.3 3.3 1:02.59 vdsmd
17542 root 20 0 100956 10540 8828 S 0.9 0.3 0:00.09 systemd
15408 nsoffer 20 0 65700 5260 4192 S 0.7 0.1 0:00.65 top
8737 qemu 20 0 2798860 956288 24436 S 0.5 25.0 0:29.70 qemu-kvm
top - 19:19:45 up 1:05, 2 users, load average: 53.82, 39.07, 28.61
Tasks: 271 total, 9 running, 258 sleeping, 0 stopped, 4 zombie
%Cpu(s): 0.2 us, 49.9 sy, 0.0 ni, 49.2 id, 0.0 wa, 0.4 hi, 0.1 si, 0.2 st
MiB Mem : 3736.1 total, 738.2 free, 2287.0 used, 711.0 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1119.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12284 root 20 0 0 0 0 R 99.5 0.0 2:18.76 kworker/u8:4+rpciod
16744 vdsm 0 -20 559168 6444 3872 S 99.5 0.2 0:17.71 ioprocess
12247 vdsm 20 0 637548 33076 10656 S 0.6 0.9 0:02.80 momd
15408 nsoffer 20 0 65700 5260 4192 S 0.4 0.1 0:00.69 top
8737 qemu 20 0 2798860 956288 24436 S 0.3 25.0 0:29.73 qemu-kvm
1285 root 20 0 394220 19136 16344 S 0.1 0.5 0:00.97 NetworkManager
top - 19:19:55 up 1:05, 2 users, load average: 47.06, 38.11, 28.41
Tasks: 260 total, 1 running, 259 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.0 us, 25.3 sy, 0.0 ni, 69.8 id, 0.2 wa, 0.3 hi, 0.2 si, 0.2 st
MiB Mem : 3736.1 total, 781.6 free, 2244.1 used, 710.5 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1163.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12284 root 20 0 0 0 0 I 69.2 0.0 2:25.69 kworker/u8:4-events_unbound
16744 vdsm 0 -20 559168 6444 3872 S 22.1 0.2 0:19.92 ioprocess
2067 vdsm 0 -20 3046136 126548 30068 S 2.7 3.3 1:02.86 vdsmd
1 root 20 0 254512 15196 9620 S 1.2 0.4 0:11.91 systemd
1006 dbus 20 0 65076 5988 4780 S 1.1 0.2 0:06.97 dbus-daemon
top - 19:20:30 up 1:06, 2 users, load average: 57.07, 41.13, 29.76
Tasks: 264 total, 22 running, 242 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.7 us, 79.9 sy, 0.0 ni, 11.9 id, 3.4 wa, 0.5 hi, 0.0 si, 0.6 st
MiB Mem : 3736.1 total, 767.2 free, 2258.0 used, 710.9 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1149.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16744 vdsm 0 -20 780364 8492 3872 S 159.4 0.2 0:59.47 ioprocess
4805 root 20 0 0 0 0 D 79.6 0.0 3:19.27 kworker/u8:0+iscsi_eh
12284 root 20 0 0 0 0 R 79.4 0.0 2:45.40 kworker/u8:4+rpciod
2067 vdsm 0 -20 3054332 126572 30068 S 16.1 3.3 1:07.27 vdsmd
8737 qemu 20 0 2798860 956288 24436 R 0.1 25.0 0:29.85 qemu-kvm
15408 nsoffer 20 0 65700 5260 4192 R 0.1 0.1 0:00.80 top
top - 19:20:52 up 1:06, 2 users, load average: 62.28, 43.18, 30.66
Tasks: 261 total, 24 running, 235 sleeping, 0 stopped, 2 zombie
%Cpu(s): 0.3 us, 71.6 sy, 0.0 ni, 22.1 id, 5.3 wa, 0.4 hi, 0.1 si, 0.2 st
MiB Mem : 3736.1 total, 767.6 free, 2257.5 used, 711.1 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1149.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17429 vdsm 0 -20 485436 6368 3808 S 153.7 0.2 0:34.71 ioprocess
4805 root 20 0 0 0 0 R 68.2 0.0 3:34.67 kworker/u8:0+rpciod
16744 vdsm 0 -20 780364 8492 3872 S 63.2 0.2 1:13.73 ioprocess
2067 vdsm 0 -20 3054332 126580 30068 S 0.8 3.3 1:07.45 vdsmd
8737 qemu 20 0 2798860 956288 24436 S 0.5 25.0 0:29.96 qemu-kvm
12284 root 20 0 0 0 0 I 0.4 0.0 2:45.48 kworker/u8:4-rpciod
12247 vdsm 20 0 637548 33076 10656 R 0.3 0.9 0:02.91 momd
top - 19:21:02 up 1:07, 2 users, load average: 61.09, 43.59, 30.93
Tasks: 258 total, 2 running, 256 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 36.3 sy, 0.0 ni, 41.9 id, 20.7 wa, 0.2 hi, 0.1 si, 0.3 st
MiB Mem : 3736.1 total, 780.6 free, 2244.5 used, 711.0 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1162.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16744 vdsm 0 -20 780364 8508 3872 S 47.8 0.2 1:18.52 ioprocess
4805 root 20 0 0 0 0 I 45.2 0.0 3:39.20 kworker/u8:0-flush-253:0
2067 vdsm 0 -20 3037940 126540 30068 S 1.4 3.3 1:07.59 vdsmd
8737 qemu 20 0 2798860 956288 24436 S 0.7 25.0 0:30.03 qemu-kvm
315 root 0 -20 0 0 0 I 0.5 0.0 0:00.17 kworker/2:1H-kblockd
12247 vdsm 20 0 637548 33076 10656 S 0.4 0.9 0:02.95 momd
1154 openvsw+ 10 -10 67416 5764 3960 S 0.3 0.2 0:05.87 ovsdb-server
top - 19:21:12 up 1:07, 2 users, load average: 53.57, 42.55, 30.73
Tasks: 259 total, 11 running, 248 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 28.9 sy, 0.0 ni, 69.3 id, 0.4 wa, 0.2 hi, 0.1 si, 0.3 st
MiB Mem : 3736.1 total, 757.0 free, 2267.3 used, 711.8 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1139.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16744 vdsm 0 -20 780364 8508 3872 S 112.9 0.2 1:29.83 ioprocess
2067 vdsm 0 -20 3037940 126576 30068 S 1.2 3.3 1:07.71 vdsmd
8737 qemu 20 0 2798860 956288 24436 S 0.5 25.0 0:30.08 qemu-kvm
1 root 20 0 254512 15196 9620 S 0.3 0.4 0:12.47 systemd
1154 openvsw+ 10 -10 67416 5764 3960 S 0.3 0.2 0:05.90 ovsdb-server
12247 vdsm 20 0 637548 33076 10656 S 0.3 0.9 0:02.98 momd
top - 19:21:22 up 1:07, 2 users, load average: 49.02, 41.94, 30.66
Tasks: 261 total, 13 running, 248 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 50.1 sy, 0.0 ni, 48.3 id, 0.0 wa, 0.4 hi, 0.1 si, 0.6 st
MiB Mem : 3736.1 total, 738.1 free, 2285.0 used, 713.0 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1121.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16744 vdsm 0 -20 780364 8508 3872 S 198.9 0.2 1:49.76 ioprocess
2067 vdsm 0 -20 3037940 126576 30068 S 1.8 3.3 1:07.89 vdsmd
8737 qemu 20 0 2798860 956288 24436 S 0.6 25.0 0:30.14 qemu-kvm
1 root 20 0 254512 15196 9620 S 0.2 0.4 0:12.49 systemd
1154 openvsw+ 10 -10 67416 5764 3960 S 0.2 0.2 0:05.92 ovsdb-server
15308 nsoffer 20 0 65520 5040 4152 R 0.2 0.1 0:00.27 top
top - 19:21:32 up 1:07, 2 users, load average: 43.89, 41.08, 30.50
Tasks: 257 total, 4 running, 253 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.9 us, 23.9 sy, 0.0 ni, 73.4 id, 0.0 wa, 0.3 hi, 0.2 si, 0.3 st
MiB Mem : 3736.1 total, 779.7 free, 2245.2 used, 711.2 buff/cache
MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 1161.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16744 vdsm 0 -20 780364 8508 3872 S 85.2 0.2 1:58.30 ioprocess
12284 root 20 0 0 0 0 R 2.4 0.0 2:45.72 kworker/u8:4+rpciod
2067 vdsm 0 -20 3029744 126560 30068 S 2.0 3.3 1:08.09 vdsmd
1 root 20 0 254512 15196 9620 S 0.9 0.4 0:12.58 systemd
1006 dbus 20 0 65076 5988 4780 S 0.7 0.2 0:07.43 dbus-daemon
8737 qemu 20 0 2798860 956288 24436 S 0.7 25.0 0:30.21 qemu-kvm
Looks like ioprocess is the reason for the high load, or maybe this is kernel
issue and iprocess is just the victim of the kernel issue.
I'll test this again without oVirt, it will make it clear if this is a kernel
issue.
Avihai, do we have bare metal steup I can use for testing? I test on VMs and this
is not good enviroment to test such issues.
We need:
- engine 4.4.7 (can be a vm)
- 2 bare meta hosts with latest RHEL 8.4 and RHV 4.7.4
- nfs or iscsi server for good storage (netapp used for tests?)
- host for temporary nfs server that will be blocked or shut down (can be a vm)
Nir, can you please set this bug as dependent on the kernel bug that was found? Mordechai, we reproduced this bug on a very old (2013) machines with 4 cores. We need to understand how this bug affects real servers. We need to know if this is a critical issue that may affect users or minor issue that affects only our old testing environment. Can we try to reproduce this with real environment in the scale lab? To reproduce this I need 2 hosts. One will function as NFS server and the other as the NFS client. The NFS client should be strong server that is likey to be used by users. The NFS server host can be anything since we test the case when the NFS server is not accessible. Nir we dont have an enviroment that is currently up as our lab went down earlier today. Adding need info on dagur maybe other hosts in tlv can be used here. Michal, entire cluster goes down because of one inaccessible storage sounds urgent to me, a blocker actually. (In reply to Nir Soffer from comment #16) > Michal, entire cluster goes down because of one inaccessible storage > sounds urgent to me, a blocker actually. It is serious of course, but - it's a negative scenario, comes back on its own once connection is restored, and there's no data loss. More importantly, we can't do anything about it (i.e. no urgent activity required on Dev side) until the kernel bug (which is Urgent) gets fixed. "blocker+" means we won't release a version with this bug. But here we are going to proceed with release, since it's anyway already the current state so blocking doesn't help anyone (and again that consideration might be different for the RHEL bug) Looks like there was confusion about the nature of this issue. The issue is on the NFS client side (RHV host kernel), not on the NFS server side. So the fixed kernel must be installed on all RHV hosts in the environment, not on the NFS server host. This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. We have confidence the kernel fix 4.18.0-305.8.1.el8_4.bz1979070.test.x86_64 does fix the issue but it was not officially delivered to RHV. We'll retest with official RHV and RHEL/kernel builds and verify then. Can you verify this one? Please make sure kernel-4.18.0-305.11.1.el8_4 is present rhv: 4.4.8.3-0.10el8ev HOSTS: os version on hosts: RHEL 8.4-4.elev kernel version 4.18.0-305.12.1.el8_4x86_64 Steps to Reproduce: 1.Create vm from a template (any good template) on hosted-engine environment - this vm will be the storage server 2.install nfs-utils package in the vm (after you got ip and ssh to the vm) 3. create a folder (nfs) and mount this folder 4. define exports file - nano /etc/exports with nfs path and *(rw, sync,no_all_squash, root_squash) 5. restart nfs service 6. go to storage domain screen in rhv - there are some storage domains ui that running- and create new storage domain- put in the path the storage server of vm that you already created 7. go to storage vm and disconnect the network with - nmcli n off 8. look on the hosts screen on rhv ui Actual results: Hosts continue to work as expected and all other storage domain are running and environment continue to work. I got this warning in events: Storage Domain RHV_NFS (Data Center golden_env_mixed) was deactivated by system because it's not visible by any of the hosts. |
Created attachment 1794030 [details] vdsm.log Description of problem: hosts become 'NonOperational' after disconnect the storage server's network. Version-Release number of selected component (if applicable): 4.4.7 How reproducible: 100% Steps to Reproduce: 1.Create vm from a template (any good template) on hosted-engine environment - this vm will be the storage server 2.install nfs-utils package in the vm (after you got ip and ssh to the vm) 3. create a folder (nfs) and mount this folder 4. define exports file - nano /etc/exports with nfs path and *(rw, sync,no_all_squash, root_squash) 5. restart nfs service 6. go to storage domain screen in rhv - there are some storage domains ui that running- and create new storage domain- put in the path the storage server of vm that you already created 7. go to storage vm and disconnect the network with - nmcli n off 8. look on the hosts screen on rhv ui Actual results: hosts become 'NonOperational' after disconnect the storage server's network. The environment collapse and the user need to reprovision the environment host try to connect all the time to the disconnect storage server. Expected results: host should be ok and try to connect to other good storage domains, after disconnect one of nfs storage server. Additional info: