Bug 1684745
Summary: | VM hangs on RHEL rt-kernel and OSP 13 | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Yichen Wang <yicwang> | |
Component: | kernel-rt | Assignee: | Daniel Bristot de Oliveira <daolivei> | |
kernel-rt sub component: | KVM | QA Contact: | Pei Zhang <pezhang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | atheurer, bhu, chayang, chhudson, daolivei, dhoward, egallen, hhuang, ikulkarn, jianzzha, jraju, juzhang, knoel, krister, lcapitulino, lgoncalv, mmilgram, mtosatti, pagupta, pbonzini, peterx, pezhang, rkrcmar, rt-maint, snagar, virt-maint, vkuznets, williams, wlehman, yicwang | |
Version: | 7.5 | Keywords: | Regression, ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | 3.10.0-1019.rt56.978 | Doc Type: | Bug Fix | |
Doc Text: |
VMs using real-time priority for VCPus and the kernel-rt cannot boot. The VM starts but does not finish booting. The problem was a regression added by a patch about updating in the SynIC timers on guest entry only. The problem was fixed by conditioning the update in the SynIC timers.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1687556 1688673 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-06 12:36:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1655694, 1672377, 1687556, 1688673, 1707454 |
Description
Yichen Wang
2019-03-02 03:11:14 UTC
Yichen, I'm under the impression that the VM's threads (vCPUs and/or iothreads) are deadlocking. You may be right that this could be a kernel issue, but before debugging the kernel we have to double the host configuration. Are you using the realtime-virtual-host profile in the host and the realtime-virtual-guest profile in the host? What is the contents of /etc/tuned/realtime-virtual-host-variables.conf in the host and /etc/tuned/realtime-virtual-guest-variables.conf in the guest? Erwan, Andrew, can you help double checking the OSP configuration? Adding Jianzhu to also help reviewing OSP configuration. Yicheng, for the flavor setting, rather than using hw:cpu_realtime_mask="^0", say if you have 8 vCPU, can you use hw:cpu_realtime_mask="0-7" in stead? Thanks (In reply to Luiz Capitulino from comment #2) > Yichen, > > I'm under the impression that the VM's threads (vCPUs and/or iothreads) are > deadlocking. > You may be right that this could be a kernel issue, but before debugging the > kernel > we have to double the host configuration. > > Are you using the realtime-virtual-host profile in the host and the > realtime-virtual-guest > profile in the host? What is the contents of > /etc/tuned/realtime-virtual-host-variables.conf > in the host and /etc/tuned/realtime-virtual-guest-variables.conf in the > guest? > We tried both with and without tuned, same result. When using tuned, we are using realtime-virtual-host, and the contents are just one line with "isolated_cores=1-19,21-39". Also this is purely host level thing, so I don't think it has anything to do with guest. > Erwan, Andrew, can you help double checking the OSP configuration? (In reply to jianzzha from comment #4) > Yicheng, for the flavor setting, rather than using > hw:cpu_realtime_mask="^0", say if you have 8 vCPU, can you use > hw:cpu_realtime_mask="0-7" in stead? Thanks If I am understanding correctly, by doing 0-8, I am making all my CPU worker thread RT. By doing this, I see VM coms up fine! Why I need all my VM cores to be RT? It is bug, right? (In reply to Yichen Wang from comment #5) > (In reply to Luiz Capitulino from comment #2) > > Yichen, > > > > I'm under the impression that the VM's threads (vCPUs and/or iothreads) are > > deadlocking. > > You may be right that this could be a kernel issue, but before debugging the > > kernel > > we have to double the host configuration. > > > > Are you using the realtime-virtual-host profile in the host and the > > realtime-virtual-guest > > profile in the host? What is the contents of > > /etc/tuned/realtime-virtual-host-variables.conf > > in the host and /etc/tuned/realtime-virtual-guest-variables.conf in the > > guest? > > > > We tried both with and without tuned, same result. When using tuned, we are > using realtime-virtual-host, and the contents are just one line with > "isolated_cores=1-19,21-39". Also this is purely host level thing, so I > don't think it has anything to do with guest. OK, the configuration looks correct. But there are two important points to mention. The first one is, even if the tuned profile didn't make an apparent difference, it is still required to use it as we have discussed in bug 1678810. The second one is, when you get a guest hang as you did, we don't know right away if it's a host or guest issue. Actually, in this case it seems that it indeeded was a host issue (more on this below). > > Erwan, Andrew, can you help double checking the OSP configuration? > > (In reply to jianzzha from comment #4) > > Yicheng, for the flavor setting, rather than using > > hw:cpu_realtime_mask="^0", say if you have 8 vCPU, can you use > > hw:cpu_realtime_mask="0-7" in stead? Thanks > > If I am understanding correctly, by doing 0-8, I am making all my CPU worker > thread RT. By doing this, I see VM coms up fine! Why I need all my VM cores > to be RT? It is bug, right? It is not a bug. All VM vCPU threads need to be RT so that they are scheduled to run when they have to. Even if they are spinning, you must have a housekeeping vCPU0 that might hold kernel locks and hence has to run when it has to not to delay other vCPU threads. Btw, if the issue is now fixed, may we close this BZ? (In reply to Luiz Capitulino from comment #6) > (In reply to Yichen Wang from comment #5) > > (In reply to Luiz Capitulino from comment #2) > > > Yichen, > > > > > > I'm under the impression that the VM's threads (vCPUs and/or iothreads) are > > > deadlocking. > > > You may be right that this could be a kernel issue, but before debugging the > > > kernel > > > we have to double the host configuration. > > > > > > Are you using the realtime-virtual-host profile in the host and the > > > realtime-virtual-guest > > > profile in the host? What is the contents of > > > /etc/tuned/realtime-virtual-host-variables.conf > > > in the host and /etc/tuned/realtime-virtual-guest-variables.conf in the > > > guest? > > > > > > > We tried both with and without tuned, same result. When using tuned, we are > > using realtime-virtual-host, and the contents are just one line with > > "isolated_cores=1-19,21-39". Also this is purely host level thing, so I > > don't think it has anything to do with guest. > > OK, the configuration looks correct. But there are two important points to > mention. The first one is, even if the tuned profile didn't make an > apparent difference, it is still required to use it as we have discussed > in bug 1678810. The second one is, when you get a guest hang as you did, > we don't know right away if it's a host or guest issue. Actually, in this > case it seems that it indeeded was a host issue (more on this below). > > > > Erwan, Andrew, can you help double checking the OSP configuration? > > > > (In reply to jianzzha from comment #4) > > > Yicheng, for the flavor setting, rather than using > > > hw:cpu_realtime_mask="^0", say if you have 8 vCPU, can you use > > > hw:cpu_realtime_mask="0-7" in stead? Thanks > > > > If I am understanding correctly, by doing 0-8, I am making all my CPU worker > > thread RT. By doing this, I see VM coms up fine! Why I need all my VM cores > > to be RT? It is bug, right? > > It is not a bug. All VM vCPU threads need to be RT so that they are scheduled > to run when they have to. Even if they are spinning, you must have a > housekeeping > vCPU0 that might hold kernel locks and hence has to run when it has to not to > delay other vCPU threads. OK, if above is true, this is a behavior change if you don't want to call it a regression or bug when compared to an older working kernel-rt version. Also if what you said is true, there is no point to have "hw:cpu_realtime_mask", as it will be always all vCPUs. > > Btw, if the issue is now fixed, may we close this BZ? I would want to get a better clarification on this, as this change is relatively new. We have to handle the existing VMs for infrastructure update, and there will cause a potential breakage if existing VMs are already with "hw:cpu_realtime_mask=^0". I am still not very convinced this is not a bug, please help me to understand the logic behind it why the new kernel-rt changed the behavior. Thanks very much! Yichen, As I explained, as per KVM-RT required configuration, all VM vCPUs must have real-time priority. By setting vCPU0 without real-time priority, you will run into spikes since that vCPU may be holding a lock shared with other vCPUs, which in turn may cause the other vCPUs to spin on the lock. Also, if you can't reproduce the hang when all vCPUs have real-time priority, then there's no regression in KVM-RT since that's the required configuration. Having said that, I can try to confirm whether or not there's a behavior change for the non-KVM-RT configuration you seem to be using. Could you please provide the following information for the configuration that reproduces the bug: 1. Host's lscpu and a list of isolated cores in the host 2. The guest's XML file 3. The kernel version that works and the one that doesn't work Also, for item 3, I assume you change the kernel version in the host, right? (In reply to Luiz Capitulino from comment #8) > Yichen, > > As I explained, as per KVM-RT required configuration, all VM vCPUs must > have real-time priority. By setting vCPU0 without real-time priority, you > will run into spikes since that vCPU may be holding a lock shared with > other vCPUs, which in turn may cause the other vCPUs to spin on the lock. > Also, if you can't reproduce the hang when all vCPUs have real-time priority, > then there's no regression in KVM-RT since that's the required configuration. > Ok, what you said above is new to me, which I don't find any documentation on that saying all my VM cores have to be RT. If you Google it, people are all using the mask. Also from https://docs.openstack.org/nova/rocky/user/flavors.html, no mentioning of above either. Same points: (1) If all VM cores need to be RT, what is the point of hw:cpu_realtime_mask? (2) Why it works on previous version kernel, but not in new kernel? Without proper documentation I would mark it as behavior change or regression. This will cause problem in my 3rd point. (3) For customers running VMs with hw:cpu_realtime_mask=^0. After kernel update, and VM reboots, and the VM will never come up. This is a blocker for people want to do updates on kernels. Unless we patch libvirt XML, I don't see any other better solution. Please advise. > Having said that, I can try to confirm whether or not there's a behavior > change for the non-KVM-RT configuration you seem to be using. Could you > please > provide the following information for the configuration that reproduces > the bug: > > 1. Host's lscpu and a list of isolated cores in the host [root@quincy-compute-2 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 1 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 1601.000 CPU max MHz: 1601.0000 CPU min MHz: 1000.0000 BogoMIPS: 3200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0-19 NUMA node1 CPU(s): 20-39 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d isolated cores are: 1-19, 21-39 > 2. The guest's XML file # virsh dumpxml 6 <domain type='kvm' id='6'> <name>instance-0000002c</name> <uuid>50210f2e-4ff9-4d95-ac56-655db6b84631</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="17.0.7-5.cisco.2.el7"/> <nova:name>myvm</nova:name> <nova:creationTime>2019-03-05 20:04:00</nova:creationTime> <nova:flavor name="m1.medium.noht"> <nova:memory>4096</nova:memory> <nova:disk>40</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>9</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="a7bc657862cb40728afc943c8aa57e3e">admin</nova:user> <nova:project uuid="88316df5962845d58ac840f0e46869af">admin</nova:project> </nova:owner> <nova:root type="image" uuid="dcf6bdf2-cef6-487a-87f3-72657b86680b"/> </nova:instance> </metadata> <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> <nosharepages/> <locked/> </memoryBacking> <vcpu placement='static'>9</vcpu> <cputune> <shares>9216</shares> <vcpupin vcpu='0' cpuset='15'/> <vcpupin vcpu='1' cpuset='2'/> <vcpupin vcpu='2' cpuset='8'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='9'/> <vcpupin vcpu='5' cpuset='4'/> <vcpupin vcpu='6' cpuset='10'/> <vcpupin vcpu='7' cpuset='5'/> <vcpupin vcpu='8' cpuset='16'/> <emulatorpin cpuset='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> <vcpusched vcpus='2' scheduler='fifo' priority='1'/> <vcpusched vcpus='3' scheduler='fifo' priority='1'/> <vcpusched vcpus='4' scheduler='fifo' priority='1'/> <vcpusched vcpus='5' scheduler='fifo' priority='1'/> <vcpusched vcpus='6' scheduler='fifo' priority='1'/> <vcpusched vcpus='7' scheduler='fifo' priority='1'/> <vcpusched vcpus='8' scheduler='fifo' priority='1'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> <resource> <partition>/machine</partition> </resource> <sysinfo type='smbios'> <system> <entry name='manufacturer'>Red Hat</entry> <entry name='product'>OpenStack Compute</entry> <entry name='version'>17.0.7-5.cisco.2.el7</entry> <entry name='serial'>329497fe-667a-11e8-88af-d8c49789492d</entry> <entry name='uuid'>50210f2e-4ff9-4d95-ac56-655db6b84631</entry> <entry name='family'>Virtual Machine</entry> </system> </sysinfo> <os> <type arch='x86_64' machine='pc-i440fx-rhel7.6.0'>hvm</type> <boot dev='hd'/> <smbios mode='sysinfo'/> </os> <features> <acpi/> <apic/> <pmu state='off'/> </features> <cpu mode='host-passthrough' check='none'> <topology sockets='9' cores='1' threads='1'/> <feature policy='require' name='tsc-deadline'/> <numa> <cell id='0' cpus='0-8' memory='4194304' unit='KiB' memAccess='shared'/> </numa> </cpu> <clock offset='utc'> <timer name='pit' tickpolicy='delay'/> <timer name='rtc' tickpolicy='catchup'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/var/lib/nova/instances/50210f2e-4ff9-4d95-ac56-655db6b84631/disk'/> <backingStore type='file' index='1'> <format type='raw'/> <source file='/var/lib/nova/instances/_base/39924454a2f83b902c560386f4aae35eca3a6575'/> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <controller type='usb' index='0' model='piix3-uhci'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <interface type='bridge'> <mac address='fa:16:3e:dc:73:75'/> <source bridge='qbr5695ac7e-db'/> <target dev='tap5695ac7e-db'/> <model type='virtio'/> <driver name='vhost' rx_queue_size='1024'/> <mtu size='9000'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/2'/> <log file='/var/lib/nova/instances/50210f2e-4ff9-4d95-ac56-655db6b84631/console.log' append='off'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/2'> <source path='/dev/pts/2'/> <log file='/var/lib/nova/instances/50210f2e-4ff9-4d95-ac56-655db6b84631/console.log' append='off'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='tablet' bus='usb'> <alias name='input0'/> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'> <alias name='input1'/> </input> <input type='keyboard' bus='ps2'> <alias name='input2'/> </input> <graphics type='vnc' port='5900' autoport='yes' listen='172.29.86.247' keymap='en-us'> <listen type='address' address='172.29.86.247'/> </graphics> <video> <model type='cirrus' vram='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <stats period='10'/> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> </devices> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+2012:+2012</label> <imagelabel>+2012:+2012</imagelabel> </seclabel> </domain> > 3. The kernel version that works and the one that doesn't work Working old version: 3.10.0-957.1.3.rt56.913.el7.x86_64 Non-working new version: 3.10.0-957.5.1.rt56.916.el7.x86_64 > > Also, for item 3, I assume you change the kernel version in the host, right? Yes (In reply to Yichen Wang from comment #9) > (In reply to Luiz Capitulino from comment #8) > > Yichen, > > > > As I explained, as per KVM-RT required configuration, all VM vCPUs must > > have real-time priority. By setting vCPU0 without real-time priority, you > > will run into spikes since that vCPU may be holding a lock shared with > > other vCPUs, which in turn may cause the other vCPUs to spin on the lock. > > Also, if you can't reproduce the hang when all vCPUs have real-time priority, > > then there's no regression in KVM-RT since that's the required configuration. > > > > Ok, what you said above is new to me, which I don't find any documentation > on that saying all my VM cores have to be RT. If you Google it, people are > all using the mask. Also from > https://docs.openstack.org/nova/rocky/user/flavors.html, no mentioning of > above either. KVM-RT is very new, so I wouldn't expect all details like this to be fully documented yet. Also, as it turns out, different kernel versions may require different tunings since different kernels may have different interrupt sources. So, not all tunings apply to all kernels equally. > Same points: > (1) If all VM cores need to be RT, what is the point of hw:cpu_realtime_mask? I think this makes sense for full flexibility and not to tie OpenStack configuration to a particular kernel version. But for more info, we'd need to talk to OpenStack developers (adding Erwan to the CC list). > (2) Why it works on previous version kernel, but not in new kernel? Without > proper documentation I would mark it as behavior change or regression. This > will cause problem in my 3rd point. > (3) For customers running VMs with hw:cpu_realtime_mask=^0. After kernel > update, and VM reboots, and the VM will never come up. This is a blocker for > people want to do updates on kernels. Unless we patch libvirt XML, I don't > see any other better solution. Please advise. I'll try to reproduce this issue to see if there's a kernel regression or behavior change. But for KVM-RT, the required configuration is all vCPU threads must have real-time priority. I was able to reproduce this issue without OpenStack. Actually, there are two ways to reproduce it: 1. The simplest way is to just reboot a KVM-RT guest from within the guest. If the bug reproduces, and it reproduces 100% of the time for me, the guest will get stuck with most vCPU threads taking 3% userspace and 18% kernelspace 2. In the guest XML, configure all vCPUs but vCPU0 to have fifo:1 priority (that is, vCPU0 will have SCHED_OTHER priority). Try to start the guest with virsh. virsh itself will hang and the guest will hang like item 1 Here's all that I know so far: o This is a regression. The first bad kernel is kernel-rt-3.10.0-957.3.1.rt56.914.el7.x86_64. The last good kernel is kernel-rt-3.10.0-957.2.1.rt56.913.el7.x86_64 o The latest RHEL7.7 kernel 3.10.0-1014.rt56.972.el7.x86_64 is also affected o The non-RT kernel kernel-3.10.0-957.3.1.el7.x86_64 DOES NOT reproduce the issue, which makes me think this is an RT-kernel only issue o When tracing, I observe that the vCPU threads are "spinning" on this mutex from the signal code tsk->sighang->siglock. Any code path leading to this spinlock will cause the thread to "spin", for example: sigprocmask() /* from vcpu run code in KVM */ __set_current_blocked() migrate_disable() pin_current_cpu() spin_lock_irq(&tsk->sighang->siglock) I'm saying "spinning" because in the RT kernel spinlock contention causes the thread to go to sleep for a while, wake up, see the lock taken and go to sleep. The first bad kernel (kernel-rt-3.10.0-957.3.1.rt56.914.el7.x86_64) has a relatively big KVM update, mostly relating to HyperV. So, I think I'll try to revert this series. Otherwise, I have two hypothesis for this issue: 1. There's something in the HyperV updating "conflicting" with the RT kernel code (say, bad lock semantics) 2. There's a bad conflict resolution PS: I'm clearing the NEEDINFO request for Erwan, since I think that's a bit irrelevant now Yichen, Thanks a lot on finding this one. It is a true bug and we're working on it. I reverted the HyperV patches in the latest 7.6.z, and it works fine. Here is the brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20482594 What are the next steps? make a hotfix kernel, deliver to the customer, and continue working in the real solution? -- Daniel Being delayed for this RPM signature issue, which still haven't been resolved yet... I just install the RPM anyway, and give it a shot. The original issue I saw on this thread has been fixed, and VM is launching fine. However, still want to keep it open, as we are seeing another abnormal things with this new version with wider testing. In order to better reveal this, we configured our testbed to have mixed of HT and non-HT deployment. On HT enabled node, the issue is 100% reproducible (4/4), on non-HT enabled node, the issue is 30% reproducible (1/3). Please find the "lscpu" output above for the hardware information. [root@quincy-compute-4 ~]# top top - 13:03:08 up 13:39, 1 user, load average: 149.30, 149.47, 149.73 Tasks: 1622 total, 9 running, 1613 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.7 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 39469430+total, 19495484+free, 19490510+used, 4834360 buff/cache KiB Swap: 2097148 total, 2097148 free, 0 used. 19361452+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15856 root 39 19 0 0 0 S 13.9 0.0 77:30.36 kipmi0 610 root -4 0 0 0 0 S 11.7 0.0 74:54.14 ktimersoftd/60 23 root -3 0 0 0 0 S 4.5 0.0 14:18.30 ksoftirqd/1 213 root -4 0 0 0 0 S 4.5 0.0 33:38.03 ktimersoftd/20 226553 root 20 0 163648 3976 1592 R 1.9 0.0 0:00.38 top On non-HT enabled testbed, top shows: [root@quincy-control-2 ~]# top top - 13:03:43 up 13:40, 1 user, load average: 73.13, 73.89, 73.84 Tasks: 1133 total, 2 running, 1131 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.3 us, 1.6 sy, 0.0 ni, 97.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 19652601+total, 91847016 free, 96707744 used, 7971260 buff/cache KiB Swap: 33554428 total, 33554428 free, 0 used. 92710192 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 183236 rabbitmq 20 0 3574340 107592 2804 S 28.3 0.1 0:02.50 beam.smp 54714 rabbitmq 20 0 7835916 268384 4348 S 27.7 0.1 224:43.84 beam.smp 4 root -4 0 0 0 0 S 2.9 0.0 27:16.72 ktimersoftd/0 213 root -4 0 0 0 0 S 2.6 0.0 21:12.23 ktimersoftd/20 71258 cloudpu+ 20 0 298596 83080 7696 S 2.0 0.0 14:13.75 cloudpulse-serv 53655 mysql 20 0 15.6g 531176 146124 S 1.6 0.3 10:44.52 mysqld 183237 root 20 0 163116 3476 1592 R 1.6 0.0 0:00.20 top You see the load average is like 150 for HT, and 74 for non-HT. Given this is a 20 * 2 = 40 cores CPU, if the system is behaving like what "top" says, it should be super overloaded. However, things are still fine, just the top output is scaring people away. The issue is confirmed to be tuned, on both new and old kernel versions. I tried with tuned profile realtime, same thing. Then I tried realtime's parent profile, network-latency, issue is not seen. So clearly something is wrong when applying realtime tuned profiles. When reproduced, it is 100% happening. So I would like helps to understand how to debug with this... Thanks very much! Regards, Yichen (In reply to Yichen Wang from comment #54) > Being delayed for this RPM signature issue, which still haven't been > resolved yet... I just install the RPM anyway, and give it a shot. The > original issue I saw on this thread has been fixed, and VM is launching > fine. However, still want to keep it open, as we are seeing another abnormal > things with this new version with wider testing. Yichen, Would you mind opening a new BZ for this issue? It doesn't seem related to the vCPU hang issue so it's better for us to track it in a different BZ. Thanks! (In reply to Luiz Capitulino from comment #61) > (In reply to Yichen Wang from comment #54) > > Being delayed for this RPM signature issue, which still haven't been > > resolved yet... I just install the RPM anyway, and give it a shot. The > > original issue I saw on this thread has been fixed, and VM is launching > > fine. However, still want to keep it open, as we are seeing another abnormal > > things with this new version with wider testing. > > Yichen, > > Would you mind opening a new BZ for this issue? It doesn't seem related to > the vCPU hang issue so it's better for us to track it in a different BZ. > > Thanks! No problem, Luiz. Please feel free to close this BZ. A new BZ is opened for tuned issue that I am seeing: https://bugzilla.redhat.com/show_bug.cgi?id=1694877 Thanks very much! Regards, Yichen Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2043 |