| Summary: | ksmd high CPU usage | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Ales Kozumplik <akozumpl> |
| Component: | kernel | Assignee: | Andrea Arcangeli <aarcange> |
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 6.1 | CC: | aarcange, alexey.pushkin, aph, areis, ari.tilli, athanasios.zorbas, balay, bjrosen, bloodhound, clasohm, cquike, ddick, dougsland, edgar.hoch, erik, fche, ffejes, fschwarz, gansalmon, gsgatlin, itamar, j.dodkins, jonathansteffan, jorton, jsmith.fedora, juzhang, jzeleny, kas, kernel-maint, kparal, kreucher, kronos, loganjerry, mdavis, me, mikey, mjw, mmccune, ojab, oliver.henshaw, paul, pcfe, plarsen, redhat2, rene.purcell, roland.friedwagner, ron, samuel-rhbugs, slavagt, soeren.grunewald, tadej.j, tiagomatos, tim, tjb, tmraz, vorgusa, wcohen, wtogami |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 541230 | Environment: | |
| Last Closed: | 2013-10-01 14:32:39 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | 541230 | ||
| Bug Blocks: | |||
|
Description
Ales Kozumplik
2011-01-24 12:43:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. Hello,
I am still seeing this with qemu-kvm-0.12.1.2-2.113.el6_0.6.x86_64.
The virtual machines running on the host do not do much, but ksmd is at around 30%.
top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
35 root 25 5 0 0 0 S 31.5 0.0 4061:41 ksmd
11732 qemu 20 0 1301m 593m 604 S 5.9 15.5 1362:31 qemu-kvm
11792 qemu 20 0 1342m 655m 548 S 5.9 17.1 1318:23 qemu-kvm
11763 qemu 20 0 1301m 685m 548 S 3.9 17.9 1324:49 qemu-kvm
22102 qemu 20 0 1381m 860m 996 S 3.9 22.5 78:54.88 qemu-kvm
6403 qemu 20 0 1035m 488m 1592 S 2.0 12.7 2:43.69 qemu-kvm
1 root 20 0 19244 436 272 S 0.0 0.0 0:00.83 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.29 kthreadd
Is this bug assigned correctly? The fedora bug this was cloned from is assigned to virtualization managers, not kernel.
Hello, I am seeing this problem also. I am using RHEL 6, with kernel 2.6.32-71.24.1.el6.x86_64. The version of qemu-kvm is: qemu-kvm-0.12.1.2-2.113.el6_0.8.x86_64. When I first set up this system I had this problem with 4 VMs. I reduced the number of vms to 2 and the problem went away today until I rebooted. Now I see the problem with just 2 vms idleing. (one RHEL 5, one RHEL 6, both 64 bit) It seems that ksmd is using about 25-30% of cpu. I have another box (beefier CPU and more RAM) with more vms that does not appear to have this same problem. The machine with the high ksmd cpu percentage has 4 GB of RAM, one vm has 512 mb, the other has 1gb. The machine not having problems has 8GB memory and has 7 vms on it. The vms are a mix of fedoras and RHELs of various versions. I am pretty sure all those vms have 512 mb. Top on machine with high ksmd: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 35 root 25 5 0 0 0 S 26.3 0.0 55:36.99 ksmd 3178 gsgatlin 20 0 1016m 359m 16m S 5.6 9.2 5:01.71 firefox-bin 2371 qemu 20 0 1309m 701m 1288 R 4.7 18.0 11:39.93 qemu-kvm 2337 qemu 20 0 797m 422m 1304 S 1.3 10.8 19:38.93 qemu-kvm 2824 gsgatlin 20 0 240m 24m 7184 S 1.0 0.6 2:18.42 compiz 3525 root 20 0 602m 37m 10m S 1.0 1.0 1:29.96 python 2482 root 20 0 239m 107m 9480 S 0.7 2.8 3:01.39 Xorg 3353 gsgatlin 20 0 104m 5764 2148 S 0.7 0.1 1:15.99 plugin-containe The ksmd load can easily be reduced by editing /etc/ksmtuned.conf See Comment 11 in https://bugzilla.redhat.com/show_bug.cgi?id=541230 (In reply to comment #7) > The ksmd load can easily be reduced by editing /etc/ksmtuned.conf > > See Comment 11 in https://bugzilla.redhat.com/show_bug.cgi?id=541230 Yes, thanks for pointing that out, I'll update my config. I still think this is weird, sometimes it looks like the high CPU usage doesn't happen until after few weeks while the number of running machines is constant. Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. 2% sounds not excessive to me as well, 400 minutes of ksmd runtime over 15 days is 1.8% of a single core. Not excessive in my view. However if you've plenty of RAM available and you don't need KSM you can shut it off. Ideally the KSM scan should slowdown more dynamically if the merging rate decreases (showing the KSM scan is less worthwhile for the workload), so if you don't need it, it will use less than 2% of one core, but that is material not for RHEL6 and it'd probably be better implemented with a ksmtuned feature in the kernel. We can't know automatically if the admin will benefit from KSM or if there's plenty of RAM and KSM will never be needed, so it's up to the admin to shut it off if not needed. By default the above scan rate looks ok and not too aggressive. I can't see any reference to 400 minutes over 15 days in a casual perusal of this bug report, maybe I'm missing it? Nevertheless, ksmd takes a high proportion of cpu time when it is running. So it is to be hoped that it's doing something useful with that cpu time - bug #541230 comment #35 suggests that this isn't always the case. There's no guarantee that the work done by ksmd is beneficial in any way. But the point is that we should spend little CPU on it, because even if it's not beneficial now, it could be later.
The 400min over 15 days are decuted by these numbers:
[root@cobra03 ~]# uptime
14:20:06 up 15 days, 1:52, 1 user, load average: 1.91, 2.51, 2.63
This is the ksmd's top:
39 root 25 5 0 0 0 S 2.0 0.0 400:58.15 ksmd
uptime says 15 days. The number of minutes ksmd used 1 single core was 400min, 58sec, 15centsofsec.
400./(15*24*60) = 1.8% of a single core
That to me doesn't look excessive and it matches the "instant" 2% here:
39 root 25 5 0 0 0 S 2.0 0.0 400:58.15 ksmd
^^^
The problem is with KSM there's no exact rule of how much CPU you want to spend. But it's all tunable in /etc/ksmtuned.conf.
If you don't benefit from KSM and you have plenty of memory, you can should disable ksmd.
"If you don't benefit from KSM ..." How does a user/admin know if they are benefitting from KSM? The ksmtuned.log file doesn't appear to help inform the cost/benefit question. KSM is a tradeoff between spending CPU to save memory, or not spending CPU and not saving memory. There is no hard rule in deciding if to turn KSM on and off. If you've many similar guests the amount of RAM saved can be massive with overall improved performance thanks to higher cache amounts available in the host or simply in allowing a more efficient consolidation. 2% of one core sounds a reasonable amount of CPU spent to save potentially dozen gigabytes of memory. It also depends on if the guests are mostly idle or if they're massively computing. ksmtuned has a reasonable default but admin can ovverride if he knows the virtual machine requirements. During CPU bound benchmarks for example KSM usually should be disabled to save a bit of CPU. With desktop virtualization it is very good idea to keep KSM enabled because of the low CPU load and similarity between the different virtual machine software (if they all run the same desktop OS at least). I believe that is understood. The question is how does an administrator know how well it's working? Can she do better than gross system-level stats like eyeballing /proc/meminfo? have a look at the files in /sys/kernel/mm/ksm/. at least for recent kernels the information you need should be there. i don't have my rhel machine handy to check. Thank you for the pointer. On my F18 machine with ksmtuned / ksm running, these were typical results: /sys/kernel/mm/ksm/full_scans:0 /sys/kernel/mm/ksm/pages_shared:0 /sys/kernel/mm/ksm/pages_sharing:0 /sys/kernel/mm/ksm/pages_to_scan:100 /sys/kernel/mm/ksm/pages_unshared:0 /sys/kernel/mm/ksm/pages_volatile:0 /sys/kernel/mm/ksm/run:0 /sys/kernel/mm/ksm/sleep_millisecs:20 With ksmtuned stopped but ksmd forced on more aggressively: /sys/kernel/mm/ksm/full_scans:4 /sys/kernel/mm/ksm/pages_shared:32128 /sys/kernel/mm/ksm/pages_sharing:261406 /sys/kernel/mm/ksm/pages_to_scan:10000 /sys/kernel/mm/ksm/pages_unshared:789171 /sys/kernel/mm/ksm/pages_volatile:2947 /sys/kernel/mm/ksm/run:1 /sys/kernel/mm/ksm/sleep_millisecs:20 Is there some documentation to guide a sysadmin to these files, and how to interpret the numbers? At that pages_to_scan rate, ksmd was consuming 80% cpu, in order to save some memory. How much per unit cpu-second for example? from the documentation in the linux kernel's Documentation/vm/ksm.txt pages_shared - how many shared pages are being used pages_sharing - how many more sites are sharing them i.e. how much saved pages_unshared - how many pages unique but repeatedly checked for merging pages_volatile - how many pages changing too fast to be placed in a tree full_scans - how many times all mergeable areas have been scanned A high ratio of pages_sharing to pages_shared indicates good sharing, but a high ratio of pages_unshared to pages_sharing indicates wasted effort. pages_volatile embraces several different kinds of activity, but a high proportion there would also indicate poor use of madvise MADV_MERGEABLE. your sample looks ok for my eyes (of course pages_to_scan is way too high). ksmtuned doesn't turn on ksm until the system is low on memory. This is to save CPU as well. Eventually stock F18 should also reach the same amount of "pages_shared" and "pages_sharing" levels that you achieved with the ksmd forced on aggressively. It is just much quicker to get there with an aggressive ksmd setting with a very big pages_to_scan value, so it's good for testing, to be sure KSM is saving you tons of RAM. But then you can start ksmtuned again and lower the scan rate. There are also tweaks for ksmtuned to reduce its aggressiveness if needed, but again I don't see big issues in the current defaults. Based on the discussion above, looks like the current behavior is acceptable and the tradeoffs are well understood. Closing as WONTFIX in RHEL6. Please reopen if you still have concerns. Thanks. |