Bug 806548
Summary: | Kernel 3.3.x: terrible slow boot / shutdown in VMWare | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Harald Reindl <h.reindl> | ||||||||||||||||||||||||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||||||||||||||
Version: | 16 | CC: | drjones, fullung, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, maurizio.antillon, mschmidt, Panos.Kavalagios, pascal.chapperon, sergio | ||||||||||||||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||||||
Fixed In Version: | kernel-2.6.43.5-2.fc15 | Doc Type: | Bug Fix | ||||||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||||||||||
Last Closed: | 2012-05-04 22:50:39 UTC | Type: | --- | ||||||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Harald Reindl
2012-03-24 17:02:57 UTC
Please post the dmesg for that boot. There's nothing we can work with in this bug other than it reporting the vast majority of the time being spent in the initrd and userspace. "dmesg.txt" will be attached in next step maybe also related: https://bugzilla.redhat.com/show_bug.cgi?id=805317 this happened on this machine multiple times today never seen this before 3.3, so i currently hold back 3.2 for vSphere guests :-( __________________ i notice this degraded boot/shutdown times also on 3 "HP Compaq 8200 Elite CMT PC" physical machines where boot-time raised from 14 seconds to 21 seconds as example Created attachment 572829 [details]
dmesg output
Created attachment 572836 [details]
print.ktime=0
this is another virtual machine
seems print.ktime=0 makes also a big difference
interesting is that WITHOUT timestamps it is MUCH slower
i would expect the opposite
Startup finished in 1430ms (kernel) + 3817ms (initramfs) + 11708ms (userspace) = 16956ms
Startup finished in 1408ms (kernel) + 6755ms (initramfs) + 22005ms (userspace) = 30168ms
Created attachment 572837 [details]
print.ktime=1
here the other dmesg-output to see the difference of the last attachment
as showed in the two dmesg-outputs attached "printk.time=0" seems to make the start a lot slower shutdown is in both variants slow i can see the "unmount old..." messages often a second per earch of them Startup finished in 1430ms (kernel) + 3817ms (initramfs) + 11708ms (userspace) = 16956ms Startup finished in 1408ms (kernel) + 6755ms (initramfs) + 22005ms (userspace) = 30168ms These are all issues in VMWare guests? Is the host using the VMWare modules (it seems so)? Do you have this issue on real hardware or non-VMWare guests? Created attachment 572847 [details]
also slow with print.ktime=1
as said in my initial report "most degraded are virtual machines, but also physical ones" i have here phyiscal machines too which a boot time degraded from 11 seconds to 18 seconds by kernel 3.3 on F16 (In reply to comment #9) > as said in my initial report "most degraded are virtual machines, but also > physical ones" > > i have here phyiscal machines too which a boot time degraded from 11 seconds to > 18 seconds by kernel 3.3 on F16 Great. Can you attach a dmesg from a physical machine? Created attachment 572850 [details]
physical machine degraded from 8 to 17 seconds
attached the dmesg of a physical machine
it hangs around some seconds by set hostname / udev too
with 3.3.0-2.fc16 the whole boot-process did take around 8 seconds
Startup finished in 1056ms (kernel) + 2006ms (initramfs) + 14213ms (userspace) = 17277ms
__________________
[ 3.257595] systemd[1]: systemd 37 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +SYSVINIT +LIBCRYPTSETUP; fedora)
[ 3.271676] systemd[1]: Set hostname to <south.thelounge.net>.
[ 6.877056] EXT4-fs (md1): re-mounted. Opts: data=writeback,commit=45,barrier=1,delalloc,journal_async_commit,journal_checksum,inode_readahead_blks=128,noquota,nodiscard,acl,user_xattr
[ 7.157829] udevd[505]: starting version 173
Hello, exactly the same issue for me on my main laptop with kernel 3.3.0-4 or 3.3.0-8 (MSI GTX780 cpu i7-2670qm). boot time is now ~20s instead of 12s with kernel 3.2.10. Surprizingly, from time to time, the boot time reverts to 12s. After checking systemd-analyze plot, i saw that systemd-readahead-collect is running at almost each boot; IMHO, it explains degraded performances (for boot, but not for shutdown). It is kernel related, because this don't happen with kernel 3.2.10. However, F16 + kernel 3.3.0-4 is running fine on a Acer 1825ptz (cpu ulv SU4100). the performances are the same for kernel 3.2.10 and 3.3.0-4. Created attachment 574324 [details]
systemd-analyze plot
readahead-collect running during the boot
Created attachment 574326 [details]
systemd-analyse plot
This is a one of the few occurrences where kernel 3.3.0-8 has the same performance as 3.2.10.
I enabled debugging for systemd, but no errors or warnings, or suspicious things. Obviously something in the kernel 3.3.0-4 (or -8) broke the mechanism of systemd for some hardware, but I'm not convinced this is a kernel bug (rather a systemd bug ?) I will install a vanilla kernel to do a git bisect. (unless the cause of the issue comes from a Fedora patch ?) i fear there are race-conditions between udev/systemd/kernel introduced with 3.3 * my first bugreport against 3.3 was becuse no boot at all in a VM * after reinstall packages and some snakeoil magic it bootet * some days later the same problem again (udev segfault) * so this happens randomly witgout software changes * most of my boot time is generally udev * reference: https://bugzilla.redhat.com/show_bug.cgi?id=805317 maybe in virtual machines whatever race condition is triggered harder because much fewer "hardware" and so udev usually faster, leading normally to boots around 5 seconds but makes the current problem much more noisy if is 3-4 times slower and not 100% repdrodceuable would it not be a godd idea if kernel-packages ping udev/systemd-maintainers to look in this three components what may have introduced this? the fact that machines sometimes not boot at all with 3.3 currently forces me to let the last 3.2.x run in production what is from view of security suboptimal git bisect finished! it was a bit tedious, but I found the culprit; it's an upstream commit (see attachments below for details) : commit 7cb92499000e3c86dae653077b1465458a039ef6 Author: Paul E. McKenney <paul.mckenney> Date: Mon Nov 28 12:28:34 2011 -0800 rcu: Permit dyntick-idle with callbacks pending I was wrong to think that systemd was involved. The slow start and sluggish shut-down are perfectly correlated; and I think (perhaps wrongly) that systemd is not involved in umounting old root. Created attachment 575157 [details]
git bisect log
git bisect log
Created attachment 575160 [details]
git bisect visualize
It is hardware related, as one of my laptop run smoothly with this commit. How can I report my hardware to complete the bug submission (which tool) ? Another question : as it is a quite old patch, i think it was introduce in fedora kernel with RCU_FAST_NO_HZ ? (In reply to comment #20) > It is hardware related, as one of my laptop run smoothly with this commit. How > can I report my hardware to complete the bug submission (which tool) ? smoltSendProfile -p will print a profile to stdout. Thank you very much for doing the vanilla bisect. It is much appreciated. I've emailed Paul McKenney about this and will post a link to the upstream conversation when it shows up in an archive. (In reply to comment #21) > Another question : as it is a quite old patch, i think it was introduce in > fedora kernel with RCU_FAST_NO_HZ ? The patch you found in your bisect was merged into the mainline kernel with 3.3-rc1, so it's not that old. Created attachment 575381 [details]
smoltSendProfile -p
BTW, I tried 3.4.0-rc1 with and without CONFIG_RCU_FAST_NO_HZ.
with CONFIG_RCU_FAST_NO_HZ=y :
- random startup time from 12s to 28s (systemd-analyze)
- always sluggish shut-down (10s to 15s)
with # CONFIG_RCU_FAST_NO_HZ is not set :
- stable 7-8s startup time
- stable 5s shut-down time
these are exactly the differences i notice on most virtual machines thank you for debugging this i do not understand exactly what "CONFIG_RCU_FAST_NO_HZ" does but it is a candidate for "dislike" can this only be done as compile-flag? it would be really better as grub-param because in my case seeing slow down virtual machines drastically while they are running fine with NO_HZ it could be disabled and should only be a "soft preset" __________________________________ CONFIG_RCU_FAST_NO_HZ: Accelerate last non-dyntick-idle CPU's grace periods General informations The Linux kernel configuration item CONFIG_RCU_FAST_NO_HZ: prompt: Accelerate last non-dyntick-idle CPU's grace periods type: bool depends on: CONFIG_NO_HZ && CONFIG_SMP defined in init/Kconfig found in Linux kernels: 2.6.34–2.6.39, 3.0–3.3, 3.4-rc+HEAD Help text This option causes RCU to attempt to accelerate grace periods in order to allow CPUs to enter dynticks-idle state more quickly. On the other hand, this option increases the overhead of the dynticks-idle checking, particularly on systems with large numbers of CPUs. Say Y if energy efficiency is critically important, particularly if you have relatively few CPUs. Say N if you are unsure. Created attachment 579003 [details]
the same with 3.3.2-4.fc16.x86_64
is there any progress?
up to 70 seconds in a virtual machine which are booting normally mcuh faster than physical ones because minimum of hardware on a Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz with 16 GB RAM is really odd
[root@testserver:~]$ uname -r
3.3.2-4.fc16.x86_64
[root@testserver:~]$ systemd-analyze
Startup finished in 5937ms (kernel) + 922ms (initramfs) + 63799ms (userspace) = 70659ms
Created attachment 579125 [details]
hanging boot
see screenshot
this happens randomly booting vmware-guests with 3.3.x
you have to wait some minutes, each boot line longs forever
after the "emergency mode" you can wait a while and after CTRL+D boot will finish without any real problem
i have NEVER seen such a behavior before Kernel 3.3 and i am maintaining 25 Fedora productions and al ot of test-guests since 2008 and Fedora 9 on VMware ESXi 4.0/4.4, VMwareServer (until 2009) and VMwareWorkstation 6-8
Created attachment 579161 [details]
failed boot part 1
sometimes boot fails completly because udev seems to timeout / fail
as said: only with 3.3.x
this is part 1, another screen follows
Created attachment 579162 [details]
failed boot part 2
this is the second screen of a completly failing boot
afer waiting around a minute
Created attachment 580740 [details] rcu tracing during a slow boot kernel rcu tracing log during boot and associated "systemd-analyze plot" during boot for both with and without FAST_NO_HZ: # tar xjf report.tar.bz2 report/ report/without_FAST_NO_HZ/ report/without_FAST_NO_HZ/plot.svg report/without_FAST_NO_HZ/rcu_trace.txt report/with_FAST_NO_HZ/ report/with_FAST_NO_HZ/plot.svg report/with_FAST_NO_HZ/rcu_trace.txt report/kernel_config # grep ksoftirq report/without_FAST_NO_HZ/rcu_trace.txt |wc -l 3971 # grep ksoftirq report/with_FAST_NO_HZ/rcu_trace.txt |wc -l 74039 The kernel config is exactly the same for both kernel (apart FAST_NO_HZ). CAN WE PLEASE GET THIS FIXED? is see 3.3.3 building and change-log doe snot contain a fix for this http://koji.fedoraproject.org/koji/buildinfo?buildID=315834 Harald, I am sorry to say that, but: what are YOU currently doing to get this fixed ? WHAT should i do in your opinion? hack the fedora infrastructure and set "CONFIG_RCU_FAST_NO_HZ=n" what was introduced with the 3.3 kernels and making troubles since more than a month? Does booting with "nohz=off" help as a workaround? this is not a workaround because it disables tickless kernel completly which means on a host with 25 virtual machines you have a cpu-load about 1200 MHz (around 50 MHz per guest) instead 0 Mhz host-load from guests which are waiting for operations two days ago as anonymous forced a DDOS on our main server this would have killed all operations Harald, for now we have no certainty that your problem is exactly the same as mine. It's not so difficult to get fedora kernel sources and patch .config file (and then rebuild binary rpm). It would at lest qualify the problem on different hardware than mine. Michal, your workaround works well on my laptop with a standard fedora kernel (3.3.2-6.fc16). It can help users who don't use fedora 16 for a production server (IMHO, 99% of fedora users, as many system administrators would prefer RHEL6 or similar for production). > Harald, for now we have no certainty that your problem is exactly the
> same as mine.It's not so difficult to get fedora kernel sources and
> patch .config file (and then rebuild binary rpm).
i have not the ressources on my build-environment to rebuild a whole kernel
it is also not so difficult to provide a testing-kernel on koji with "CONFIG_RCU_FAST_NO_HZ=n" and you can expect that if one is available i will install it on some test-virtual-machines and reboot them 50 times
(In reply to comment #36) > Does booting with "nohz=off" help as a workaround? It doesn't work on my latest 4.1.14 virtualbox. Fedora 17 Alpha needs the same 20 minutes to boot in single user mode!?!?! I hope on beta the things are better. I haven't issued any "yum update" yet, since I am estimating that would take a week to complete. Maybe a "yum update kernel" that would take 1 day or to re-install Fedora 17 beta might be the best solution, given that the problem is solved on latest kernels. Usually, such kind of problems are charged to Virtualbox as bugs, but since it has also been identified on non-virtualised hardware, it looks as a kernel issue that needs attention. disabled for next build. (In reply to comment #41) > disabled for next build. Your choice, but: I tried f17 beta in a Virtualbox VM (host f16 with 3.3 or 3.4 kernel), and unlike Panos, I just experiment a little delay (~10s) during boot, same for shutdown. Panos, you must try f17 beta before reporting a bug here. For Harald, he will not be happy anyway, since he does not even ask himself if the host (kernel 3.3) can be responsible for the VM slowdown, and since he don't want to try anything (and probably he could be disappointed by FAST_NO_HZ=n). My opinion is that the problem must be solved now, not when it is buried under a ton of patches. (bug = kernel ? systemd ?). Much longer FAST_NO_HZ is enabled, and more likely we get a new bug's report with people willing to help. > For Harald, he will not be happy anyway, since he does not even ask
> himself if the host (kernel 3.3) can be responsible for the VM
> slowdown, and since he don't want to try anything (and probably
> he could be disappointed by FAST_NO_HZ=n)
you are missing the point that my host ist not only linux
my host is also VMware ESXi 4.1 (current patch level)
i am speaking here not about one or two virtual machines
my expierience is about 30 virtual machines on Linux and ESXi hosts
the problem arrived with the first 3.3 build long before it went to updates-testing, the host at home this time was NOT 3.3 and so yes it is pretty sure any change in 3.3
"CONFIG_RCU_FAST_NO_HZ=y" is NOT recommedned upstream for generic kernels
Hi , will you notify us when a kernel is available for testing on F17 ? (In reply to comment #44) > Hi , will you notify us when a kernel is available for testing on F17 ? Bodhi will leave a comment here when it has been pushed to the repo, as usual. kernel-3.3.4-3.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/kernel-3.3.4-3.fc17 hi , I test it in my vm virtualbox F17 and still stuck about 2 minutes to shutdown, /var/log/messages is clear between last message and first message of next boot is 2 m 10' . But disable selinux , fix it and reboot in 24 seconds May 3 06:00:50 rawhide rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="588" x-info="http://www.rsyslog.com"] exiting on signal 15. May 3 06:01:14 rawhide kernel: imklog 5.8.10, log source = /proc/kmsg started. (In reply to comment #47) > hi , I test it in my vm virtualbox F17 and still stuck about 2 minutes to > shutdown You're likely seeing bug 816842 or bug 739836. Package kernel-3.3.4-3.fc17: * should fix your issue, * was pushed to the Fedora 17 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.3.4-3.fc17' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-7212/kernel-3.3.4-3.fc17 then log in and leave karma (feedback). so, and now it is confirmed that "CONFIG_RCU_FAST_NO_HZ=y" is the reason for crawling and partly even failing boots due udev-timeouts and without CONFIG_RCU_FAST_NO_HZ boot times are around SIX TIMES faster!
> * Mon Apr 30 2012 Dave Jones <davej>
> - Disable CONFIG_RCU_FAST_NO_HZ for now. (rhbz 806548)
PLEASE not only "for now"
kernel upstream says clearly "if you are unsure say N"
this is not for GENERIC distro-kernels
__________________________________________________
the following differences are repeatable on several machines
for each boot and the crawling shutdown is with 3.3.4-3
also away and all is booting/shutdown/reboot like before
any Fedora 3.3.x kernel!
______________________
previous 3.3 kernels:
[root@buildserver64:~]$ systemd-analyze
Startup finished in 1468ms (kernel) + 2652ms (initramfs) + 35619ms (userspace) = 39740ms
______________________
[root@buildserver64:~]$ uname -r
3.3.4-3.fc16.x86_64
[root@buildserver64:~]$ systemd-analyze
Startup finished in 1462ms (kernel) + 1094ms (initramfs) + 4154ms (userspace) = 6711ms
kernel-3.3.4-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/kernel-3.3.4-3.fc16 (In reply to comment #48) > (In reply to comment #47) > > hi , I test it in my vm virtualbox F17 and still stuck about 2 minutes to > > shutdown > > You're likely seeing bug 816842 or bug 739836. Thanks for pointing this 2 bugs, I just test it on the host with kernel-3.3.4-3 for F16 and I think that is better. (In reply to comment #50) > so, and now it is confirmed that "CONFIG_RCU_FAST_NO_HZ=y" is the reason for > crawling It would more correct to say: A bug in v3.3's implementation of "CONFIG_RCU_FAST_NO_HZ=y" is the reason for crawling. Previous Fedora kernels had the option enabled without problems. Surely it can be enabled again once the bug gets fixed. Thanks to Pascal's efforts to help Paul McKenney debug it, it may be soon. Paul already has a patch: http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-rcu.git;a=commit;h=d302c2d403e5f84fc225cb32a61fc4e0d06e4f4e kernel-3.3.4-3.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report. kernel-3.3.4-3.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report. kernel-2.6.43.5-2.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/kernel-2.6.43.5-2.fc15 kernel-2.6.43.5-2.fc15 has been pushed to the Fedora 15 stable repository. If problems still persist, please make note of it in this bug report. *** Bug 824983 has been marked as a duplicate of this bug. *** |