Bug 1030719
Summary: | tmux segfaults on start in 32-bit Fedora 20 TC1 install images due to libevent-2.0.so.5 being corrupt, causing anaconda to fail to start | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andre Robatino <robatino> |
Component: | pungi | Assignee: | Dennis Gilmore <dennis> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 20 | CC: | anaconda-maint-list, awilliam, dcantrell, dennis, dshea, g.kaviyarasu, johannbg, jonathan, lnykryn, mkolman, mkrizek, msekleta, plautrba, praiskup, psabata, rosset.filipe, samuel-rhbugs, satellitgo, strobert, sven, systemd-maint, twu, vanmeeuwen+fedora, vpavlin, zbyszek |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | AcceptedBlocker | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-28 01:21:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 980656 | ||
Attachments: |
Description
Andre Robatino
2013-11-15 02:36:41 UTC
Also seen in VirtualBox. nomodeset andbasic graphics boot tried all failed at "starting networkmanager" Almost the same behavior in KVM. 32-bit install images hang, 64-bit install images don't. Only difference is where the hang happens. In KVM, it hangs much earlier, at the line "dracut-initqueue[483]: mount: /dev/sr0 is write-protected, mounting read-only" which is just a few lines after the mediacheck (if the mediacheck is run). Setting to Accepted Blocker. Confirming, I see this as well in KVM. Moving off of distribution. Pushing to anaconda as a starting point; could end up in dracut or kernel. Can you boot with 'rd.debug' and see where it is hanging/cycling? Created attachment 824685 [details]
VirtualBox 4.3.2 hang after removing "quiet" and adding "rd.debug"
Created attachment 824686 [details]
KVM hang after removing "quiet" and adding "rd.debug"
For each of the above, I booted from Fedora-20-TC1-i386-DVD.iso. There were lots of lines visible starting with "dracut" while booting, but not when it hung, in both cases. Looks like maybe it's waiting for the network come up? Can you boot with systemd.log_level=debug systemd.log_target=console and include the output here? Created attachment 824688 [details]
KVM hang after adding "systemd.log_level=debug systemd.log_target=console", last page of output
The above attachment is the last page of output. I can scroll up and get screenshots of earlier output if you need it. After a long wait, there was some additional output. If you need multiple pages of output, please let me know so I can grab all of them before it changes. Yes, we'd want all the output. https://bugzilla.redhat.com/show_bug.cgi?id=1030804 may be another case of this. For the record, Andre set this to AcceptedBlocker correctly under the "Automatic Blockers" policy - https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers : "Certain types of bugs are considered automatic blockers. These bugs can be marked as AcceptedBlocker by any member of one of the stakeholder groups without formal review...Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)" (In reply to Adam Williamson from comment #12) > Yes, we'd want all the output. Since this is 100% reproducible, and can be seen easily with the 32-bit netinst, I'd rather someone else who is more knowledgeable get the output for themselves, rather than trying to get it from me by remote control. I'm also looking at providing info on bug 1006386, which is a proposed blocker, and that's not 100% reproducible by other people, so I'd rather spend my time with that, sorry. *** Bug 1030804 has been marked as a duplicate of this bug. *** Confirm this bug with 32-bit netinst.iso in a KVM. Booting with systemd debug logging to serial console, it looks like anaconda is failing to start up: anaconda.target changed dead -> active Job anaconda.target/start finished, result=done About to execute: /usr/bin/tmux -u -f /usr/share/anaconda/tmux.conf start Forked /usr/bin/tmux as 883 anaconda.service changed dead -> start network.target changed dead -> active Job network.target/start finished, result=done (iscsi fails, irrelevant) Set up jobs progress timerfd. Received SIGCHLD from PID 883 (tmux). Got SIGCHLD for process 883 (tmux) Child 883 died (code=killed, status=11/SEGV) Child 883 belongs to anaconda.service anaconda.service: control process exited, code=killed status=11 anaconda.service got final SIGCHLD for state start anaconda.service changed start -> failed Job anaconda.service/start finished, result=failed Job anaconda-tmux/start finished, result=dependency Closed jobs progress timerfd. Unit anaconda.service entered failed state. So, it looks like tmux is crashing, perhaps? Not sure how to get more details about that. OK, so yeah, if I boot to emergency.target , start anaconda.target , switch to a console on VT2 and run: HOME=/root MALLOC_CHECK_=2 MALLOC_PERTURB_=204 /usr/bin/tmux -u -f /usr/share/anaconda/tmux.conf start i.e. basically recreating what anaconda.service does, I get an immediate segfault. That's clearly the problem. Now to figure out why... Just running a bare 'tmux' is enough to cause an instant segfault. journal has: tmux[959]: segfault at 4 ip b77a6174 sp bfe465e0 error 4 in ld-2.18.so[b779b000+1f000] Created attachment 824759 [details]
coredump from the crashed tmux
I managed to generate and extract a coredump from the crashing tmux. Need to run it through gdb or abrt to get a usable backtrace. Here it is.
Created attachment 824760 [details]
backtrace of the crash
Assuming I did it right, here's a backtrace for the crash. Generated by installing Final TC1 from the 32-bit desktop live, installing gdb and 'debuginfo-install tmux', and running gdb on the coredump and doing 'thread apply all bt full'.
Data-Point: I've installed F19-32bit, upgraded to F20 via fedup, did a yum upgrade (with updates-testing active) and tmux works fine. Yeah, it also works fine if you install the 32-bit desktop live and 'yum install tmux'. It's something about the environment on the TC1 32-bit non-live images that's zapping it. Seeing the _dl calls in the coredomp stacktrace, took a look around at the dynamic libs it was using. I noticed when I 'ldd /bin/tmux' from the FC20-TC!-DVD based install it doesn't show an entry for libpthread. whereas it does show a libpthread on the 64bit DVD (checked during the install heading to the console after the installer went graphical). out of curiosity what is the installer using tmux for? did some basic searching but didn't find any notes on it... not sure best way to see why libpthread is not being loaded right by the dynamic linker, but doing a strace on tmux inside the installer where it segfaults and in a normal FC20 install it looks like libpthread related. outside of the installer open's for libresolv,libc,libdl, then libpthread. inside of the installer right after the libdl.so.2 open is the segfault. /usr/lib/libevent-2.0.so.5 is corrupt. libevent is what uses libpthread, and the copy on the i386 TC1 netinst ISO is all zeroes after byte 326774. Replace it with the copy from libevent-2.0.21-3.fc20.i686.rpm and everything works fine. So I guess this should get reassigned to whatever component gets install image bugs but I don't know what that should be. Let's call it pungi and throw it at Dennis. (In reply to Steven Roberts from comment #25) > out of curiosity what is the installer using tmux for? did some basic > searching but didn't find any notes on it... We have number of TTYs for different uses - shell, logs & optionally TUI. Tmux is used to enable easy switching between the TTYs, scrollback, copy-paste, etc. Status reviewed at 2013-11-20 blocker review meeting: http://meetbot.fedoraproject.org/fedora-blocker-review/2013-11-20/f20-blocker-review.2013-11-20-17.00.log.txt . We still have no idea what caused this, but our current plan is to build TC2 and see if it magically goes away. FEEL THE PROFESSIONALISM Magically went away in 20 Final TC2. yeah, forgot to note I had answered my own question when I dove into debugging what was happening. I looked at the tmux conf file being used the it became clear what was going on. I like the idea. a number of times I have wished for better scrollback on the various install output virtual consoles. This was OK in TC2 and I think TC3, so let's close it. fired up the ISO based install for TC3 32-bit on a VMware VM. - installer kicked up fine. - I checked the tmux panes and confirmed I could switch between them. - I checked md5sum of libevent-2.0.so.5.1.9 of the copy during the installer and on a copy from a fc20 test box (started from fc19 and then upgraded via the yum process) and they match. |