Bug 1030719

Summary: tmux segfaults on start in 32-bit Fedora 20 TC1 install images due to libevent-2.0.so.5 being corrupt, causing anaconda to fail to start
Product: [Fedora] Fedora Reporter: Andre Robatino <robatino>
Component: pungiAssignee: Dennis Gilmore <dennis>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 20CC: anaconda-maint-list, awilliam, dcantrell, dennis, dshea, g.kaviyarasu, johannbg, jonathan, lnykryn, mkolman, mkrizek, msekleta, plautrba, praiskup, psabata, rosset.filipe, samuel-rhbugs, satellitgo, strobert, sven, systemd-maint, twu, vanmeeuwen+fedora, vpavlin, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard: AcceptedBlocker
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-28 01:21:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 980656    
Attachments:
Description Flags
VirtualBox 4.3.2 hang after removing "quiet" and adding "rd.debug"
none
KVM hang after removing "quiet" and adding "rd.debug"
none
KVM hang after adding "systemd.log_level=debug systemd.log_target=console", last page of output
none
coredump from the crashed tmux
none
backtrace of the crash none

Description Andre Robatino 2013-11-15 02:36:41 UTC
Description of problem:
On booting either the 32-bit 20 Final TC1 netinst or DVD, it hangs at "Starting Wait for Plymouth Boot Screen to Quit..." (or sometimes the following line "Starting Network Manager"). The 64-bit install images boot normally.

Version-Release number of selected component (if applicable):
32-bit 20 Final TC1 install images

How reproducible:
always

Additional info:
Tested in VirtualBox 4.3.2. If other people reproduce this, it's an automatic blocker under https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers : "Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)".

Comment 1 satellitgo 2013-11-15 03:08:00 UTC
Also seen in VirtualBox. nomodeset andbasic graphics boot tried all failed at "starting networkmanager"

Comment 2 Andre Robatino 2013-11-15 05:01:29 UTC
Almost the same behavior in KVM. 32-bit install images hang, 64-bit install images don't. Only difference is where the hang happens. In KVM, it hangs much earlier, at the line "dracut-initqueue[483]: mount: /dev/sr0 is write-protected, mounting read-only" which is just a few lines after the mediacheck (if the mediacheck is run).
Setting to Accepted Blocker.

Comment 3 Martin Krizek 2013-11-15 11:27:47 UTC
Confirming, I see this as well in KVM.

Comment 4 Bill Nottingham 2013-11-15 18:38:20 UTC
Moving off of distribution. Pushing to anaconda as a starting point; could end up in dracut or kernel.

Can you boot with 'rd.debug' and see where it is hanging/cycling?

Comment 5 Andre Robatino 2013-11-15 19:49:22 UTC
Created attachment 824685 [details]
VirtualBox 4.3.2 hang after removing "quiet" and adding "rd.debug"

Comment 6 Andre Robatino 2013-11-15 19:50:17 UTC
Created attachment 824686 [details]
KVM hang after removing "quiet" and adding "rd.debug"

Comment 7 Andre Robatino 2013-11-15 19:51:56 UTC
For each of the above, I booted from Fedora-20-TC1-i386-DVD.iso. There were lots of lines visible starting with "dracut" while booting, but not when it hung, in both cases.

Comment 8 David Shea 2013-11-15 19:56:11 UTC
Looks like maybe it's waiting for the network come up? Can you boot with systemd.log_level=debug systemd.log_target=console and include the output here?

Comment 9 Andre Robatino 2013-11-15 20:09:58 UTC
Created attachment 824688 [details]
KVM hang after adding "systemd.log_level=debug systemd.log_target=console", last page of output

Comment 10 Andre Robatino 2013-11-15 20:11:06 UTC
The above attachment is the last page of output. I can scroll up and get screenshots of earlier output if you need it.

Comment 11 Andre Robatino 2013-11-15 20:13:38 UTC
After a long wait, there was some additional output. If you need multiple pages of output, please let me know so I can grab all of them before it changes.

Comment 12 Adam Williamson 2013-11-15 22:01:56 UTC
Yes, we'd want all the output.

Comment 13 Adam Williamson 2013-11-15 22:02:23 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1030804 may be another case of this.

Comment 14 Adam Williamson 2013-11-15 22:04:00 UTC
For the record, Andre set this to AcceptedBlocker correctly under the "Automatic Blockers" policy - https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers :

"Certain types of bugs are considered automatic blockers. These bugs can be marked as AcceptedBlocker by any member of one of the stakeholder groups without formal review...Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)"

Comment 15 Andre Robatino 2013-11-15 22:09:20 UTC
(In reply to Adam Williamson from comment #12)
> Yes, we'd want all the output.

Since this is 100% reproducible, and can be seen easily with the 32-bit netinst, I'd rather someone else who is more knowledgeable get the output for themselves, rather than trying to get it from me by remote control. I'm also looking at providing info on bug 1006386, which is a proposed blocker, and that's not 100% reproducible by other people, so I'd rather spend my time with that, sorry.

Comment 16 Adam Williamson 2013-11-15 22:47:06 UTC
*** Bug 1030804 has been marked as a duplicate of this bug. ***

Comment 17 Adam Williamson 2013-11-15 23:27:14 UTC
Confirm this bug with 32-bit netinst.iso in a KVM.

Booting with systemd debug logging to serial console, it looks like anaconda is failing to start up:

anaconda.target changed dead -> active
Job anaconda.target/start finished, result=done
About to execute: /usr/bin/tmux -u -f /usr/share/anaconda/tmux.conf start
Forked /usr/bin/tmux as 883
anaconda.service changed dead -> start
network.target changed dead -> active
Job network.target/start finished, result=done
(iscsi fails, irrelevant)
Set up jobs progress timerfd.
Received SIGCHLD from PID 883 (tmux).
Got SIGCHLD for process 883 (tmux)
Child 883 died (code=killed, status=11/SEGV)
Child 883 belongs to anaconda.service
anaconda.service: control process exited, code=killed status=11
anaconda.service got final SIGCHLD for state start
anaconda.service changed start -> failed
Job anaconda.service/start finished, result=failed
Job anaconda-tmux/start finished, result=dependency
Closed jobs progress timerfd.
Unit anaconda.service entered failed state.

So, it looks like tmux is crashing, perhaps? Not sure how to get more details about that.

Comment 18 Adam Williamson 2013-11-15 23:40:44 UTC
OK, so yeah, if I boot to emergency.target , start anaconda.target , switch to a console on VT2 and run:

HOME=/root MALLOC_CHECK_=2 MALLOC_PERTURB_=204 /usr/bin/tmux -u -f /usr/share/anaconda/tmux.conf start

i.e. basically recreating what anaconda.service does, I get an immediate segfault. That's clearly the problem. Now to figure out why...

Comment 19 Adam Williamson 2013-11-15 23:41:13 UTC
Just running a bare 'tmux' is enough to cause an instant segfault.

Comment 20 Adam Williamson 2013-11-15 23:47:49 UTC
journal has:

tmux[959]: segfault at 4 ip b77a6174 sp bfe465e0 error 4 in ld-2.18.so[b779b000+1f000]

Comment 21 Adam Williamson 2013-11-16 00:07:11 UTC
Created attachment 824759 [details]
coredump from the crashed tmux

I managed to generate and extract a coredump from the crashing tmux. Need to run it through gdb or abrt to get a usable backtrace. Here it is.

Comment 22 Adam Williamson 2013-11-16 00:55:56 UTC
Created attachment 824760 [details]
backtrace of the crash

Assuming I did it right, here's a backtrace for the crash. Generated by installing Final TC1 from the 32-bit desktop live, installing gdb and 'debuginfo-install tmux', and running gdb on the coredump and doing 'thread apply all bt full'.

Comment 23 Sven Lankes 2013-11-16 11:53:14 UTC
Data-Point:

I've installed F19-32bit, upgraded to F20 via fedup, did a yum upgrade (with updates-testing active) and tmux works fine.

Comment 24 Adam Williamson 2013-11-16 16:50:38 UTC
Yeah, it also works fine if you install the 32-bit desktop live and 'yum install tmux'. It's something about the environment on the TC1 32-bit non-live images that's zapping it.

Comment 25 Steven Roberts 2013-11-16 21:29:03 UTC
Seeing the _dl calls in the coredomp stacktrace, took a look around at the dynamic libs it was using.  I noticed when I 'ldd /bin/tmux' from the FC20-TC!-DVD based install it doesn't show an entry for libpthread.  whereas it does show a libpthread on the 64bit DVD (checked during the install heading to the console after the installer went graphical).

out of curiosity what is the installer using tmux for?  did some basic searching but didn't find any notes on it...

Comment 26 Steven Roberts 2013-11-17 03:44:09 UTC
not sure best way to see why libpthread is not being loaded right by the dynamic linker, but doing a strace on tmux inside the installer where it segfaults and in a normal FC20 install it looks like libpthread related.

outside of the installer open's for libresolv,libc,libdl, then libpthread.
inside of the installer right after the libdl.so.2 open is the segfault.

Comment 27 David Shea 2013-11-18 19:05:25 UTC
/usr/lib/libevent-2.0.so.5 is corrupt. libevent is what uses libpthread, and the copy on the i386 TC1 netinst ISO is all zeroes after byte 326774. Replace it with the copy from libevent-2.0.21-3.fc20.i686.rpm and everything works fine.

So I guess this should get reassigned to whatever component gets install image bugs but I don't know what that should be.

Comment 28 Adam Williamson 2013-11-18 19:17:13 UTC
Let's call it pungi and throw it at Dennis.

Comment 29 Martin Kolman 2013-11-19 15:12:45 UTC
(In reply to Steven Roberts from comment #25) 
> out of curiosity what is the installer using tmux for?  did some basic
> searching but didn't find any notes on it...
We have number of TTYs for different uses - shell, logs & optionally TUI.
Tmux is used to enable easy switching between the TTYs, scrollback, copy-paste, etc.

Comment 30 Adam Williamson 2013-11-20 19:58:21 UTC
Status reviewed at 2013-11-20 blocker review meeting: http://meetbot.fedoraproject.org/fedora-blocker-review/2013-11-20/f20-blocker-review.2013-11-20-17.00.log.txt . We still have no idea what caused this, but our current plan is to build TC2 and see if it magically goes away. FEEL THE PROFESSIONALISM

Comment 31 Andre Robatino 2013-11-21 04:09:38 UTC
Magically went away in 20 Final TC2.

Comment 32 Steven Roberts 2013-11-21 07:45:07 UTC
yeah, forgot to note I had answered my own question when I dove into debugging what was happening.

I looked at the tmux conf file being used the it became clear what was going on.  I like the idea.  a number of times I have wished for better scrollback on the various install output virtual consoles.

Comment 33 Adam Williamson 2013-11-28 01:21:53 UTC
This was OK in TC2 and I think TC3, so let's close it.

Comment 34 Steven Roberts 2013-11-29 00:08:24 UTC
fired up the ISO based install for TC3 32-bit on a VMware VM.
- installer kicked up fine.
- I checked the tmux panes and confirmed I could switch between them.
- I checked md5sum of libevent-2.0.so.5.1.9 of the copy during the installer and on a copy from a fc20 test box (started from fc19 and then upgraded via the yum process) and they match.