482907 – f11alpha anaconda hanging when moving to timezone screen

Bug 482907 - f11alpha anaconda hanging when moving to timezone screen

Summary: f11alpha anaconda hanging when moving to timezone screen

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	anaconda
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Anaconda Maintenance Team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	482950 (view as bug list)
Depends On:
Blocks:	F11Alpha, F11AlphaBlocker
TreeView+	depends on / blocked

Reported:	2009-01-28 19:13 UTC by Mark McLoughlin
Modified:	2009-01-30 20:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-01-30 20:48:11 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
1411.log (2.92 MB, text/plain) 2009-01-29 16:08 UTC, Mark McLoughlin	no flags	Details
View All

Description Mark McLoughlin 2009-01-28 19:13:09 UTC

Doing an "console=ttyS0 vnc" install using netinst.iso. x86_64 KVM guest on recent rawhide x86_64 host.

Click Next on the hostname screen and it hangs

Log just says "moving (1) to step timezone"

strace shows the parent process doing this over and over:

--- SIGSEGV (Segmentation fault) @ 0 (0) ---
rt_sigaction(SIGSEGV, {0x711f70, [], SA_RESTORER, 0x11f690}, {0x711f70, [], SA_RESTORER, 0x11f690}, 8) = 0
rt_sigreturn(0xb)     = 140591843010560

and a child process doing this over and over:

ppoll([{fd=0, events=POLLIN|POLLPRI|POLLERR|POLLHUP|0x400}], 1, {18446744073709551615, 18446744073709551615}, ~[HUP INT TERM RTMIN], 8) = -1 EINVAL (Invalid argument)
recvfrom(0, 0x7fff32b4e960, 8988, 64, 0x7fff32b4e7c0, 0x7fff32b4e7cc) = -1 EAGAIN (Resource temporarily unavailable)

Comment 1 Jesse Keating 2009-01-29 04:31:42 UTC

*** Bug 482950 has been marked as a duplicate of this bug. ***

Comment 2 Hans de Goede 2009-01-29 13:23:20 UTC

kvm is the keyword here, there are know issues with kvm and Fedora >= 10, try adding "clocksource=acpi_pm" to the kernel cmdline (pass -x "clocksource=acpi_pm" to virt-install)

Given that the known problems are clocksource related (and include hard hangs of the virtual machine), I think this has the same cause. Therefore I suggest closing this as duplicate of the kvm bug tracking this.

Comment 3 Mark McLoughlin 2009-01-29 14:52:22 UTC

(In reply to comment #2)
> kvm is the keyword here, there are know issues with kvm and Fedora >= 10, try
> adding "clocksource=acpi_pm" to the kernel cmdline (pass -x
> "clocksource=acpi_pm" to virt-install)
> 
> Given that the known problems are clocksource related (and include hard hangs
> of the virtual machine), I think this has the same cause.

I have just reproduced the issue with clocksource=acpi_pm, so it's not the same as bug #475598

Comment 4 Mark McLoughlin 2009-01-29 16:08:01 UTC

Created attachment 330370 [details]
1411.log

Okay, here's an strace log of the parent process from before the welcome screen

The log of the child process is just 8Mb of the poll() = EINVAL, recvfrom() = EAGAIN loop, so I'm not bothering attaching that

Comment 5 Will Woods 2009-01-29 16:28:34 UTC

I've reproduced this on two bare-metal machines (i386 and ppc64), with local graphics and VNC.

It's not KVM.

Comment 6 Mark McLoughlin 2009-01-29 18:28:07 UTC

FWIW, the log shows anaconda talking to dbus then trying to load anaconda.mo (it wants to print an error maybe?) and then segvs:

1233244607.898335 select(19, [18], [18], NULL, NULL) = 1 (out [18])
1233244607.898500 writev(18, [{"%\4\1\0"..., 4}, {NULL, 0}, {""..., 0}], 3) = 4
1233244607.898882 read(18, 0x2570cc4, 4096) = -1 EAGAIN (Resource temporarily unavailable)
1233244607.899114 gettimeofday({1233244607, 899150}, NULL) = 0
1233244607.900480 stat("/usr/share/locale/en_US.UTF8/LC_MESSAGES/anaconda.mo", 0x7fff3b5d52a0) = -1 ENOENT (No such file or directory)
1233244607.900738 stat("/usr/share/locale/en_US/LC_MESSAGES/anaconda.mo", 0x7fff3b5d52a0) = -1 ENOENT (No such file or directory)
1233244607.900962 stat("/usr/share/locale/en.UTF8/LC_MESSAGES/anaconda.mo", 0x7fff3b5d52a0) = -1 ENOENT (No such file or directory)
1233244607.901270 stat("/usr/share/locale/en/LC_MESSAGES/anaconda.mo", 0x7fff3b5d52a0) = -1 ENOENT (No such file or directory)
1233244607.903095 --- SIGSEGV (Segmentation fault) @ 0 (0) ---

Comment 7 Chris Lumens 2009-01-29 20:46:10 UTC

There's no such file as /usr/share/local/en*/LC_MESSAGES/anaconda.mo.  I wonder what happens if you were to try installation in another language.  The closest I can think of would be to boot up with lang=en_GB, unless you can speak another language.

Comment 8 Will Woods 2009-01-29 20:47:43 UTC

Reading through TimezoneWindow.__init__() and comparing that to the strace, I believe those lines searching for anaconda.mo correspond to:

        ics.setTitle(_("Time Zone Selection"))

which has been there, unchanged, since 2002. So that's probably not the problem. 

My guess would be that we're dying in whatever's supposed to happen after iw/timezone_gui.py's TimezoneWindow() object is instantiated.

I'm having trouble tracing how the dispatcher and things work, so I'm not sure where that's happening..

Comment 9 Eric Paris 2009-01-29 21:11:15 UTC

i686 install.  I have 2 different encrypted partitions, both have the same password.  The installer asks me for the pword for my first encrypted partition.  If I give it the password and hit enter, it ask for the password for the next encrypted partition (like it should).    If I give it the password and check the "This is a global password" check box, it hangs.  The screen stops redrawing, if I drop out to shell top tells me there are two anaconda python processes each spinning as much cpu time as they can.

With what little visability that I have it appears like the same results I saw on the tz issue.   (which I got around on another machine by using kickstart)

Comment 10 Chris Lumens 2009-01-29 21:17:50 UTC

Adding in a pdb.set_trace() and stepping through, I have found the problem block of code:

        self.utcCheckbox.set_active(asUTC)

        if not anaconda.isKickstart:
            self.utcCheckbox.set_active(not hasWindows(anaconda.id.bootloader))

The first set_active succeeds in the default interactive case, since asUTC is set to 0.  On the second set_active call, I have no Windows partitions so we try to set the checkbox to active.  That call then never returns.

In addition, I have verified that hasWindows(anaconda.id.bootloader) returns in a timely fashion.  So it seems that self.utcCheckbox.set_active(True) is leading to an endless stream of segfaults.

Comment 11 Hans de Goede 2009-01-30 07:55:04 UTC

(In reply to comment #10)
> Adding in a pdb.set_trace() and stepping through, I have found the problem
> block of code:
> 
>         self.utcCheckbox.set_active(asUTC)
> 
>         if not anaconda.isKickstart:
>             self.utcCheckbox.set_active(not hasWindows(anaconda.id.bootloader))
> 
> The first set_active succeeds in the default interactive case, since asUTC is
> set to 0.  On the second set_active call, I have no Windows partitions so we
> try to set the checkbox to active.  That call then never returns.
> 
> In addition, I have verified that hasWindows(anaconda.id.bootloader) returns in
> a timely fashion.  So it seems that self.utcCheckbox.set_active(True) is
> leading to an endless stream of segfaults.

Interesting, in comment #9, eparis claims a similar hang when trying to click a checkbox in the encrypted partition password dialog, so it seems we are hitting a pygtk or gtk bug when a checkbox gets set (be it from code or with the mouse).

Comment 12 Ray Strode [halfline] 2009-01-30 14:58:56 UTC

Hey guys, I think you missed a good theory that Mark pointed out in comment 6.

Note the crash happens after sending a message to the bus.  There are dbus apis for construction messages that are variadic and easy to screw up.  If you don't terminate them properly, for instance, you could hose your stack and end up with a random crash shortly after triggered by some unrelated action that uses the stack.

Is there any C glue code in anaconda that uses dbus?

Comment 13 Ray Strode [halfline] 2009-01-30 15:09:55 UTC

Actually, looking closer, Mark, are you sure fd 18 is dbus?  Looking at the trace, 7 seems to be the bus connection.  

The stuff going over the wire for 18 doesn't match the bus wireformat and libdbus doesn't ever pass 4 io_vec's to it's writev calls.

Would be useful to see what fd 18 is.

Comment 14 Chris Lumens 2009-01-30 15:19:26 UTC

it seems unlikely to me that this is dbus related, given comment #11 (which references comment #9).  Also, there's nothing going on when moving to the timezone screen that involves dbus.  The only dbus-aware parts of anaconda are the networking and disk discovery code.

Comment 15 Ray Strode [halfline] 2009-01-30 15:37:47 UTC

Still, much more likely some memory corruption or something than bug-in-GtkCheckButton

Comment 16 Chris Lumens 2009-01-30 20:48:11 UTC

Strangely, we are not seeing this bug in Rawhide trees composed after our initial Alpha freeze.  Whether this is due to the package set, the composition location, or some transient problem I do not know.  However, since it looks like we are going to move the alpha on to a tree without this bug, I'm going to go ahead and close it as RAWHIDE.

If we do continue to see it in more nightly trees, I guess we can go back to panicking.

Note You need to log in before you can comment on or make changes to this bug.