Bug 128003

Summary:	firstboot or rhgb hangs at gray screen
Product:	[Fedora] Fedora	Reporter:	Barry K. Nathan <barryn>
Component:	rhgb	Assignee:	Daniel Veillard <veillard>
Status:	CLOSED RAWHIDE	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	barryn, bnocera, otaylor
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-16 07:22:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	123268

Description Barry K. Nathan 2004-07-16 10:33:00 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040625

Description of problem:
FC-devel on 20040714 has firstboot freezing at a gray screen for some
reason. 20040715 has the same problem. 20040713 does not have the problem.

Right now I'm filing against distribution because I'm not sure which
package is guilty. However, I intend to find out within the next few
hours.

Version-Release number of selected component (if applicable):
fc-devel snapshots on 20040714 and 20040715

How reproducible:
Always

Steps to Reproduce:
Procedure A:

1. Install fc-devel from 20040714 or 20040715, with a Personal Desktop
install and the standard package selection.
2. After installation finishes, reboot.
3. Wait for firstboot to come up.

    
Procedure B:

1. Repeat procedure A, except using FC 3 test1 or FC-devel from the
last several days (but *before* 20040714).
2. At firstboot, press Control-Alt-Backspace to kill the X server.
3. Log in as root. (There are no unprivileged user accounts at this
point.)
4. Do any other system setup stuff you feel like doing.
5. Edit /etc/yum.conf so it points to a 20040714 snapshot of fc-devel
(I imagine 20040715 would work too but I only tested 20040715 with
fresh installs, not upgrades via yum).
6. Use "yum upgrade" to bring the system up to 20040714.
7. Restart.

Actual Results:  After RHGB finishes its business, the X server stops
and starts again. There is a gray screen, and perhaps a little disk
activity, but nothing else happens even after waiting an hour or more
on 1.3-1.4GHz machines.

Expected Results:  Firstboot should not stop at a gray screen and
should proceed as normal.

Additional info:

I'm reproducing this 100% of the time on both of the test machines
I've tried so far. Also, adding "selinux=0" to the kernel command line
does not make the problem go away.

If you kill the frozen firstboot, log in as root, and examine
/var/log/Xorg.0.log.old, then there are messages at the end of the log
that range from somewhat odd to totally bizarre, depending on the test
machine's hardware.

I erased the log messages from one of my test machines, but if it
doesn't take too long I may try to reproduce it again and I'll post
the messages to this bug. On the other test machine (using the "vesa"
driver on a VIA C3M266 motherboard's onboard video), there's this one
line at the end of the log:
AUDIT: Fri Jul 16 02:46:16 2004: 3645 X: client 4 rejected from local host

Right now I'm using ext2 filesystems on my test boxes, FWIW. Yesterday
I also reproduced this bug on xfs, so the type of filesystem isn't
making a difference.

I just tested again with kernel 2.6.7-1.478 rather than 1.486. The
weird Xorg.0.log.old message is gone now, but the problem remains. So
maybe the messages are harmless and completely unrelated.

Comment 1 Barry K. Nathan 2004-07-16 10:42:16 UTC

Actually, AFAICT sometimes rhgb freezes with a gray screen when it
should quit, and sometimes it's firstboot that freezes right after it
starts up. (I could be misperceiving the whole situation, however!)

Anyway, it turns out that downgrading rhgb from 0.12.2-1 to 0.11.2-7
makes the problem go away...

Comment 2 Barry K. Nathan 2004-08-23 06:51:48 UTC

Still happening with 2004-08-22 rawhide snapshot, although there's a
traceback first (I'll attach that at some point in the next 48 hours).

Comment 3 Barry K. Nathan 2004-08-23 06:57:00 UTC

The traceback I'm getting is probably the same as bug 130567.

Comment 4 Barry K. Nathan 2004-08-23 07:26:37 UTC


*** This bug has been marked as a duplicate of 129532 ***

Comment 5 Barry K. Nathan 2004-08-23 07:40:31 UTC

My workaround of downgrading rhgb still works, but that doesn't affect
the traceback. So, the traceback I mentioned in comment #3 appears to
be unrelated to this bug.

Comment 6 Adrian Likins 2004-09-09 22:24:33 UTC

do we have a bracket of what versions of rhgb do and do not work?

Comment 7 Daniel Veillard 2004-09-09 22:42:55 UTC

No idea, I have snapshot of rawhide from mid-july and so far
I have never been able to reproduce that problem. I have seen
an X server without any ouput on X86_64, but it wasn't specific
to the one running rhgb and after a bunch of nightly update failures.

Doing a diff of the source between the version 5 months ago and 
last week  show only:
  - a change in gtk widget code to accomodate xinerama
    which I doubt can produce that effect
  - a patches for close on exec of a socket
  - correctly unmounting the ramfs needed if exec'ing the X
   server failed
  - and a 0 -> NULL pointer cleanup fix.

I don't see how any of those can generate the stated result. And
I can't reproduce it to chase where this may occur.

Daniel

Comment 8 Daniel Veillard 2004-09-09 23:05:18 UTC

err ... I have no snapshot of rawhide from mid-july ...
Will try to reproduce this tomorrow and check

Comment 9 Barry K. Nathan 2004-09-10 04:51:05 UTC

Quoting from (my) comment #1:

> Anyway, it turns out that downgrading rhgb from 0.12.2-1 to 0.11.2-7
> makes the problem go away...

0.12.2-1 breaks, 0.11.2-7 works. Does that help?

Comment 10 Barry K. Nathan 2004-09-10 04:54:25 UTC

BTW, 0.11.2-7 also happens to be the rhgb version from FC3 test1.

Comment 11 Daniel Veillard 2004-09-10 10:06:54 UTC

Okay, problem reproduced, from here it is now possible to detect
what change messed things up and get a proper fix,

Daniel

Comment 12 Daniel Veillard 2004-09-10 13:00:42 UTC

The grey screen appears for 2 reasons in rawhide:
  - firstboot can simply crash see #129532
  - the patch supplied to fix xinerama handling #115209 (or more
    precisely the part of the patch affecting splash.h and splash.c)
    make firstboot hang 

In the later case it's not obvious to find why the stck trace is not
very clear:
[Switching to Thread -151071648 (LWP 3006)]
0x00d99782 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) where
#0  0x00d99782 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0068f23d in poll () from /lib/tls/libc.so.6
#2  0x0018ff13 in g_main_context_acquire () from /usr/lib/libglib-2.0.so.0
#3  0x0019022f in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#4  0x00eba1de in gtk_main () from /usr/lib/libgtk-x11-2.0.so.0
#5  0x00282c16 in init_gtk ()
   from /usr/lib/python2.3/site-packages/gtk-2.0/gtk/_gtk.so
#6  0x00d1fcb4 in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#7  0x00d210ae in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#8  0x00cdce6e in PyFunction_SetClosure () from
/usr/lib/libpython2.3.so.1.0
#9  0x00cc9617 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#10 0x00cd0dac in PyMethod_New () from /usr/lib/libpython2.3.so.1.0
#11 0x00cc9617 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#12 0x00d1b2a0 in PyEval_CallObjectWithKeywords ()
   from /usr/lib/libpython2.3.so.1.0
#13 0x00cccaa9 in PyInstance_New () from /usr/lib/libpython2.3.so.1.0
#14 0x00cc9617 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#15 0x00d1eccf in _PyEval_SliceIndex () from /usr/lib/libpython2.3.so.1.0
#16 0x00d210ae in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#17 0x00d21372 in PyEval_EvalCode () from /usr/lib/libpython2.3.so.1.0
#18 0x00d3a8b7 in PyErr_Display () from /usr/lib/libpython2.3.so.1.0
#19 0x00d3b9e2 in PyRun_SimpleFileExFlags () from
/usr/lib/libpython2.3.so.1.0
#20 0x00d3ca34 in PyRun_AnyFileExFlags () from
/usr/lib/libpython2.3.so.1.0
---Type <return> to continue, or q <return> to quit---
#21 0x00d4172e in Py_Main () from /usr/lib/libpython2.3.so.1.0
#22 0x080485b2 in main ()
(gdb)

  One simple fix is to just reverse that patch. A better way would
be to find exactly what in the patch makes the whole thing hang,
probably the window manager code of the patch. rhgb runs without a
window manager ... except when firstboot starts since firstboot
itself starts metacity.

Daniel

Comment 13 Daniel Veillard 2004-09-10 21:24:09 UTC

Okay we now have a fix for this it's in 
http://people.redhat.com/veillard/testing/SRPMS/rhgb-0.12.5-1.src.rpm
with that and a quick fix to #129532 from first boot (removing
mouse configuration), then I have firstboot back on today's rawhide.
I need to push this into rawhide, maybe over the week-end,

Daniel

Comment 14 Daniel Veillard 2004-09-16 07:22:37 UTC

This has been pushed to Rawhide, this should be fixed there,

  thanks,

Daniel