Bug 89431
Summary: | Unable to restart mozilla after all windows are closed | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Pablo Endres <pablo_endres> | ||||||
Component: | kernel | Assignee: | Ingo Molnar <mingo> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 9 | CC: | bugzilla, davide_bolcioni, mrsam, twaugh | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2004-08-18 12:02:51 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Pablo Endres
2003-04-22 20:43:33 UTC
I have the same problem. If launched from Gnome terminal, I obtain: [2] Exit 3 mozilla I also tried variations like: LC_ALL=POSIX mozilla just in case my locale (it_IT) had troubles, to no avail. On the other hand, if I launch: /usr/lib/mozilla-1.2.1/mozilla-bin & the browser starts up, but then Mail and News has serious trouble (I guess it is missing several environment variables). Hope this helps. I can reproduce this bug too. When this happens, ps shows a mozilla-bin process with high CPU usage racked up (but not consuming any additional CPU cycles). What appears to be happening is that the previous mozilla process did not terminate. New mozilla processes try to contact the existing process, to open a new window. The old process does not respond, and new processes hang. Killing the mozilla-bin process allows new mozilla processes to run. Hmmm. stracing the stuck mozilla-bin process shows that it's hung in futex() PID TTY TIME CMD 1951 ? 00:00:56 gnome-terminal 1953 ? 00:00:00 gconfd-2 1954 ? 00:00:00 gnome-pty-helpe 1955 pts/3 00:00:00 bash 14590 ? 00:00:53 mozilla-bin 15164 pts/3 00:00:40 emacs 18914 pts/3 00:00:07 xemacs 21915 pts/3 00:00:00 ps $ strace -p 14590 futex(0x80ebaf4, FUTEX_WAIT, 0, NULL I see hangs from time to time as well and this is what people see as a symptom, sadly. On my system the real culprit is the JVM which sometimes hangs on purpose on shutdown due to a segfault or some kind. Just in case: on my system (see comment #1) I have upgraded to glibc-2.3.2-27.9, and mozilla does not hang indefinitely - after a long wait "mozilla the script", that is /usr/bin/mozilla, exits with $? = 3. We might be seeing two different bugs at work here: one is the futex() problem. My problem might be different because I usually keep four instances of mozilla on the same X display (processes, not windows) launched under different user accounts, running at the same time. Launching them works; when one is closed, however, that user account cannot start another. FWIW, this seems to happen *much* more often with a system that's current with rawhide. On my system this happens, too. It's because the java plugin crashes on exit, spits out a ton of crap on the console and waits to be read or something. I can easily reproduce this bug without loading the java plugin. An strace shows that the mozilla-bin process is stuck waiting on a futex. See comment #3 I have exactly the same problem (mozilla-1.2.1). I have noticed that the problem seems to be associated with another problem - mozilla suddenly not being able to get to sites, probably because something is wrong with DNS. To fix this problem, I close all mozilla windows, but then I can't restart mozilla. The fix is to kill -HUP the remaining mozilla-bin process. I am not certain if these are two separate bugs or connected in some way, but if I am having problems getting to sites, I always have problems restarting. The same is not true in reverse. There is a known DNS issue, yes. When I've used Mozilla and close it, I see in 'ps' that the following ALWAYS remains behind: 28591 ? D 0:15 /usr/lib/mozilla-1.2.1/mozilla-bin -UILocale en-US Shortly afterwards the 'D' changes to an 'S'. If I try to start Mozilla again (usually clicking on the toolbar next to the Red Hat), nothing starts. In 'ps' I see the following: 28591 ? S 0:15 /usr/lib/mozilla-1.2.1/mozilla-bin -UILocale en-US 29236 ? S 0:00 /bin/sh /usr/bin/mozilla 29247 ? S 0:00 /bin/sh /usr/bin/mozilla 29248 ? S 0:00 /usr/lib/mozilla-1.2.1/mozilla-xremote-client ping() If I click the Mozilla icon again, the last three lines repeat, but disappear after a short while. The only way to be able to restart Mozilla is to KILL the process in the first line. Then when I restart everything works. I have the following Mozilla RPM's installed: mozilla-nss-1.2.1-26 mozilla-nspr-1.2.1-26 mozilla-mail-1.2.1-26 mozilla-psm-1.2.1-26 mozilla-1.2.1-26 I am running RedHat 9.0 and have it fully up to date via RHN. Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Regards, Louis van Dyk I think this is really related to bug #90211 I also think that the process aparantly stuck in FUTEX_WAIT might be a Red Herring. Mozilla have multiple threads though recent kernels hides all of them but one. Using the -m option for ps will show them all. In my case "ps -mfC mozilla-bin" will show me, that there are really multiple threads. And the one in FUTEX_WAIT really is waiting, it is another thread that is stuck. Trying to strace the stuck thread happens to make the problem disapear, which is strange because strace is not supposed to change the behaviour of the program. The bug might actually be in the kernel rather than mozilla. After using strace it is sometimes necesarry to send a CONT signal to the straced process. I have sucessfully made the problem go away by using these two commands: strace $(ps -mC mozilla-bin -o pid=| sed -e 's|^| -p |') killall -CONT mozilla-bin And so far that haven't failed to make the problem go away. In addition I have seen a different symptom, that appeared at the same time as the DNS problem, and was also "solved" by the above two commands, which makes me believe that is another symptom of the same problem: Sometimes Mozilla ignores requests to open new windows. Attempting to open one more window will cause mozilla to open both requested windows at the same time. It continues like that forever, so I can only open two windows at a time, until I execute the commands from before. Created attachment 96662 [details]
Commented output from Alt+SysRq+T
I still don't know exactly how to reproduce the problem, but if you keep using mozilla for days, eventually it will show up. I have sometimes used the previously mentioned strace command many times to fix the same running instance of mozilla. One of the times the problem appeared I did some further investigation. Using "ps -mC mozilla-bin -o wchan" I could see all processes where sleeping in schedule_timeout. But that probably doesn't say much, as that is also the case when there are no problems. I think the schedule_timeout in the wchan output might be a bug, it is really on the call stack, but I think it should show the caller of schedule_timeout. I tried using Alt+SysRq+T while there was a problem to see the kernel stack of each mozilla process. The output is attached. I stripped everything but mozilla threads and added a few comments about what happened when I tried to strace each individual process. The information from strace about the last thread might not apply as the problem was gone before I got around to strace the last process. I suspect the problem is really in the futex implementation in the kernel, maybe a race condition causing a thread to go to sleep when it really shouldn't. The problem still exists with the most recent kernel version 2.4.20-27.9. In an attempt to reproduce the problem I ran a program that would attempt to use all physical memory and thereby force mozilla to swap out most of its pages. In one of five attempts I managed to reproduce the problem. Afterwards I noticed the xmms that had been running during the experiment had frozen with symptoms similar to the problem with mozilla. A futex related problem, that disapears by tracing one of the threads. Here is the output from strace on an xmms thread which got xmms back to life: futex(0x8140608, FUTEX_WAKE, 1, ptrace: umoven: Input/output error {...}) = 0 _llseek(8, 21164032, [21164032], SEEK_SET) = 0 read(8, "F\373\205\364|\372I\363\363\371\203\362<\372B\362\222\372"..., 4096) = 4096 nanosleep({0, 10000000}, NULL) = 0 read(8, "\247\0051\366\30\5}\365q\4>\365\326\3\207\364\236\3\263"..., 4096) = 4096 read(8, "0\366\4\355\251\365\354\354\272\364Y\354\3\364\372\353"..., 4096) = 4096 close(8) = 0 munmap(0x4037e000, 4096) = 0 _exit(0) = ? Seeing similar symptoms with two different applications support my suspect, that this is not a mozilla problem but rather a race/deadlock condition in the futex code. Any suggestions about how to find out what is going wrong in the futex code will be apreachiated. Created attachment 96705 [details]
Testcase
Steps to Reproduce:
1. Start mozilla
2. Open 3-5 windows with different websites
3. Run attached program
4. Try to access a site that have not been visited recently
By running four instances of a program, that each try to allocate more virtual
memory than is physically available I can reliably reproduce the problem. Using
strace on one particular mozilla-bin thread will temporarily solve the problem.
By running the attached program again, the problem will appear again. If strace
is left running the attached program cannot reproduce the problem. Attempting
to close all mozilla windows will leave a few mozilla threads preventing a new
instance of mozilla to be started.
It turned out, that running the previously attached test program not only affects mozilla and xmms, but also named is affected. This is not a mozilla problem. It is a bug in the kernel or a library used by these programs. The kernel stack traces in the case of named look similar to those from mozilla, except one, which got me wondering, how can schedule_timeout pagefault? named S C0130CC0 4816 870 861 869 (NOTLB) Call Trace: [<c0130cc0>] handle_mm_fault [kernel] 0x120 (0xd2837ed8)) [<c0125e7d>] schedule_timeout [kernel] 0xad (0xd2837eec)) [<c0213fd6>] tcp_poll [kernel] 0x36 (0xd2837ef0)) [<c01f040c>] sock_poll [kernel] 0x2c (0xd2837f10)) [<c015807d>] do_select [kernel] 0x10d (0xd2837f24)) [<c015851e>] sys_select [kernel] 0x34e (0xd2837f60)) [<c010953f>] system_call [kernel] 0x33 (0xd2837fc0)) -> kernel |