Bug 89431

Summary: Unable to restart mozilla after all windows are closed
Product: [Retired] Red Hat Linux Reporter: Pablo Endres <pablo_endres>
Component: kernelAssignee: Ingo Molnar <mingo>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: bugzilla, davide_bolcioni, mrsam, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-08-18 12:02:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Commented output from Alt+SysRq+T
none
Testcase none

Description Pablo Endres 2003-04-22 20:43:33 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
After browsing I close all mozilla windows.  If I try to open a browser again
later on and it doesn't start.  

There is no error message, and looking at what happens with strace I can't
really find anything interesting.

Version-Release number of selected component (if applicable):
mozilla-1.2.1-36

How reproducible:
Always

Steps to Reproduce:
1. Start the mozilla browser
2. Close all windows o simply quit.
3. Try to open mozilla again later on
    

Actual Results:  Nothing, no error mesages, no browser..

Expected Results:  Browser should of started

Additional info:

Comment 1 Davide Bolcioni 2003-04-29 21:56:50 UTC
I have the same problem. If launched from Gnome terminal, I obtain:

[2] Exit 3 mozilla

I also tried variations like:

  LC_ALL=POSIX mozilla

just in case my locale (it_IT) had troubles, to no avail. On the other hand, if
I launch:

/usr/lib/mozilla-1.2.1/mozilla-bin &

the browser starts up, but then Mail and News has serious trouble (I guess it
is missing several environment variables).

Hope this helps.

Comment 2 Sam Varshavchik 2003-05-05 14:55:11 UTC
I can reproduce this bug too.  When this happens, ps shows a mozilla-bin process
with high CPU usage racked up (but not consuming any additional CPU cycles).

What appears to be happening is that the previous mozilla process did not
terminate.  New mozilla processes try to contact the existing process, to open a
new window.  The old process does not respond, and new processes hang.

Killing the mozilla-bin process allows new mozilla processes to run.



Comment 3 Sam Varshavchik 2003-05-05 17:35:54 UTC
Hmmm.

stracing the stuck mozilla-bin process shows that it's hung in futex()

  PID TTY          TIME CMD
 1951 ?        00:00:56 gnome-terminal
 1953 ?        00:00:00 gconfd-2
 1954 ?        00:00:00 gnome-pty-helpe
 1955 pts/3    00:00:00 bash
14590 ?        00:00:53 mozilla-bin
15164 pts/3    00:00:40 emacs
18914 pts/3    00:00:07 xemacs
21915 pts/3    00:00:00 ps

$ strace -p 14590
futex(0x80ebaf4, FUTEX_WAIT, 0, NULL



Comment 4 Christopher Blizzard 2003-05-05 17:39:34 UTC
I see hangs from time to time as well and this is what people see as a symptom,
sadly.  On my system the real culprit is the JVM which sometimes hangs on
purpose on shutdown due to a segfault or some kind.

Comment 5 Davide Bolcioni 2003-05-05 19:57:38 UTC
Just in case: on my system (see comment #1) I have upgraded to glibc-2.3.2-27.9,
and mozilla does not hang indefinitely - after a long wait "mozilla the script",
that is /usr/bin/mozilla, exits with $? = 3. We might be seeing two different
bugs at work here: one is the futex() problem. My problem might be different
because I usually keep four instances of mozilla on the same X display
(processes, not windows) launched under different user accounts, running at the
same time. Launching them works; when one is closed, however, that user account
cannot start another.


Comment 6 Tim Waugh 2003-06-04 08:51:59 UTC
FWIW, this seems to happen *much* more often with a system that's current with
rawhide.

Comment 7 Christopher Blizzard 2003-06-04 15:07:26 UTC
On my system this happens, too.  It's because the java plugin crashes on exit,
spits out a ton of crap on the console and waits to be read or something.

Comment 8 Sam Varshavchik 2003-06-04 16:26:33 UTC
I can easily reproduce this bug without loading the java plugin.

An strace shows that the mozilla-bin process is stuck waiting on a futex.

See comment #3



Comment 9 David Kaplan 2003-08-12 19:28:26 UTC
I have exactly the same problem (mozilla-1.2.1).  I have noticed that the
problem seems to be associated with another problem - mozilla suddenly not being
able to get to sites, probably because something is wrong with DNS.  To fix this
problem, I close all mozilla windows, but then I can't restart mozilla.  The fix
is to kill -HUP the remaining mozilla-bin process.

I am not certain if these are two separate bugs or connected in some way, but if
I am having problems getting to sites, I always have problems restarting.  The
same is not true in reverse.

Comment 10 Christopher Blizzard 2003-09-03 15:43:29 UTC
There is a known DNS issue, yes.

Comment 11 Louis van Dyk 2003-10-03 11:41:55 UTC
When I've used Mozilla and close it, I see in 'ps' that the following ALWAYS
remains behind:
28591 ?        D      0:15 /usr/lib/mozilla-1.2.1/mozilla-bin -UILocale en-US
Shortly afterwards the 'D' changes to an 'S'.

If I try to start Mozilla again (usually clicking on the toolbar next to the Red
Hat), nothing starts. In 'ps' I see the following:
28591 ?        S      0:15 /usr/lib/mozilla-1.2.1/mozilla-bin -UILocale en-US
29236 ?        S      0:00 /bin/sh /usr/bin/mozilla
29247 ?        S      0:00 /bin/sh /usr/bin/mozilla
29248 ?        S      0:00 /usr/lib/mozilla-1.2.1/mozilla-xremote-client ping()

If I click the Mozilla icon again, the last three lines repeat, but disappear
after a short while.

The only way to be able to restart Mozilla is to KILL the process in the first
line.  Then when I restart everything works.

I have the following Mozilla RPM's installed:
mozilla-nss-1.2.1-26
mozilla-nspr-1.2.1-26
mozilla-mail-1.2.1-26
mozilla-psm-1.2.1-26
mozilla-1.2.1-26

I am running RedHat 9.0 and have it fully up to date via RHN.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Regards,
Louis van Dyk


Comment 12 Kasper Dupont 2003-10-08 05:34:54 UTC
I think this is really related to bug #90211

I also think that the process aparantly stuck in FUTEX_WAIT might be a Red
Herring. Mozilla have multiple threads though recent kernels hides all of them
but one. Using the -m option for ps will show them all. In my case "ps -mfC
mozilla-bin" will show me, that there are really multiple threads. And the one
in FUTEX_WAIT really is waiting, it is another thread that is stuck. Trying to
strace  the stuck thread happens to make the problem disapear, which is strange
because strace is not supposed to change the behaviour of the program. The bug
might actually be in the kernel rather than mozilla. After using strace it is
sometimes necesarry to send a CONT signal to the straced process.

I have sucessfully made the problem go away by using these two commands:
  strace $(ps -mC mozilla-bin -o pid=| sed -e 's|^| -p |')
  killall -CONT mozilla-bin
And so far that haven't failed to make the problem go away.

In addition I have seen a different symptom, that appeared at the same time as
the DNS problem, and was also "solved" by the above two commands, which makes me
believe that is another symptom of the same problem:

Sometimes Mozilla ignores requests to open new windows. Attempting to open one
more window will cause mozilla to open both requested windows at the same time.
It continues like that forever, so I can only open two windows at a time, until
I execute the commands from before.


Comment 13 Kasper Dupont 2003-12-21 22:50:02 UTC
Created attachment 96662 [details]
Commented output from Alt+SysRq+T

Comment 14 Kasper Dupont 2003-12-21 22:51:08 UTC
I still don't know exactly how to reproduce the problem, but if you
keep using mozilla for days, eventually it will show up. I have
sometimes used the previously mentioned strace command many times to
fix the same running instance of mozilla. One of the times the problem
appeared I did some further investigation. Using "ps -mC mozilla-bin
-o wchan" I could see all processes where sleeping in
schedule_timeout. But that probably doesn't say much, as that is also
the case when there are no problems. I think the schedule_timeout in
the wchan output might be a bug, it is really on the call stack, but I
think it should show the caller of schedule_timeout. I tried using
Alt+SysRq+T while there was a problem to see the kernel stack of each
mozilla process. The output is attached. I stripped everything but
mozilla threads and added a few comments about what happened when I
tried to strace each individual process. The information from strace
about the last thread might not apply as the problem was gone before I
got around to strace the last process. I suspect the problem is really
in the futex implementation in the kernel, maybe a race condition
causing a thread to go to sleep when it really shouldn't.

Comment 15 Kasper Dupont 2003-12-27 16:29:08 UTC
The problem still exists with the most recent kernel version
2.4.20-27.9. In an attempt to reproduce the problem I ran a program
that would attempt to use all physical memory and thereby force
mozilla to swap out most of its pages. In one of five attempts I
managed to reproduce the problem. Afterwards I noticed the xmms that
had been running during the experiment had frozen with symptoms
similar to the problem with mozilla. A futex related problem, that
disapears by tracing one of the threads. Here is the output from
strace on an xmms thread which got xmms back to life:

futex(0x8140608, FUTEX_WAKE, 1, ptrace: umoven: Input/output error
{...})  = 0
_llseek(8, 21164032, [21164032], SEEK_SET) = 0
read(8,
"F\373\205\364|\372I\363\363\371\203\362<\372B\362\222\372"..., 4096)
= 4096
nanosleep({0, 10000000}, NULL)          = 0
read(8, "\247\0051\366\30\5}\365q\4>\365\326\3\207\364\236\3\263"...,
4096) = 4096
read(8, "0\366\4\355\251\365\354\354\272\364Y\354\3\364\372\353"...,
4096) = 4096
close(8)                                = 0
munmap(0x4037e000, 4096)                = 0
_exit(0)                                = ?

Seeing similar symptoms with two different applications support my
suspect, that this is not a mozilla problem but rather a race/deadlock
condition in the futex code. Any suggestions about how to find out
what is going wrong in the futex code will be apreachiated.


Comment 16 Kasper Dupont 2003-12-27 18:48:57 UTC
Created attachment 96705 [details]
Testcase

Steps to Reproduce:
1. Start mozilla
2. Open 3-5 windows with different websites
3. Run attached program
4. Try to access a site that have not been visited recently

By running four instances of a program, that each try to allocate more virtual
memory than is physically available I can reliably reproduce the problem. Using
strace on one particular mozilla-bin thread will temporarily solve the problem.
By running the attached program again, the problem will appear again. If strace
is left running the attached program cannot reproduce the problem. Attempting
to close all mozilla windows will leave a few mozilla threads preventing a new
instance of mozilla to be started.

Comment 17 Kasper Dupont 2003-12-27 21:39:01 UTC
It turned out, that running the previously attached test program not
only affects mozilla and xmms, but also named is affected. This is not
a mozilla problem. It is a bug in the kernel or a library used by
these programs. The kernel stack traces in the case of named look
similar to those from mozilla, except one, which got me wondering, how
can schedule_timeout pagefault?

named         S C0130CC0  4816   870    861                 869 (NOTLB)
Call Trace:   [<c0130cc0>] handle_mm_fault [kernel] 0x120 (0xd2837ed8))
[<c0125e7d>] schedule_timeout [kernel] 0xad (0xd2837eec))
[<c0213fd6>] tcp_poll [kernel] 0x36 (0xd2837ef0))
[<c01f040c>] sock_poll [kernel] 0x2c (0xd2837f10))
[<c015807d>] do_select [kernel] 0x10d (0xd2837f24))
[<c015851e>] sys_select [kernel] 0x34e (0xd2837f60))
[<c010953f>] system_call [kernel] 0x33 (0xd2837fc0))


Comment 18 Christopher Blizzard 2004-02-12 18:34:12 UTC
-> kernel