Bug 431162
|
Description
Andrew Farris
2008-02-01 04:10:37 UTC
If it's a new issue, then xulrunner/firefox are not likely to be the true culprit, because I've stuck to epiphany-2.21.5-0.1.svn7856.fc9 and xulrunner-1.9-0.beta2.11.nightly20080115.fc9.x86_64 which are already 2 weeks old. Nevertheless, I do encounter the same problem when trying to choose a component against which I want to file a new bug in Red Hat bugzilla. I have only noticed this in the last few days, however I do the vast majority of my access to bugzilla from my macbook not running rawhide. I can't say how new it is. But I have verified I can make epiphany, firefox, and galeon all crash on the same query.cgi page with very similar backtraces (though I lack most of the debugging symbols in them as you see above). I haven't had time to dig further, but also given the lack of interest upstream in unofficial build debugging... I'm just not making it a priority yet. I suspect it could have to do with the full rewrite of javascript handling they've been planning for ff3, though I don't know if that has been done yet. From what I can tell this only happens when the component list attempts to load, but the rest of the page has been fetched correctly beforehand. Thank you for taking the time to report this bug report. Unfortunately,
that stack traces in the original bugreport are not very useful in determining
the cause of the crash.
Please install firefox-debuginfo and xulrunner-debuginfo; in order to do this
you have to enable -debuginfo repository.
yum install --enablerepo=\*debuginfo {firefox,xullrunner}-debuginfo
Then run firefox with a parameter -g. That will start firefox running inside of
gdb debugger. Then use command run and do whatever you did to make firefox
crash. When it happens, you should go back to the gdb and run
(gdb) thread apply all backtrace
This produces usually many screens of the text. Copy all of them into a text
editor and attach the file to the bug as an uncompressed attachment.
We will review this issue again once you've had a chance to attach this information.
Thanks in advance.
Created attachment 293951 [details]
gdb 'thread apply all bt' from hung firefox-3.0-0.beta2.15.nightly20080130.fc9.i386
Unfortunately there's a lot of unresolved references inside libxul.so in that backtrace. I've tried installing scads of debuginfo packages but it's not real helpful so far. I'll see what I can come up with late tonight. Created attachment 293985 [details]
firefox backtrace unresponsive script
Ok here is something new. The backtrace attached has 3 backtraces with a brief
continued period in between them.
I was viewing this bug, and clicked 'update components' by the component select
dropdown. Firefox froze, as the javascript was executed. I stopped it in gdb
for bt, continued, bt'd, continued.. and waited. Firefox eventually displayed
the 'stop unresponsive script' dialog. The third bt is immediately after the
script stopped.
Firefox is still running as I post this after those backtraces, so the script
being unresponsive did not crash it fully this time.
Like Will said there are alot of missing symbols in libxul, I've got glibc, nspr, firefox, and xulrunner debuginfo installed. (those seemed to be involved in the earlier backtraces) I just tried the query.cgi page and it also now shows the unresponsive script dialog, and the components list is populated... but it took 3 minutes to display the dialog meanwhile firefox is completely frozen. OK, two things: a) I really cannot reproduce this with firefox-3.0-0.beta2.15.nightly20080130.fc9.x86_64 and xulrunner-1.9-0.beta2.15.nightly20080130.fc9.x86_64 -- all these pages stop downloading stuff (on pretty busy line) in around ten to fifteen seconds. b) to be sure you have all debugging symbols, install yum-utils and then run (as root) debuginfo-install firefox the result is quite large install, but it should really contain everything you can need. debuginfo-install is what I used to get debuginfo for my backtrace. It should be obvious that it doesn't fetch all the needed debug files, in this case. Created attachment 294191 [details]
backtraces of firefox after page load (normal) and again after scripting crash (processor spinning)
Attached is a log of firefox -g after using debuginfo-install. Unfortunately,
as the gdb output shows there is alot of missing info still. I did run (as gdb
suggests) one of the yum install commands in the top of the file, and it did
install a debuginfo package I did not have yet... I have only done one of those
so far.
What I did in the above attachment is:
1- start firefox -g, then load (by bookmark) my bugzilla front page
2- click on this bug, load the page fully
3- open wireshark to gather tcpdump (will attach next), start capturing packets
4- reload this page via ctrl-r
5- interrupt firefox after page finishes loading and backtrace (the first t a a
bt in the attachment)
6- wait 20 seconds or so
7- click 'update components' on the bug page
8- firefox locks up, I wait about 5-7 seconds and notice all traffic has
stopped moving (firefox is spinning the cpu 100%)
9- interrupt firefox and t a a bt
Created attachment 294192 [details]
wireshark output tcpdump showing traffic from bugzilla
Here is all traffic during load of this page (first 3 seconds, 250 packets),
and then when the update components script is run (packets 251-299 starting at
30s).
Firefox appears to complete the transfer of the component list, then no traffic
occurs for more than 5 minutes before the unresponsive script dialog appears.
At this time there is NO additional traffic, but after closing the dialog the
component list is populated.
Created attachment 294193 [details]
An error in gdb that occurred while debugging firefox, t a a bt produced only this first thread's info
This is probably not of much use, but here to look at anyway. There are only
two references found in this thread; js_Invoke () and JS_CallFunctionValue ()
Meant to remove needinfo and return this to assigned, not modified. I can still reproduce this on every bugzilla query load with: firefox-3.0-0.beta2.18.nightly20080210.fc9.i386 xulrunner-1.9-0.beta2.18.nightly20080210.fc9.i386 kernel-2.6.24.1-28.fc9.i686 FWIW, the query page does *eventually* come up, but it takes a while. On my machine, after about 3 minutes I get the "Stop unresponsive script" dialog, and after selecting "Continue" and waiting another ~30sec the page is usable. (In reply to comment #13) > Unfortunately, as the gdb output shows there is alot of missing info still. I am trying to investigate this problem, and we would need for that attached output of this command: rpm -qa --qf '%{name}-%{version}-%{release}.%{arch}\n' Thank you. Created attachment 295021 [details] Popup from firefox.... After allowing firefox to stew for quite a while on the bugzilla query page, I got the attached popup warning: A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete. Script: https://bugzilla.redhat.com/js/product_query.js:53 .... Clicking Stop didn't appear to help much. Created attachment 295039 [details]
rpm-qa-sorted-i686.txt
rpm -qa --qf '%{name}-%{version}-%{release}.%{arch}\n' | sort is attached for
two different machines (i686 on physical machine, and x86_64 on vmware guest)
both experiencing this issue. They have both had debuginfo-install firefox
xulrunner galeon epiphany. All three browsers for both machines have the
similar behavior on javascript content loads.
Created attachment 295040 [details]
rpm-qa-sorted-x86_64.txt
By the way it is interesting that gmail and many other ajax and client side only javascript pages are behaving just fine. The demo sites at http://echo.nextapp.com/site/ for instance all work and they are serverside applications. Created attachment 295053 [details]
Error cosole produce on bugzilla query page
Opening an error console for the browse to query page produces the attachment.
Apparently, something is not agreeable about the JavaScript function
"updateSelect".
The error console is complaining about "sel" being null, the concurrent
"timeout" popup complains about the for loop about 13 lines later.
Here is the function. Line 39 is " sel.disabled = true"
function updateSelect(func, field, add_all) {
var products = ''
if ( field != 'product' ) {
products = findProducts()
}
saved[field] = saved[field].length == 0 ? get_selection(field) :
saved[field]
var sel = document.getElementById(field);
sel.disabled = true
var callback = {
success:function(o) {
var result = eval(o.responseText);
if (typeof(sel.options) != 'undefined') {
sel.options.length = 0;
var i = 0
var j = 0
if (add_all) {
sel.options[i] = new Option('All','')
i = 1
}
for (; j < result.length; i++) {
sel.options[i] = new Option(result[j],result[j])
j++
}
}
sel.disabled = false
restore_selection(field, saved[field])
saved[field] = new Array()
}
}
var url = urlbase + "ajax.cgi?ajax_func=" + encodeURIComponent(func) +
"&ctype=json" + products
var cObj = YAHOO.util.Connect.asyncRequest('GET', url, callback, null)
}
Any of this helpful?
Thats interesting Tom, I do see that error in console about sel being null when
loading query.cgi for firefox 2.0.0.12 on OSX, and on rawhide with ff3. The
updateSelectAll() javascript function (in the onload="" for the page body) must
be calling that incorrectly.
However, it appears that the page is calling updateSelect correctly for the
'Update Components' link on the bug page (top of this page) anyway. That script
also causes the hang on rawhide, but I see no javascript console error on this
page in OSX or rawhide. There is virtually no delay in updating the components
on FF2 in OSX but there is the 3min hang on rawhide.
<td>
<select id="component" name="component">
<option value="xulrunner" selected="selected">xulrunner</option>
</select>
<span id="change_component">
<a title="Click here to select a different component"
href="javascript:updateSelect('getComponents','component');
emptyDiv('change_component');">
Update Components</a>
</span>
</td>
Hmm, updateSelectAll looks ok too, at least from my meager javascript
background. The component select has the right id, the function is called with
that id, the getElementById must just be returning null for some reason.
In the generated source of query.cgi:
<div id="selcomponent">
<select name="component" id="component" multiple="multiple" size="7"></select>
</div>
The script error where sel is null:
function updateSelectAll(add_all) {
updateSelect('getComponents', 'component', add_all)
updateSelect('getVersions', 'version', add_all)
updateSelect('getMilestones', 'target_milestone', add_all)
}
function updateSelect(func, field, add_all) {
var products = ''
if ( field != 'product' ) {
products = findProducts()
}
saved[field] = saved[field].length == 0 ? get_selection(field) : saved[field]
var sel = document.getElementById(field);
sel.disabled = true
(In reply to comment #21) > Created an attachment (id=295039) [edit] > rpm-qa-sorted-i686.txt The debuginfo rpms versions are not matching the binary rpms versions, such as: firefox-3.0-0.beta2.18.nightly20080210.fc9.i386 firefox-debuginfo-3.0-0.beta2.15.nightly20080130.fc9.i386 Possible workarounds: yum --enablerepo='*-debuginfo' update Follow all the lines printed by GDB: yum --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/../... It is in fact Bug 151598 but there would be YUM failures if it would get fixed. Just in order to make clear that this bug is not about crash before I will make all the same bugs duplicates of this one. Jan, I think it would be better to keep this problems with debuginfo in bug 432806 -- this bug was reproduced with the upstream binary version of firefox, so it might be CLOSED/UPSTREAM pretty soon. *** Bug 426685 has been marked as a duplicate of this bug. *** *** Bug 432291 has been marked as a duplicate of this bug. *** The debuginfo may have been in sync when I obtained those backtraces, I had deliberately not updated it since and the rpm lists were afterthought. I'll try to get them in sync and see if the bt of thread 1 when it hangs is more useful. Created attachment 295085 [details]
firefox-bt-matched-debuginfo-3.0-0.beta3.21.nightly20080214.txt
x86_64 backtrace while firefox is hanging
Also it is interesting to note, the 'update products', 'update versions'
scripts on this page do not cause firefox to hang, only the update components
script does.
Dave, does this seems like something which may be actually the Bugzilla's problem? (In reply to comment #0) > The cpu spins and the application does not respond or redraw itself. > It can be closed (gnome recognizes not responding, force quit dialog). (In reply to comment #33) > Dave, does this seems like something which may be actually the Bugzilla's problem? Matej, a foreign web page (Bugzilla) must have no right to block the browser responsiveness. Even if the (possibly malicious) web page would really like to. (->gecko/xulrunner Bug) (In reply to comment #34) > Matej, a foreign web page (Bugzilla) must have no right to block the browser > responsiveness. Even if the (possibly malicious) web page would really like to. > (->gecko/xulrunner Bug) I am not trying now to blame Bugzilla on this bug — I totally agree with you — however, given the importance of this bug for keeping whole bug filing and bug management process working, I would be eager to be able to find as soon as possible whether there is something to be done in Bugzilla to make it working for everybody affected again. Then we could happily return to the interesting issue why is bugzilla able to bring firefox to its knees. Still waiting on reply from Bugzilla folks. For the BZ guys info, I'm using FF3b3 on windows xp fine with the same network in front of the machine (same router, hardware firewall configured, nat, etc) on two machines. FF3 on rawhide does not work from either of the same machines. BZ is behaving fine for FF2 and safari (windows and osx) and IE7 on the same network. If the BZ javascript is to blame in this case I would think it is something very quirkcentric. ;) Andrew, does FF2 on Linux work for you? Well I don't have an F8 install in place now, but it does work on Ubuntu, and it used to work fine on F8. I'll see what I can do about checking on an F8 live cd. Matej, the official build of ff2.0.0.12 works fine (on rawhide), using the tarball directly from upstream. Both the query.cgi page and the update components link on the bug work within 5-6 seconds with no noticeable cpu load. And.. its reproduced using the latest upstream FF3 nightly. Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9b4pre) Gecko/2008022004 Minefield/3.0b4pre The javascript interpreter is just broken it seems. I just also tried the latest upstream nightly on OSX and it works fine with bz. Is it possible that the way js is allocating memory severely degrades the
performance? milestones & versions are typically only a few array items,
however the component list can be on the order of 1000's of items.
var result = eval(o.responseText);
if (typeof(sel.options) != 'undefined') {
sel.options.length = 0;
var i = 0
var j = 0
if (add_all) {
sel.options[i] = new Option('All','')
i = 1
}
for (; j < result.length; i++) {
sel.options[i] = new Option(result[j],result[j])
j++
}
}
To my mind js is going to have to allocate memory for a new 'Option' on each
loop iteration as well as allocating space for the array as it grows. Though
I'd bet it pre-allocates arrays to handle the general cases.
For instance, here are the component counts for some products
Fedora 6155
Fedora EPEL 1583
Red Hat Raw Hide 1568
Red Hat Linux Beta 1349
Red Hat Enterprise Linux 5 1298
That is the top 5 products by component count. So it's probably no surprise
that you Fedora guys are hitting this.
FWIW it works ok for me with FF2 Mozilla/5.0 (X11; U; Linux i686; en-US;
rv:1.8.1.12) Gecko/20080208 Fedora/2.0.0.12-1.fc8 Firefox/2.0.0.12 of F8
Created attachment 295524 [details]
callgrind annotated output
i tried running callgrind on ff, here's the shortened output
firefox-3.0-0.beta3.1.el5
xulrunner-1.9-0.beta3.1.el5
Bug filed upstream as https://bugzilla.mozilla.org/show_bug.cgi?id=418845 Created attachment 295764 [details] oprofile output for only the Option test link load I've created a test script for this bug at (c=numberofoptions): http://www.lordmorgul.net/pub/fedora/testing/ff3-bug431162.php?c=1000 This attached oprofile is for that load only (the page is already loaded so the static options are not included. *** Bug 435188 has been marked as a duplicate of this bug. *** The upstream bug result if anyone goes searching for this bug, setting
GNOME_ACCESSIBILITY=0 in your environment when starting firefox seems to
drastically reduce the problem. A real solution to the issue will hopefully be
found soon.
> GNOME_ACCESSIBILITY=0 firefox
A more thorough fix for this (temporary workaround for Fedora) is to turn off the gnome assistive technologies option. I do not know how/why this got turned on (maybe its been set as default on in Rawhide?) but it was turned on for both my machines. Setting the environment variable tells firefox (all apps that respect the var) not to use AT but turning it off also works. System->Preferences->Personal->Assistive Technologies Toggle the 'enable assistive technologies' checkbox, then restart firefox. It is not necessary to logout. Matej: maybe we can find out if this is supposed to be on by default for installs? Keeping this from biting the F9 release is probably a good idea. Maybe until it is fixed upstream the Fedora firefox start script could set the environment var? All xulrunner apps actually need it. accessibility support is turned on by default since gnome 2.17, at least Seems fixed for me with the 20080310 snapshot (Current rawhide). Huzzah! Andrew, can you retest and confirm? (In reply to comment #50) > Seems fixed for me with the 20080310 snapshot (Current rawhide). Huzzah! Most likely it is a false positive -- the upstream patch is still being reviewed (and yes the bug generated something over 70 comments -- yay!). (In reply to comment #51) > > Most likely it is a false positive D'oh! You're right. Mixed up my test machines - this one turned out to have accessibility turned off, which is a known workaround. Please disregard comment #50. Greetings: I just entered 43780, which appears to be related to this thread. Thanks Upstream fixed in their CVS. Looks like this made it to rawhide today. firefox-3.0-0.51.beta5rc2.fc9.i386 xulrunner-1.9-0.51.beta5rc2.fc9.i386 My test script [1] and 'update components' on the bug page both run in about 10 seconds max with accessibility turned on. Confirm and we'll rejoice. [1] http://www.lordmorgul.net/pub/fedora/testing/ff3-bug431162.php?c=6000 Wonderful news! This may have a regression in firefox-3.0-0.52.beta5 and xulrunner-1.9-0.52.beta5, or maybe the package before. I noticed the bugzilla query page running slower the last day or two, but after update today it seems like a big change. I have assistive technologies toggled on to keep an eye on this. My test page script is taking over a minute now for c=6000. Does anyone but me see this slowness again? I don't see it. Created attachment 303957 [details] testing page (In reply to comment #55) > My test script [1] and 'update components' on the bug page both run in about 10 > seconds max with accessibility turned on. Confirm and we'll rejoice. > [1] http://www.lordmorgul.net/pub/fedora/testing/ff3-bug431162.php?c=6000 Just for the sake of beauty of this test page and its preservation, I am attaching it here. |