Bug 441643

Summary: mugshot extension crashes Firefox
Product: [Fedora] Fedora Reporter: Matěj Cepl <mcepl>
Component: mugshotAssignee: Owen Taylor <otaylor>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: axel.thimm, gecko-bugs-nobody, jan.kratochvil, mcepl, mpg, otaylor, stransky, tagoh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-28 20:44:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 441977    
Attachments:
Description Flags
session in gdb
none
my backtrace on gdb
none
Minimal testcase
none
A patch to mugshot, uses internal strings with nsIDocument
none
Patch I'm applying none

Description Matěj Cepl 2008-04-09 10:11:21 UTC
Description of problem:
Pretty consistently firefox crashes and only now I have caught mugshot FF
extension in flagranti (see attached backtrace).

Version-Release number of selected component (if applicable):
mugshot-1.1.93-2.fc9.x86_64
xulrunner-1.9-0.52.beta5.fc9.x86_64
firefox-3.0-0.52.beta5.fc9.x86_64

How reproducible:
more often than I would like to ;-) (not all the times, but makes FF unusable
for me)

Steps to Reproduce:
not a good pattern
  
Actual results:
crash

Expected results:
shouldn't

Additional info:

Comment 1 Matěj Cepl 2008-04-09 10:11:21 UTC
Created attachment 301771 [details]
session in gdb

Comment 2 Owen Taylor 2008-04-10 17:53:08 UTC
*** Bug 441892 has been marked as a duplicate of this bug. ***

Comment 3 Owen Taylor 2008-04-10 17:55:01 UTC
Colin: can you look into this?

Comment 4 Axel Thimm 2008-04-10 18:36:23 UTC
> Steps to Reproduce:
> not a good pattern

Until I disabled the mugshot extension the following URL would *always* crash
firefox.

Comment 5 Matěj Cepl 2008-04-10 20:36:14 UTC
(In reply to comment #4)
> Until I disabled the mugshot extension the following URL would *always* crash
> firefox.

What URL?

Comment 6 Axel Thimm 2008-04-11 16:08:07 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Until I disabled the mugshot extension the following URL would *always* crash
> > firefox.
> 
> What URL?

http://mugshot.org/visit?post=R43l8bh5QClq8M

Comment 7 Akira TAGOH 2008-04-15 11:39:26 UTC
Created attachment 302438 [details]
my backtrace on gdb

I can reproduce this bug as well, with
http://online.gnome.org/application?id=gnu-emacs say.

Comment 8 Owen Taylor 2008-04-15 13:29:44 UTC
Reproduced on x86_64 ... investigating.

Comment 9 Owen Taylor 2008-04-15 14:58:09 UTC
Having a lot of trouble getting past the XPCOM and figuring out what's really
going on - the section of our code that is crashing looks like:

/* void setListener (in hippoIControlListener listener); */
NS_IMETHODIMP hippoControl::SetWindow(nsIDOMWindow *window)
{
[...]
    nsCOMPtr<nsIWidget> widget = GetMainWidget(window);
[...]
}

static nsIWidget* GetMainWidget(nsIDOMWindow* aWindow)
{
        // get the native window for this instance
        nsCOMPtr<nsPIDOMWindow> window(do_QueryInterface(aWindow));
        NS_ENSURE_TRUE(window, nsnull);
        nsCOMPtr<nsIDocument> doc(do_QueryInterface(window->GetExtantDocument()));
        NS_ENSURE_TRUE(doc, nsnull);
>>>>    nsCOMPtr<nsISupports> container = doc->GetContainer();
        nsCOMPtr<nsIBaseWindow> baseWindow(do_QueryInterface(container));

The crash occurs in the line marked with >>>> If I trace inside there,
the crash occurs in the attempt to dereference the mDocumentContainer
weak reference filed of nsIDocument.

  /**
   * Get the container (docshell) for this document.
   */
  already_AddRefed<nsISupports> GetContainer() const
  {
    nsISupports* container = nsnull;
    if (mDocumentContainer)
>>>>   CallQueryReferent(mDocumentContainer.get(), &container);

    return container;
  }

None of the objects are obviously garbage when inspected in a debugger,
but I'm not really expert enough in XPCOM to get past the surface types
and look at the details of the implementation classes.

As far as we can tell this crash occurs on x86_64 and not on i386.
On x86_64 it is 100% reproducible.


Comment 10 Christopher Aillon 2008-04-17 19:27:17 UTC
valgrind shows:

==18620== Invalid read of size 8
==18620==    at 0x1D0E0947: hippoControl::SetWindow(nsIDOMWindow*) (in
/usr/lib64/mugshot/firefox/components/libhippofirefox.so)
==18620==    by 0x765EAC3: NS_InvokeByIndex_P (xptcinvoke_x86_64_linux.cpp:208)
==18620==    by 0x6E959FE: XPCWrappedNative::CallMethod(XPCCallContext&,
XPCWrappedNative::CallMode) (xpcwrappednative.cpp:2369)
==18620==    by 0x6E9E11A: XPC_WN_CallMethod(JSContext*, JSObject*, unsigned,
long*, long*) (xpcwrappednativejsops.cpp:1470)
==18620==    by 0x61A2F16: js_Invoke (jsinterp.c:1287)
==18620==    by 0x6196257: js_Interpret (jsinterp.c:4841)
==18620==    by 0x61A2F7E: js_Invoke (jsinterp.c:1303)
==18620==    by 0x61A3383: js_InvokeConstructor (jsinterp.c:1861)
==18620==    by 0x619A325: js_Interpret (jsinterp.c:3811)
==18620==    by 0x61A2F7E: js_Invoke (jsinterp.c:1303)
==18620==    by 0x6190BD2: fun_apply (jsfun.c:1650)
==18620==    by 0x619F788: js_Interpret (jsinterp.c:4824)
==18620==  Address 0x393500382d46546d is not stack'd, malloc'd or (recently) free'd


Comment 11 Christopher Aillon 2008-04-20 16:24:02 UTC
Building without optimization makes this crash dissapear...

Comment 12 Christopher Aillon 2008-04-20 16:27:03 UTC
So, eventually I was able to get a xulrunner built with --enable-debug that
exhibits the crash.  I see:

WARNING: recurring into frame construction:
'mPresContext->mLayoutPhaseCount[eLayoutPhase_FrameC] == 0', file
../../dist/include/layout/nsPresContext.h, line 958
WARNING: NS_ENSURE_TRUE(sgo || !hasHadScriptObject) failed: file
nsContentUtils.cpp, line 2594
--DOMWINDOW == 10 (0x1dd1240) [serial = 10] [Outer = 0x1ad7eb0]
++DOMWINDOW == 11 (0x1dd1240) [serial = 12] [Outer = 0x1ad7eb0]
CSS Error (http://online.gnome.org/css3/1207784006/site-positions.css :21.17):
Error in parsing value for property 'width'.  Declaration dropped.
WARNING: recurring into frame construction:
'mPresContext->mLayoutPhaseCount[eLayoutPhase_FrameC] == 0', file
../../dist/include/layout/nsPresContext.h, line 958
++DOMWINDOW == 12 (0x281d420) [serial = 13] [Outer = 0x1ad7eb0]
CSS Error (http://online.gnome.org/css3/1207784006/site-positions.css :21.17):
Error in parsing value for property 'width'.  Declaration dropped.
WARNING: recurring into frame construction:
'mPresContext->mLayoutPhaseCount[eLayoutPhase_FrameC] == 0', file
../../dist/include/layout/nsPresContext.h, line 958
** (firefox:21711): DEBUG: Connected to session bus
** (firefox:21711): DEBUG: unique name of client: :1.19
** (firefox:21711): DEBUG: adding rule
type='signal',sender=':1.19',path='/com/dumbhippo/listener',interface='com.dumbhippo.Listener',member='Connected'
** (firefox:21711): DEBUG: adding rule
type='signal',sender=':1.19',path='/com/dumbhippo/listener',interface='com.dumbhippo.Listener',member='Disconnected'
** (firefox:21711): DEBUG: Error from registerEndpoint(): XMPP connection not active

Program /usr/lib64/firefox-3.0b5/firefox (pid = 21711) received signal 11.


Comment 13 Martin Stransky 2008-04-21 15:01:49 UTC
It's here:

676       already_AddRefed<nsISupports> GetContainer() const
677       {
678         nsISupports* container = nsnull;
679         if (mDocumentContainer)
680           CallQueryReferent(mDocumentContainer.get(), &container);
681
682         return container;
683       }

(gdb) p mDocumentContainer
$119 = {<nsCOMPtr_base> = {mRawPtr = 0x1ad1888}, <No data fields>}

(gdb) p mDocumentContainer.mRawPtr
$120 = (nsISupports *) 0x1ad1888

(gdb) p* mDocumentContainer.mRawPtr
$121 = {_vptr.nsISupports = 0x393500382d465455}


0x393500382d465455 doesn't seem to be the right value here ;-) But why is the
mDocumentContainer broken?


#0  nsIDocument::GetContainer (this=0x18c9580) at
/usr/include/xulrunner-sdk-1.9pre/content/nsIDocument.h:679
#1  0x00002aaac26861e1 in GetMainWidget (aWindow=0x1598690) at
common-dist/firefox/src/hippoControl.cpp:167
#2  0x00002aaac26863d7 in hippoControl::SetWindow (this=0x9bbc70, window=0x1598690)
    at common-dist/firefox/src/hippoControl.cpp:201
#3  0x00002aaaac89807d in NS_InvokeByIndex_P (that=0x9bbc70, methodIndex=9,
paramCount=1, params=0x7fffe07f1720)
    at xptcinvoke_x86_64_linux.cpp:208
#4  0x00002aaaab95f6b6 in XPCWrappedNative::CallMethod (ccx=@0x7fffe07f1b80,
mode=XPCWrappedNative::CALL_METHOD)
    at xpcwrappednative.cpp:2369
#5  0x00002aaaab96e45c in XPC_WN_CallMethod (cx=0x1598a60, obj=0x76ed80, argc=1,
argv=0x17e9b58, vp=0x7fffe07f1d88)
    at xpcwrappednativejsops.cpp:1470
#6  0x00002aaaaab7866f in js_Invoke (cx=0x1598a60, argc=1, vp=0x17e9b48,
flags=2048) at jsinterp.c:1287
#7  0x00002aaaaab6a2c0 in js_Interpret (cx=0x1598a60) at jsinterp.c:4841
#8  0x00002aaaaab78781 in js_Invoke (cx=0x1598a60, argc=1, vp=0x17e9a28,
flags=1) at jsinterp.c:1303
#9  0x00002aaaaab79f10 in js_InvokeConstructor (cx=0x1598a60, vp=0x17e9a28,
argc=1) at jsinterp.c:1861
#10 0x00002aaaaab62e4e in js_Interpret (cx=0x1598a60) at jsinterp.c:3811
#11 0x00002aaaaab78781 in js_Invoke (cx=0x1598a60, argc=1, vp=0x17e97e0,
flags=2) at jsinterp.c:1303
#12 0x00002aaaaab78a9d in js_InternalInvoke (cx=0x1598a60, obj=0x1ae03c0,
fval=26271568, flags=0, argc=1, argv=0x17e97d8, 
    rval=0x7fffe07f3ad8) at jsinterp.c:1359

Comment 14 Marco Pesenti Gritti 2008-04-21 22:42:44 UTC
Created attachment 303226 [details]
Minimal testcase

Comment 15 Marco Pesenti Gritti 2008-04-21 23:09:41 UTC
The trace I'm getting here stops at SetWindow. I'm using a debug build and there
are no messages when clicking the button in the testcase I attached. Using
--disable-optimize doesn't seem to make any difference.

I can't dig down in the nsIDOMWindow passed to SetWindow. I don't know the xpcom
internals and what _vptr.nsISupports is about.

(gdb) f 5
#5  0x00007fb26ed56c56 in hippoControl::SetWindow (this=0x22dbf30, window=0x1b64c50)
    at /usr/include/xulrunner-sdk-1.9pre/stable/nsIWeakReferenceUtils.h:63
63	                                  reinterpret_cast<void**>(aDestination));

(gdb) p window
$3 = (nsIDOMWindow *) 0x1b64c50
(gdb) p* window
$4 = {<nsISupports> = {_vptr.nsISupports = 0x7fb281f92350}, <No data fields>}

Just before calling javascript mDocumentContainer seem to be fine:

(gdb) f 17
#17 0x00007fb280ddc384 in nsEventListenerManager::HandleEvent (this=0x2ca0660,
aPresContext=0x2c74e10, aEvent=0x7fff8a0a7250, aDOMEvent=0x7fff8a0a7030, 
    aCurrentTarget=0x2ca05a0, aFlags=6, aEventStatus=0x7fff8a0a7038) at
nsEventListenerManager.cpp:1184

(gdb) p aPresContext.mDocument.mRawPtr.mDocumentContainer.mRawPtr
$35 = (class nsIWeakReference *) 0x1f03b30
(gdb) p* aPresContext.mDocument.mRawPtr.mDocumentContainer.mRawPtr
$36 = {<nsISupports> = {_vptr.nsISupports = 0x7fb28204acb0}, <No data fields>}

Full trace:

#0  0x00000030acea6251 in nanosleep () from /lib64/libc.so.6
#1  0x00000030acea6077 in __sleep (seconds=<value optimized out>) at
../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fb28089fd3b in ah_crap_handler (signum=11) at nsSigHandlers.cpp:149
#3  0x00007fb2808b42c2 in nsProfileLock::FatalSignalHandler (signo=11) at
nsProfileLock.cpp:216
#4  <signal handler called>
#5  0x00007fb26ed56c56 in hippoControl::SetWindow (this=0x22dbf30, window=0x1b64c50)
    at /usr/include/xulrunner-sdk-1.9pre/stable/nsIWeakReferenceUtils.h:63
#6  0x00007fb28161232f in NS_InvokeByIndex_P (that=0x22dbf30, methodIndex=9,
paramCount=1, params=0x7fff8a0a5e60) at xptcinvoke_x86_64_linux.cpp:208
#7  0x00007fb2808fcf9c in XPCWrappedNative::CallMethod (ccx=@0x7fff8a0a6110,
mode=<value optimized out>) at xpcwrappednative.cpp:2369
#8  0x00007fb28090a0ee in XPC_WN_CallMethod (cx=0x1b650b0, obj=0x168c500,
argc=1, argv=0x2687cc0, vp=0x7fff8a0a62f8) at xpcwrappednativejsops.cpp:1470
#9  0x000000000067de86 in js_Invoke (cx=0x1b650b0, argc=1, vp=0x2687cb0,
flags=2048) at jsinterp.c:1287
#10 0x000000000066acd6 in js_Interpret (cx=0x1b650b0) at jsinterp.c:4841
#11 0x000000000067e2f8 in js_Invoke (cx=0x1b650b0, argc=1, vp=0x2687ba0,
flags=2) at jsinterp.c:1303
#12 0x000000000067e524 in js_InternalInvoke (cx=0x1b650b0, obj=0x1682480,
fval=23241024, flags=2, argc=1, argv=0x2687b98, rval=0x7fff8a0a6918)
    at jsinterp.c:1359
#13 0x0000000000625cc4 in JS_CallFunctionValue (cx=0x1b650b0, obj=0x1682480,
fval=23241024, argc=1, argv=0x2687b98, rval=0x7fff8a0a6918) at jsapi.c:5036
#14 0x00007fb280f6fb37 in nsJSContext::CallEventHandler (this=0x1b64f70,
aTarget=<value optimized out>, aScope=<value optimized out>, 
    aHandler=0x162a140, aargv=0x2c3fc70, arv=0x7fff8a0a6b80) at
nsJSEnvironment.cpp:1962
#15 0x00007fb280fd4357 in nsJSEventListener::HandleEvent (this=0x2ca0710,
aEvent=0x15dc9a0) at nsJSEventListener.cpp:248
#16 0x00007fb280ddbd73 in nsEventListenerManager::HandleEventSubType
(this=0x2ca0660, aListenerStruct=<value optimized out>, aListener=0x2ca0710, 
    aDOMEvent=0x15dc9a0, aCurrentTarget=0x2ca05a0, aPhaseFlags=<value optimized
out>) at nsEventListenerManager.cpp:1080
#17 0x00007fb280ddc384 in nsEventListenerManager::HandleEvent (this=0x2ca0660,
aPresContext=0x2c74e10, aEvent=0x7fff8a0a7250, aDOMEvent=0x7fff8a0a7030, 
    aCurrentTarget=0x2ca05a0, aFlags=6, aEventStatus=0x7fff8a0a7038) at
nsEventListenerManager.cpp:1184
#18 0x00007fb280e04782 in nsEventTargetChainItem::HandleEvent (this=<value
optimized out>, aVisitor=@0x7fff8a0a7020, aFlags=6)
    at nsEventDispatcher.cpp:206
#19 0x00007fb280e04c63 in nsEventTargetChainItem::HandleEventTargetChain
(this=0x23136d0, aVisitor=@0x7fff8a0a7020, aFlags=6, aCallback=0x7fff8a0a70e0)
    at nsEventDispatcher.cpp:264
#20 0x00007fb280e051b1 in nsEventDispatcher::Dispatch (aTarget=<value optimized
out>, aPresContext=0x2c74e10, aEvent=0x7fff8a0a7250, aDOMEvent=0x0, 
    aEventStatus=0x7fff8a0a77ac, aCallback=0x7fff8a0a70e0) at
nsEventDispatcher.cpp:479
#21 0x00007fb280b4a6ea in PresShell::HandleEventInternal (this=0x2c7a660,
aEvent=0x7fff8a0a7250, aView=<value optimized out>, aStatus=0x7fff8a0a77ac)
    at nsPresShell.cpp:5945
#22 0x00007fb280b4b227 in PresShell::HandleEventWithTarget (this=0x2c7a660,
aEvent=0x7fff8a0a7250, aFrame=<value optimized out>, 
    aContent=<value optimized out>, aStatus=0x7fff8a0a77ac) at nsPresShell.cpp:5850
#23 0x00007fb280de23eb in nsEventStateManager::CheckForAndDispatchClick
(this=<value optimized out>, aPresContext=<value optimized out>, 
    aEvent=0x7fff8a0a79d0, aStatus=0x7fff8a0a77ac) at nsEventStateManager.cpp:3320


Comment 16 Marco Pesenti Gritti 2008-04-22 00:44:30 UTC
I figured out how to dig through _vptr.nsISupports. I can't find anything wrong
there.

(gdb) f 5
#5  0x00007fe501b7cc50 in hippoControl::GetMainWidget (this=<value optimized
out>, aWindow=<value optimized out>)
    at /usr/include/xulrunner-sdk-1.9pre/stable/nsIWeakReferenceUtils.h:63
63	                                  reinterpret_cast<void**>(aDestination));

(gdb) p* (nsStandardURL*)((nsWebShell*)((nsWeakReference*)((nsHTMLDocument
*)doc).mDocumentContainer).mReferent).mCurrentURI
$29 = {<nsIFileURL> = {<nsIURL> = {<nsIURI> = {<nsISupports> = {
          _vptr.nsISupports = 0x7fe5139bfd10}, <No data fields>}, <No data
fields>}, <No data fields>}, <nsIStandardURL> = {<nsIMutable> = {<nsISupports> =
{_vptr.nsISupports = 0x7fe5139bff30}, <No data fields>}, <No data fields>},
<nsISerializable> = {<nsISupports> = {
      _vptr.nsISupports = 0x7fe5139bff70}, <No data fields>}, <nsIClassInfo> =
{<nsISupports> = {
      _vptr.nsISupports = 0x7fe5139bffa8}, <No data fields>}, mRefCnt = {mValue
= 11}, _mOwningThread = {mThread = 0x2103070}, 
  mSpec = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>},
mData = 0x311e6f8 "http://mugshot.org/visit?post=R43l8bh5QClq8M", 
      mLength = 44, mFlags = 5}, <No data fields>}, mDefaultPort = 80, mPort =
-1, mScheme = {mPos = 0, mLen = 4}, mAuthority = {mPos = 7, mLen = 11}, 
  mUsername = {mPos = 7, mLen = -1}, mPassword = {mPos = 7, mLen = -1}, mHost =
{mPos = 7, mLen = 11}, mPath = {mPos = 18, mLen = 26}, mFilepath = {
    mPos = 18, mLen = 6}, mDirectory = {mPos = 18, mLen = 1}, mBasename = {mPos
= 19, mLen = 5}, mExtension = {mPos = 19, mLen = -1}, mParam = {
    mPos = 18, mLen = -1}, mQuery = {mPos = 25, mLen = 19}, mRef = {mPos = 18,
mLen = -1}, 
  mOriginCharset = {<nsACString_internal> = {<nsCSubstring_base> = {<No data
fields>}, mData = 0x7fe5043bbd18 "", mLength = 0, 
      mFlags = 5}, <No data fields>}, mParser = {mRawPtr = 0x2263fa0}, mFile =
{mRawPtr = 0x0}, mHostA = 0x0, mHostEncoding = 1, mSpecEncoding = 1, 
  mURLType = 2, mMutable = 0, mSupportsFileURL = 0, static gIDN = 0x223d940,
static gCharsetMgr = 0x0, static gInitialized = 1, static gEscapeUTF8 = 1, 
  static gAlwaysEncodeInUTF8 = 1, static gEncodeQueryInUTF8 = 0}

(I had to make GetMainWidget a method, to be able to look at doc)


Comment 17 Marco Pesenti Gritti 2008-04-22 02:05:38 UTC
Verified that this is x86_64 specific.

Comment 18 Marco Pesenti Gritti 2008-04-22 16:07:38 UTC
I got a better trace by tweaking the plugin compilation flags:

#5  0x00007f30ce7aa091 in CallQueryReferent<nsIWeakReference, nsISupports>
(aSource=0x7f30d035a3c8, aDestination=0x7fffe955d658)
    at /usr/include/xulrunner-sdk-1.9pre/stable/nsIWeakReferenceUtils.h:63
#6  0x00007f30ce7aa0ec in nsIDocument::GetContainer (this=0x7f30d0398b20) at
/usr/include/xulrunner-sdk-1.9pre/content/nsIDocument.h:680
#7  0x00007f30ce7a82f2 in hippoControl::GetMainWidget (this=0x7f30d024a370,
aWindow=0x7f30d002d100) at ./../common/firefox/src/hippoControl.cpp:169
#8  0x00007f30ce7a85c7 in hippoControl::SetWindow (this=0x7f30d024a370,
window=0x7f30d002d100) at ./../common/firefox/src/hippoControl.cpp:203

And here is something interesting:

(gdb) p ((nsIDocument*)doc).mDocumentContainer.mRawPtr
$18 = (class nsIWeakReference *) 0x7f30d0155e70
(gdb) p doc.mRawPtr.mDocumentContainer.mRawPtr
$19 = (class nsIWeakReference *) 0x7f30d035a3c8
(gdb) p * $18
$20 = {<nsISupports> = {_vptr.nsISupports = 0x7f30e1501cb0}, <No data fields>}
(gdb) p * $19
$21 = {<nsISupports> = {_vptr.nsISupports = 0x393538382d4f5349}, <No data fields>}

$18 works, $19 is broken. $18 is what get passed to GetContainer() -> crash

Comment 19 Owen Taylor 2008-04-22 16:21:47 UTC
Would it be possible to:

 - Put a conditional breakpoint on nsEventListenerManager::HandleEvent
   (conditional so it doesn't trigger until you click the button in 
   your test case)
 - Find the window and document objects from there, make sure they
   are OK
 - Put hardware watchpoints on mDocumentContainer->mRawPtr and 
   mDocumentContainer->mRawPtr->mRefCnt
 
   [Using the typical  
    (gdb) p &foo->bar
    $43 = (unsigned int *)0x432341234123123
    (gdb) watch *$43
    trick, if you aren't familar with that]
And see where things go south?

Comment 20 Marco Pesenti Gritti 2008-04-22 18:12:33 UTC
I tried to do something like that.

* I added a breakpoint to nsEventStateManager::CheckForAndDispatchClick,
mDocumentContainer was fine there (I also verified it's fine down until the JS
methods).
* I set a watch on mDocumentContainer.mRawPtr. Firefox crashed and the mRawPtr
changed but gdb didn't break...

Log below. Is the watch setup correct?

Breakpoint 1, nsEventStateManager::CheckForAndDispatchClick
(this=0x7f9ef213b370, aPresContext=0x7f9ef213af00, aEvent=0x7fff05ea54e0, 
    aStatus=0x7fff05ea52bc) at nsEventStateManager.cpp:3278
3278	                                              nsEventStatus* aStatus)
Current language:  auto; currently c++
(gdb) p &(aPresContext.mDocument.mRawPtr.mDocumentContainer.mRawPtr)
$1 = (class nsIWeakReference **) 0x7f9ef050b3e8
(gdb) watch *$1
Hardware watchpoint 2: *$1
(gdb) continue
Continuing.

Breakpoint 1, nsEventStateManager::CheckForAndDispatchClick
(this=0x7f9ef213b370, aPresContext=<value optimized out>, aEvent=0x7fff05ea54e0, 
    aStatus=0x7fff05ea52bc) at nsEventStateManager.cpp:3278
3278	                                              nsEventStatus* aStatus)
(gdb) continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007f9eeb6c6091 in CallQueryReferent<nsIWeakReference, nsISupports>
(aSource=0x7f9ef06dcd48, aDestination=0x7fff05ea35a8)
    at /usr/include/xulrunner-sdk-1.9pre/stable/nsIWeakReferenceUtils.h:63
63	                                  reinterpret_cast<void**>(aDestination));


Comment 21 Owen Taylor 2008-04-22 18:23:11 UTC
That looks right to me.

- Have you checked that the document pointer is the same in the two locations?
  A sanity check: do p *$1 when you set the watch point, and then *$1 after
  you crash... if they are different, then the watch point didn't work,
  if they are the same, then that tells you something else.

- I haven't tried hardware watchpoints on x86_64 before. Might be worth
  testing with a small test program.


Comment 22 Marco Pesenti Gritti 2008-04-22 19:09:23 UTC
Before the crash:

(gdb) p aPresContext.mDocument.mRawPtr
$1 = (nsIDocument *) 0x7f8a3e0c36d0
(gdb) p &($1.mDocumentContainer.mRawPtr)
$2 = (class nsIWeakReference **) 0x7f8a3e0c3718
(gdb) p *$2
$3 = (class nsIWeakReference *) 0x7f8a3ec972c0

After the crash:

(gdb) p *$2
$4 = (class nsIWeakReference *) 0x7f8a3ec972c0

So nothing is changed, apparently. Though:

(gdb) p this
$5 = (const nsIDocument * const) 0x7f8a3e0c36d0
gdb) p $5.mDocumentContainer.mRawPtr
$6 = (class nsIWeakReference *) 0x7f8a3e83d888

Document is the same but document.mDocumentContainer is changed.

/me confused

Comment 23 Marco Pesenti Gritti 2008-04-22 19:33:03 UTC
Same thing as above if I watch the nsWeakRef instead of the raw pointer.

Comment 24 Owen Taylor 2008-04-22 19:52:57 UTC
Transcribing from IRC in case it inspires anybody on cc:

<owen> is ((char *)&this.mDocumentContainer  - (char *)this) the same in the
inside-gecko code, and in the inside-plugin code
<marcopg> with "this" you mean the document?
<owen> So, the theory I'm going on, is that something in the two compilation
environments was different, resulting in a different binary layout for
nsIDocument on x86_64, causing the inlined method to go kerfluey
<owen> marcopg: Yeah
<marcopg> yeah that's a theory that occurred to me too, I tried to make the
compilation envs more similar, but I may be very well still missing something


Comment 25 Jan Kratochvil 2008-04-22 21:02:06 UTC
The problem is that there is ABI difference between internal xulrunner/firefox
and external plugin mugshot:

-xulrunner/firefox
+mugshot plugin
--- /tmp/state-ok       2008-04-22 22:04:48.000000000 +0200
+++ /tmp/state-crashing 2008-04-22 22:04:16.000000000 +0200
@@ -1,12 +1,12 @@
 (gdb) ptype this
 type = const class nsIDocument : public nsINode {
   protected:
-    nsString mDocumentTitle;
+    nsString_external mDocumentTitle;
     nsCOMPtr<nsIURI> mDocumentURI;
     nsCOMPtr<nsIURI> mDocumentBaseURI;
     nsWeakPtr mDocumentLoadGroup;
     nsWeakPtr mDocumentContainer;

due to
#define nsString  nsString_external in nsStringAPI.h
the nsIDocument layout gets presentet differently outside but the same binary
structure is passed which breaks the external access:
(gdb) p sizeof ( nsString )
$7 = 16
(gdb) p sizeof ( nsString_external )
$8 = 24

It was crashing while accessing `mDocumentContainer' - its offset is dependent
on `sizeof nsString'.

The problem background was given by stransky.


Comment 26 Marco Pesenti Gritti 2008-04-23 01:04:10 UTC
On i386 they are the same size, which explains why the bug is x86_64 specific:

(gdb) p sizeof(nsString)
$1 = 12
(gdb) p sizeof(nsString_external)
$2 = 12
(gdb) p sizeof(nsCString)
$3 = 12
(gdb) p sizeof(nsCString_external)
$4 = 12

Looks like it has been fixed upstream a few months ago for i386 (it was
affecting other extensions):

https://bugzilla.mozilla.org/show_bug.cgi?id=390849

Perhaps we should just make the sizes match on x86_64 as well...

Comment 27 Martin Stransky 2008-04-23 07:15:23 UTC
Created attachment 303440 [details]
A patch to mugshot, uses internal strings with nsIDocument

I'm not sure if it's bug in mozilla. Mugshot mixes internal and external API's
together and I wonder how it even can work on any platform...

Comment 28 Marco Pesenti Gritti 2008-04-23 09:16:23 UTC
I think there is a bug in firefox since you can't use nsIDocument with external
strings on x86_64 because of the nsString size difference. It could be argued
that nsIDocument should not be used by plugins certainly, I'm defining it a bug
on the base of #390849.

For Fedora 9 using internal strings sounds like a reasonable solution. Because
of
http://developer.mozilla.org/en/docs/Migrating_from_Internal_Linkage_to_Frozen_Linkage
I was under the impression that the possibility was going away with 1.9, but
apparently that's not the case (or not yet).

Comment 29 Marco Pesenti Gritti 2008-04-23 09:31:46 UTC
Just for the record, I don't think mugshot is currently mixing external and
internal. It's using external only.

Comment 30 Marco Pesenti Gritti 2008-04-23 10:08:57 UTC
To recap my understanding of the issue, after having talked with stransky on irc:

* external strings + nsIDocument does not work on x86_64 because of the
internal/external size difference. This has been fixed by
https://bugzilla.mozilla.org/show_bug.cgi?id=390849 but it's still broken on x86_64.

* nsIDocument is theoretically an internal header, but it has been made to work
with internal linkage to ease the transition. See
https://bugzilla.mozilla.org/show_bug.cgi?id=315562

* Currently it's possible to use internal strings/linkage (stransky patch) for
the mugshot plugin because all the string symbols we are using are exported.
Obviously this might change in the future and we might have to switch back to
frozen strings (stransky run into this issue when porting epiphany/galeon).

* The correct long term solution is to get the functionality of nsIDocument we
need exposed in a public interface.

Comment 31 Christopher Aillon 2008-04-23 22:21:17 UTC
Marco is the size issue (comment 26) filed upstream yet?  We should definitely
do so if it isn't...

Comment 33 Marco Pesenti Gritti 2008-04-23 23:26:18 UTC
(In reply to comment #31)
> Marco is the size issue (comment 26) filed upstream yet?  We should definitely
> do so if it isn't...

https://bugzilla.mozilla.org/show_bug.cgi?id=430581

Comment 34 Owen Taylor 2008-04-23 23:46:14 UTC
Martin's patch seems like the best bet for now. I'll test it tomorrow morning
to make sure it works for me and apply it (with a comment.) Thanks everybody.

Comment 35 Owen Taylor 2008-04-24 20:57:04 UTC
For me the patch seems to prevent the extension from loading at all
(presumably because the use of internal symbols is causing dlopen() to
fail, though I don't know how to get firefox to print out the dlerror()
visibly.)

Since the page appearance with and without the extension is fairly
small, my guess is that when Jan's patch was tested that was also the
case. (The other possibility is that the visibility tricks that the 
Mozilla folks play aren't there on x86_64.)

With the extension properly loaded, if you are logged into online.gnome.org,
have the mugshot client running, and go to:
 
 http://online.gnome.org/application?id=mozilla-firefox

It should show you information about your installed version, not

 "No Package Information"

My best thought right now is

 #undef nsString
 #define nsString ProperlySizedDummyObject
 #include  "nsIDocument.h"
 #undef nsString
 #define nsString nsString_external

I'll see if I can make that work.

Comment 36 Owen Taylor 2008-04-24 21:24:00 UTC
Created attachment 303694 [details]
Patch I'm applying

Here's the patch I'm committing. Tested to work on i386. I'm hopeful that it
will fix the problem on x86_64.

Comment 37 Owen Taylor 2008-04-25 15:10:01 UTC
mugshot-1.1.95-1.fc9 built in koji and confirmed to work properly on x86_64.
I'll close this once it is pushed to Rawhide.