Bug 703249 - gnucash segfaults on startup
Summary: gnucash segfaults on startup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gnucash
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Bill Nottingham
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-09 18:07 UTC by Jonathan Corbet
Modified: 2014-03-17 03:27 UTC (History)
10 users (show)

Fixed In Version: gnucash-2.4.7-4.fc16
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-19 04:43:18 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Crash traceback (4.18 KB, text/plain)
2011-05-09 18:07 UTC, Jonathan Corbet
no flags Details
patch to link against libgnutls (554 bytes, patch)
2011-10-09 04:47 UTC, Jonathan Kamens
no flags Details | Diff
slightly different patch (571 bytes, patch)
2011-10-11 16:14 UTC, Bill Nottingham
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
GNOME Bugzilla 661383 0 None None None Never

Description Jonathan Corbet 2011-05-09 18:07:20 UTC
Created attachment 497881 [details]
Crash traceback

Description of problem:

Running gnucash yields the splash screen, then a segfault.  The last message in the splash window is "gnucash/import-export/aqbanking".

Version-Release number of selected component (if applicable):

gnucash-2.4.5-2.fc16.x86_64

How reproducible: 100%


Steps to Reproduce:
1. Run gnucash
2. Sweep up the core dump
3.
  
Actual results:

Segmentation violation, crash, no application window

Expected results:

Normal gnucash window showing the dire state of my finances.

Additional info:

If you could fix my finances too that would be extra cool.

Comment 1 Bill Nottingham 2011-05-09 18:51:00 UTC
This crash appears to be somewhere in the gnutls/libgcrypt stack, and I do note the gnutls is different between F-15 and F-16. Moving there for now.

Comment 2 Horas 2011-05-25 18:10:15 UTC
This also happens in Fedora 14 to me, but only if there are open reports. If a delete/remove .gnucash/ gnucash start fine, but as soon as I open a report, gnucash crashes. 
On Fedora 15 gnucash seems to be working fine.

Comment 3 Jonathan Corbet 2011-05-25 18:20:46 UTC
Removing .gnucash changes nothing for me; I still get a segfault on startup.

Comment 4 Bill Nottingham 2011-09-13 23:35:44 UTC
Does this still persist? gnucash works fine for me in F-16, although I haven't spun up a rawhide VM yet.

Comment 5 Jonathan Corbet 2011-09-14 19:00:35 UTC
As of yesterday's rawhide, yes, the problem is still there.

Comment 6 Bill Nottingham 2011-10-07 18:54:22 UTC
*** Bug 742202 has been marked as a duplicate of this bug. ***

Comment 7 Bill Nottingham 2011-10-07 18:56:08 UTC
From the duplicated bug:

"Tomas Mraz 2011-10-06 02:21:23 EDT

I suspect gnucash does something wrong at startup perhaps it tries to
initialize the gnutls multiple times simultaneously or something similar."

gnucash itself doesn't use gnutls. Moving to gwenhywfar.

Comment 8 Bill Nottingham 2011-10-07 19:54:11 UTC
So, some debugging:

The gnutls/gcrypt initialization is done from gwenhywfar, which is brought in by AQBanking.

This *is* actually odne twice during gnucash setup. First, when gnucash scans for its modules, it dlopen()s the module, and checks it for symbols. This calls the initialization constructor in gwenhywfar. However, the module is then closed, calling the destructor. 

It's then opened again when the module is fully initialized.

Reading the gwenhywfar code, it looks like it won't call the destructor for gnutls if it didn't think it initializaed itself correctly. 

Can anyone who's seeing this attach the output of "GWEN_LOGLEVEL=debug gnucash"?

Comment 9 Bill Nottingham 2011-10-07 20:01:38 UTC
*** Bug 744310 has been marked as a duplicate of this bug. ***

Comment 10 Jonathan Corbet 2011-10-07 20:04:57 UTC
GWEN_LOGLEVEL=debug give me:

7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  250: Initializing I18N module
6:2011/10/07 14-04-23:gwen(52702):i18n.c:  199: Real locale is [en_US.utf8]
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  254: Initializing InetAddr module
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  258: Initializing Socket module
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  262: Initializing Libloader module
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  266: Initializing Crypt3 module
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  270: Initializing Process module
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  274: Initializing Plugin module
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  278: Initializing DataBase IO module
6:2011/10/07 14-04-23:gwen(52702):plugin.c:  544: Plugin type "dbio" registered
6:2011/10/07 14-04-23:gwen(52702):dbio.c:  106: Adding plugin path [/usr/lib64/gwenhywfar/plugins/60/dbio]
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  282: Initializing ConfigMgr module
6:2011/10/07 14-04-23:gwen(52702):plugin.c:  544: Plugin type "configmgr" registered
6:2011/10/07 14-04-23:gwen(52702):configmgr.c:   80: Adding plugin path [/usr/lib64/gwenhywfar/plugins/60/configmgr]
7:2011/10/07 14-04-23:gwen(52702):gwenhywfar.c:  286: Initializing CryptToken2 module
6:2011/10/07 14-04-23:gwen(52702):plugin.c:  544: Plugin type "ct" registered
6:2011/10/07 14-04-23:gwen(52702):ctplugin.c:   65: Adding plugin path [/usr/lib64/gwenhywfar/plugins/60/ct]
6:2011/10/07 14-04-23:gwen(52702):plugin.c:  574: Plugin type "ct" unregistered
6:2011/10/07 14-04-23:gwen(52702):plugin.c:  574: Plugin type "configmgr" unregistered
6:2011/10/07 14-04-23:gwen(52702):plugin.c:  574: Plugin type "dbio" unregistered
Segmentation fault (core dumped)

Comment 11 Bill Nottingham 2011-10-07 20:19:20 UTC
That doesn't shed any light on it, alas.

I'm assuming if you break on gnutls_global_init(), it's only called twice - once from gnc_module_get_info very early, and then when it crashes?

Comment 12 Bill Nottingham 2011-10-07 20:24:57 UTC
Also, given that multiple people seem to be able to reproduce this (and I can't):

- Is there anything unusual in your setup? (weird environment variables, unusual authentication methods, etc)
- Are you using the online account access in GnuCash?

Comment 13 Jonathan Corbet 2011-10-07 20:29:16 UTC
Yep.  First call from gnc_module_system_refresh(), second from somewhere else in libgnc-module.so (I don't have all the debuginfo on the system, can fix that if it would help).

Comment 14 Jonathan Corbet 2011-10-07 20:30:55 UTC
I don't *think* I have anything that weird in my setup.  I have no strange auth methods and am not using online account access.  That said, something must clearly be different somewhere, since it doesn't hit everybody.

Comment 15 Bill Nottingham 2011-10-07 20:43:09 UTC
One other debugging aid may be installing the gnutls debuginfo and stepping through gnutls_global_deinit to see if something looks like it's going awry there.

Comment 16 Jonathan Kamens 2011-10-09 04:27:11 UTC
Here's what's happening:

* gnutls is loaded and initialized when libgncmod-aqbanking.so is loaded
* gnutls initializes libgcrypt
  * when initializing libgcrypt, gnutls passes in pointers to gnutls mutex
    callback functions
* during initialization, libgcrypt uses gnutls mutex management functions to
  create mutex
  * this creates private data about the mutex inside gnutls
* gnutls is unloaded when libgncmod-aqbanking.so is unloaded
* However, LIBGCRYPT IS NOT UNLOADED, because it is linked directly against
  gnucash rather than loaded dynamically with libgncmod-aqbanking.so
* gnutls is loaded and initialized again later when libgncmod-aqbanking.so is
  loaded again
* gnutls initializes libgcrypt again
* but libgcrypt was never loaded and so it thinks it's still initialized
* but the private data associated with the mutex that libgcrypt created and still
  has is no longer valid, because gnutls's private data was erased when it was
  unloaded
* bam, segfault when libgcrypt tries to use the mutex

Easiest fix: link gnutls directly against gnucash and call gnutls_global_init() before loading any modules.

Comment 17 Jonathan Kamens 2011-10-09 04:28:17 UTC
Woops, in the third bullet from the end, I should have said, "libgcrypt was never UNloaded".

Comment 18 Jonathan Kamens 2011-10-09 04:47:05 UTC
Created attachment 527065 [details]
patch to link against libgnutls

Actually, it's even easier than that. You don't need to modify the source code to call gnutls_global_init. You just need to link against libgnutls when compiling gnucash so that it doesn't get unloaded when aqbanking gets unloaded. The attached patch does this.

Comment 19 Jonathan Kamens 2011-10-09 04:48:54 UTC
Moving this ticket back to gnucash, since it's a gnucash shared-library loading/unloading thing that's causing the issue and a gnucash patch (attached to my last comment) is needed to fix it.

Comment 20 Andy Grimm 2011-10-10 02:25:27 UTC
Just adding a "me too" here.  I haven't done any serious debugging on this, but I was using gnucash on F15 with no problem, and it broke when I upgraded to F16 Alpha last month.  If there are any additional data points that I could give to help with this one, let me know.

Comment 21 Andy Grimm 2011-10-10 03:45:16 UTC
+1 to Jonathan's patch.  I applied it and rebuilt, and no more segfault for me.  Thanks!

Comment 22 Derek Atkins 2011-10-10 05:44:37 UTC
I'd question why GnuCash is linking against libgcrypt directly?

Can someone pass this patch (or at least this bug report) upstream?

Comment 23 Jonathan Kamens 2011-10-10 15:29:03 UTC
I doubt GC is linking against libgcrypt directly. It's linking against another shared library that links against libgcrypt.

Comment 24 Bill Nottingham 2011-10-10 16:14:57 UTC
(In reply to comment #22)
> I'd question why GnuCash is linking against libgcrypt directly?

LD_DEBUG=all shows initialization goes: libgncmod-gnome-utils -> libgnome-keyring -> libgcrypt, in a brief check here.

This does imply a more generic issue with the load-all-modules, unload-all-modules, load-all-modules-again method that could pop up again later. Of course, most libraries that these modules use don't have initialization side effects.

Comment 25 Bill Nottingham 2011-10-10 16:54:20 UTC
Reading this, it seems like gnutls_global_deinit() should call the inverse of gnutls_crypto_init() ... it doesn't. (Possibly because such a function doesn't exist.)

Comment 26 Jonathan Kamens 2011-10-10 18:08:18 UTC
(In reply to comment #25)
> Reading this, it seems like gnutls_global_deinit() should call the inverse of
> gnutls_crypto_init() ... it doesn't. (Possibly because such a function doesn't
> exist.)

Unfortunately it's not that simple. There actually appear to be significant architectural issue in the way that gcrypt and gnutls interact with each other.

For example, as noted previously which gnutls initializes gcrypt, it passes in  a set of callbacks inside gnutls for gcrypt to use. These callbacks are global to gcrypt, i.e., there's only one set of callbacks for all the things calling into gcrypt. But one if something else besides gnutls wants to initialize gcrypt with its own callbacks? It can't... they both can't exist in the same program at the same time. And this isn't even visible to the caller... If gnutls initializes gcrypt with its own callbacks, and then something else initializes gcrypt with its callbacks, the latter initialization will "succeed" and the caller won't know that his callbacks aren't actually going to be used.

Similarly, gnutls can't just uninitalize gcrypt, because there may be something other than gnutls using gcrypt, and if gcrypt is uninitialized that other code will break.

This could be avoided by requiring different instances of gcrypt to be instantiated with different static data for anyone who links against it, but I'm not even sure if that can be enforced by the shared library itself, as opposed to requiring that whoever is doing the linking explicitly requesting it. Furthermore, doing that would cause significant architectural issues of its own.

Bill, if you want to wade into these shark-infested waters to try and figure out just how gnutls and gcrypt are related to each other and how all this should be structured and implemented, I wish you the best of luck, but in the meantime, I hope you'll just patch gnucash so it doesn't keep crashing on people. :-)

Comment 27 Bill Nottingham 2011-10-10 19:15:49 UTC
Oh, I can patch gnucash, it just seems like a hack, hand also doesn't necessarily explain why a large number of people hit this reliably, and others (like me) don't hit it at all.

Comment 28 Jonathan Kamens 2011-10-10 19:23:05 UTC
(In reply to comment #27)
> Oh, I can patch gnucash, it just seems like a hack,

Yeah, but the stuff gnucash does to load / unload / load modules again is also a bit of a hack that certainly pushes the boundaries of the dynamic loading system, so think of it as one hack compensating for the breakage caused by another one. :-)

> and also doesn't
> necessarily explain why a large number of people hit this reliably, and others
> (like me) don't hit it at all.

It's memory-management-dependent, so it depends on exactly what executes when gnucash starts up and in what order. You may be running it on a different architecture, or with different perl modules, or with different versions of shared libraries, or with different gnucash preferences that cause load-time behavior to change, etc., etc.

Comment 29 Bill Nottingham 2011-10-10 19:52:04 UTC
(In reply to comment #28)
> > and also doesn't
> > necessarily explain why a large number of people hit this reliably, and others
> > (like me) don't hit it at all.
> 
> It's memory-management-dependent, so it depends on exactly what executes when
> gnucash starts up and in what order. You may be running it on a different
> architecture, or with different perl modules, or with different versions of
> shared libraries, or with different gnucash preferences that cause load-time
> behavior to change, etc., etc.

It shouldn't be, though - if the error is because libgcrypt is being brought in by gnucash itself via DSO dependencies, it's going to be linked in before module scanning, no matter what.

Comment 30 Jonathan Kamens 2011-10-10 19:59:55 UTC
(In reply to comment #29)
> It shouldn't be, though - if the error is because libgcrypt is being brought in
> by gnucash itself via DSO dependencies, it's going to be linked in before
> module scanning, no matter what.

When gcrypt is loaded into memory is not the issue.

The issue is that gcrypt caches references to data within gnutls when gcrypt is initialized by gnutls, and then gnutls is unloaded and those references become invalid.

Comment 31 Bill Nottingham 2011-10-11 15:39:35 UTC
From testing, it's it's prelink dependent. Prelink does ahead-of-time linking, which causes gnutls to always get the same address space, so things just happen to work. I suspect that everyone who's seeing this isn't running prelink....

Comment 32 Jonathan Kamens 2011-10-11 15:49:22 UTC
I am running prelink.

Comment 33 Bill Nottingham 2011-10-11 16:14:11 UTC
Created attachment 527492 [details]
slightly different patch

Here's what I'm intending to push and build. This changes GnuCash so that, on successfully scanning a module, it marks it as resident so it won't be dlclose()d/unloaded.

This solves the problem here, and should cover any non-gnutls cases that might pop up later. Given that gnc_module_unload() actually *doesn't* call g_module_close(), it seems to be in-line with what GnuCash is doing in the main module system.

Comment 34 Jonathan Kamens 2011-10-11 16:20:15 UTC
Your fix is obviously better than mine. Thanks for taking the time to come up with it!

Comment 35 Fedora Update System 2011-10-11 16:53:34 UTC
gnucash-2.4.7-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/gnucash-2.4.7-3.fc16

Comment 36 Bill Nottingham 2011-10-11 17:41:10 UTC
Jonathan - thanks for the help in tracking the problem down.

Comment 37 Jonathan Corbet 2011-10-12 18:48:24 UTC
Yay - gnucash works again!  Thanks.

There's just one other little problem: it reports that I spent all my money and can't afford to buy beer.  I'd really rather it showed my bank balance as being rather higher and that college tuition payment already made.  But I guess I should probably file a separate bug report for that one.

Comment 38 Fedora Update System 2011-10-13 18:12:10 UTC
Package gnucash-2.4.7-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing gnucash-2.4.7-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14258
then log in and leave karma (feedback).

Comment 39 Anton Arapov 2011-10-17 06:54:54 UTC
Package: gnucash-2.4.7-1.fc16
Architecture: x86_64
OS Release: Fedora release 16 (Verne)

Comment
-----
Start gnucash in a freshly installed F16beta.

Comment 40 Anton Arapov 2011-10-17 07:02:27 UTC
fyi, gnucash-2.4.7-4.fc16, works for me.

Comment 41 Fedora Update System 2011-10-19 04:43:18 UTC
gnucash-2.4.7-4.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.