Bug 1750575 - dnfdragora complains that dnf is locked by another process after updates (due to dnfdaemon crashing when attempting dbus operation after dnf.Base instance is closed)
Summary: dnfdragora complains that dnf is locked by another process after updates (due...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rpm
Version: 31
Hardware: All
OS: Linux
high
unspecified
Target Milestone: ---
Assignee: Packaging Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://fedoraproject.org/wiki/Common...
: 1761620 (view as bug list)
Depends On:
Blocks: F31FinalBlocker 1786127
TreeView+ depends on / blocked
 
Reported: 2019-09-09 22:47 UTC by Paul Whalen
Modified: 2020-01-10 08:30 UTC (History)
28 users (show)

Fixed In Version: rpm-4.15.0-2.fc31
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-17 21:38:22 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dnfdaemon backtrace (133.93 KB, text/plain)
2019-09-09 22:52 UTC, Paul Whalen
no flags Details
better backtrace (with udev asserts enabled) (94.97 KB, text/plain)
2019-10-15 05:30 UTC, Adam Williamson
no flags Details

Description Paul Whalen 2019-09-09 22:47:45 UTC
Description of problem:
When attempting an update with dnfdragora, update finishes with a dialog:

"DNF is locked by another process. dnfdragora will exit."

This happens after the selected packages have finished upgrading.

Version-Release number of selected component (if applicable):

dnfdragora-1.1.1-3.fc31.noarch
python3-dnfdaemon-0.3.19-7.fc31.noarch


How reproducible:
Everytime

Steps to Reproduce:
1. Attempt to update the system using dnfdragora
2. At the end of updates, dialog box is displayed "DNF is locked by another process. dnfdragora will exit."


Additional info:

coredumpctl gdb 2444
           PID: 2444 (dnfdaemon-syste)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 6 (ABRT)
     Timestamp: Mon 2019-09-09 17:49:06 EDT (4min 13s ago)
  Command Line: /usr/bin/python3 /usr/share/dnfdaemon/dnfdaemon-system
    Executable: /usr/bin/python3.7
 Control Group: /system.slice/dnfdaemon.service
          Unit: dnfdaemon.service
         Slice: system.slice
       Boot ID: a5cce266acaf451a87c10eec4734b67a
    Machine ID: 69d3916002784cefaaf940ea46fea120
      Hostname: rpi3-1
       Storage: /var/lib/systemd/coredump/core.dnfdaemon-syste.0.a5cce266acaf451a87c10eec4734b67a.2444.1568065746000000000000.lz4
       Message: Process 2444 (dnfdaemon-syste) of user 0 dumped core.
                
                Stack trace of thread 2444:
                #0  0x00000000b6e8625c __GI_raise (libc.so.6)
                #1  0x00000000b6e6f8a0 __GI_abort (libc.so.6)
                #2  0x00000000b274ee3c _dbus_abort (libdbus-1.so.3)
                #3  0x00000000b2744bb4 _dbus_warn_check_failed (libdbus-1.so.3)
                #4  0x00000000b2745400 _dbus_warn_return_if_fail (libdbus-1.so.3)
                #5  0x00000000b27292c4 _dbus_connection_send_preallocated_and_unlock (libdbus-1.so.3)
                #6  0x00000000b27716bc Connection_send_message (_dbus_bindings.so)

Comment 1 Paul Whalen 2019-09-09 22:52:24 UTC
Created attachment 1613351 [details]
dnfdaemon backtrace

Comment 2 Paul Whalen 2019-09-09 23:13:48 UTC
The updates are applied successfully, but this message is confusing. This also affects all arches. 

Proposing as a blocker for Fedora 31 Beta. Criteria "The installed system must be able appropriately to install, remove, and update software with the default tool for the relevant software type in all release-blocking desktops (e.g. default graphical package manager)."

Comment 3 Adam Williamson 2019-09-12 18:55:55 UTC
Discussed at the Fedora 31 Beta Go/No-Go Meeting #1, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-meeting-1/2019-09-12/f31-beta-go_no_go-meeting.2019-09-12-17.00.txt . Rejected as a blocker as this does not seem to impede the actual functionality at all - the updates are installed, and subsequent use of dnfdragora still works. It's just an unfortunate message.

Comment 4 Adam Williamson 2019-09-30 18:44:01 UTC
Neal, any news here?

Comment 5 Paul Whalen 2019-10-02 20:38:05 UTC
Proposing as a blocker for final, criteria "The installed system must be able appropriately to install, remove, and update software with the default tool for the relevant software type in all release-blocking desktops (e.g. default graphical package manager).". As discussed previously the updates are applied successfully but this error message is confusing and affects many of the Desktop spins. Perhaps the criteria should be extended slightly for final to say "without errors".

Comment 6 Neal Gompa 2019-10-03 18:17:57 UTC
(In reply to Adam Williamson from comment #4)
> Neal, any news here?

I've not yet had a chance to do some digging here. I'm planning to investigate this weekend.

Comment 7 Neal Gompa 2019-10-05 20:57:00 UTC
I've spoken to Daniel Mach about this at the DNF community meeting yesterday. The symptoms appear to point to a problem in libdnf that they've already been investigating. As agreed, I'm reassigning this to libdnf.

Comment 8 Adam Williamson 2019-10-07 17:44:47 UTC
clearing RejectedBlocker (from Beta decision) so we can reconsider this for Final.

Comment 9 Geoffrey Marr 2019-10-08 06:29:47 UTC
Discussed during the 2019-10-07 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as we did not have enough votes for a clear decision on this in meeting; we will ask for votes in-bug to get a clearer decision.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-10-07/f31-blocker-review.2019-10-07-16.02.txt

Comment 10 Daniel Mach 2019-10-08 11:40:18 UTC
The problem is easily reproducible on Fedora 31:
* dnf upgrade
* dnf downgrade bash
* dnfdragora --ncurses
* mark [x] bash and run upgrade
* error dialogue: "DNF is locked by another process. dnfdragora will exit."

Unfortunately I was unable to reproduce it on Fedora 30 with the *same* codebase.
I used the latest dnf and libdnf from rpmsoftwaremanagement/dnf-nightly COPR.

That makes me think that the problem *might* be elsewhere, but we need to spend more time on discovering the root cause.

Comment 11 Kamil Páral 2019-10-10 16:43:20 UTC
If I understand it correctly from comment 0, dnfdragora (in particular dnfdaemon-system) actually crashes at the end? In that case I believe we can apply
All applications that can be launched using the standard graphical mechanism of a release-blocking desktop after a default installation of that desktop must start successfully and withstand a basic functionality test. 
https://fedoraproject.org/wiki/Fedora_31_Final_Release_Criteria#Default_application_functionality
(even the error message itself can be considered violating the basic functionality)

I'm nervous about allowing a system updater to crash, especially when the root cause is unknown and we're not completely sure about the consequences. I'm rather +1 blocker here.

Comment 12 Adam Williamson 2019-10-10 17:01:36 UTC
dnfdragora is the GUI app, dnfdaemon is a sort of helper daemon that it uses. The user does not observe the GUI application crash. So I don't think that criterion really applies, unless I miss something.

Comment 13 Kevin Fenzi 2019-10-10 17:28:54 UTC
-1 blocker, +1 FE. The updates do finish. The error is messy and looks bad, but doesn't itself break anything.

Comment 14 Ben Cotton 2019-10-10 17:35:30 UTC
I was -1 blocker for beta since it was functional-but-ugly. I'm less okay with that for a final. I'm 0 blocker, +1 FE for now.

Comment 15 Mohan Boddu 2019-10-10 17:41:56 UTC
-1 Blocker but +1 FE since its doesn't break anything but just ugly.

Comment 16 František Zatloukal 2019-10-10 17:45:55 UTC
-1 Blocker, +1 FE

Comment 17 Adam Williamson 2019-10-10 17:57:58 UTC
So we have +1/-4 for blocker, +6 (counting me and kparal as +1) for FE. I kinda would like this to come up in a review meeting where someone can make the case for blocker status, though, so I'm gonna take an executive decision to accept as FE but leave blocker status open to the next meeting, with a presumption that we'll reject it unless the vote changes drastically.

Comment 18 cappellorosso 2019-10-12 12:54:03 UTC
(In reply to Adam Williamson from comment #12)
> The user does not observe the GUI application crash. So I don't think
> that criterion really applies, unless I miss something.

The GUI does crash after closing the error window or clicking ok.

Comment 19 Geoffrey Marr 2019-10-14 19:12:44 UTC
Discussed during the 2019-10-14 blocker review meeting: [0]

The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criterion:

"All applications that can be launched using the standard graphical mechanism of a release-blocking desktop after a default installation of that desktop must start successfully and withstand a basic functionality test"

Note, for dnfdragora, we hold that 'install some updates without printing a confusing error and quitting' is 'basic' functionality for an updater.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-10-14/f31-blocker-review.2019-10-14-16.01.txt

Comment 20 Adam Williamson 2019-10-15 00:04:38 UTC
So, Go/No-Go is on Thursday. This is the last blocker for which we really do not have a resolution at present. Daniel, can you please prioritize this? We really need it fixed ASAP, ideally tomorrow. Thanks a lot!

Comment 21 Adam Williamson 2019-10-15 00:57:20 UTC
Doing a bit of digging into this myself, I managed to at least hack up dnfdragora to log a traceback showing us what codepath we're in here:

2019-10-14 17:53:29,436 [dnfdragora](ERROR) dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36) 
Traceback (most recent call last):
  File "/usr/bin/dnfdragora", line 85, in <module>
    main_gui.handleevent()
  File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
    if not self._searchPackages(filter, True) :
  File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
    packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
  File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
    rc = func(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
    newest_only, tags)
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
    fields, keys, attrs, match_all, newest_only, tags))
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
    result = self._get_result(data)
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
    self._handle_dbus_error(user_data['error'])
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
    raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)

which means that, somehow, shortly after the update operation finishes, we wind up handling a yui.YEvent.WidgetEvent for the 'find button', and that calls self._searchPackages which calls `search()` from dnf_backend.py which is wrapped in the ExceptionHandler decorator, and that's where we're somehow hitting this locking issue.

Comment 22 Adam Williamson 2019-10-15 02:07:08 UTC
OK, did a bit more debugging. I think the error message is misleading us slightly here in two ways.

For a start, this isn't actually any kind of "dnf" lock, AFAICT. It's a lock state for dnfdaemon itself. All it is is an attribute of a DnfDaemonBase instance called `_lock`; basically when a new DnfDaemonBase is initialized it is set to None. Then if the daemon receives a DBus 'Lock' method call, the attribute is set to the id of whatever process sent the call. If the daemon receives a DBus 'Unlock' method call, the attribute is cleared. There's also a 'check_lock()' function that gets called (via 'working_start()') when most other DBus method calls arrive that checks the attribute is set to the ID of the sender, and errors out if not. That's really the whole implementation - AFAICT it's not actually wired up to any of dnf's *own* locks in any way at all, unless I'm missing something.

Secondly, the `check_lock()` function that produces the error "dnf is locked by another application" actually produces that error even if *the lock is not held at all*:

    def check_lock(self, sender):
        """
        Check that the current sender is owning the dnf lock
        :param sender:
        """
        if self._lock == sender:
            return True
        else:
            raise LockedError('dnf is locked by another application')

note that it doesn't do a boolean check on self._lock - it just raises that error if it's not equal to sender. So I hooked up some more logging around all this in dnfdaemon, and found some interesting stuff:

2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now!
2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835
2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
[...snip a bunch more calls to check_lock where the sender is the same...]
2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now!
2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is None

here are the main app logs from the same time:

2019-10-14 18:53:03,227 [dnfdragora.ui](INFO) dnfdragora started
2019-10-14 18:53:03,401 [dnfdaemon.client](DEBUG) org.baseurl.DnfSystem daemon loaded - version :  2
2019-10-14 18:53:03,401 [dnfdragora.dnf_backend](DEBUG) dnfdaemon api version (2)
2019-10-14 18:53:03,401 [dnfdragora.base](DEBUG) Lock the DNF root daemon
2019-10-14 18:53:07,555 [dnfdragora.base](INFO) Locked the DNF root daemon
2019-10-14 18:53:07,813 [dnfdragora.dnf_backend](DEBUG) get-packages : updates 
2019-10-14 18:53:08,524 [dnfdragora.misc](DEBUG) get_packages took 0.71 sec
2019-10-14 18:53:10,544 [dnfdragora.misc](DEBUG) _make_pkg_object_with_attr took 0.00 sec
2019-10-14 18:53:12,674 [dnfdragora.ui](DEBUG) adding: install rsync,0,3.1.3,10.fc31,x86_64,updates-testing
2019-10-14 18:53:12,680 [dnfdragora.dnf_backend](DEBUG) TransactionEvent : start-build
2019-10-14 18:53:12,735 [dnfdragora.dnf_backend](DEBUG) TransactionEvent : end-build
2019-10-14 18:53:13,563 [dnfdragora.dnf_backend](DEBUG) on_RepoMetaDataProgress (root): ('', 0.0)
2019-10-14 18:53:13,849 [dnfdragora.dnf_backend](DEBUG) on_RepoMetaDataProgress (root): ('', 0.06477154724818276)
2019-10-14 18:53:13,851 [dnfdragora.dnf_backend](DEBUG) on_RepoMetaDataProgress (root): ('', 0.23234683281412252)
2019-10-14 18:53:13,851 [dnfdragora.dnf_backend](DEBUG) on_RepoMetaDataProgress (root): ('', 0.5674974039460021)
2019-10-14 18:53:13,925 [dnfdragora.dnf_backend](DEBUG) on_RepoMetaDataProgress (root): ('', 0.7350726895119418)
2019-10-14 18:53:13,926 [dnfdragora.dnf_backend](DEBUG) on_RepoMetaDataProgress (root): ('', 1.0)
2019-10-14 18:53:13,953 [dnfdragora.dnf_backend](DEBUG) Downloaded : rsync-3.1.3-10.fc31.x86_64.rpm
2019-10-14 18:53:14,711 [dnfdragora.dnf_backend](DEBUG) on_RPMProgress : [rsync,0,3.1.3,10.fc31,x86_64,updates-testing]
2019-10-14 18:53:14,845 [dnfdragora.dnf_backend](DEBUG) on_RPMProgress : [rsync,0,3.1.3,9.fc31,x86_64,@System]
2019-10-14 18:53:15,894 [dnfdragora.dnf_backend](DEBUG) on_RPMProgress : [rsync,0,3.1.3,10.fc31,x86_64,updates-testing]
2019-10-14 18:53:15,894 [dnfdragora.dnf_backend](DEBUG) on_RPMProgress : [rsync,0,3.1.3,9.fc31,x86_64,@System]
2019-10-14 18:53:16,517 [dnfdragora](ERROR) dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)
Traceback (most recent call last):
  File "/usr/bin/dnfdragora", line 85, in <module>
    main_gui.handleevent()
  File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
    if not self._searchPackages(filter, True) :
  File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
    packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
  File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
    rc = func(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
    newest_only, tags)
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
    fields, keys, attrs, match_all, newest_only, tags))
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
    result = self._get_result(data)
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
    self._handle_dbus_error(user_data['error'])
  File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
    raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)

Also note that there is no call to dnfdaemon's Unlock method - I wired up logging for that too, and it does *not* fire between the last call to check_lock() where self._lock came up as ":1.1835" (at 18:53:13) and the call where self._lock came up as "None" (at 18:53:16). No. As we can see, what happens instead is somehow *we get a whole new DnfDaemon instance* at 18:53:16.513900 and the `search()` runs almost instantly, *without* any call to `Lock()` happening in between...so we get the "dnf is locked by another application" error not because the new daemon instance is locked by another application but because the new daemon instance *isn't locked at all* (self._lock for the new DnfDaemon is None, because Lock has not happened).

I don't know *why* this is happening yet, though.

Comment 23 Adam Williamson 2019-10-15 02:11:26 UTC
heyyyy, okay, this is making a bit more sense now. What happens is dnfdaemon *dies* at 18:53:15 and gets automatically restarted by systemd:

Oct 14 18:53:03 adam.happyassassin.net systemd[1]: Starting Package management dnf daemon...
Oct 14 18:53:03 adam.happyassassin.net systemd[1]: Started Package management dnf daemon.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]:   D-Bus not built with -rdynamic so unable to print a backtrace
Oct 14 18:53:16 adam.happyassassin.net systemd[1]: dnfdaemon.service: Main process exited, code=dumped, status=6/ABRT
Oct 14 18:53:16 adam.happyassassin.net systemd[1]: dnfdaemon.service: Failed with result 'core-dump'.
Oct 14 18:53:16 adam.happyassassin.net systemd[1]: dnfdaemon.service: Consumed 7.153s CPU time.
Oct 14 18:53:16 adam.happyassassin.net systemd[1]: Starting Package management dnf daemon...
Oct 14 18:53:16 adam.happyassassin.net systemd[1]: Started Package management dnf daemon.

so dnfdragora winds up sending its Search dbus method call to the *new* dnfdaemon instance, which it has not yet locked, and then we see the error in dnfdragora. But really the problem here is what's causing the daemon to drop dead a second earlier in the first place. That's what we need to find out and fix.

Comment 24 Adam Williamson 2019-10-15 05:12:39 UTC
I rebuilt dbus with --enable-asserts (so we get a traceback) and also set dnfdaemon to -v -d, and that gets us this:

Oct 14 22:09:13 adam.happyassassin.net dnfdaemon-system[11383]: 22:09:13: setting max_err to : 1
Oct 14 22:09:14 adam.happyassassin.net dnfdaemon-system[11383]: 22:09:14: download status : None - rsync-3.1.3-10.fc31.x86_64.rpm
Oct 14 22:09:14 adam.happyassassin.net dnfdaemon-system[11383]: 22:09:14: checking signature for : rsync-3.1.3-10.fc31.x86_64, 0
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]: dbus[11383]: assertion failed "connection->generation == _dbus_current_generation" file "../../dbus/dbus-connection.c" line 1424 function _dbus_connection_ref_unlocked
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(_dbus_print_backtrace+0x36) [0x7fb1a41be8f6]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(_dbus_abort+0xf) [0x7fb1a41c20af]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(_dbus_real_assert+0x1f) [0x7fb1a41afa9f]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(_dbus_connection_ref_unlocked+0x67) [0x7fb1a4182307]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(+0x1c23f) [0x7fb1a418423f]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(+0x1c4d6) [0x7fb1a41844d6]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(+0x1c849) [0x7fb1a4184849]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(+0x1d455) [0x7fb1a4185455]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(+0x1d522) [0x7fb1a4185522]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/_dbus_bindings.so(+0xb17f) [0x7fb1a423317f]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyMethodDef_RawFastCallKeywords+0x25e) [0x7fb1b48b939e]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyMethodDescr_FastCallKeywords+0x4d) [0x7fb1b48eb40d]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x14059b) [0x7fb1b48eb59b]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xbfd) [0x7fb1b4921fbd]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x2a2) [0x7fb1b48da122]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x1404bf) [0x7fb1b48eb4bf]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xbfd) [0x7fb1b4921fbd]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x2a2) [0x7fb1b48da122]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x1404bf) [0x7fb1b48eb4bf]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xbfd) [0x7fb1b4921fbd]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallDict+0x2ee) [0x7fb1b48da7be]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x230c) [0x7fb1b49236cc]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallDict+0x2ee) [0x7fb1b48da7be]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x230c) [0x7fb1b49236cc]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallDict+0x2ee) [0x7fb1b48da7be]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyObject_Call_Prepend+0x66) [0x7fb1b48aeed6]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyObject_FastCallDict+0xbd) [0x7fb1b48aa9fd]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x10d07f) [0x7fb1b48b807f]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(PyObject_CallFunctionObjArgs+0x99) [0x7fb1b48b82b9]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/_dbus_bindings.so(+0xa81f) [0x7fb1a423281f]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/_dbus_bindings.so(+0xb45d) [0x7fb1a423345d]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(+0x3808a) [0x7fb1a41a008a]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libdbus-1.so.3(dbus_connection_dispatch+0xa77) [0x7fb1a4188be7]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/_dbus_glib_bindings.so(+0x2f09) [0x7fb1a4211f09]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libglib-2.0.so.0(g_main_context_dispatch+0x160) [0x7fb1a66d34a0]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libglib-2.0.so.0(+0x51830) [0x7fb1a66d3830]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libglib-2.0.so.0(g_main_loop_run+0x83) [0x7fb1a66d3b23]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7fb1a5d4aaa8]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libffi.so.6(ffi_call+0x1d4) [0x7fb1a5d4a2a4]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/gi/_gi.cpython-37m-x86_64-linux-gnu.so(+0x30c35) [0x7fb1a48a7c35]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/gi/_gi.cpython-37m-x86_64-linux-gnu.so(+0x3276c) [0x7fb1a48a976c]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/lib64/python3.7/site-packages/gi/_gi.cpython-37m-x86_64-linux-gnu.so(+0x2330b) [0x7fb1a489a30b]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyObject_FastCallKeywords+0x4bc) [0x7fb1b48ea39c]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x140649) [0x7fb1b48eb649]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x5592) [0x7fb1b4926952]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x2a2) [0x7fb1b48da122]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x1404bf) [0x7fb1b48eb4bf]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xbfd) [0x7fb1b4921fbd]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x102) [0x7fb1b48d9f82]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x1404bf) [0x7fb1b48eb4bf]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xbfd) [0x7fb1b4921fbd]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x102) [0x7fb1b48d9f82]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x1404bf) [0x7fb1b48eb4bf]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xb58) [0x7fb1b4921f18]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7fb1b48d9120]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(PyEval_EvalCodeEx+0x39) [0x7fb1b48d9e79]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(PyEval_EvalCode+0x1b) [0x7fb1b4968b5b]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x1fea83) [0x7fb1b49a9a83]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(PyRun_FileExFlags+0x97) [0x7fb1b49a9fd7]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(PyRun_SimpleFileExFlags+0x19a) [0x7fb1b49b04ca]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(+0x207251) [0x7fb1b49b2251]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libpython3.7m.so.1.0(_Py_UnixMain+0x3c) [0x7fb1b49b23fc]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fb1b4b1d1a3]
Oct 14 22:09:16 adam.happyassassin.net dnfdaemon-system[11383]:   /usr/bin/python3(_start+0x2e) [0x55734166808e]

Still, there's two seconds between the last log message from dnfdaemon-system and when it actually blows up; per the dnfdragora logs I think it's got well past the signature check by that point, so I still don't know exactly what it's doing when it dies :/

Comment 25 Adam Williamson 2019-10-15 05:30:57 UTC
Created attachment 1625859 [details]
better backtrace (with udev asserts enabled)

I got a better backtrace with the new udev build, here it is. Based on this and some debug logging, I believe we crash in this bit of dnfdaemon/server/__init__.py DnfDaemonBase.run_transaction():

        self._can_quit = True
        self._reset_base()
        self.TransactionEvent('end-run', NONE)            <---- HERE
        result = json.dumps((rc, msgs))
        return result

we crash exactly during the `self.TransactionEvent('end-run', NONE)` call, I believe.

Comment 26 Adam Williamson 2019-10-15 06:29:18 UTC
So, after playing around with this a bit more, I don't yet have a convincing explanation for *why*, but I think what's happening here is that dnfdaemon will crash as soon as it tries to send any dbus signal after DnfDaemonBase._reset_base() is called. If I move the self._reset_base() call in run_transaction() to the end, the `self.TransactionEvent('end-run', NONE)` works OK, and we crash as soon as something *else* tries to send a signal. If I comment that out entirely, we don't crash until after an Unlock event has run - which is another thing that calls `_reset_base()`.

`_reset_base()` only does this:

    def _reset_base(self):
        """Close the current dnf.Base object."""
        if self._base:
            self._base.close()
            self._base = None

so, now the question is, why is doing that causing problems for subsequent dbus stuff?

Comment 27 Adam Williamson 2019-10-15 06:54:53 UTC
I think the next step is gonna have to be bisecting libdnf, but just to add one more nugget, it seems to be the `self._base.close()` that's the problem, not the `self._base = None`. Just commenting that out "fixes the bug", except of course it's wrong. We need to know why closing the dnf.Base object somehow breaks dnfdaemon's subsequent dbus interactions.

Comment 28 Adam Williamson 2019-10-15 22:17:52 UTC
Narrowed it down some more: the thing that seems to cause the problem is the call to `self._closeRpmDB()` in dnf `Base._finalize_base()`, which is called by `Base.close()`. If I change dnfdaemon's `_reset_base()` implementation to call `self._base._closeRpmDB()` instead of `self._base.close()`, that's enough to cause the problem to keep happening. If I instead replace the call to `self._base.close()` with, say, `self._base.reset(True, True, True)` (the other half of what `close()` does), or `self._base._store_persistent_data()` (another thing that `_finalize_base()` does), that's not enough to make the problem happen - but just calling `_closeRpmDB()` is enough to make it happen.

Comment 29 Adam Williamson 2019-10-15 22:56:25 UTC
OK, so have dnfdaemon's `_reset_base()` just do this:

    def _reset_base(self):
        if self._base:
            self._base._priv_ts.ts = None
            self._base = None

is enough to trigger the problem. That's one of the things that dnf.rpm.transaction.TransactionWrapper.close() does, and that is the thing that ultimately happens when `Base._closeRpmDB()` is called. It also calls `closeDB()` on the ts object and does `self.open = False`, but those aren't needed to trigger the bug - it's the setting of the wrapper's `ts` attribute to `None` that causes the problem, somehow.

Comment 30 Adam Williamson 2019-10-15 23:21:43 UTC
OOOOH! Big news.

I've found the culprit, I think: rpm-plugin-systemd-inhibit . This is an rpm plugin (rpm has plugins? who knew) which tries to inhibit systemd from shutting down or suspending the system while a transaction is occurring. It does this...via dbus. It is recommended by python3-dnf, so it is installed by default basically always.

If I remove that plugin, the bug doesn't happen. Install it again, the bug happens.

The code for the plugin has a cleanup function which calls dbus_shutdown():

static void systemd_inhibit_cleanup(rpmPlugin plugin)
{
    dbus_shutdown();
}

which, right now, is my suspect. My best theory as of right now: the Base's `_priv_ts.ts` is an rpm.TransactionSet instance. I think when it goes out of scope - when we drop our last/only reference to it, in dnf - rpm's code winds up calling the cleanup function, and the call to dbus_shutdown() happens. dbus_shutdown() is documented thus:

https://dbus.freedesktop.org/doc/api/html/group__DBusMemory.html#ga01912903e39428872920d861ef565bac

"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().

libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.

...

You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."

so, it's a pretty big cannon. And remember, we're all in a single process here - the dnfdaemon process. So I think this dbus_shutdown() call in the rpm plugin, which runs in that same process, winds up nerfing dnfdaemon's *own* dbus usage.

Comment 32 Fedora Update System 2019-10-16 00:58:37 UTC
FEDORA-2019-5caeccb98d has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-5caeccb98d

Comment 33 Panu Matilainen 2019-10-16 06:30:39 UTC
(In reply to Adam Williamson from comment #30)
> 
> You can't continue to use any D-Bus objects, such as connections, that were
> allocated prior to dbus_shutdown(). You can, however, start over; call
> dbus_threads_init() again, create new connections, and so forth."
> 
> so, it's a pretty big cannon. And remember, we're all in a single process
> here - the dnfdaemon process. So I think this dbus_shutdown() call in the
> rpm plugin, which runs in that same process, winds up nerfing dnfdaemon's
> *own* dbus usage.

Right, I only thought about shutting down the handle *we* have but I admit the thought of another dbus user in the same process never crossed my mind. So either we need to figure a way to only call shutdown if we're the only user, or revert the change. I doubt the former is possible at all.

Comment 35 Kamil Páral 2019-10-16 12:02:21 UTC
(In reply to Fedora Update System from comment #32)
> FEDORA-2019-5caeccb98d has been submitted as an update to Fedora 31.
> https://bodhi.fedoraproject.org/updates/FEDORA-2019-5caeccb98d

Fixes the problem in dnfdragora.

Comment 36 Adam Williamson 2019-10-16 14:47:09 UTC
"Right, I only thought about shutting down the handle *we* have but I admit the thought of another dbus user in the same process never crossed my mind."

Sure, it's a tricky problem to foresee - I'm sure if I was reviewing your PR I wouldn't have thought about it either! I hope I didn't come off as suggesting you should've foreseen this, I just like saying "j'accuse" :)

Comment 37 Panu Matilainen 2019-10-16 14:52:15 UTC
(In reply to Adam Williamson from comment #36)
> Sure, it's a tricky problem to foresee - I'm sure if I was reviewing your PR
> I wouldn't have thought about it either! I hope I didn't come off as
> suggesting you should've foreseen this, I just like saying "j'accuse" :)

Oh no, not at all. No worries :)

Comment 38 Fedora Update System 2019-10-16 15:00:43 UTC
rpm-4.15.0-2.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-5caeccb98d

Comment 39 Adam Williamson 2019-10-16 19:30:51 UTC
we don't need to track this as *both* an FE *and* a blocker. blocker beats FE.

Comment 40 Fedora Update System 2019-10-17 21:38:22 UTC
rpm-4.15.0-2.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.

Comment 41 Woi 2019-10-27 10:18:02 UTC
Is it possible to have a build for FC30, too please?
Reasons are bug 1667903 and bug 1751196 and there might be quite some more: https://bugzilla.redhat.com/buglist.cgi?component=dnfdragora&list_id=10596445&product=Fedora

I also like to point to my upstream report:
https://github.com/manatools/dnfdragora/issues/140

Comment 42 Adam Williamson 2019-10-28 00:25:15 UTC
This bug has nothing to do with either of those reports, and does not affect F30. Note the bug here was actually in rpm, not dnfdragora, and we fixed it in rpm. F30's rpm does not have that bug - F30 uses RPM 4.14, the bug was introduced in 4.15.

Comment 43 Panu Matilainen 2020-01-10 08:30:15 UTC
*** Bug 1761620 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.