Created attachment 1223520 [details] gnome-software stuck forever at its loading screen Description of problem: gnome-software hangs at load because the PackageKit daemon is crashing. Version-Release number of selected component (if applicable): Fedora 25 How reproducible: 100% reproducible for me. I was using gnome-software normally for two days, but starting today this is happening 100% of the time. Not sure what changed to trigger this. Steps to Reproduce: 1. check to make sure that the PackageKit service is running: systemctl status packagekit.service -l ● packagekit.service - PackageKit Daemon Loaded: loaded (/usr/lib/systemd/system/packagekit.service; static; vendor preset: disabled) Active: active (running) since Wed 2016-11-23 22:14:04 MST; 3min 26s ago Main PID: 18290 (packagekitd) Tasks: 3 (limit: 4915) CGroup: /system.slice/packagekit.service └─18290 /usr/libexec/packagekitd Nov 23 22:14:04 spectre systemd[1]: Starting PackageKit Daemon... Nov 23 22:14:04 spectre PackageKit[18290]: daemon start Nov 23 22:14:04 spectre systemd[1]: Started PackageKit Daemon. Nov 23 22:14:04 spectre PackageKit[18290]: uid 1000 is trying to obtain org.freedesktop.packagekit.system-sources-ref Nov 23 22:14:04 spectre PackageKit[18290]: uid 1000 obtained auth for org.freedesktop.packagekit.system-sources-refre Nov 23 22:14:04 spectre packagekitd[18290]: BDB2053 Freeing read locks for locker 0x13: 18059/140196001058112 Nov 23 22:14:04 spectre packagekitd[18290]: BDB2053 Freeing read locks for locker 0x15: 18059/140196001058112 Nov 23 22:14:23 spectre PackageKit[18290]: refresh-cache transaction /6_ecabedbb from uid 1000 finished with success Cool, looks like it's running. 2. Open gnome-software Actual results: gnome-software gets stuck at "Software catalog is being loaded". If you check on the PackageKit service, you'll see that it's crashed: $ systemctl status packagekit.service -l ● packagekit.service - PackageKit Daemon Loaded: loaded (/usr/lib/systemd/system/packagekit.service; static; vendor preset: disabled) Active: failed (Result: signal) since Wed 2016-11-23 22:30:04 MST; 4s ago Process: 18604 ExecStart=/usr/libexec/packagekitd (code=killed, signal=ABRT) Main PID: 18604 (code=killed, signal=ABRT) Nov 23 22:30:03 spectre packagekitd[18604]: 7fea25fdd000-7fea25fde000 r--p 00025000 fd:00 4203032 Nov 23 22:30:03 spectre packagekitd[18604]: 7fea25fde000-7fea25fdf000 rw-p 00026000 fd:00 4203032 Nov 23 22:30:03 spectre packagekitd[18604]: 7fea25fdf000-7fea25fe0000 rw-p 00000000 00:00 0 Nov 23 22:30:03 spectre packagekitd[18604]: 7ffeaef59000-7ffeaef7a000 rw-p 00000000 00:00 0 Nov 23 22:30:03 spectre packagekitd[18604]: 7ffeaefcb000-7ffeaefcd000 r--p 00000000 00:00 0 Nov 23 22:30:03 spectre packagekitd[18604]: 7ffeaefcd000-7ffeaefcf000 r-xp 00000000 00:00 0 Nov 23 22:30:03 spectre packagekitd[18604]: ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 Nov 23 22:30:04 spectre systemd[1]: packagekit.service: Main process exited, code=killed, status=6/ABRT Nov 23 22:30:04 spectre systemd[1]: packagekit.service: Unit entered failed state. Nov 23 22:30:04 spectre systemd[1]: packagekit.service: Failed with result 'signal'. Expected results: gnome-software should load properly and allow me to install software, and PackageKit should not crash. Additional info: Happy to collect more info, since this is 100% reproducible for me and seemingly not going away. Hardware is an 2016 HP Spectre x360 (2016, Kaby Lake).
Created attachment 1224028 [details] Backtrace
Created attachment 1224057 [details] Debugged the crash with gdb
This same issue has also been reported here: https://bugs.freedesktop.org/show_bug.cgi?id=99083 I can confirm that the same thing happens to me from the CLI $ pkcon install Thunar Resolving [=========================] Testing changes [ ] (0%) The daemon crashed mid-transaction!
Not fixed with PackageKit-1.1.5-0.1.20161221.fc25: $ pkcon install Thunar Resolving [=========================] Testing changes [= ] (5%) The daemon crashed mid-transaction!
Friends, I am also encountering the freeze of gnome-software. Can anyone suggest a solution?
Actually: malloc(): memory corruption Richard, any idea what caused this?
(In reply to Igor Gnatenko from comment #6) > malloc(): memory corruption > Richard, any idea what caused this? Without the logs of packagekitd running under valgrind, no.
Created attachment 1239519 [details] packagekitd running under valgrind Here's the output of packagekitd running under valgrind, as requested. Looks like this is the issue: valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed. valgrind: Heap block lo/hi size mismatch: lo = 101856, hi = 109289737790854. This is probably caused by your program erroneously writing past the end of a heap block and corrupting heap metadata. If you fix any invalid writes reported by Memcheck, this assertion failure will probably go away. Please try that before reporting this as a bug. Not sure if "fix[ing] any invalid writes reported by Memcheck" is something I would do or something you would do.
For me it's the same. I'm reporting every bug which occurs with packagekit in abrt because it's not usable at all. Worked for 3 days or so and afterwards the packagekit daemon crashes at every operation in GNOME Software (and in the shell too). I can't search for updates, install or remove packages from repositories. For me it's not a big deal because i'm using mostly dnf. But for users which aren't home inside the shell it's a big problem. I freshly reinstalled my system (because removing configs, deinstalling and installing packagekit + GNOME Software again won't work) but after some time the problems occurred again.
Richard, is there anything else you need from me on this? I'm happy to provide any information that would be useful. Is this not the right place for this bug? Should I file another one at freedesktop.org?
Looking at the valgrind log from comment #8, it looks like it's libsolv doing an out of bounds write in repo_write/traverse_dirs.
Can you please attach a core file?
Created attachment 1244511 [details] packagekitd coredump
That seems to be a coredump from 'vim'...
Thanks for looking at this, Michael. If it helps, there are a few ABRT filed bug reports with crashes in the same functions and they all have good backtraces, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1405832 and https://bugzilla.redhat.com/show_bug.cgi?id=1404468 (look at the "backtrace" attachment).
I need a core so I can access the data to reproduce this.
Looks like I attached the wrong core. Trying again...
The right one was too big to upload to this bug report, so you can download it from here: hommelscitadel.com/wp-content/uploads/2017/01/coredump
Doesn't work for me. 'file coredump' says "error reading (Invalid argument)" and gdb prints "coredump" is not a core dump: File format not recognized"...
Created attachment 1245164 [details] packagekitd core Okay, looks like something got corrupted during the upload or something. I'm re-uploading it in an archive.
If I download and un-archive the core, it looks good to me now: $ file '/home/nate/coredump' /home/nate/coredump: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/libexec/packagekitd', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/libexec/packagekitd', platform: 'x86_64' Thanks for your patience, Michael!
Ok, from looking at the core I found that the directory data for the "@System" repo is bad. There's a bogus NULL entry in it with trips the logic in the repository write code. There are not many places in the code where directories are added; I checked them all and I don't see how this could have happened. So here's a little test: Packagekit should store a file called @System.solv somewhere in /var/cache. Do you still see coredumps if you remove this file so that packagekit needs to rebuild its cache?
Removing /var/cache/PackageKit/25/hawkey/@System.solv fixed the crashes for me :)
Good to hear! Now I just need to find out what caused the bad data in the first place...
There's also a lot of temp-file-looking things in there: $ find /var/cache/PackageKit/25/hawkey/ | grep -i System | wc -l 120 $ find /var/cache/PackageKit/25/hawkey/ | grep -i System | head -n 5 /var/cache/PackageKit/25/hawkey/@System.solv.NndZva /var/cache/PackageKit/25/hawkey/@System.solv /var/cache/PackageKit/25/hawkey/@System.solv.KWL0wF /var/cache/PackageKit/25/hawkey/@System.solv.Y05sdi /var/cache/PackageKit/25/hawkey/@System.solv.TtcMrX I can also confirm that deleting /var/cache/PackageKit/25/hawkey/@System.solv resolves the issue of packagekitd coredumping. In addition to figuring out how the bad data got in there, perhaps packagekitd should better sanitize its inputs.
Those are leftovers from the crashes. When writing a cache file, it first writes into a tmp file and then renames it to the final destination. The writing part was where packagekit crashed.
mls, I wonder if there's a good workaround we could do in PackageKit to deal with existing installations that have broken @System.solv? Maybe unlink it every time packagekitd starts, just to be on the safe side? Or are the libsolv side fixes sufficient here? (I noticed https://github.com/openSUSE/libsolv/commit/3a8f2216aeec9126968ea3d99872f839548a6d65 which I guess is for this crash?)
Removing @System.solv fixes this on my multi-upgraded (23 -> 24 -> 25) system. Thanks for posting the work-around, I never would have figured that out on my own.
FWIW I'm running a fresh 25 install.
Ok, I've got a theory to prove. Nate, can you please create an attachment with the/var/cache/PackageKit/25/hawkey/@System.solv and /var/lib/rpm/Packages files?
Available at http://hommelscitadel.com/wp-content/uploads/2017/requested_files.tar.xz since it was too large to attach here. Note: the @System.solv I've included there is the one that works, not the bad broken one (it was deleted yesterday).
Thanks! The bad package leading to the corrupt entry seems to be gone now, though. Anyway, I did a couple of changes to libsolv: 1) it will now reject solv files that have bad directory entries. This will make packagekit rebuild the cache so this problem should be gone. (commit 64ea54c31ec396531faac2a86b5c0c1c056b59b2) 2) I added a guard so that illegal directories entries can no longer be added. (commit 3a8f2216aeec9126968ea3d99872f839548a6d65) I also cleanup up some of the code a bit, but that should not make a difference. If somebody still has a @System.solv file that leads to a crash, could you please attach it to this bug? I'm still a little bit nervous because I couldn't find the real root cause of the illegal directory entry. Thanks!
Awesome, thanks for the fixes, Michael! Do you have plans for an upstream libsolv release or should we backport patches?
I just released libsolv version 0.6.25 which contains those fixes.
libsolv-0.6.25-1.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2017-2810889d00
libsolv-0.6.25-1.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2017-8bfcc055a1
Excellent!
Thanks mls!
libsolv-0.6.25-1.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-8bfcc055a1
libsolv-0.6.25-1.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-2810889d00
libsolv-0.6.25-1.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.
libsolv-0.6.25-1.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
It looks like I'm seeing the same problem. libsolv-0.6.25-1.fc25 was supposed to cure it; my system is running libsolv-0.6.26-1.fc25 but the problem persists. Here's a brief bash log: $ pkcon get-updates Getting updates [=========================] Querying [ == ] The daemon crashed mid-transaction! $ rpm -q libsolv libsolv-0.6.26-1.fc25.x86_64
Jonathan, can you attach your /var/cache/PackageKit/25/hawkey/@System.solv file?
Created attachment 1259702 [details] /var/cache/PackageKit/25/hawkey/@System.solv As requested by Nate Graham in Comment #44 BTW: What kind of file is this? What is its function?
I'm re-opening this for investigation, since we now have a case in the wild with the supposedly-fixed libsolv.
Created attachment 1263236 [details] /var/cache/PackageKit/25/hawkey/@System.solv I'm experiencing the same issue as Jonathan, also with the exact same bash output. Attaching my @System.solv file, maybe it'll help?
Confirmed here.... I started experiencing the problems when I was installing codecs from the "Add-ons" Categories Now I can't install anything via the GUI Per what previous people have posted I get... $pkcon get-updates Getting updates [=========================] Querying [ == ] The daemon crashed mid-transaction! $rpm -q libsolv libsolv-0.6.26-1.fc25.x86_64 Anyone have any suggestions on how to get this working at least temporarily?
(In reply to Patrick R from comment #48) > Anyone have any suggestions on how to get this working at least temporarily? Hi Patrick, You can work around this by removing any of these files: /var/cache/PackageKit/25/hawkey/@System.solv* *BUT* before you do that, please consider just moving them into /tmp and attaching them to this ticket for further investigation - I remember some of the devs were looking for examples of these files to inspect and find commonalities.
Created attachment 1272693 [details] @System.solv
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.