Bug 1849250 - 'dnf upgrade' crashes with Bus error
Summary: 'dnf upgrade' crashes with Bus error
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: dnf
Version: 32
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Lukáš Hrázký
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-20 01:43 UTC by Kevin Donovan
Modified: 2020-06-24 10:50 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-06-24 10:50:07 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Kevin Donovan 2020-06-20 01:43:32 UTC
Description of problem:
For about the last week, I cannot upgrade packages.  When I run 'dnf upgrade', I get an error message:

Bus error (core dumped)m Cisco)

Version-Release number of selected component (if applicable):
4.2.23

How reproducible:
It happens every time.

Steps to Reproduce:
1. Login as root.
2. Enter the command: 'dnf upgrade'.
3.

Actual results:
Crashes, as described above.

Expected results:
Upgrade packages.

Additional info:
This is a strange error.  It appears that no one else has reported it, so there must be something wrong at my end, but I don't know what it would be.

It happens on Fedora 32.  I also have Fedora Silverblue (32), installed on a different disk partition.  I run a toolbox under Silverblue, and there's no problem with using dnf.  I compared the dnf (points to dnf-3) and libdnf files between the Silverblue toolbox and the Fedora 32 installations, and they are identical.

I indicated that the severity was high because dnf is important, but if it's just me, then I can of course re-install Fedora.  But I thought it was worth reporting.

Comment 1 Lukáš Hrázký 2020-06-22 12:01:49 UTC
Hello Kevin, we've had at least one report of a dbus error before, but this looks different: https://bugzilla.redhat.com/show_bug.cgi?id=1753439

First off, how do you know it's actually a dbus error? Are you getting more output than just the "Bus error (core dumped)m Cisco)"? How is Cisco related? Do you have any Cisco hardware or software?

Do you use any dnf plugins that would use dbus, like snapper or tracer? dnf doesn't use dbus itself. You can try using e.g. --disableplugin="*" to rule those out and if that works, you can use e.g. --disableplugin="*" --enableplugin="tracer" on the plugins one by one to find the faulty plugin.

From the error message it also seems some process (the dbus one?) is dumping core, you can check coredumpctl to list and inspect those, a traceback might be of some interest.

Comment 2 Kevin Donovan 2020-06-22 12:38:58 UTC
I don't know if it's a dbus error.  For some reason, I read the error message as 'dbus error' instead of 'bus error'.  I'm sorry for the confusion.

There's a simple explanation for the 'Cisco', and I have a little more information.  My directory /etc/yum.repos.d contains 15 .repo files.  The first of them, in alphabetical order, is fedora-cisco-openh264.repo.  It appears that this is the first repo which it checks for updates, and it almost immediately crashes with the error message.

Seven of the repositories are fedora repositories.  All of the fedora repositories trigger the bug.  The other repositories do not.  For example, if I try:

    dnf upgrade --repo=rpmfusion-free --refresh

then the command runs successfully.  When I tried it, there was nothing to upgrade, but it appeared to check for updates.

I though the problem might be with corrupted .repo files, so I checked.  I compared all the fedora repo files against copies from a working installation, and the files were identical.  I also checked the key files.  More precisely, I looked at the three files in /etc/pki/rpm-gpg with filenames ending in 'fedora-32-primary'.  They also were identical to the files on a working installation.

The dnf search function appears to work properly.

Let me know if you need more information.

Thanks.

Kevin

Comment 3 Lukáš Hrázký 2020-06-22 13:45:15 UTC
Oh, I see, the error message has probably overwritten the name of the Cisco repository:

Fedora 32 openh264 (From Cisco)
Bus error (core dumped)m Cisco)

The Bus error is actually the process receiving the SIGBUS signal: https://en.wikipedia.org/wiki/Bus_error

Nothing to do with dbus (I'm changing the bug title).

Did you check your coredumpctl for core dumps? Please do, you should have several by now. You can:

coredumpctl list  # to list the core dumps

coredumpctl info -1  # to show information about the last coredump

coredumpctl debug -1  # to run gdb (debugger) on the last core dump, typing `bt full` in gdb will give you the backtrace; you may need to install debuginfo packages to get the symbols listed (instead of seeing only "??"), gdb on fedora automatically suggests which packages to install

coredumpctl dump -1 -o coredump  # to dump the coredump into a file

You can send me the coredump file along with a complete list of packages you have installed on your system (`dnf repoquery --installed`) for me to examine to my email address. You can also attach it to the bug report, just note it contains the whole process memory dump, so if you have any secrets configured in dnf or potentially its dependencies (I can't think of anything sensitive besides potentially dnf configuration though), they may be stored in the dump.

Also, it would appear that the error started happening after you upgraded your system, you can use `dnf history` to examine what you installed and share it here. If it would work, you can also try reverting the changes (using `dnf history undo` or manually installing the older versions), but it may be useful to keep the system in the broken state for a bit longer if you don't mind helping to debug the issue :)

Thanks!

Comment 4 Kevin Donovan 2020-06-22 15:08:49 UTC
Thanks for the update, especially the clear instructions on how to get the information you need.  I did not know anything about the coredumpctl command, so it was useful.

I did everything you requested.  I was going to attach it here, but I don't see how to attach a file to this reply.  I'll send the attachments to your redhat.com email account instead.

Kevin

Comment 5 Lukáš Hrázký 2020-06-23 13:41:23 UTC
Thanks for the data, Kevin. Inspecting the core dump, the problem happens when downloading a repository metalink, in the case of the coredump you've sent, the URL is: https://mirrors.fedoraproject.org/metalink?repo=fedora-cisco-openh264-32&arch=x86_64

From what you've said, I assume if you disabled the fedora-cisco-openh264-32 repository, the error would happen on the next Fedora repository, which would also lead to this server?

This is the relevant top of the stack from the traceback:
#0  0x00007fe296ed9460 in nghttp2_session_client_new () at /lib64/libnghttp2.so.14
#1  0x00007fe297448491 in Curl_http2_setup () at /lib64/libcurl.so.4
#2  0x00007fe29744854a in Curl_http2_switched () at /lib64/libcurl.so.4
#3  0x00007fe29740c64f in Curl_http () at /lib64/libcurl.so.4
#4  0x00007fe29742ccf6 in multi_runsingle () at /lib64/libcurl.so.4
#5  0x00007fe29742daf1 in curl_multi_perform () at /lib64/libcurl.so.4
#6  0x00007fe2980350cc in lr_download () at /lib64/librepo.so.0
#7  0x00007fe298036b4f in lr_download_target () at /lib64/librepo.so.0
#8  0x00007fe298047958 in lr_yum_download_url () at /lib64/librepo.so.0
#9  0x00007fe298039522 in lr_yum_download_url_retry.constprop () at /lib64/librepo.so.0
#10 0x00007fe29803b336 in lr_handle_prepare_internal_mirrorlist () at /lib64/librepo.so.0
#11 0x00007fe29803c168 in lr_handle_perform () at /lib64/librepo.so.0
#12 0x00007fe2981aa5f7 in libdnf::Repo::Impl::lrHandlePerform(_LrHandle*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () at /lib64/libdnf.so.2
#13 0x00007fe2981ab9ac in libdnf::Repo::Impl::fetch(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<_LrHandle, std::default_delete<_LrHandle> >&&) () at /lib64/libdnf.so.2
#14 0x00007fe2981ad242 in libdnf::Repo::Impl::load() () at /lib64/libdnf.so.2
#15 0x00007fe2966f55a9 in _wrap_Repo_load () at /usr/lib64/python3.8/site-packages/libdnf/_repo.so

The bug will most likely be between libcurl and libnghttp2. And notably the URL is being downloaded over HTTP2, I'm not sure how stable that is...

I haven't been able to reproduce the issue, can you also send me your whole repository configuration (contents of /etc/yum.repos.d) and your /etc/dnf/dnf.conf?

Thanks!

Comment 6 Kevin Donovan 2020-06-23 14:49:53 UTC
Thanks very much, Lukas.

I just confirmed that the error still occurs even if the file /etc/yum.repos.d/fedora-cisco-openh264.repo is moved out of the way, and I confirmed that it works properly with the rpmfusion-free repository.

This would be a serious bug if it were widespread, but as you say, no one else has reported anything.  If it looks like it would be too much work to figure it out, I can understand.  I could try some more to fix it myself, or even re-install the OS if necessary.

Comment 7 Lukáš Hrázký 2020-06-23 15:01:14 UTC
You're welcome, Kevin. Your configuration files are mostly standard and I wasn't able to reproduce the issue with them.

As I've said in an email, the transaction history you've sent me didn't show upgrades on any packages that should be relevant here.

Of note, the Bus error you're getting is actually an error raised by hardware when it can't access a physical memory address. It can be caused by a software bug as well, but I can't really imagine what error it would be, usually in case of memory access violation it's a Segfault that is raised.

So this can be also caused by your hardware (unlikely) or kernel or some combination of these and other factors... I don't know much about these issues and it's very likely not related to dnf. Not sure I can help any further unfortunately.

Comment 8 Kevin Donovan 2020-06-23 15:10:53 UTC
That makes sense, especially since you can't reproduce the problem.

I'll see if I can find a fix, maybe even a dirty fix, and I'll let you know if I find anything.  Actually, I've learned quite a bit from the comments.

Comment 9 Kevin Donovan 2020-06-23 22:31:16 UTC
I finally figured it out and fixed the problem.

The file /usr/lib64/libnghttp2.so.14.20.0 was corrupted.  It's used when downloading files from a Fedora repository, but is apparently not used for the non-Fedora repositories.

I downloaded a good copy of the shared library from another machine.  The 'ls -l' command showed that the existing (bad) file had the same size as the good one, but when I tried to use the cmp command to compare the two files, I got an error message: Input/output error.  I renamed the bad file to libnghttp2.so.14.20.0-save and copied over the good one.  Now, all is good.

I would never have been able to find this without your help, because I did not know how to use the proper tools, especially coredump and the various dnf options.  Hopefully, if something like this happens again, I'll be better prepared.

Thanks for your help.

Comment 10 Lukáš Hrázký 2020-06-24 10:50:07 UTC
Glad you've found out what the issue is, Kevin, you're welcome. I haven't considered a corrupted library as a possibility, that doesn't happen too often. Now I'll know for the next time, too.

Closing this then.


Note You need to log in before you can comment on or make changes to this bug.