Bug 1761282 - dnf segfault after "Generating completion cache"; likely libdnf 0.35.5
Summary: dnf segfault after "Generating completion cache"; likely libdnf 0.35.5
Keywords:
Status: CLOSED DUPLICATE of bug 1758737
Alias: None
Product: Fedora
Classification: Fedora
Component: dnf
Version: 30
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Lukáš Hrázký
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-14 01:39 UTC by Ian Wienand
Modified: 2019-10-23 08:37 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-23 08:37:58 UTC
Type: Bug


Attachments (Terms of Use)
dnf backtrace with abort due to bad free() (165.45 KB, text/plain)
2019-10-14 01:40 UTC, Ian Wienand
no flags Details
dnf backtrace with segv (222.63 KB, text/plain)
2019-10-14 01:41 UTC, Ian Wienand
no flags Details
Valgind logs from crashing dnf (179.50 KB, text/plain)
2019-10-15 01:08 UTC, Ian Wienand
no flags Details

Description Ian Wienand 2019-10-14 01:39:29 UTC
We have started seeing a segfault with dnf installs during or after "Generating completion cache"

---
2019-10-14 01:00:02.566 | Completion plugin: Generating completion cache...
2019-10-14 01:00:02.588 | /opt/dib/tmp/dib_build.ZqkLGLSg/hooks/root.d/08-yum-chroot: line 303: 18083 Segmentation fault      sudo -E chroot $TARGET_ROOT ${YUM} -y -v debuginfo-install libdnf dnf libsolv glibc python3
---

This started happening after the following updates (based on our package install logs; [1] is the last working build, [2] is a current failure)

GOOD VERSIONS                                  BAD VERSIONS
-------------                                  ------------
dnf.noarch 0:4.2.8-2.fc30                      | dnf.noarch 0:4.2.11-2.fc30                                   
dnf-plugins-core.noarch 0:4.0.9-1.fc30         | dnf-plugins-core.noarch 0:4.0.10-1.fc30                      
dnf-data.noarch 0:4.2.8-2.fc30                 | dnf-data.noarch 0:4.2.11-2.fc30                              
libdnf.x86_64 0:0.35.2-4.fc30                  | libdnf.x86_64 0:0.35.5-2.fc30                                
librepo.x86_64 0:1.10.5-1.fc30                 | librepo.x86_64 0:1.10.6-1.fc30                               
python3-dnf.noarch 0:4.2.8-2.fc30              | python3-dnf.noarch 0:4.2.11-2.fc30                           
python3-dnf-plugins-core.noarch 0:4.0.9-1.fc30 | python3-dnf-plugins-core.noarch 0:4.0.10-1.fc30              
python3-hawkey.x86_64 0:0.35.2-4.fc30          | python3-hawkey.x86_64 0:0.35.5-2.fc30                        
python3-libdnf.x86_64 0:0.35.2-4.fc30          | python3-libdnf.x86_64 0:0.35.5-2.fc30                        
zlib.x86_64 0:1.2.11-18.fc30                   | zlib.x86_64 0:1.2.11-18.fc30                           

This looks like #1758737.  Filing this as additional confirmation for F30 issues.

The py-bt is:

---
2019-10-14 01:12:12.218 | #85 0x000055555555508e in _start ()
2019-10-14 01:12:12.292 | Traceback (most recent call first):
2019-10-14 01:12:12.292 |   <method-wrapper '__init__' of Sack object at 0x7ffff4d09360>
2019-10-14 01:12:12.292 |   File "/usr/lib/python3.7/site-packages/dnf/sack.py", line 35, in __init__
2019-10-14 01:12:12.292 |     super(Sack, self).__init__(*args, **kwargs)
2019-10-14 01:12:12.292 |   File "/usr/lib/python3.7/site-packages/dnf/sack.py", line 56, in _build_sack
2019-10-14 01:12:12.292 |     logdebug=base.conf.debuglevel > 2)
2019-10-14 01:12:12.292 |   File "/usr/lib/python3.7/site-packages/dnf/sack.py", line 61, in _rpmdb_sack
2019-10-14 01:12:12.292 |     sack = _build_sack(base)
2019-10-14 01:12:12.292 |   File "/usr/lib/python3.7/site-packages/dnf-plugins/generate_completion_cache.py", line 85, in transaction
2019-10-14 01:12:12.292 |     inst_pkgs = dnf.sack._rpmdb_sack(self.base).query().installed()
2019-10-14 01:12:12.292 |   File "/usr/lib/python3.7/site-packages/dnf/plugin.py", line 104, in _caller
2019-10-14 01:12:12.293 |     getattr(plugin, method)()
2019-10-14 01:12:12.293 |   File "/usr/lib/python3.7/site-packages/dnf/plugin.py", line 164, in run_transaction
2019-10-14 01:12:12.293 |     self._caller('transaction')
2019-10-14 01:12:12.293 |   File "/usr/lib/python3.7/site-packages/dnf/base.py", line 871, in do_transaction
2019-10-14 01:12:12.293 |     self._plugins.run_transaction()
2019-10-14 01:12:12.293 |   File "/usr/lib/python3.7/site-packages/dnf/cli/cli.py", line 241, in do_transaction
2019-10-14 01:12:12.293 |     tid = super(BaseCli, self).do_transaction(display)
2019-10-14 01:12:12.293 |   File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 166, in resolving
2019-10-14 01:12:12.293 |     base.do_transaction(display=displays)
2019-10-14 01:12:12.293 |   File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 122, in cli_run
2019-10-14 01:12:12.294 |     ret = resolving(cli, base)
2019-10-14 01:12:12.294 |   File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 98, in _main
2019-10-14 01:12:12.294 |     return cli_run(cli, base)
2019-10-14 01:12:12.294 |   File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 65, in main
2019-10-14 01:12:12.294 |     return _main(base, args, cli_class, option_parser_class)
2019-10-14 01:12:12.294 |   File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 191, in user_main
2019-10-14 01:12:12.294 |     errcode = main(args)
2019-10-14 01:12:12.294 |   File "/usr/bin/dnf", line 58, in <module>
2019-10-14 01:12:12.294 |     main.user_main(sys.argv[1:], exit_code=True)
---

It seems to move depending on what debug libraries I have installed or what packages are being installed.  Before I had debug libraries installed I saw

---
2019-10-14 00:10:41.251 | Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
2019-10-14 00:10:41.251 | 0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6
2019-10-14 00:10:41.254 | #0  0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6
2019-10-14 00:10:41.254 | #1  0x00007ffff7e889c6 in calloc () from /lib64/libc.so.6
2019-10-14 00:10:41.276 | #2  0x00007ffff6787b90 in solv_calloc () from /lib64/libsolv.so.1
2019-10-14 00:10:41.276 | #3  0x00007ffff674fa8f in pool_setarchpolicy () from /lib64/libsolv.so.1
2019-10-14 00:10:41.277 | #4  0x00007ffff6bfaf24 in dnf_sack_set_arch () from /lib64/libdnf.so.2
2019-10-14 00:10:41.299 | #5  0x00007ffff51623e5 in sack_init(_SackObject*, _object*, _object*) () from /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so
2019-10-14 00:10:41.299 | #6  0x00007ffff7bb3276 in ?? () from /lib64/libpython3.7m.so.1.0
...
---

I believe this strongly suggests that #1759204 is *not* a libsolv error, but indeed some sort of consequence of preceding leaks within libdnf somewhere.

At one point, I captured an abort from an invalid free() (bt-abort.txt) that appeared to originate around 

---
#7  0x00007ffff6bfc3bc in dnf_sack_load_system_repo (sack=0x555555a0fb40, a_hrepo=0x0, flags=0, error=0x7fffffffb070) at /usr/src/debug/libdnf-0.35.5-2.fc30.x86_64/libdnf/dnf-sack.cpp:169
---

On other occasions, seeing a SEGV that comes from a similar, but different location

---
2019-10-14 01:12:09.594 | #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:240
2019-10-14 01:12:09.594 | #1  0x00007ffff7e806c8 in __GI__IO_default_xsputn (n=<optimized out>, data=<optimized out>, f=<optimized out>) at genops.c:386
2019-10-14 01:12:09.603 | #2  __GI__IO_default_xsputn (f=f@entry=0x5555584c5370, data=data@entry=0x5555584a8940, n=n@entry=46) at genops.c:370
2019-10-14 01:12:09.607 | #3  0x00007ffff7e7e92b in _IO_new_file_xsputn (n=46, data=<optimized out>, f=<optimized out>) at fileops.c:1272
2019-10-14 01:12:09.609 | #4  _IO_new_file_xsputn (f=0x5555584c5370, data=<optimized out>, n=46) at fileops.c:1204
2019-10-14 01:12:09.648 | #5  0x00007ffff7e73389 in __GI__IO_fwrite (buf=0x5555584a8940, size=46, count=1, fp=0x5555584c5370) at libioP.h:904
2019-10-14 01:12:09.649 | #6  0x00007ffff5163bd7 in log_handler(char const*, GLogLevelFlags, char const*, void*) () from /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so
2019-10-14 01:12:09.650 | #7  0x00007ffff6a5e34b in g_logv () from /lib64/libglib-2.0.so.0
2019-10-14 01:12:09.651 | #8  0x00007ffff6a5e533 in g_log () from /lib64/libglib-2.0.so.0
2019-10-14 01:12:09.812 | #9  0x00007ffff6bfaef4 in dnf_sack_set_arch (sack=<optimized out>, value=<optimized out>, error=0x7fffffffaa00) at /usr/src/debug/libdnf-0.35.5-2.fc30.x86_64/libdnf/dnf-sack.cpp:824
---

Bug #1758737 says

> The problem has been introduced in libdnf-0.35.5, so reverting to libdnf-0.35.3 should bet a temporary solution until we fix it.

I will attempt to confirm this, but hopefully this is helpful for anyone seeing the same on F30

[1] https://f921bf67a57bb9acb3a9-7f483fd0170ce51389d38b0c84b96364.ssl.cf2.rackcdn.com/687475/1/check/dib-nodepool-functional-openstack-fedora-30-src/2474d20/nodepool/builds/test-image-0000000001.log
[2] https://654d83e20ff6115f815e-89f45418e048e284a0aa8c1a9bb175a2.ssl.cf1.rackcdn.com/688036/1/check/dib-nodepool-functional-openstack-fedora-30-src/c27292c/nodepool/builds/test-image-0000000001.log

Comment 1 Ian Wienand 2019-10-14 01:40:31 UTC
Created attachment 1625414 [details]
dnf backtrace with abort due to bad free()

Comment 2 Ian Wienand 2019-10-14 01:41:16 UTC
Created attachment 1625416 [details]
dnf backtrace with segv

Comment 3 Lukáš Hrázký 2019-10-14 12:56:02 UTC
Hello Ian, thanks for the information.

The third traceback is almost surely https://bugzilla.redhat.com/show_bug.cgi?id=1758737. The fix for that was merged and should be released this week. IF you want to test it now, you can try our nightly repositories here: https://copr.fedorainfracloud.org/coprs/rpmsoftwaremanagement/dnf-nightly/ Which should contain the fix.

IMO the second traceback you posted is not very similar to the third one - they do have the dnf_sack_set_arch() call in common, but one fails in the log_handler and the other one in allocating memory. We've gotten several bug reports lately failing in allocating memory. Presumably the cause would be heap corruption caused sometime before the SEGFAULT. The cause could be the log_handler problem as well, or something else.

If you were able to (semi-)consistently reproduce the second traceback, could you try running it in valgrind or possibly ElectricFence? That should discover errorneous memory writes.

Please report back once you get to try the fix! Thanks.

Comment 4 Ian Wienand 2019-10-15 00:52:42 UTC
(In reply to Lukáš Hrázký from comment #3)
> The third traceback is almost surely
> https://bugzilla.redhat.com/show_bug.cgi?id=1758737. The fix for that was
> merged and should be released this week. IF you want to test it now, you can
> try our nightly repositories here:
> https://copr.fedorainfracloud.org/coprs/rpmsoftwaremanagement/dnf-nightly/
> Which should contain the fix.

Thanks; I have tested with that [1] and we do not see the segfaults any more.  Just for completeness the versions were:

---
2019-10-14 20:58:58.267 | ================================================================================
2019-10-14 20:58:58.267 |  Package                      ArchVersion                   Repository                                                       Size
2019-10-14 20:58:58.267 | ================================================================================
2019-10-14 20:58:58.267 | Upgrading:
2019-10-14 20:58:58.267 |  dnf                          noarch4.2.12-0.34ge5b78300.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 388 k
2019-10-14 20:58:58.267 |  dnf-data                     noarch4.2.12-0.34ge5b78300.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 30 k
2019-10-14 20:58:58.267 |  dnf-plugins-core             noarch4.0.10-1g1f46091.fc30     copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 25 k
2019-10-14 20:58:58.267 |  libcomps                     x86_640.1.11-12g919c0e2.fc30    copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 73 k
2019-10-14 20:58:58.267 |  libdnf                       x86_640.35.6-0.23g0999bcc7.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 595 k
2019-10-14 20:58:58.267 |  librepo                      x86_641.10.7-0.5gc3e91b6.fc30   copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 88 k
2019-10-14 20:58:58.267 |  libsolv                      x86_640.7.6-5gf428e2bd.fc30     copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 384 k
2019-10-14 20:58:58.267 |  python3-dnf                  noarch4.2.12-0.34ge5b78300.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 398 k
2019-10-14 20:58:58.267 |  python3-dnf-plugins-core     noarch4.0.10-1g1f46091.fc30     copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 159 k
2019-10-14 20:58:58.267 |  python3-hawkey               x86_640.35.6-0.23g0999bcc7.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 83 k
2019-10-14 20:58:58.267 |  python3-libcomps             x86_640.1.11-12g919c0e2.fc30    copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 45 k
2019-10-14 20:58:58.267 |  python3-libdnf               x86_640.35.6-0.23g0999bcc7.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 699 k
2019-10-14 20:58:58.267 | Installing dependencies:
2019-10-14 20:58:58.267 |  dbus-libs                    x86_641:1.12.16-1.fc30          updates                                                          147 k
2019-10-14 20:58:58.267 |  libevent                     x86_642.1.8-7.fc30              updates                                                          231 k
2019-10-14 20:58:58.267 |  unbound-libs                 x86_641.9.3-2.fc30              updates                                                          498 k
2019-10-14 20:58:58.267 | Installing weak dependencies:
2019-10-14 20:58:58.267 |  python3-unbound              x86_641.9.3-2.fc30              updates                                                          104 k
2019-10-14 20:58:58.267 |  rpm-plugin-systemd-inhibit   x86_644.14.2.1-5.fc30           updates                                                          19 k
2019-10-14 20:58:58.267 |  deltarpm                     x86_643.6-29.fc30               fedora                                                           77 k
---
 
> IMO the second traceback you posted is not very similar to the third one -
> they do have the dnf_sack_set_arch() call in common, but one fails in the
> log_handler and the other one in allocating memory. We've gotten several bug
> reports lately failing in allocating memory. Presumably the cause would be
> heap corruption caused sometime before the SEGFAULT. The cause could be the
> log_handler problem as well, or something else.
> 
> If you were able to (semi-)consistently reproduce the second traceback,
> could you try running it in valgrind or possibly ElectricFence? That should
> discover errorneous memory writes.

I have tried running this under valgrind in a number of circumstances.  I get some errors like

---
error: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f30&arch=x86_64 [Resolving timed out after 30000 milliseconds] (https://mirrors.fedoraproject.org/metalink?repo=updates-released-f30&arch=x86_64).
---

which I think is valgrind slowing things down, but it does seem to complete so I don't think they're fatal.

But basically I don't trust the backtrace without symbols; which is the one that seems to be in libsolv.  Running the same command (dnf -v -y install gettext) under gdb with no symbol packages I get

---
2019-10-14 23:36:57.891 | Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
2019-10-14 23:36:57.891 | 0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6
2019-10-14 23:36:57.891 | #0  0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6
2019-10-14 23:36:57.891 | No symbol table info available.
2019-10-14 23:36:57.891 | #1  0x00007ffff7e889c6 in calloc () from /lib64/libc.so.6
2019-10-14 23:36:57.903 | No symbol table info available.
2019-10-14 23:36:57.903 | #2  0x00007ffff6787b90 in solv_calloc () from /lib64/libsolv.so.1
2019-10-14 23:36:57.903 | No symbol table info available.
2019-10-14 23:36:57.903 | #3  0x00007ffff674fa8f in pool_setarchpolicy () from /lib64/libsolv.so.1
2019-10-14 23:36:57.904 | No symbol table info available.
2019-10-14 23:36:57.904 | #4  0x00007ffff6bfaf24 in dnf_sack_set_arch () from /lib64/libdnf.so.2
2019-10-14 23:36:57.917 | No symbol table info available.
2019-10-14 23:36:57.917 | #5  0x00007ffff51623e5 in sack_init(_SackObject*, _object*, _object*) () from /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so
2019-10-14 23:36:57.917 | No symbol table info available.
---

even when I run that under valgrind (without symbol packages) it comes up differently

---
2019-10-14 23:44:37.137 | Completion plugin: Generating completion cache...
2019-10-14 23:44:37.188 | ==30450== Invalid read of size 4
2019-10-14 23:44:37.189 | ==30450==    at 0x48BD2D2: fwrite (in /usr/lib64/libc-2.29.so)
2019-10-14 23:44:37.189 | ==30450==    by 0x99E6BD6: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so)
2019-10-14 23:44:37.189 | ==30450==    by 0x6FF434A: g_logv (in /usr/lib64/libglib-2.0.so.0.6000.7)
2019-10-14 23:44:37.189 | ==30450==    by 0x6FF4532: g_log (in /usr/lib64/libglib-2.0.so.0.6000.7)
2019-10-14 23:44:37.189 | ==30450==    by 0x6E61EF3: dnf_sack_set_arch (in /usr/lib64/libdnf.so.2)
2019-10-14 23:44:37.189 | ==30450==    by 0x99E53E4: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B0C275: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B38C5D: PyObject_Call (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B82778: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B3707F: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B3871D: _PyFunction_FastCallDict (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B0C9C5: _PyObject_Call_Prepend (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==  Address 0x9891040 is 0 bytes inside a block of size 552 free'd
2019-10-14 23:44:37.189 | ==30450==    at 0x4839A0C: free (vg_replace_malloc.c:540)
2019-10-14 23:44:37.189 | ==30450==    by 0x48BC090: fclose@@GLIBC_2.2.5 (in /usr/lib64/libc-2.29.so)
2019-10-14 23:44:37.189 | ==30450==    by 0x99E5FA2: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B6B2A8: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x99E23D6: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B37206: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.189 | ==30450==    by 0x4B38081: _PyFunction_FastCallKeywords (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B4969E: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B8106C: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B3707F: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B38081: _PyFunction_FastCallKeywords (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B4969E: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==  Block was alloc'd at
2019-10-14 23:44:37.190 | ==30450==    at 0x483880B: malloc (vg_replace_malloc.c:309)
2019-10-14 23:44:37.190 | ==30450==    by 0x48BC93E: __fopen_internal (in /usr/lib64/libc-2.29.so)
2019-10-14 23:44:37.190 | ==30450==    by 0x99E545C: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B0C275: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B38C5D: PyObject_Call (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B82778: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B3707F: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B3871D: _PyFunction_FastCallDict (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B0C9C5: _PyObject_Call_Prepend (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B472E2: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B481C6: _PyObject_FastCallKeywords (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450==    by 0x4B49828: ??? (in /usr/lib64/libpython3.7m.so.1.0)
2019-10-14 23:44:37.190 | ==30450== 
---

when I do have all the symbols it seems to be right in the log handlers

---
2019-10-14 23:05:39.835 | Completion plugin: Generating completion cache...
2019-10-14 23:05:39.901 | ==20269== Invalid read of size 4
2019-10-14 23:05:39.901 | ==20269==    at 0x48BD2D2: fwrite (iofwrite.c:37)
2019-10-14 23:05:39.901 | ==20269==    by 0x99E6BD6: log_handler(char const*, GLogLevelFlags, char const*, void*) (sack-py.cpp:186)
2019-10-14 23:05:39.901 | ==20269==    by 0x6FF434A: g_logv (in /usr/lib64/libglib-2.0.so.0.6000.7)
2019-10-14 23:05:39.901 | ==20269==    by 0x6FF4532: g_log (in /usr/lib64/libglib-2.0.so.0.6000.7)
2019-10-14 23:05:39.901 | ==20269==    by 0x6E61EF3: dnf_sack_set_arch (dnf-sack.cpp:824)
2019-10-14 23:05:39.901 | ==20269==    by 0x99E53E4: sack_init(_SackObject*, _object*, _object*) (sack-py.cpp:265)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B0C275: UnknownInlinedFun (typeobject.c:5880)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B0C275: UnknownInlinedFun (descrobject.c:346)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B0C275: wrapper_call (descrobject.c:1146)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B38C5D: PyObject_Call (call.c:245)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B82778: UnknownInlinedFun (ceval.c:4645)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B82778: _PyEval_EvalFrameDefault (ceval.c:3191)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B3707F: _PyEval_EvalCodeWithName (ceval.c:3930)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B3871D: _PyFunction_FastCallDict (call.c:376)
2019-10-14 23:05:39.902 | ==20269==    by 0x4B0C9C5: _PyObject_Call_Prepend (call.c:908)
2019-10-14 23:05:39.902 | ==20269==  Address 0x897b840 is 0 bytes inside a block of size 552 free'd
2019-10-14 23:05:39.903 | ==20269==    at 0x4839A0C: free (vg_replace_malloc.c:540)
2019-10-14 23:05:39.903 | ==20269==    by 0x48BC090: fclose@@GLIBC_2.2.5 (iofclose.c:77)
2019-10-14 23:05:39.903 | ==20269==    by 0x99E5FA2: sack_dealloc(_SackObject*) (sack-py.cpp:135)
2019-10-14 23:05:39.903 | ==20269==    by 0x4B6B2A8: subtype_dealloc (typeobject.c:1278)
2019-10-14 23:05:39.903 | ==20269==    by 0x99E23D6: query_dealloc(_QueryObject*) (query-py.cpp:211)
2019-10-14 23:05:39.903 | ==20269==    by 0x4B37206: UnknownInlinedFun (frameobject.c:470)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B37206: _PyEval_EvalCodeWithName (ceval.c:3946)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B38081: _PyFunction_FastCallKeywords (call.c:433)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B4969E: call_function (ceval.c:4616)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B8106C: _PyEval_EvalFrameDefault (ceval.c:3110)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B3707F: _PyEval_EvalCodeWithName (ceval.c:3930)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B38081: _PyFunction_FastCallKeywords (call.c:433)
2019-10-14 23:05:39.904 | ==20269==    by 0x4B4969E: call_function (ceval.c:4616)
2019-10-14 23:05:39.904 | ==20269==  Block was alloc'd at
2019-10-14 23:05:39.904 | ==20269==    at 0x483880B: malloc (vg_replace_malloc.c:309)
2019-10-14 23:05:39.905 | ==20269==    by 0x48BC93E: __fopen_internal (iofopen.c:65)
2019-10-14 23:05:39.905 | ==20269==    by 0x99E545C: set_logfile (sack-py.cpp:199)
2019-10-14 23:05:39.905 | ==20269==    by 0x99E545C: sack_init(_SackObject*, _object*, _object*) (sack-py.cpp:276)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B0C275: UnknownInlinedFun (typeobject.c:5880)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B0C275: UnknownInlinedFun (descrobject.c:346)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B0C275: wrapper_call (descrobject.c:1146)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B38C5D: PyObject_Call (call.c:245)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B82778: UnknownInlinedFun (ceval.c:4645)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B82778: _PyEval_EvalFrameDefault (ceval.c:3191)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B3707F: _PyEval_EvalCodeWithName (ceval.c:3930)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B3871D: _PyFunction_FastCallDict (call.c:376)
2019-10-14 23:05:39.905 | ==20269==    by 0x4B0C9C5: _PyObject_Call_Prepend (call.c:908)
2019-10-14 23:05:39.906 | ==20269==    by 0x4B472E2: slot_tp_init (typeobject.c:6636)
2019-10-14 23:05:39.906 | ==20269==    by 0x4B481C6: UnknownInlinedFun (typeobject.c:971)
2019-10-14 23:05:39.906 | ==20269==    by 0x4B481C6: _PyObject_FastCallKeywords (call.c:199)
2019-10-14 23:05:39.906 | ==20269==    by 0x4B49828: call_function (ceval.c:4619)
2019-10-14 23:05:39.906 | ==20269== 
---

I don't know if it's better unwind info or what, but to me it is looking like being inside libsolv might be a red herring.

Unfortunately, I haven't been able to reproduce that abort from free() from the other trace again :/

[1] https://review.opendev.org/#/c/688580/

Comment 5 Ian Wienand 2019-10-15 01:08:14 UTC
Created attachment 1625726 [details]
Valgind logs from crashing dnf

Comment 6 Lukáš Hrázký 2019-10-16 08:39:30 UTC
Thanks. To me the traceback without symbols seems to be the same as the last one with symbols.

As of now I've got the impression that the root cause of all what you've described is the log handler error. I'll leave this for a bit and then, if there are no more occurrences, I'll mark this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1758737.

Be careful with the nightly repo though, it may be unstable and right now it seems we have another issue there: https://bugzilla.redhat.com/show_bug.cgi?id=1761779

Comment 7 Ian Wienand 2019-10-17 00:30:11 UTC
(In reply to Lukáš Hrázký from comment #6)

> Thanks. To me the traceback without symbols seems to be the same as the last
> one with symbols.

Yeah, the difference is more in the very first one, the gdb backtrace (with no symbols installed) that seems to show it passing through libsolv; but then running the same thing under valgrind (even with the no symbols) doesn't mention libsolv in the backtrace at all.  And then running under gdb with symbols shows a backtrace like valgrind ... so that's why I'm super supicious that "gdb without symbols" -- which is where libsolv gets implicated -- is showing a valid backtrace.

I have also caught another error in a different test (python3-dnf-4.2.11-2.fc30) [1]

---
2019-10-16 05:41:27.645 | > Completion plugin: Generating completion cache...
2019-10-16 05:41:27.645 | > malloc(): unsorted double linked list corrupted
---

I haven't got backtraces, but might be something others see too

[1] https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_7e8/688849/2/check/dib-functests-bionic-python3/7e89693/logs/fedora-minimal_build-succeeds.FAIL.log

Comment 8 Lukáš Hrázký 2019-10-17 09:04:17 UTC
(In reply to Ian Wienand from comment #7)
> Yeah, the difference is more in the very first one, the gdb backtrace (with
> no symbols installed) that seems to show it passing through libsolv; but
> then running the same thing under valgrind (even with the no symbols)
> doesn't mention libsolv in the backtrace at all.  And then running under gdb
> with symbols shows a backtrace like valgrind ... so that's why I'm super
> supicious that "gdb without symbols" -- which is where libsolv gets
> implicated -- is showing a valid backtrace.

I see, but the thing is the behavior is random. Since the g_log_handler has a pointer to memory that we deallocated, various things can happen depending on where the memory is and what it was used for next. So in the most common case the crash is in the handler itself, because it finds gibberish in the memory. But it can also somehow pass and do something nasty to the memory and then, something else will crash sometime later and then the tracebacks you get don't even point you to the place where the problem is, but somewhere totally unrelated.

That's what I think is happening here and it's not the fact that you installed debug symbols, but the random behavior.

> I have also caught another error in a different test
> (python3-dnf-4.2.11-2.fc30) [1]
> 
> ---
> 2019-10-16 05:41:27.645 | > Completion plugin: Generating completion cache...
> 2019-10-16 05:41:27.645 | > malloc(): unsorted double linked list corrupted
> ---
> 
> I haven't got backtraces, but might be something others see too
> 
> [1]
> https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/
> zuul_opendev_logs_7e8/688849/2/check/dib-functests-bionic-python3/7e89693/
> logs/fedora-minimal_build-succeeds.FAIL.log

Ok, that malloc error actually confirms my theory - seems to be a rare case where malloc will actually tell you its memory is corrupted instead of segfaulting. AFAICS the crash is with libdnf-0.35.5-2, which is the affected version.

Comment 9 Lukáš Hrázký 2019-10-23 08:37:58 UTC
At this stage I'm quite certain the crashes were caused by the log handler issue, marking this as duplicate. The fix for it has been released. If the problems persist, feel free to reopen this.

*** This bug has been marked as a duplicate of bug 1758737 ***


Note You need to log in before you can comment on or make changes to this bug.