We have started seeing a segfault with dnf installs during or after "Generating completion cache" --- 2019-10-14 01:00:02.566 | Completion plugin: Generating completion cache... 2019-10-14 01:00:02.588 | /opt/dib/tmp/dib_build.ZqkLGLSg/hooks/root.d/08-yum-chroot: line 303: 18083 Segmentation fault sudo -E chroot $TARGET_ROOT ${YUM} -y -v debuginfo-install libdnf dnf libsolv glibc python3 --- This started happening after the following updates (based on our package install logs; [1] is the last working build, [2] is a current failure) GOOD VERSIONS BAD VERSIONS ------------- ------------ dnf.noarch 0:4.2.8-2.fc30 | dnf.noarch 0:4.2.11-2.fc30 dnf-plugins-core.noarch 0:4.0.9-1.fc30 | dnf-plugins-core.noarch 0:4.0.10-1.fc30 dnf-data.noarch 0:4.2.8-2.fc30 | dnf-data.noarch 0:4.2.11-2.fc30 libdnf.x86_64 0:0.35.2-4.fc30 | libdnf.x86_64 0:0.35.5-2.fc30 librepo.x86_64 0:1.10.5-1.fc30 | librepo.x86_64 0:1.10.6-1.fc30 python3-dnf.noarch 0:4.2.8-2.fc30 | python3-dnf.noarch 0:4.2.11-2.fc30 python3-dnf-plugins-core.noarch 0:4.0.9-1.fc30 | python3-dnf-plugins-core.noarch 0:4.0.10-1.fc30 python3-hawkey.x86_64 0:0.35.2-4.fc30 | python3-hawkey.x86_64 0:0.35.5-2.fc30 python3-libdnf.x86_64 0:0.35.2-4.fc30 | python3-libdnf.x86_64 0:0.35.5-2.fc30 zlib.x86_64 0:1.2.11-18.fc30 | zlib.x86_64 0:1.2.11-18.fc30 This looks like #1758737. Filing this as additional confirmation for F30 issues. The py-bt is: --- 2019-10-14 01:12:12.218 | #85 0x000055555555508e in _start () 2019-10-14 01:12:12.292 | Traceback (most recent call first): 2019-10-14 01:12:12.292 | <method-wrapper '__init__' of Sack object at 0x7ffff4d09360> 2019-10-14 01:12:12.292 | File "/usr/lib/python3.7/site-packages/dnf/sack.py", line 35, in __init__ 2019-10-14 01:12:12.292 | super(Sack, self).__init__(*args, **kwargs) 2019-10-14 01:12:12.292 | File "/usr/lib/python3.7/site-packages/dnf/sack.py", line 56, in _build_sack 2019-10-14 01:12:12.292 | logdebug=base.conf.debuglevel > 2) 2019-10-14 01:12:12.292 | File "/usr/lib/python3.7/site-packages/dnf/sack.py", line 61, in _rpmdb_sack 2019-10-14 01:12:12.292 | sack = _build_sack(base) 2019-10-14 01:12:12.292 | File "/usr/lib/python3.7/site-packages/dnf-plugins/generate_completion_cache.py", line 85, in transaction 2019-10-14 01:12:12.292 | inst_pkgs = dnf.sack._rpmdb_sack(self.base).query().installed() 2019-10-14 01:12:12.292 | File "/usr/lib/python3.7/site-packages/dnf/plugin.py", line 104, in _caller 2019-10-14 01:12:12.293 | getattr(plugin, method)() 2019-10-14 01:12:12.293 | File "/usr/lib/python3.7/site-packages/dnf/plugin.py", line 164, in run_transaction 2019-10-14 01:12:12.293 | self._caller('transaction') 2019-10-14 01:12:12.293 | File "/usr/lib/python3.7/site-packages/dnf/base.py", line 871, in do_transaction 2019-10-14 01:12:12.293 | self._plugins.run_transaction() 2019-10-14 01:12:12.293 | File "/usr/lib/python3.7/site-packages/dnf/cli/cli.py", line 241, in do_transaction 2019-10-14 01:12:12.293 | tid = super(BaseCli, self).do_transaction(display) 2019-10-14 01:12:12.293 | File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 166, in resolving 2019-10-14 01:12:12.293 | base.do_transaction(display=displays) 2019-10-14 01:12:12.293 | File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 122, in cli_run 2019-10-14 01:12:12.294 | ret = resolving(cli, base) 2019-10-14 01:12:12.294 | File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 98, in _main 2019-10-14 01:12:12.294 | return cli_run(cli, base) 2019-10-14 01:12:12.294 | File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 65, in main 2019-10-14 01:12:12.294 | return _main(base, args, cli_class, option_parser_class) 2019-10-14 01:12:12.294 | File "/usr/lib/python3.7/site-packages/dnf/cli/main.py", line 191, in user_main 2019-10-14 01:12:12.294 | errcode = main(args) 2019-10-14 01:12:12.294 | File "/usr/bin/dnf", line 58, in <module> 2019-10-14 01:12:12.294 | main.user_main(sys.argv[1:], exit_code=True) --- It seems to move depending on what debug libraries I have installed or what packages are being installed. Before I had debug libraries installed I saw --- 2019-10-14 00:10:41.251 | Thread 1 "python3" received signal SIGSEGV, Segmentation fault. 2019-10-14 00:10:41.251 | 0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6 2019-10-14 00:10:41.254 | #0 0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6 2019-10-14 00:10:41.254 | #1 0x00007ffff7e889c6 in calloc () from /lib64/libc.so.6 2019-10-14 00:10:41.276 | #2 0x00007ffff6787b90 in solv_calloc () from /lib64/libsolv.so.1 2019-10-14 00:10:41.276 | #3 0x00007ffff674fa8f in pool_setarchpolicy () from /lib64/libsolv.so.1 2019-10-14 00:10:41.277 | #4 0x00007ffff6bfaf24 in dnf_sack_set_arch () from /lib64/libdnf.so.2 2019-10-14 00:10:41.299 | #5 0x00007ffff51623e5 in sack_init(_SackObject*, _object*, _object*) () from /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so 2019-10-14 00:10:41.299 | #6 0x00007ffff7bb3276 in ?? () from /lib64/libpython3.7m.so.1.0 ... --- I believe this strongly suggests that #1759204 is *not* a libsolv error, but indeed some sort of consequence of preceding leaks within libdnf somewhere. At one point, I captured an abort from an invalid free() (bt-abort.txt) that appeared to originate around --- #7 0x00007ffff6bfc3bc in dnf_sack_load_system_repo (sack=0x555555a0fb40, a_hrepo=0x0, flags=0, error=0x7fffffffb070) at /usr/src/debug/libdnf-0.35.5-2.fc30.x86_64/libdnf/dnf-sack.cpp:169 --- On other occasions, seeing a SEGV that comes from a similar, but different location --- 2019-10-14 01:12:09.594 | #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:240 2019-10-14 01:12:09.594 | #1 0x00007ffff7e806c8 in __GI__IO_default_xsputn (n=<optimized out>, data=<optimized out>, f=<optimized out>) at genops.c:386 2019-10-14 01:12:09.603 | #2 __GI__IO_default_xsputn (f=f@entry=0x5555584c5370, data=data@entry=0x5555584a8940, n=n@entry=46) at genops.c:370 2019-10-14 01:12:09.607 | #3 0x00007ffff7e7e92b in _IO_new_file_xsputn (n=46, data=<optimized out>, f=<optimized out>) at fileops.c:1272 2019-10-14 01:12:09.609 | #4 _IO_new_file_xsputn (f=0x5555584c5370, data=<optimized out>, n=46) at fileops.c:1204 2019-10-14 01:12:09.648 | #5 0x00007ffff7e73389 in __GI__IO_fwrite (buf=0x5555584a8940, size=46, count=1, fp=0x5555584c5370) at libioP.h:904 2019-10-14 01:12:09.649 | #6 0x00007ffff5163bd7 in log_handler(char const*, GLogLevelFlags, char const*, void*) () from /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so 2019-10-14 01:12:09.650 | #7 0x00007ffff6a5e34b in g_logv () from /lib64/libglib-2.0.so.0 2019-10-14 01:12:09.651 | #8 0x00007ffff6a5e533 in g_log () from /lib64/libglib-2.0.so.0 2019-10-14 01:12:09.812 | #9 0x00007ffff6bfaef4 in dnf_sack_set_arch (sack=<optimized out>, value=<optimized out>, error=0x7fffffffaa00) at /usr/src/debug/libdnf-0.35.5-2.fc30.x86_64/libdnf/dnf-sack.cpp:824 --- Bug #1758737 says > The problem has been introduced in libdnf-0.35.5, so reverting to libdnf-0.35.3 should bet a temporary solution until we fix it. I will attempt to confirm this, but hopefully this is helpful for anyone seeing the same on F30 [1] https://f921bf67a57bb9acb3a9-7f483fd0170ce51389d38b0c84b96364.ssl.cf2.rackcdn.com/687475/1/check/dib-nodepool-functional-openstack-fedora-30-src/2474d20/nodepool/builds/test-image-0000000001.log [2] https://654d83e20ff6115f815e-89f45418e048e284a0aa8c1a9bb175a2.ssl.cf1.rackcdn.com/688036/1/check/dib-nodepool-functional-openstack-fedora-30-src/c27292c/nodepool/builds/test-image-0000000001.log
Created attachment 1625414 [details] dnf backtrace with abort due to bad free()
Created attachment 1625416 [details] dnf backtrace with segv
Hello Ian, thanks for the information. The third traceback is almost surely https://bugzilla.redhat.com/show_bug.cgi?id=1758737. The fix for that was merged and should be released this week. IF you want to test it now, you can try our nightly repositories here: https://copr.fedorainfracloud.org/coprs/rpmsoftwaremanagement/dnf-nightly/ Which should contain the fix. IMO the second traceback you posted is not very similar to the third one - they do have the dnf_sack_set_arch() call in common, but one fails in the log_handler and the other one in allocating memory. We've gotten several bug reports lately failing in allocating memory. Presumably the cause would be heap corruption caused sometime before the SEGFAULT. The cause could be the log_handler problem as well, or something else. If you were able to (semi-)consistently reproduce the second traceback, could you try running it in valgrind or possibly ElectricFence? That should discover errorneous memory writes. Please report back once you get to try the fix! Thanks.
(In reply to Lukáš Hrázký from comment #3) > The third traceback is almost surely > https://bugzilla.redhat.com/show_bug.cgi?id=1758737. The fix for that was > merged and should be released this week. IF you want to test it now, you can > try our nightly repositories here: > https://copr.fedorainfracloud.org/coprs/rpmsoftwaremanagement/dnf-nightly/ > Which should contain the fix. Thanks; I have tested with that [1] and we do not see the segfaults any more. Just for completeness the versions were: --- 2019-10-14 20:58:58.267 | ================================================================================ 2019-10-14 20:58:58.267 | Package ArchVersion Repository Size 2019-10-14 20:58:58.267 | ================================================================================ 2019-10-14 20:58:58.267 | Upgrading: 2019-10-14 20:58:58.267 | dnf noarch4.2.12-0.34ge5b78300.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 388 k 2019-10-14 20:58:58.267 | dnf-data noarch4.2.12-0.34ge5b78300.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 30 k 2019-10-14 20:58:58.267 | dnf-plugins-core noarch4.0.10-1g1f46091.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 25 k 2019-10-14 20:58:58.267 | libcomps x86_640.1.11-12g919c0e2.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 73 k 2019-10-14 20:58:58.267 | libdnf x86_640.35.6-0.23g0999bcc7.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 595 k 2019-10-14 20:58:58.267 | librepo x86_641.10.7-0.5gc3e91b6.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 88 k 2019-10-14 20:58:58.267 | libsolv x86_640.7.6-5gf428e2bd.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 384 k 2019-10-14 20:58:58.267 | python3-dnf noarch4.2.12-0.34ge5b78300.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 398 k 2019-10-14 20:58:58.267 | python3-dnf-plugins-core noarch4.0.10-1g1f46091.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 159 k 2019-10-14 20:58:58.267 | python3-hawkey x86_640.35.6-0.23g0999bcc7.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 83 k 2019-10-14 20:58:58.267 | python3-libcomps x86_640.1.11-12g919c0e2.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 45 k 2019-10-14 20:58:58.267 | python3-libdnf x86_640.35.6-0.23g0999bcc7.fc30 copr:copr.fedorainfracloud.org:rpmsoftwaremanagement:dnf-nightly 699 k 2019-10-14 20:58:58.267 | Installing dependencies: 2019-10-14 20:58:58.267 | dbus-libs x86_641:1.12.16-1.fc30 updates 147 k 2019-10-14 20:58:58.267 | libevent x86_642.1.8-7.fc30 updates 231 k 2019-10-14 20:58:58.267 | unbound-libs x86_641.9.3-2.fc30 updates 498 k 2019-10-14 20:58:58.267 | Installing weak dependencies: 2019-10-14 20:58:58.267 | python3-unbound x86_641.9.3-2.fc30 updates 104 k 2019-10-14 20:58:58.267 | rpm-plugin-systemd-inhibit x86_644.14.2.1-5.fc30 updates 19 k 2019-10-14 20:58:58.267 | deltarpm x86_643.6-29.fc30 fedora 77 k --- > IMO the second traceback you posted is not very similar to the third one - > they do have the dnf_sack_set_arch() call in common, but one fails in the > log_handler and the other one in allocating memory. We've gotten several bug > reports lately failing in allocating memory. Presumably the cause would be > heap corruption caused sometime before the SEGFAULT. The cause could be the > log_handler problem as well, or something else. > > If you were able to (semi-)consistently reproduce the second traceback, > could you try running it in valgrind or possibly ElectricFence? That should > discover errorneous memory writes. I have tried running this under valgrind in a number of circumstances. I get some errors like --- error: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f30&arch=x86_64 [Resolving timed out after 30000 milliseconds] (https://mirrors.fedoraproject.org/metalink?repo=updates-released-f30&arch=x86_64). --- which I think is valgrind slowing things down, but it does seem to complete so I don't think they're fatal. But basically I don't trust the backtrace without symbols; which is the one that seems to be in libsolv. Running the same command (dnf -v -y install gettext) under gdb with no symbol packages I get --- 2019-10-14 23:36:57.891 | Thread 1 "python3" received signal SIGSEGV, Segmentation fault. 2019-10-14 23:36:57.891 | 0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6 2019-10-14 23:36:57.891 | #0 0x00007ffff7e8611d in _int_malloc () from /lib64/libc.so.6 2019-10-14 23:36:57.891 | No symbol table info available. 2019-10-14 23:36:57.891 | #1 0x00007ffff7e889c6 in calloc () from /lib64/libc.so.6 2019-10-14 23:36:57.903 | No symbol table info available. 2019-10-14 23:36:57.903 | #2 0x00007ffff6787b90 in solv_calloc () from /lib64/libsolv.so.1 2019-10-14 23:36:57.903 | No symbol table info available. 2019-10-14 23:36:57.903 | #3 0x00007ffff674fa8f in pool_setarchpolicy () from /lib64/libsolv.so.1 2019-10-14 23:36:57.904 | No symbol table info available. 2019-10-14 23:36:57.904 | #4 0x00007ffff6bfaf24 in dnf_sack_set_arch () from /lib64/libdnf.so.2 2019-10-14 23:36:57.917 | No symbol table info available. 2019-10-14 23:36:57.917 | #5 0x00007ffff51623e5 in sack_init(_SackObject*, _object*, _object*) () from /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so 2019-10-14 23:36:57.917 | No symbol table info available. --- even when I run that under valgrind (without symbol packages) it comes up differently --- 2019-10-14 23:44:37.137 | Completion plugin: Generating completion cache... 2019-10-14 23:44:37.188 | ==30450== Invalid read of size 4 2019-10-14 23:44:37.189 | ==30450== at 0x48BD2D2: fwrite (in /usr/lib64/libc-2.29.so) 2019-10-14 23:44:37.189 | ==30450== by 0x99E6BD6: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so) 2019-10-14 23:44:37.189 | ==30450== by 0x6FF434A: g_logv (in /usr/lib64/libglib-2.0.so.0.6000.7) 2019-10-14 23:44:37.189 | ==30450== by 0x6FF4532: g_log (in /usr/lib64/libglib-2.0.so.0.6000.7) 2019-10-14 23:44:37.189 | ==30450== by 0x6E61EF3: dnf_sack_set_arch (in /usr/lib64/libdnf.so.2) 2019-10-14 23:44:37.189 | ==30450== by 0x99E53E4: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so) 2019-10-14 23:44:37.189 | ==30450== by 0x4B0C275: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x4B38C5D: PyObject_Call (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x4B82778: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x4B3707F: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x4B3871D: _PyFunction_FastCallDict (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x4B0C9C5: _PyObject_Call_Prepend (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== Address 0x9891040 is 0 bytes inside a block of size 552 free'd 2019-10-14 23:44:37.189 | ==30450== at 0x4839A0C: free (vg_replace_malloc.c:540) 2019-10-14 23:44:37.189 | ==30450== by 0x48BC090: fclose@@GLIBC_2.2.5 (in /usr/lib64/libc-2.29.so) 2019-10-14 23:44:37.189 | ==30450== by 0x99E5FA2: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so) 2019-10-14 23:44:37.189 | ==30450== by 0x4B6B2A8: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x99E23D6: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so) 2019-10-14 23:44:37.189 | ==30450== by 0x4B37206: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.189 | ==30450== by 0x4B38081: _PyFunction_FastCallKeywords (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B4969E: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B8106C: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B3707F: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B38081: _PyFunction_FastCallKeywords (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B4969E: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== Block was alloc'd at 2019-10-14 23:44:37.190 | ==30450== at 0x483880B: malloc (vg_replace_malloc.c:309) 2019-10-14 23:44:37.190 | ==30450== by 0x48BC93E: __fopen_internal (in /usr/lib64/libc-2.29.so) 2019-10-14 23:44:37.190 | ==30450== by 0x99E545C: ??? (in /usr/lib64/python3.7/site-packages/hawkey/_hawkey.so) 2019-10-14 23:44:37.190 | ==30450== by 0x4B0C275: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B38C5D: PyObject_Call (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B82778: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B3707F: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B3871D: _PyFunction_FastCallDict (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B0C9C5: _PyObject_Call_Prepend (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B472E2: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B481C6: _PyObject_FastCallKeywords (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== by 0x4B49828: ??? (in /usr/lib64/libpython3.7m.so.1.0) 2019-10-14 23:44:37.190 | ==30450== --- when I do have all the symbols it seems to be right in the log handlers --- 2019-10-14 23:05:39.835 | Completion plugin: Generating completion cache... 2019-10-14 23:05:39.901 | ==20269== Invalid read of size 4 2019-10-14 23:05:39.901 | ==20269== at 0x48BD2D2: fwrite (iofwrite.c:37) 2019-10-14 23:05:39.901 | ==20269== by 0x99E6BD6: log_handler(char const*, GLogLevelFlags, char const*, void*) (sack-py.cpp:186) 2019-10-14 23:05:39.901 | ==20269== by 0x6FF434A: g_logv (in /usr/lib64/libglib-2.0.so.0.6000.7) 2019-10-14 23:05:39.901 | ==20269== by 0x6FF4532: g_log (in /usr/lib64/libglib-2.0.so.0.6000.7) 2019-10-14 23:05:39.901 | ==20269== by 0x6E61EF3: dnf_sack_set_arch (dnf-sack.cpp:824) 2019-10-14 23:05:39.901 | ==20269== by 0x99E53E4: sack_init(_SackObject*, _object*, _object*) (sack-py.cpp:265) 2019-10-14 23:05:39.902 | ==20269== by 0x4B0C275: UnknownInlinedFun (typeobject.c:5880) 2019-10-14 23:05:39.902 | ==20269== by 0x4B0C275: UnknownInlinedFun (descrobject.c:346) 2019-10-14 23:05:39.902 | ==20269== by 0x4B0C275: wrapper_call (descrobject.c:1146) 2019-10-14 23:05:39.902 | ==20269== by 0x4B38C5D: PyObject_Call (call.c:245) 2019-10-14 23:05:39.902 | ==20269== by 0x4B82778: UnknownInlinedFun (ceval.c:4645) 2019-10-14 23:05:39.902 | ==20269== by 0x4B82778: _PyEval_EvalFrameDefault (ceval.c:3191) 2019-10-14 23:05:39.902 | ==20269== by 0x4B3707F: _PyEval_EvalCodeWithName (ceval.c:3930) 2019-10-14 23:05:39.902 | ==20269== by 0x4B3871D: _PyFunction_FastCallDict (call.c:376) 2019-10-14 23:05:39.902 | ==20269== by 0x4B0C9C5: _PyObject_Call_Prepend (call.c:908) 2019-10-14 23:05:39.902 | ==20269== Address 0x897b840 is 0 bytes inside a block of size 552 free'd 2019-10-14 23:05:39.903 | ==20269== at 0x4839A0C: free (vg_replace_malloc.c:540) 2019-10-14 23:05:39.903 | ==20269== by 0x48BC090: fclose@@GLIBC_2.2.5 (iofclose.c:77) 2019-10-14 23:05:39.903 | ==20269== by 0x99E5FA2: sack_dealloc(_SackObject*) (sack-py.cpp:135) 2019-10-14 23:05:39.903 | ==20269== by 0x4B6B2A8: subtype_dealloc (typeobject.c:1278) 2019-10-14 23:05:39.903 | ==20269== by 0x99E23D6: query_dealloc(_QueryObject*) (query-py.cpp:211) 2019-10-14 23:05:39.903 | ==20269== by 0x4B37206: UnknownInlinedFun (frameobject.c:470) 2019-10-14 23:05:39.904 | ==20269== by 0x4B37206: _PyEval_EvalCodeWithName (ceval.c:3946) 2019-10-14 23:05:39.904 | ==20269== by 0x4B38081: _PyFunction_FastCallKeywords (call.c:433) 2019-10-14 23:05:39.904 | ==20269== by 0x4B4969E: call_function (ceval.c:4616) 2019-10-14 23:05:39.904 | ==20269== by 0x4B8106C: _PyEval_EvalFrameDefault (ceval.c:3110) 2019-10-14 23:05:39.904 | ==20269== by 0x4B3707F: _PyEval_EvalCodeWithName (ceval.c:3930) 2019-10-14 23:05:39.904 | ==20269== by 0x4B38081: _PyFunction_FastCallKeywords (call.c:433) 2019-10-14 23:05:39.904 | ==20269== by 0x4B4969E: call_function (ceval.c:4616) 2019-10-14 23:05:39.904 | ==20269== Block was alloc'd at 2019-10-14 23:05:39.904 | ==20269== at 0x483880B: malloc (vg_replace_malloc.c:309) 2019-10-14 23:05:39.905 | ==20269== by 0x48BC93E: __fopen_internal (iofopen.c:65) 2019-10-14 23:05:39.905 | ==20269== by 0x99E545C: set_logfile (sack-py.cpp:199) 2019-10-14 23:05:39.905 | ==20269== by 0x99E545C: sack_init(_SackObject*, _object*, _object*) (sack-py.cpp:276) 2019-10-14 23:05:39.905 | ==20269== by 0x4B0C275: UnknownInlinedFun (typeobject.c:5880) 2019-10-14 23:05:39.905 | ==20269== by 0x4B0C275: UnknownInlinedFun (descrobject.c:346) 2019-10-14 23:05:39.905 | ==20269== by 0x4B0C275: wrapper_call (descrobject.c:1146) 2019-10-14 23:05:39.905 | ==20269== by 0x4B38C5D: PyObject_Call (call.c:245) 2019-10-14 23:05:39.905 | ==20269== by 0x4B82778: UnknownInlinedFun (ceval.c:4645) 2019-10-14 23:05:39.905 | ==20269== by 0x4B82778: _PyEval_EvalFrameDefault (ceval.c:3191) 2019-10-14 23:05:39.905 | ==20269== by 0x4B3707F: _PyEval_EvalCodeWithName (ceval.c:3930) 2019-10-14 23:05:39.905 | ==20269== by 0x4B3871D: _PyFunction_FastCallDict (call.c:376) 2019-10-14 23:05:39.905 | ==20269== by 0x4B0C9C5: _PyObject_Call_Prepend (call.c:908) 2019-10-14 23:05:39.906 | ==20269== by 0x4B472E2: slot_tp_init (typeobject.c:6636) 2019-10-14 23:05:39.906 | ==20269== by 0x4B481C6: UnknownInlinedFun (typeobject.c:971) 2019-10-14 23:05:39.906 | ==20269== by 0x4B481C6: _PyObject_FastCallKeywords (call.c:199) 2019-10-14 23:05:39.906 | ==20269== by 0x4B49828: call_function (ceval.c:4619) 2019-10-14 23:05:39.906 | ==20269== --- I don't know if it's better unwind info or what, but to me it is looking like being inside libsolv might be a red herring. Unfortunately, I haven't been able to reproduce that abort from free() from the other trace again :/ [1] https://review.opendev.org/#/c/688580/
Created attachment 1625726 [details] Valgind logs from crashing dnf
Thanks. To me the traceback without symbols seems to be the same as the last one with symbols. As of now I've got the impression that the root cause of all what you've described is the log handler error. I'll leave this for a bit and then, if there are no more occurrences, I'll mark this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1758737. Be careful with the nightly repo though, it may be unstable and right now it seems we have another issue there: https://bugzilla.redhat.com/show_bug.cgi?id=1761779
(In reply to Lukáš Hrázký from comment #6) > Thanks. To me the traceback without symbols seems to be the same as the last > one with symbols. Yeah, the difference is more in the very first one, the gdb backtrace (with no symbols installed) that seems to show it passing through libsolv; but then running the same thing under valgrind (even with the no symbols) doesn't mention libsolv in the backtrace at all. And then running under gdb with symbols shows a backtrace like valgrind ... so that's why I'm super supicious that "gdb without symbols" -- which is where libsolv gets implicated -- is showing a valid backtrace. I have also caught another error in a different test (python3-dnf-4.2.11-2.fc30) [1] --- 2019-10-16 05:41:27.645 | > Completion plugin: Generating completion cache... 2019-10-16 05:41:27.645 | > malloc(): unsorted double linked list corrupted --- I haven't got backtraces, but might be something others see too [1] https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_7e8/688849/2/check/dib-functests-bionic-python3/7e89693/logs/fedora-minimal_build-succeeds.FAIL.log
(In reply to Ian Wienand from comment #7) > Yeah, the difference is more in the very first one, the gdb backtrace (with > no symbols installed) that seems to show it passing through libsolv; but > then running the same thing under valgrind (even with the no symbols) > doesn't mention libsolv in the backtrace at all. And then running under gdb > with symbols shows a backtrace like valgrind ... so that's why I'm super > supicious that "gdb without symbols" -- which is where libsolv gets > implicated -- is showing a valid backtrace. I see, but the thing is the behavior is random. Since the g_log_handler has a pointer to memory that we deallocated, various things can happen depending on where the memory is and what it was used for next. So in the most common case the crash is in the handler itself, because it finds gibberish in the memory. But it can also somehow pass and do something nasty to the memory and then, something else will crash sometime later and then the tracebacks you get don't even point you to the place where the problem is, but somewhere totally unrelated. That's what I think is happening here and it's not the fact that you installed debug symbols, but the random behavior. > I have also caught another error in a different test > (python3-dnf-4.2.11-2.fc30) [1] > > --- > 2019-10-16 05:41:27.645 | > Completion plugin: Generating completion cache... > 2019-10-16 05:41:27.645 | > malloc(): unsorted double linked list corrupted > --- > > I haven't got backtraces, but might be something others see too > > [1] > https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/ > zuul_opendev_logs_7e8/688849/2/check/dib-functests-bionic-python3/7e89693/ > logs/fedora-minimal_build-succeeds.FAIL.log Ok, that malloc error actually confirms my theory - seems to be a rare case where malloc will actually tell you its memory is corrupted instead of segfaulting. AFAICS the crash is with libdnf-0.35.5-2, which is the affected version.
At this stage I'm quite certain the crashes were caused by the log handler issue, marking this as duplicate. The fix for it has been released. If the problems persist, feel free to reopen this. *** This bug has been marked as a duplicate of bug 1758737 ***