Bug 2278016 - Samba DLZ module crashes BIND on startup
Summary: Samba DLZ module crashes BIND on startup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: samba
Version: 40
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Petr Menšík
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-04-30 17:29 UTC by Rob Foehl
Modified: 2024-10-03 02:03 UTC (History)
17 users (show)

Fixed In Version: samba-4.21.0-14.fc42
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-10-01 15:35:36 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
GDB-captured backtrace of failed named startup (6.25 KB, text/plain)
2024-04-30 17:31 UTC, Rob Foehl
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gitlab samba-team samba merge_requests 3810 0 None closed lib:ldb: Don't use RTLD_DEEPBIND by default 2024-10-01 11:53:09 UTC
Samba Project 15643 0 None None None 2024-05-12 23:01:07 UTC

Description Rob Foehl 2024-04-30 17:29:45 UTC
Samba AD DC using DLZ DNS on Fedora 39 with current Samba 4.19.6 / BIND 9.18.24 packages upgraded to Fedora 40, Samba 4.20.0 / BIND 9.18.26.  named consistently crashes on startup when trying to load the Samba-provided DLZ module; stack trace attached.  No other indications of issues, including samba-tool dbcheck.  Samba itself seems to operate fine otherwise, including answering domain-related queries via LDAP.

Reproducible: Always

Comment 1 Rob Foehl 2024-04-30 17:31:01 UTC
Created attachment 2030350 [details]
GDB-captured backtrace of failed named startup

Comment 2 Guenther Deschner 2024-05-07 12:01:25 UTC
Thanks for reporting the crash, can you please report this to the Samba upstream? I think it needs to get addressed there first.

Comment 3 Rob Foehl 2024-05-07 19:39:37 UTC
I can try, but between the closed Bugzilla instance and the ridiculously broken SMTP setup, it sure seems like there's no available avenue to actually do that...

Comment 4 Pavel Lisý 2024-05-12 21:30:00 UTC
I can confirm that on brand new instalation of Samba AD using DLZ DNS on Fedora 40. On Fedora 39 with the same setup no such problems. 

Samba deployed with option --dns-backend=SAMBA_INTERNAL work fine too.

Comment 5 Rob Foehl 2024-05-12 23:01:07 UTC
Managed to get this reported upstream...  However, BIND 9.18.26 packages have since made it to F39, and it looks like the problem exists there as well, still running Samba 4.19.6.  From a different DC, with only BIND packages upgraded from 9.18.24 to 9.18.26 (lightly edited coredumpctl output):

Module dlz_bind9_18.so from rpm samba-4.19.6-1.fc39.x86_64
Module named from rpm bind-9.18.26-1.fc39.x86_64

Stack trace of thread 983:
#0  0x00007f516a146820 _int_free_merge_chunk (libc.so.6 + 0x9c820)
#1  0x00007f516a146b4a _int_free (libc.so.6 + 0x9cb4a)
#2  0x00007f516a1493de free (libc.so.6 + 0x9f3de)
#3  0x00007f515f1748d5 dsdb_schema_refresh (schema_load.so + 0x48d5)
#4  0x00007f5168e2e2da dsdb_get_schema (libldbsamba-samba4.so + 0x1b2da)
#5  0x00007f515f175024 schema_load_init (schema_load.so + 0x5024)
#6  0x00007f5168f9f08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#7  0x00007f5168f9f08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#8  0x00007f515f1ce352 rootdse_init (rootdse.so + 0xe352)
#9  0x00007f5168f9f08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#10 0x00007f515f1a0664 samba_dsdb_init (samba_dsdb.so + 0x4664)
#11 0x00007f5168f9f08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#12 0x00007f5168fa2ac1 ldb_load_modules (libldb.so.2 + 0x15ac1)
#13 0x00007f5168fa341f ldb_connect (libldb.so.2 + 0x1641f)
#14 0x00007f5168e2d269 samba_ldb_connect (libldbsamba-samba4.so + 0x1a269)
#15 0x00007f5168ff1198 samdb_connect_url (libsamdb.so.0 + 0xf198)
#16 0x00007f5169d4ef05 dlz_create (dlz_bind9_18.so + 0x7f05)
#17 0x0000561e83e5c921 dlopen_dlz_create (named + 0x22921)
#18 0x00007f516af3ecf9 dns_sdlzcreate (libdns-9.18.26.so + 0x13ecf9)
#19 0x00007f516ae627cb dns_dlzcreate (libdns-9.18.26.so + 0x627cb)
#20 0x0000561e83e702f0 configure_view.lto_priv.0 (named + 0x362f0)
#21 0x0000561e83e7e751 load_configuration (named + 0x44751)
#22 0x0000561e83e80897 run_server (named + 0x46897)
#23 0x00007f516b10efbf isc_task_run (libisc-9.18.26.so + 0x66fbf)
#24 0x00007f516b0ce4bc isc__nm_async_task (libisc-9.18.26.so + 0x264bc)
#25 0x00007f516b0d6739 process_netievent (libisc-9.18.26.so + 0x2e739)
#26 0x00007f516b0d6e57 process_queue (libisc-9.18.26.so + 0x2ee57)
#27 0x00007f516b0d707a async_cb (libisc-9.18.26.so + 0x2f07a)
#28 0x00007f516ad49f23 uv__async_io.part.0 (libuv.so.1 + 0xaf23)
#29 0x00007f516ad6862b uv__io_poll (libuv.so.1 + 0x2962b)
#30 0x00007f516ad4f708 uv_run (libuv.so.1 + 0x10708)
#31 0x00007f516b0d758d nm_thread (libisc-9.18.26.so + 0x2f58d)
#32 0x00007f516b11320a isc__trampoline_run (libisc-9.18.26.so + 0x6b20a)
#33 0x00007f516a138897 start_thread (libc.so.6 + 0x8e897)
#34 0x00007f516a1bfa5c __clone3 (libc.so.6 + 0x115a5c)

Since none of the relevant code on either side seems to have seen any significant changes recently, is this possibly just an unintentional ABI break or similar issue that a rebuild would fix?

Comment 6 Andreas Schneider 2024-05-14 11:28:36 UTC
Can you try with the Samba 4.20.1 which has been just pushed to Fedora 40?

https://bodhi.fedoraproject.org/updates/FEDORA-2024-5e07d3d1ec

Comment 7 Rob Foehl 2024-05-15 05:02:31 UTC
No luck, same result:

Module dlz_bind9_18.so from rpm samba-4.20.1-1.fc40.x86_64
Module named from rpm bind-9.18.26-1.fc40.x86_64

Stack trace of thread 718:
#0  0x00007f287df5c290 _int_free_merge_chunk (libc.so.6 + 0xa4290)
#1  0x00007f287df5c5ba _int_free (libc.so.6 + 0xa45ba)
#2  0x00007f287df5edce free (libc.so.6 + 0xa6dce)
#3  0x00007f28708b13c5 dsdb_schema_refresh (schema_load.so + 0x43c5)
#4  0x00007f287b2fd16d dsdb_get_schema (libldbsamba-private-samba.so + 0x1816d)
#5  0x00007f28708b1b0c schema_load_init (schema_load.so + 0x4b0c)
#6  0x00007f287ce180aa ldb_module_init_chain (libldb.so.2 + 0x120aa)
#7  0x00007f287ce180aa ldb_module_init_chain (libldb.so.2 + 0x120aa)
#8  0x00007f287090924f rootdse_init (rootdse.so + 0xe24f)
#9  0x00007f287ce180aa ldb_module_init_chain (libldb.so.2 + 0x120aa)
#10 0x00007f28708dd26c samba_dsdb_init (samba_dsdb.so + 0x426c)
#11 0x00007f287ce180aa ldb_module_init_chain (libldb.so.2 + 0x120aa)
#12 0x00007f287ce1bb0d ldb_load_modules (libldb.so.2 + 0x15b0d)
#13 0x00007f287ce1c46f ldb_connect (libldb.so.2 + 0x1646f)
#14 0x00007f287b2fc1f9 samba_ldb_connect (libldbsamba-private-samba.so + 0x171f9)
#15 0x00007f287ce74c78 samdb_connect_url (libsamdb.so.0 + 0xfc78)
#16 0x00007f287d397df5 dlz_create (dlz_bind9_18.so + 0x7df5)
#17 0x0000563a64582923 dlopen_dlz_create (named + 0x20923)
#18 0x00007f287ed35e19 dns_sdlzcreate (libdns-9.18.26.so + 0x135e19)
#19 0x00007f287ec5c83b dns_dlzcreate (libdns-9.18.26.so + 0x5c83b)
#20 0x0000563a645961d0 configure_view.lto_priv.0 (named + 0x341d0)
#21 0x0000563a645a467f load_configuration (named + 0x4267f)
#22 0x0000563a645a6997 run_server (named + 0x44997)
#23 0x00007f287ef44eef isc_task_run (libisc-9.18.26.so + 0x62eef)
#24 0x00007f287ef054cc isc__nm_async_task (libisc-9.18.26.so + 0x234cc)
#25 0x00007f287ef0d6d9 process_netievent (libisc-9.18.26.so + 0x2b6d9)
#26 0x00007f287ef0ddf7 process_queue (libisc-9.18.26.so + 0x2bdf7)
#27 0x00007f287ef0e018 async_cb (libisc-9.18.26.so + 0x2c018)
#28 0x00007f287eb8df23 uv__async_io.part.0 (libuv.so.1 + 0xaf23)
#29 0x00007f287ebac57b uv__io_poll (libuv.so.1 + 0x2957b)
#30 0x00007f287eb93822 uv_run (libuv.so.1 + 0x10822)
#31 0x00007f287ef0e4fd nm_thread (libisc-9.18.26.so + 0x2c4fd)
#32 0x00007f287ef4917a isc__trampoline_run (libisc-9.18.26.so + 0x6717a)
#33 0x00007f287df4e1b7 start_thread (libc.so.6 + 0x961b7)
#34 0x00007f287dfd039c __clone3 (libc.so.6 + 0x11839c)

I guess that rules out a build issue.  For completeness, downgrading BIND to the 9.18.24-1.fc40 packages seems to work.

Comment 8 Petr Menšík 2024-05-16 17:28:42 UTC
Does this happen on production or test machine? Would it be possible to attach coredump, if the source is test originated, without privacy sensitive information included?

Does it trigger assertion failure in named or it just crashes on segmentation fault?

Comment 9 Petr Menšík 2024-05-16 18:50:49 UTC
I have failed to find any documented changes to relate to DLZ in direct way. If there is indeed wrong with bind version, I would guess it has to be in 9.18.25 version. There it made changes to cache cleanups. Those might be somehow related. There were changes to lib/dns/rbtdb.c, but nothing present in coredump presented.

It would be great if anyone having it configured could test upcoming bind 9.18.27 [2]. I expect the problem is still there, but just to be sure. Though again no change seems to be touching anything close to DLZ plugins.

1. https://downloads.isc.org/isc/bind9/9.18.27/doc/arm/html/notes.html#id4
2. https://src.fedoraproject.org/rpms/bind/pull-request/21

Comment 10 Petr Menšík 2024-05-16 18:57:29 UTC
Since Samba contains non-trival code and there was locking change at [1], could it be missing proper locking in some places? Maybe it were protected until now but lock move has broken it? Not sure, just guessing.

Though there is also small change touching sdlz.c file [2]. Could that be related perhaps?

1. https://gitlab.isc.org/isc-projects/bind9/commit/156a08e327f88b629dba7cab815a3d00ed9452b8
2. https://gitlab.isc.org/isc-projects/bind9/commit/ce3b343b0a7f5574565f8bb6f900390bb46fea1d

Comment 11 Rob Foehl 2024-05-16 18:57:59 UTC
Production, so no.  This particular environment is fairly small -- DCs are VMs with offline snapshots prior to testing, and rolling the whole mess back each time.

No assertion failure, or logging of any kind other than normal startup messages -- it just dies, in the same spot every time.  That being in libc free() and what looks like haphazard mixing of 3+ different allocators is already questionable.

I'd agree with the suspicions of seemingly-unrelated changes in 9.18.25, as at a glance it doesn't look like any of the code around DLZ in either package has been touched in a while.  I haven't had time to do any more in-depth exploration, though, and probably won't before next week at minimum.

Comment 12 Petr Menšík 2024-05-16 19:23:57 UTC
Can you at least make coredump with debug information backtrace again, including all threads?

(gdb) thread apply all bt

There might be other action in other thread, which is freeing prematurely data before samba connection has finished. Thread collision is likely causing such weird behaviour, if that is not change in samba. But it is hard to guess with just single thread backtrace, which does not help much.

Comment 13 Rob Foehl 2024-05-16 19:46:04 UTC
There isn't, all other threads are idle -- that doesn't rule out any of them having already freed something, but there's nothing of interest in the stack traces.

From the 9.18.24 -> 9.18.26 attempt on F39:

Stack trace of thread 7731:
#0  0x00007f4b46946820 _int_free_merge_chunk (libc.so.6 + 0x9c820)
#1  0x00007f4b46946b4a _int_free (libc.so.6 + 0x9cb4a)
#2  0x00007f4b469493de free (libc.so.6 + 0x9f3de)
#3  0x00007f4b3b9118d5 dsdb_schema_refresh (schema_load.so + 0x48d5)
#4  0x00007f4b4448c2da dsdb_get_schema (libldbsamba-samba4.so + 0x1b2da)
#5  0x00007f4b3b912024 schema_load_init (schema_load.so + 0x5024)
#6  0x00007f4b4573b08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#7  0x00007f4b4573b08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#8  0x00007f4b3b96b352 rootdse_init (rootdse.so + 0xe352)
#9  0x00007f4b4573b08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#10 0x00007f4b3b93d664 samba_dsdb_init (samba_dsdb.so + 0x4664)
#11 0x00007f4b4573b08a ldb_module_init_chain (libldb.so.2 + 0x1208a)
#12 0x00007f4b4573eac1 ldb_load_modules (libldb.so.2 + 0x15ac1)
#13 0x00007f4b4573f41f ldb_connect (libldb.so.2 + 0x1641f)
#14 0x00007f4b4448b269 samba_ldb_connect (libldbsamba-samba4.so + 0x1a269)
#15 0x00007f4b4577d198 samdb_connect_url (libsamdb.so.0 + 0xf198)
#16 0x00007f4b464e8f05 dlz_create (dlz_bind9_18.so + 0x7f05)
#17 0x000055adca125921 dlopen_dlz_create (named + 0x22921)
#18 0x00007f4b4773ecf9 dns_sdlzcreate (libdns-9.18.26.so + 0x13ecf9)
#19 0x00007f4b476627cb dns_dlzcreate (libdns-9.18.26.so + 0x627cb)
#20 0x000055adca1392f0 configure_view.lto_priv.0 (named + 0x362f0)
#21 0x000055adca147751 load_configuration (named + 0x44751)
#22 0x000055adca149897 run_server (named + 0x46897)
#23 0x00007f4b478a8fbf isc_task_run (libisc-9.18.26.so + 0x66fbf)
#24 0x00007f4b478684bc isc__nm_async_task (libisc-9.18.26.so + 0x264bc)
#25 0x00007f4b47870739 process_netievent (libisc-9.18.26.so + 0x2e739)
#26 0x00007f4b47870e57 process_queue (libisc-9.18.26.so + 0x2ee57)
#27 0x00007f4b4787107a async_cb (libisc-9.18.26.so + 0x2f07a)
#28 0x00007f4b474e3f23 uv__async_io.part.0 (libuv.so.1 + 0xaf23)
#29 0x00007f4b4750262b uv__io_poll (libuv.so.1 + 0x2962b)
#30 0x00007f4b474e9708 uv_run (libuv.so.1 + 0x10708)
#31 0x00007f4b4787158d nm_thread (libisc-9.18.26.so + 0x2f58d)
#32 0x00007f4b478ad20a isc__trampoline_run (libisc-9.18.26.so + 0x6b20a)
#33 0x00007f4b46938897 start_thread (libc.so.6 + 0x8e897)
#34 0x00007f4b469bfa5c __clone3 (libc.so.6 + 0x115a5c)

Stack trace of thread 7734:
#0  0x0000000000000000 n/a (n/a + 0x0)
#1  0x00007f4b478a6bc4 task_ready (libisc-9.18.26.so + 0x64bc4)
#2  0x00007f4b478a70a9 isc_task_sendtoanddetach (libisc-9.18.26.so + 0x650a9)
#3  0x00007f4b47886550 isc_app_ctxrun (libisc-9.18.26.so + 0x44550)
#4  0x00007f4b4788685c isc_app_run (libisc-9.18.26.so + 0x4485c)
#5  0x000055adca1215b5 main (named + 0x1e5b5)
#6  0x00007f4b468d214a __libc_start_call_main (libc.so.6 + 0x2814a)
#7  0x00007f4b468d220b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2820b)
#8  0x000055adca122235 _start (named + 0x1f235)

Stack trace of thread 7735:
#0  0x00007f4b46935169 __futex_abstimed_wait_common (libc.so.6 + 0x8b169)
#1  0x00007f4b46937b09 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8db09)
#2  0x00007f4b478716db nm_thread (libisc-9.18.26.so + 0x2f6db)
#3  0x00007f4b478ad20a isc__trampoline_run (libisc-9.18.26.so + 0x6b20a)
#4  0x00007f4b46938897 start_thread (libc.so.6 + 0x8e897)
#5  0x00007f4b469bfa5c __clone3 (libc.so.6 + 0x115a5c)

Stack trace of thread 7737:
#0  0x00007f4b46935169 __futex_abstimed_wait_common (libc.so.6 + 0x8b169)
#1  0x00007f4b46937b09 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8db09)
#2  0x00007f4b478716db nm_thread (libisc-9.18.26.so + 0x2f6db)
#3  0x00007f4b478ad20a isc__trampoline_run (libisc-9.18.26.so + 0x6b20a)
#4  0x00007f4b46938897 start_thread (libc.so.6 + 0x8e897)
#5  0x00007f4b469bfa5c __clone3 (libc.so.6 + 0x115a5c)

Stack trace of thread 7733:
#0  0x00007f4b46935169 __futex_abstimed_wait_common (libc.so.6 + 0x8b169)
#1  0x00007f4b46937b09 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8db09)
#2  0x00007f4b478716db nm_thread (libisc-9.18.26.so + 0x2f6db)
#3  0x00007f4b478ad20a isc__trampoline_run (libisc-9.18.26.so + 0x6b20a)
#4  0x00007f4b46938897 start_thread (libc.so.6 + 0x8e897)
#5  0x00007f4b469bfa5c __clone3 (libc.so.6 + 0x115a5c)

Stack trace of thread 7732:
#0  0x0000000000000000 n/a (n/a + 0x0)
#1  0x00007f4b478a6bc4 task_ready (libisc-9.18.26.so + 0x64bc4)
#2  0x00007f4b478a70a9 isc_task_sendtoanddetach (libisc-9.18.26.so + 0x650a9)
#3  0x00007f4b47886550 isc_app_ctxrun (libisc-9.18.26.so + 0x44550)
#4  0x00007f4b4788685c isc_app_run (libisc-9.18.26.so + 0x4485c)
#5  0x000055adca1215b5 main (named + 0x1e5b5)
#6  0x00007f4b468d214a __libc_start_call_main (libc.so.6 + 0x2814a)
#7  0x00007f4b468d220b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2820b)
#8  0x000055adca122235 _start (named + 0x1f235)

Stack trace of thread 7730:
#0  0x0000000000000000 n/a (n/a + 0x0)
#1  0x00007f4b478a6bc4 task_ready (libisc-9.18.26.so + 0x64bc4)
#2  0x00007f4b478a70a9 isc_task_sendtoanddetach (libisc-9.18.26.so + 0x650a9)
#3  0x00007f4b47886550 isc_app_ctxrun (libisc-9.18.26.so + 0x44550)
#4  0x00007f4b4788685c isc_app_run (libisc-9.18.26.so + 0x4485c)
#5  0x000055adca1215b5 main (named + 0x1e5b5)
#6  0x00007f4b468d214a __libc_start_call_main (libc.so.6 + 0x2814a)
#7  0x00007f4b468d220b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2820b)
#8  0x000055adca122235 _start (named + 0x1f235)

Stack trace of thread 7736:
#0  0x0000000000000000 n/a (n/a + 0x0)
#1  0x00007f4b478a6bc4 task_ready (libisc-9.18.26.so + 0x64bc4)
#2  0x00007f4b478a70a9 isc_task_sendtoanddetach (libisc-9.18.26.so + 0x650a9)
#3  0x00007f4b47886550 isc_app_ctxrun (libisc-9.18.26.so + 0x44550)
#4  0x00007f4b4788685c isc_app_run (libisc-9.18.26.so + 0x4485c)
#5  0x000055adca1215b5 main (named + 0x1e5b5)
#6  0x00007f4b468d214a __libc_start_call_main (libc.so.6 + 0x2814a)
#7  0x00007f4b468d220b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2820b)
#8  0x000055adca122235 _start (named + 0x1f235)

Stack trace of thread 7738:
#0  0x00007f4b46935169 __futex_abstimed_wait_common (libc.so.6 + 0x8b169)
#1  0x00007f4b46937e72 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x8de72)
#2  0x00007f4b47887ca8 isc_condition_waituntil (libisc-9.18.26.so + 0x45ca8)
#3  0x00007f4b478aab94 run (libisc-9.18.26.so + 0x68b94)
#4  0x00007f4b478ad20a isc__trampoline_run (libisc-9.18.26.so + 0x6b20a)
#5  0x00007f4b46938897 start_thread (libc.so.6 + 0x8e897)
#6  0x00007f4b469bfa5c __clone3 (libc.so.6 + 0x115a5c)

Stack trace of thread 7729:
#0  0x00007f4b468e9638 __sigtimedwait (libc.so.6 + 0x3f638)
#1  0x00007f4b468e8d04 sigwait (libc.so.6 + 0x3ed04)
#2  0x00007f4b47886550 isc_app_ctxrun (libisc-9.18.26.so + 0x44550)
#3  0x00007f4b4788685c isc_app_run (libisc-9.18.26.so + 0x4485c)
#4  0x000055adca1215b5 main (named + 0x1e5b5)
#5  0x00007f4b468d214a __libc_start_call_main (libc.so.6 + 0x2814a)
#6  0x00007f4b468d220b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2820b)
#7  0x000055adca122235 _start (named + 0x1f235)

Comment 14 Rob Foehl 2024-06-12 09:40:51 UTC
I got around to cloning one of the affected VMs, and built BIND out of the git tree for a bisection...  Many hours of staying awake longer than I should have later:

# git bisect good
4ad3c694f1c39b3c9070a243adc095434d8e9b43 is the first bad commit
commit 4ad3c694f1c39b3c9070a243adc095434d8e9b43
Merge: 2044384f6d 6d7674f8f2
Author: Michał Kępień <michal>
Date:   Wed Feb 14 13:35:19 2024 +0100

    Merge tag 'v9.18.24' into bind-9.18


Hm, not very specific...  Except, in that commit:

diff --cc CHANGES
index 163ab0d0e4,9bd4f51e7e..73f87ac6e9
--- a/CHANGES
+++ b/CHANGES

[...]

 +6328. [func]          Add workaround to enforce dynamic linker to pull
 +                      jemalloc earlier than libc to ensure all memory
 +                      allocations are done via jemalloc. [GL #4404]


Which would refer to https://gitlab.isc.org/isc-projects/bind9/-/issues/4404 and associated commits.

And now I'll refer to comment 11 and specifically the part calling out the questionable mixing of allocators, because it's late and I'm easily irritated at the amount of time wasted essentially repeating myself.

This ball belongs back in Samba's court: the DLZ module needs to catch up to the allocator change in 9.18, as it's only been working by accident up until this point.

Comment 15 Petr Menšík 2024-06-12 15:53:08 UTC
There is also bug #2277997 on bind component, which involves loadable openssl modules and quite strange behaviour of memory allocator functions. And very likely caused by the same change, at least it seems so from your description.

It might help if RTLD_DEEPBIND flag were used in dlopen call, but not sure, not tested.

If my own debugging is correct, strdup call results in using libjemalloc's malloc call, but that allocated memory is given back to libc's free function. Which does not recognize that pointer and raises fatal error from free().

Potential fix for it might be omitting use of jemalloc from bind completely.

Comment 16 Rob Foehl 2024-06-12 19:04:56 UTC
(In reply to Petr Menšík from comment #15)
> Potential fix for it might be omitting use of jemalloc from bind completely.

No.  That's avoiding it, not fixing it.  jemalloc isn't a fashion statement, it solves a real problem -- and the profusion of similar efforts suggests that there's more than one real problem to solve.  Crippling BIND because other parties can't be bothered to understand how memory allocators work is unacceptable.

RTLD_DEEPBIND isn't the answer, either -- that'd just enforce mixing, not prevent it.

Comment 17 Andreas Schneider 2024-06-17 11:10:38 UTC
diff --cc CHANGES
index 163ab0d0e4,9bd4f51e7e..73f87ac6e9
--- a/CHANGES
+++ b/CHANGES

[...]

 +6328. [func]          Add workaround to enforce dynamic linker to pull
 +                      jemalloc earlier than libc to ensure all memory
 +                      allocations are done via jemalloc. [GL #4404]


This sounds like a bad idea. It means all bind modules need to use and build with jemalloc. If they don't they will not work.


Samba is using talloc! If jemalloc is enforced by bind, the Samba bind dlz module simply wont work anymore.

Workaround: Use the internal Samba DNS server instead.
Fix:
a) Remove the bind dlz package from Fedora (use Samba internal DNS server).
b) Build bind without jemalloc.

Comment 18 Rob Foehl 2024-06-17 17:46:52 UTC
(In reply to Andreas Schneider from comment #17)
> Samba is using talloc! If jemalloc is enforced by bind, the Samba bind dlz
> module simply wont work anymore.

Yes, that's rather the point.  This mixing of allocators only ever worked by accident, and there's no reasonable expectation that it continue to do so.

> Workaround: Use the internal Samba DNS server instead.

Not possible.  Samba's DNS server fails in spectacular fashion when queried from a real resolver, thus making it impossible to properly delegate a Samba AD domain when using the internal server; it also locks all clients into proxying queries through that mess, breaking all manner of things in the process.

I get that AD administrators' non-understanding of how any of this works is the reason the "it's always DNS" meme persists, but come on.  Don't sit here suggesting that an actually functional AD domain be crippled because the majority don't work.

This is indeed the only reason I'm trying to run DLZ in the first place -- I'd very much prefer not, but the only alternative doesn't actually speak DNS.

> Fix:
> a) Remove the bind dlz package from Fedora (use Samba internal DNS server).

See above.

> b) Build bind without jemalloc.

See comment 16.

Comment 19 Alexander Bokovoy 2024-06-18 06:52:54 UTC
We may have a meeting with ISC folks in near future regarding modules API bind provides and a path forward with that. I'll add jemalloc issue to the list.
No specific timeframe though.

Comment 20 mike 2024-06-30 17:57:19 UTC
now that issue has also hit debian.

I should have read this posts earlier, but here my findings:

first of all calling samba "other parties can't be bothered to understand how memory allocators work is unacceptable." is not something you can say about samba, first of all because they usually use talloc (that is not a malloc/free replacement but another concept)

second this specific mismatch between malloc and free is not really obvious. It comes from the fact that bind does not load the dlz library with RTLD_DEEPBIND (I would suggest loading third party libraries like dlz with that flag, but that's just a personal opinion), but samba than loads some modules, namly ldb modules, with that flag. This makes ldb modules call glibc's free while the other parts use malloc from jemalloc.

There is a simple runtime fix: set the environment variable LDB_MODULES_DISABLE_DEEPBIND=1 when launching bind. this disables RTLD_DEEPBIND when loading ldb modules and now everything uses jemalloc.

Comment 21 Petr Špaček 2024-07-16 07:32:47 UTC
Honestly we (ISC - BIND developers) don't know what's best approach overall.

We want to keep jemalloc because - as correctly assumed in comment 16 - it solves a real problems with memory fragmentation and poor performance we saw on large production systems.

Ad Comment 17 and "sounds like a bad idea": Well, this workaround was added because _other_ BIND users reported odd behavior, see https://gitlab.isc.org/isc-projects/bind9/-/issues/4404

Any suggestions how to work around this are more than welcome!

Having said all that, in the long term we are looking at different methods to interface between BIND and external parties. One option might be extending APIs available over network/sockets so direct linking becomes unnecessary, but at this point it's just a wild speculation. To do that we would need current DLZ/dyndb API users to present their requirements and cooperation so we could design new interfaces properly.

Comment 22 Florian Weimer 2024-09-03 09:32:12 UTC
Could we get LD_DEBUG=all from the failing process?

The local search scope for the plugin should have jemalloc for libc, but that doesn't seem to happen for some reason.

Comment 23 Florian Weimer 2024-09-04 05:22:51 UTC
It turns out that Samba uses RTLD_DEEPBIND:

        dlopen_flags = RTLD_NOW;
#ifdef RTLD_DEEPBIND
        /*
         * use deepbind if possible, to avoid issues with different
         * system library variants, for example ldb modules may be linked
         * against Heimdal while the application may use MIT kerberos.
         *
         * See the dlopen manpage for details.
         *
         * One typical user is the bind_dlz module of Samba,
         * but symbol versioning might be enough...
         *
         * We need a way to disable this in order to allow the
         * ldb_*ldap modules to work with a preloaded socket wrapper.
         *
         * So in future we may remove this completely
         * or at least invert the default behavior.
        */
        if (deepbind_enabled) {
                dlopen_flags |= RTLD_DEEPBIND;
        }
#endif

This cannot work properly. Debian has applied a workaround: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074378#96

But Samba should stop using RTLD_DEEPBIND, really.

Comment 24 Fedora Update System 2024-10-01 13:39:11 UTC
FEDORA-2024-614f34751b (samba-4.21.0-14.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-614f34751b

Comment 25 Fedora Update System 2024-10-01 15:35:36 UTC
FEDORA-2024-614f34751b (samba-4.21.0-14.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 26 Fedora Update System 2024-10-02 09:55:20 UTC
FEDORA-2024-c3aaede54e (samba-4.21.0-14.fc41) has been submitted as an update to Fedora 41.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-c3aaede54e

Comment 27 Fedora Update System 2024-10-03 02:03:01 UTC
FEDORA-2024-c3aaede54e has been pushed to the Fedora 41 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-c3aaede54e`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-c3aaede54e

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.


Note You need to log in before you can comment on or make changes to this bug.