RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2101712 - named-pkcs11 crashing into dns-pkcs11 library red-black tree [bind rhel-8.8.0]
Summary: named-pkcs11 crashing into dns-pkcs11 library red-black tree [bind rhel-8.8.0]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: bind
Version: 8.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Petr Menšík
QA Contact: rhel-cs-infra-services-qe
URL:
Whiteboard:
Depends On:
Blocks: 2126877 2126912 2129844 2130272 2131784 2133889 2162795
TreeView+ depends on / blocked
 
Reported: 2022-06-28 08:15 UTC by Anton Bobrov
Modified: 2023-05-16 11:16 UTC (History)
10 users (show)

Fixed In Version: bind-9.11.36-7.el8
Doc Type: Bug Fix
Doc Text:
Cause: Unprotected parallel access from statistics module were found, which could in rare situation lead to assertion failure. Consequence: Named would terminate abnormally, not serving DNS requests anymore. Fix: Protect shared memory structures by proper locking to avoid race conditions. Result: named-pkcs11 service will not crash from time to time even when statistics queries are often processed.
Clone Of:
: 2126877 2126912 2130272 2131784 (view as bug list)
Environment:
Last Closed: 2023-05-16 09:09:40 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Fedora Pagure bind-dyndb-ldap pull-request 213 0 None None None 2022-08-04 19:14:48 UTC
Fedora Pagure bind-dyndb-ldap pull-request 214 0 None None None 2022-09-14 16:01:10 UTC
Internet Systems Consortium (ISC) isc-projects bind9 issues 3468 0 None opened Statistics XML rendering does not seem to protect against concurrent changes 2022-07-21 10:33:42 UTC
Internet Systems Consortium (ISC) isc-projects bind9 merge_requests 6637 0 None opened Increase view and zone usage counter on dns_zt_apply calls 2022-08-05 11:15:43 UTC
Internet Systems Consortium (ISC) isc-projects bind9 merge_requests 6695 0 None opened Draft: Resolve "Statistics XML rendering does not seem to protect against concurrent changes" 2022-08-25 11:10:02 UTC
Red Hat Issue Tracker RHELPLAN-126446 0 None None None 2022-06-28 08:24:15 UTC
Red Hat Product Errata RHSA-2023:3002 0 None None None 2023-05-16 09:10:14 UTC

Description Anton Bobrov 2022-06-28 08:15:43 UTC
Description of problem:

The customer has experienced named-pkcs11 crashes. The latest one has occurred on June 26 and the one before that on May 2 tho they have lost the core and the logs for the May crash. There does not seem to be any clear event or trigger for either crash. The last one we do have a core dump collected and it goes something like this

Core was generated by `/usr/sbin/named-pkcs11 -u named -c /etc/named.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f9820d2f513 in dns_rbtnodechain_next (chain=chain@entry=0x7f98188bcad0, name=name@entry=0x0, 
    origin=origin@entry=0x0) at ../../../lib/dns-pkcs11/rbt.c:3516
3516                    while (LEFT(current) != NULL)
[Current thread is 1 (Thread 0x7f98188bf700 (LWP 79415))]
(gdb) where
#0  0x00007f9820d2f513 in dns_rbtnodechain_next (chain=chain@entry=0x7f98188bcad0, name=name@entry=0x0, 
    origin=origin@entry=0x0) at ../../../lib/dns-pkcs11/rbt.c:3516
#1  0x00007f9820e2bb98 in dns_zt_apply2 (zt=<optimized out>, stop=stop@entry=true, sub=sub@entry=0x0, 
    action=action@entry=0x55cc88636e10 <zone_xmlrender>, uap=uap@entry=0x7f97f710e210)
    at ../../../lib/dns-pkcs11/zt.c:534
#2  0x00007f9820e2bf25 in dns_zt_apply (zt=<optimized out>, stop=stop@entry=true, 
    action=action@entry=0x55cc88636e10 <zone_xmlrender>, uap=uap@entry=0x7f97f710e210)
    at ../../../lib/dns-pkcs11/zt.c:496
#3  0x000055cc886365ec in generatexml (server=<optimized out>, flags=2, buflen=buflen@entry=0x7f98188bebbc, 
    buf=buf@entry=0x7f98188bebc0) at ../../../bin/named-pkcs11/statschannel.c:1842
#4  0x000055cc88636b55 in render_xml (flags=<optimized out>, arg=<optimized out>, retcode=0x7f97f558d518, 
    retmsg=0x7f97f558d520, mimetype=0x7f97f558d510, b=0x7f97f558d528, freecb=0x7f97f558d568, 
    freecb_args=0x7f97f558d570, headers=<optimized out>, querystring=<optimized out>, urlinfo=<optimized out>, 
    url=<optimized out>) at ../../../bin/named-pkcs11/statschannel.c:1995
#5  0x00007f9820a14040 in isc_httpd_recvdone (task=0x7f9821419970, ev=<optimized out>)
    at ../../../lib/isc-pkcs11/httpd.c:1015
#6  0x00007f9820a300ef in dispatch (manager=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1157
#7  run (uap=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1331
#8  0x00007f981de2617a in start_thread (arg=<optimized out>) at pthread_create.c:479
#9  0x00007f981d7eddc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p current
$1 = (dns_rbtnode_t *) 0xdededededededede
(gdb) p chain->end
$2 = (dns_rbtnode_t *) 0x7f97fc52a740
(gdb) p *chain->end
$3 = {magic = 87473920, is_root = 0, color = 0, find_callback = 0, attributes = 3, nsec = 2, namelen = 127, 
  offsetlen = 0, oldnamelen = 0, is_mmapped = 0, parent_is_relative = 1, left_is_relative = 1, 
  right_is_relative = 1, down_is_relative = 1, data_is_relative = 0, rpz = 1, hashval = 3739147998, 
  uppernode = 0xdededededededede, hashnext = 0xdededededededede, parent = 0xdededededededede, 
  left = 0xdededededededede, right = 0xdededededededede, down = 0xdededededededede, deadlink = {
    prev = 0xdededededededede, next = 0xdededededededede}, data = 0xdededededededede, locknum = 3739147998, 
  dirty = 0, wild = 1, references = {refs = -2387225703656530210}}
(gdb)

Where it trips is in 

if (DOWN(current) != NULL)

line before the crash since ISC_MEM_FILL seem to be defined in this build the current->down is 0xdead indicating freed memory and is obviously not NULL. Even tho not using 0xdead would have probably avoided this crash it is clearly not the root cause here so this needs to be investigated further and hopefully fixed. 

Version-Release number of selected component (if applicable):

bind-9.11.26-6.el8
bind-libs-9.11.26-6.el8
bind-pkcs11-9.11.26-6.el8
bind-pkcs11-libs-9.11.26-6.el8

How reproducible:

There is no reproducer for this unfortunately.

Comment 6 Petr Menšík 2022-07-14 12:01:51 UTC
Interesting. Adding bind-dyndb-ldap maintainers, because it is very likely connected. But I have not seen yet crash triggered by a statistics channel. That is certainly new. Not sure whether there are proper locks protecting zone structure.

But I would like to do some evaluation, before I pass it bind-dyndb-ldap component.

Comment 8 Petr Menšík 2022-07-21 09:35:13 UTC
Relevant backtraces from actual.coredump attached to case:

Thread 7 (Thread 0x7f98190c0700 (LWP 79414)):
#0  0x00007f981de2f75d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f981de28bed in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f9820e03b42 in dns_zone_maintenance (zone=zone@entry=0x7f97e5d167a0) at ../../../lib/dns-pkcs11/zone.c:6126
#3  0x00007f9820e145c0 in dns_zonemgr_forcemaint (zmgr=0x55cc8a4b2600) at ../../../lib/dns-pkcs11/zone.c:17007
#4  0x000055cc8861634b in view_loaded (arg=arg@entry=0x7f98214b0e08) at ../../../bin/named-pkcs11/server.c:8871
#5  0x00007f9820e2c188 in doneloading (zt=0x7f9800341150, zone=<optimized out>, task=<optimized out>) at ../../../lib/dns-pkcs11/zt.c:575
#6  0x00007f9820e276a0 in zone_asyncload (task=0x7f98214ca010, event=<optimized out>) at ../../../lib/dns-pkcs11/zone.c:2261
#7  0x00007f9820a300ef in dispatch (manager=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1157
#8  run (uap=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1331
#9  0x00007f981de2617a in start_thread () from /lib64/libpthread.so.0
#10 0x00007f981d7eddc3 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f98198c1700 (LWP 79413)):
#0  0x00007f981de2c3fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9820a29f8a in isc__rwlock_lock (type=isc_rwlocktype_write, rwl=0x55cc8a4b2670) at ../../../lib/isc-pkcs11/rwlock.c:398
#2  isc_rwlock_lock (rwl=rwl@entry=0x55cc8a4b2670, type=type@entry=isc_rwlocktype_write) at ../../../lib/isc-pkcs11/rwlock.c:426
#3  0x00007f9820e137ea in dns_zonemgr_releasezone (zone=0x7f97e5d167a0, zmgr=0x55cc8a4b2600) at ../../../lib/dns-pkcs11/zone.c:16946
#4  dns_zonemgr_releasezone (zmgr=0x55cc8a4b2600, zone=zone@entry=0x7f97e5d167a0) at ../../../lib/dns-pkcs11/zone.c:16939
#5  0x00007f9815e13d89 in delete_bind_zone (zt=zt@entry=0x7f9800341150, zonep=zonep@entry=0x7f98198bfea0) at zone_register.c:598
#6  0x00007f9815df64ed in empty_zone_unload (ezname=ezname@entry=0x7f98198c0070, zonetable=zonetable@entry=0x7f9800341150) at empty_zones.c:270
#7  0x00007f9815df662f in empty_zone_handle_conflicts (name=name@entry=0x7f98210c07c0 <root>, zonetable=0x7f9800341150, warn_only=false) at empty_zones.c:326
#8  0x00007f9815df7e43 in fwd_configure_zone (set=0x7f97fc3ea0e0, inst=inst@entry=0x7f980555a618, name=0x7f98210c07c0 <root>) at fwd.c:599
#9  0x00007f9815df81f4 in fwd_reconfig_global (inst=inst@entry=0x7f980555a618) at fwd.c:655
#10 0x00007f9815e013c0 in ldap_parse_serverconfigentry (inst=0x7f980555a618, entry=0x7f9812fbfd70) at ldap_helper.c:1535
#11 update_serverconfig (task=<optimized out>, event=<optimized out>) at ldap_helper.c:3844
#12 0x00007f9820a300ef in dispatch (manager=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1157
#13 run (uap=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1331
#14 0x00007f981de2617a in start_thread () from /lib64/libpthread.so.0
#15 0x00007f981d7eddc3 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f98188bf700 (LWP 79415)):
#0  0x00007f9820d2f513 in dns_rbtnodechain_next (chain=chain@entry=0x7f98188bcad0, name=name@entry=0x0, origin=origin@entry=0x0) at ../../../lib/dns-pkcs11/rbt.c:3516
#1  0x00007f9820e2bb98 in dns_zt_apply2 (zt=<optimized out>, stop=stop@entry=true, sub=sub@entry=0x0, action=action@entry=0x55cc88636e10 <zone_xmlrender>, uap=uap@entry=0x7f97f710e210) at ../../../lib/dns-pkcs11/zt.c:534
#2  0x00007f9820e2bf25 in dns_zt_apply (zt=<optimized out>, stop=stop@entry=true, action=action@entry=0x55cc88636e10 <zone_xmlrender>, uap=uap@entry=0x7f97f710e210) at ../../../lib/dns-pkcs11/zt.c:496
#3  0x000055cc886365ec in generatexml (server=<optimized out>, flags=2, buflen=buflen@entry=0x7f98188bebbc, buf=buf@entry=0x7f98188bebc0) at ../../../bin/named-pkcs11/statschannel.c:1842
#4  0x000055cc88636b55 in render_xml (flags=<optimized out>, arg=<optimized out>, retcode=0x7f97f558d518, retmsg=0x7f97f558d520, mimetype=0x7f97f558d510, b=0x7f97f558d528, freecb=0x7f97f558d568, freecb_args=0x7f97f558d570, 
    headers=<optimized out>, querystring=<optimized out>, urlinfo=<optimized out>, url=<optimized out>) at ../../../bin/named-pkcs11/statschannel.c:1995
#5  0x00007f9820a14040 in isc_httpd_recvdone (task=0x7f9821419970, ev=<optimized out>) at ../../../lib/isc-pkcs11/httpd.c:1015
#6  0x00007f9820a300ef in dispatch (manager=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1157
#7  run (uap=0x7f98214b3010) at ../../../lib/isc-pkcs11/task.c:1331
#8  0x00007f981de2617a in start_thread () from /lib64/libpthread.so.0
#9  0x00007f981d7eddc3 in clone () from /lib64/libc.so.6

Thread 1 is the crashing thread. I think thread 3 is responsible for modification of zt. Strange is zone manager seems to lock zone table and zone properly. But I haven't found any trace of zone locking or zone table locking in render_xml function from statschannel code. It might be a bug in BIND code, which just normal named zones do not hit. I am not sure, I would query upstream for any pointers. There has to be some kind of protection, but I don't see it.

Comment 9 Petr Menšík 2022-07-21 10:33:42 UTC
Created upstream issue. I am not sure they would be willing to help us, because not a small difference is caused by the different way of bind-dyndb-ldap plugin. It is possible upstream just do not have any protection when rendering XML zone statistics, because zone manager changes never happens the way bind-dyndb-ldap does them.

In 9.16.30, dns_zonemgr_releasezone is called only from zone_shutdown, which is called asynchronously when zone references reach zero. That would ensure statistics renderer does not enter such zone. But we call it also in different cases (Thread 3), where such protection does not exist. Because such action does not seem to hold any lock tested also by render_xml, we hit a race condition.

Comment 10 Petr Menšík 2022-08-04 19:14:48 UTC
I think this issue is inside bind-dyndb-ldap. It seems to me removing zone should rely on zone_shutdown asynchronous call, which is emitted once all users detach from the zone. That should ensure no other user can be using the zone when removed, thus invalidating its pointers.

If there is explicit reason for immediate call to zonerelease, it would have to be done different way.

Created PR 213 [1] with a proposed fix.

1. https://pagure.io/bind-dyndb-ldap/pull-request/213

Comment 11 Petr Menšík 2022-08-05 11:15:44 UTC
Proposed also few changes in named, which should help avoiding use of zone register when zones are removed:
https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6637

Comment 12 Petr Menšík 2022-08-23 20:49:34 UTC
Moving back to bind component. I think bind-dyndb-ldap change would be just a cosmetic one, not required to prevent this sort of crash.

But change in comment #11 is needed to preserve zone structure while it is being used by stats code. It would help both in XML and json statistics rendering. Even without bind-dyndb-ldap part, it would ensure zone is not freed. Normal bind does never call dns_zonemgr_releasezone until zone references reaches zero. Zonemgr even uses different usage counter irefs instead of erefs protected by atomics.

I think only bind change is mandatory. I am not sure why normal named does not hit problems with statistics. Perhaps some exclusive mode lock ensures statistics rendering would be finished before zones shutdown proceeds. But I think that should not be necessary, statistics should use more obvious way to keep reference to zone working until they are finished.

Comment 16 Petr Menšík 2022-08-25 11:10:02 UTC
Upstream has proposed only locking zone table as the only protection. I admit it might work, because zone eref counter is decremented during zone unmount from the table. That requires write lock, which cannot be granted when read lock is being held.

More in PR: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6695

Comment 17 Petr Menšík 2022-09-14 10:06:56 UTC
I have checked the catz functions in named. They use always isc_task_beginexclusive() and isc_task_endexclusive() during modification of zone tables. Because it is similar to what bind-dyndb-ldap does, I think bind-dyndb-ldap should copy such behaviour.

Comment 19 Petr Menšík 2022-09-14 16:01:10 UTC
Adding also related bind-dyndb-ldap change [1], which should help also. I think both bind and bind-dyndb-ldap fixes should be included.

1. https://pagure.io/bind-dyndb-ldap/pull-request/214

Comment 23 Petr Menšík 2022-09-27 17:19:26 UTC
I would recommend workaround until this is fixed. Systemd is able to restart the service on ABRT termination, where it was not intended to stop. It will still make limited outage, but much less noticeable than if waiting for manual restart.

Run command:
systemctl edit named-pkcs11.service

Then inside editor, write line between comments:
[Service]
Restart=on-abnormal

Save and quit the editor and check

Double check the change is active:
systemctl show named-pkcs11.service -p Restart

From that point it would automatically restart named if it crashed. Because reported crashes do not happen often, it should work well enough until the fix arrives.

Comment 43 errata-xmlrpc 2023-05-16 09:09:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: bind security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3002


Note You need to log in before you can comment on or make changes to this bug.