Bug 1854473 - named-pkcs11 service crashes when a reload is performed.
Summary: named-pkcs11 service crashes when a reload is performed.
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: bind-dyndb-ldap
Version: ---
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Rafael Jeffman
QA Contact: ipa-qe
URL:
Whiteboard:
: 1891410 (view as bug list)
Depends On:
Blocks: 2131184 1780662
TreeView+ depends on / blocked
 
Reported: 2020-07-07 14:14 UTC by Arya Rajendran
Modified: 2023-08-16 13:26 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FREEIPA-7215 0 None Waiting on Customer ROSA Cluster setup issues 2022-04-12 13:33:26 UTC
Red Hat Issue Tracker RHELPLAN-53499 0 None None None 2022-11-17 13:10:43 UTC

Comment 12 Ding-Yi Chen 2020-10-27 00:54:30 UTC
*** Bug 1891410 has been marked as a duplicate of this bug. ***

Comment 18 Petr Čech 2021-03-02 20:33:10 UTC
Thank you taking your time and submitting this request for Red Hat Enterprise Linux 7.
Red Hat Enterprise Linux 7 is in Maintenance Support 2 Phase. This bug was reevaluated and will be postponed to RHEL 8.
Thank you for understanding.
Red Hat Enterprise Linux Identity Management Team

Comment 20 Giedrius Tuminauskas 2021-04-26 19:20:28 UTC
The problem still exists and happens occasionally every 2-4 weeks
I have created an issue directly on ISC gitlab
https://gitlab.isc.org/isc-projects/bind9/-/issues/2655

Comment 21 Mathieu Baudier 2021-08-05 08:22:43 UTC
We have the same problem, on both RHEL 8 and CentOS 8 Streams IPA instances.

It always seem to happens shortly after this step of the initialisation:
named-pkcs11[1370]: 9 master zones from LDAP instance 'ipa' loaded (9 zones defined, 0 inactive, 0 failed to load)

I am happy to help to test further, esp. on CentOS Streams.
Are there additiona logs we should gather?

Our setup is a bit experimental (but not overly complex), so we may have run into a corner case.
We would really like to understand better the problem before going into production.

Comment 22 Mathieu Baudier 2021-08-05 12:12:18 UTC
On an up-to-date CentOS 8 Stream, this is now systematic, with each restart of the whole environment.
After a few restarts of IPA (systemctl restart ipa), at some point named is able to start without crashing.

I have installed
dnf install bind-pkcs11-*debuginfo bind-libs-debuginfo

and then generated a backtrace in the ABRT directory.

Would it be useful?
I can see the names of our internal networks in some places, so I am not too keen to upload it here.

Comment 23 Mathieu Baudier 2021-11-03 05:10:31 UTC
We finally solved this issue a few weeks ago and I waited a bit before reporting, in order to make sure that it was stable.

This was due to configuration issues on our side. The problem is that the crash makes it difficult to analyse, and the disappearance of the DNS of course snowballs into causing indirect issues everywhere.

As far as I understood our problems:

1) The main issue was clearly the query authorisations in one of our internal zones. I had forgotten to give access to the WAN addresses of a new replica which was in another datacenter. As soon as I fixed this, the nearly systematic crash on this replica stopped.

2) With that understanding, I reviewed all replicas in order to make their IP configuration more robust. We have two kind of replicas: "main" replicas with their hostname pointing to public IPv4 and IPv6 addresses (+ an internal IPv6 ULA addresses on  another interface), and "proxy" replicas with only an internal IPv6 ULA address (these replicas are the ones actually used as DNS by our internal services). I modified this configuration by adding public IPv4 and IPv6 addresses to the "proxy" replicas, so that they would have a smoother access to the "main" ones. So, basically now all our replicas have public IPv4, public IPv6, and internal ULA IPv6, the difference is that the hostname points to the public addresses for the "main" ones, and to the internal ULA address for the "proxy" ones.

I don't know if the second point was directly responsible for these issues, but since then the intermittent crashes of named-pkcs11 that we had on all replicas have completely stopped. We have been using these settings for weeks now, not changing anything at the network level or with DNS query authorisations, but using IPA intensively (for internal DNS, Kerberos, PKI, etc.) and there has not been a single issue. The servers are just running flawlessly (both on RHEL 8 and CentOS Streams 8) until we reboot them after an upgrade.

I hope these details can help developers with analysing the issue, or other users with working around it.

Comment 24 Scott Serr 2021-11-09 17:07:53 UTC
(In reply to Mathieu Baudier from comment #23)

> 1) The main issue was clearly the query authorisations in one of our
> internal zones. I had forgotten to give access to the WAN addresses of a new
> replica which was in another datacenter. As soon as I fixed this, the nearly
> systematic crash on this replica stopped.

Mathieu,

Can you give more detail on "give access to the WAN address of a new replica"?

I don't understand where you made this change.  We also have this issue.

Thanks,
Scott

Comment 25 Mathieu Baudier 2021-11-10 05:34:15 UTC
> > 1) The main issue was clearly the query authorisations in one of our
> > internal zones. I had forgotten to give access to the WAN addresses of a new
> > replica which was in another datacenter. As soon as I fixed this, the nearly
> > systematic crash on this replica stopped.
> 
> Can you give more detail on "give access to the WAN address of a new
> replica"?
> 
> I don't understand where you made this change.  We also have this issue.

In our case the complexity comes from having a mix of:
- public (or more generally, WAN) replicas and internal replicas
- unrestricted DNS zones and restricted DNS zones

The hostnames of public replicas are in an unrestricted zone (say id.example.com), with public IPv4 and IPv6 records.
The hostnames of internal replicas are in a restricted zone (say internal.example.com), with ULA IPv6 records.

The internal DNS zone is protected with the "Allow query" entries.
(In the Web UI, you find it in the settings of the zone, #/e/dnszone/details/internal.example.com.)

From a functional point of view, we only have a single relevant "Allow query", which gives access to the whole internal IPv6 ULA network (say, fd11:5334:d95f::/48 ; in IPv4 that would be a private range like 192.168.0.0/24 etc.).

BUT, the subtlety is that during replication, all replicas need to be able to query all DNS zones from any other replica.

Between internal replicas there is no problem: they know each other via their internal ULA addresses, and thus their DNS queries are coming from the internal network.
But as soon as a public/WAN replica is involved, the other replicas (whether public or internal) will try to query its DNS via its IPA hostname, which resolves to the public IPv4 and IPv6 addresses. 

Therefore, for the restricted zones, one should also "Allow query" for the public/WAN IP addresses (or IP ranges) of all replicas, considered *as DNS clients*.
Between public replicas, this is basically their "main" IP addresses (to which their IPA hostname resolves) which should be allowed.
But for the internal ones, one need to consider the non-ULA/non-internal addresses that they will use in order to access the public replicas, not the internal IPs their IPA hostname resolves to.

In our case, to simplify/clarify our configs, we now have for each replica:
- fixed public IPv4 and IPv6 addresses
- a fixed (stable-privacy) internal IPv6 ULA

The only differences between public/WAN replicas and internal replicas is whether their IPA hostname maps to public addresses in an unrestricted zone, or internal address in a restricted zone. In the restricted zone, we "All query" for all these public IPv4 and IPv6 addresses (or related IP ranges that we control) in addition to the internal network (and localhost, which anyhow needed as soon as you restrict a zonre).

(Also, for completion: another difference is that the internal replicas can be used as DNS forwarder from the internal network. This is configured elsewhere via bind config files on the instance itself. My understanding is that this is irrelevant for our current discussion.)

So, I don't know whether there are similarities with your context, but to summarize:
when configuring restricted zones, consider *from which* address each replica will talk to any other replica, and allow these addresses too.

Comment 33 Trivino 2022-05-10 07:37:50 UTC
Our wish would be to fix it in 8.7 but due to non-trivial nature of this bug and our capacity, the target release is planned to 8.8.

Comment 34 Mathieu Baudier 2022-05-10 07:48:49 UTC
Thanks for your feedback!

As we started to have this problem again, our approach has been:
- ditch all CentOS Stream IPA instances and use only RHEL (in order to clarify the setup and avoid incompatibilities when new features arrive in CentOS Stream)
- remove IPA DNS from all internal IPA instances: regular internal VMs now access the DNS of the public IPA instances via their internal interface (so, as a "plain" DNS server), and use the internal IPA instances as their IPA server for Kerberos, CA, etc.

There has not been any problem for months. But the useful feature that we have therefore lost are locations and autoconfiguration by DNS. This is not a big deal since our deployment is quite small and static, and we really had to move on.

I am happy to help you with testing when you are so far!
I can either deploy CentOS Stream (on a separate dedicated domain) or rebuild RPMs from source.

Comment 35 Sam Morris 2022-11-17 13:05:44 UTC
I'm seeing this on RHEL 9.1 - let me know if you want logs, core files, a separate bug to track this for RHEL9, etc.

Comment 36 Rafael Jeffman 2022-11-18 12:42:14 UTC
@sam.uk, core files would be nice to have to compare to exisiting ones.

Comment 37 Sam Morris 2022-11-18 19:31:09 UTC
Do you mind if I mail them to you privately?

The stack of the crashing thread looks like:

                Stack trace of thread 78580:
                #0  0x00007f0bc303e54c __pthread_kill_implementation (libc.so.6 + 0xa154c)
                #1  0x00007f0bc2ff1ce6 raise (libc.so.6 + 0x54ce6)
                #2  0x00007f0bc2fc57f3 abort (libc.so.6 + 0x287f3)
                #3  0x000055f14f0015b5 assertion_failed.cold (named + 0x1c5b5)
                #4  0x00007f0bc3849420 isc_assertion_failed (libisc-9.16.23-RH.so + 0x1c420)
                #5  0x00007f0bc3a1e46a dns_rdataset_first (libdns-9.16.23-RH.so + 0x10146a)
                #6  0x00007f0bc3a2d396 fctx_try.lto_priv.0 (libdns-9.16.23-RH.so + 0x110396)
                #7  0x00007f0bc388419d isc_task_run (libisc-9.16.23-RH.so + 0x5719d)
                #8  0x00007f0bc386f2a9 process_netievent (libisc-9.16.23-RH.so + 0x422a9)
                #9  0x00007f0bc386f425 process_queue (libisc-9.16.23-RH.so + 0x42425)
                #10 0x00007f0bc386fc17 async_cb (libisc-9.16.23-RH.so + 0x42c17)
                #11 0x00007f0bc35e7b3d uv__async_io.part.0 (libuv.so.1 + 0xab3d)
                #12 0x00007f0bc360385e uv__io_poll.part.0 (libuv.so.1 + 0x2685e)
                #13 0x00007f0bc35ed5a8 uv_run (libuv.so.1 + 0x105a8)
                #14 0x00007f0bc386f4db nm_thread (libisc-9.16.23-RH.so + 0x424db)
                #15 0x00007f0bc3881f7a isc__trampoline_run (libisc-9.16.23-RH.so + 0x54f7a)
                #16 0x00007f0bc303c802 start_thread (libc.so.6 + 0x9f802)
                #17 0x00007f0bc2fdc450 __clone3 (libc.so.6 + 0x3f450)

Comment 38 Francis Augusto Medeiros-Logeay 2022-12-13 18:28:54 UTC
I also have some core dump files. Let me know where I can send them to (I'd rather not have them publicly available).

Best,

Francis

Comment 41 Chance Callahan 2023-05-20 19:27:09 UTC
@sam.uk @r_f 

Feel free to email them to ccallaha and I'll make sure they are added. If you can, please also include the output of "rpm -qa". Thanks!

Comment 43 James Petrini 2023-08-16 10:31:14 UTC
Hello, 

I have inherited the support case 02657163 which opened this BZ in 2020 against rhel 7 and is the first customer in the above table . 

I have explained it will not be fixed for rhel 7 and that it wont be in rhel 8.9  

 They are asking if this fix will be released in version 8.10. The basic plan shows rhel 8.10 release date mid 2024. Is this a fair estimate to share with the customer?


Note You need to log in before you can comment on or make changes to this bug.