Bug 1999321 - DNS often stops resolving properly after FreeIPA server upgrade to Fedora 35 or 36
Summary: DNS often stops resolving properly after FreeIPA server upgrade to Fedora 35 ...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: freeipa
Version: 37
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: IPA Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://ask.fedoraproject.org/t/we-ar...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-30 21:32 UTC by Adam Williamson
Modified: 2023-12-07 11:07 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-07 11:07:53 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
/var/log tarball from client after bad state reached (2.15 MB, application/octet-stream)
2021-08-30 21:38 UTC, Adam Williamson
no flags Details
/var/log tarball from server after bad state reached (5.95 MB, application/octet-stream)
2021-08-30 21:40 UTC, Adam Williamson
no flags Details
tarball of /var/named (server side) from a failed run (15.67 KB, application/octet-stream)
2021-09-03 18:44 UTC, Adam Williamson
no flags Details
Worker detail (230.92 KB, image/png)
2021-09-17 11:17 UTC, Lukas Ruzicka
no flags Details
Waiting workers (281.97 KB, image/png)
2021-09-17 14:39 UTC, Lukas Ruzicka
no flags Details
ipa-healthcheck log files (42.10 KB, application/zip)
2021-09-27 19:09 UTC, David Yaffe
no flags Details
Stack trace files from ipa-01 (24.72 KB, application/x-bzip)
2021-10-01 14:39 UTC, David Yaffe
no flags Details
stack trace files from ipa-02 (14.66 KB, application/x-bzip)
2021-10-01 14:40 UTC, David Yaffe
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FREEIPA-6883 0 None Closed [Satellite] Error to install erratas 2022-06-19 01:13:41 UTC

Description Adam Williamson 2021-08-30 21:32:45 UTC
openQA is commonly encountering an issue in tests which test a scenario where a FreeIPA server is deployed on Fedora 33 or 34, and upgraded to 35 or 36. It seems that after the upgrade, the server stops returning correct DNS responses. It will respond to a query, but it gives an empty response. Here's an example:

https://openqa.stg.fedoraproject.org/tests/1297421 (server test)
https://openqa.stg.fedoraproject.org/tests/1297422 (client test)

in that test, I rigged the client end to ping the server and run a DNS query (for fedoraproject.org) more or less once per second after it had finished enrolling as a client. During this time, the server upgrades itself from 34 to 35. For the first few minutes during this time, while the server is still running on 34 and is just downloading packages for the upgrade, the ping is returned and the DNS query works correctly, giving the expected list of four IP addresses (note this is happening within Fedora infra, so the IPs returned are the internal ones):

https://openqa.stg.fedoraproject.org/tests/1297422#step/realmd_join_sssd/45

then, while the server is actually upgrading, neither pings nor DNS queries are replied to, as you'd expect:

https://openqa.stg.fedoraproject.org/tests/1297422#step/realmd_join_sssd/313

but after the server has upgraded and rebooted and is up again, we reach a state where the ping is returned and a DNS query is answered, but the answer is empty (no IP addresses given):

https://openqa.stg.fedoraproject.org/tests/1297422#step/realmd_join_sssd/457

the test fails shortly afterwards, when the client continues on with what it would normally do next, which is to download the packages from the update being tested - this fails because it cannot resolve bodhi.fedoraproject.org, due to this DNS problem.

This doesn't happen on every test, but if you look at the list at https://openqa.fedoraproject.org/tests/963222#next_previous , it happens about 50% of the time, so it's a significant problem.

I'll attach /var/log tarballs from both ends of a failed test. I don't see anything obvious in them, but maybe someone else will. We might need to tweak bind config to get more info, I guess.

Proposing as a Beta blocker as a conditional violation of Beta criterion "Additionally, a system deployed as a FreeIPA domain controller must: ... Serve DNS host records on port 53 (assuming it is deployed with DNS capability)" combined with the upgrade requirement "It must be possible to successfully complete a direct upgrade from a fully updated installation of each of the last two stable Fedora Server releases with the system configured as a FreeIPA domain controller or postgresql server as specified in the relevant criteria. The upgraded system must meet all relevant release criteria, including criteria relating to functionality of the server software."

Comment 1 Adam Williamson 2021-08-30 21:38:34 UTC
Created attachment 1819210 [details]
/var/log tarball from client after bad state reached

Comment 2 Adam Williamson 2021-08-30 21:40:15 UTC
Created attachment 1819212 [details]
/var/log tarball from server after bad state reached

Comment 3 Adam Williamson 2021-08-30 21:58:32 UTC
Hmm, we're actually supposed to be getting debug logs from bind via a config ab suggested. We write a service config which should cause bind to log at debug level to named.run:

https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/tests/role_deploy_domain_controller.pm#_30

and then upload named.run in the post-fail hook:

https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/lib/installedtest.pm#_112

but it's showing up as a 0-length file, in these tests at least. I'm not sure why. I think this *did* used to work when we first set it up.

Comment 4 Alexander Bokovoy 2021-08-31 07:44:44 UTC
There are no visible issues with named after upgrade:

Aug 27 20:18:08 ipa001 named[1042]: zone 2.16.172.in-addr.arpa/IN: loaded serial 1630109888
Aug 27 20:18:08 ipa001 named[1042]: zone test.openqa.fedoraproject.org/IN: loaded serial 1630109888

it loaded two zones from IPA LDAP as needed. What happens, however, is that a modification of zones is detected but apparently becomes empty and that causes a Python exception which is ignored:

...
Aug 27 20:18:52 ipa001 ipactl[729]: ipa: INFO: The ipactl command was successful
Aug 27 20:18:52 ipa001 ipactl[729]: Starting Directory Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting krb5kdc Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting kadmin Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting named Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting httpd Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting ipa-custodia Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting pki-tomcatd Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting ipa-otpd Service
Aug 27 20:18:52 ipa001 ipactl[729]: Starting ipa-dnskeysyncd Service
...
Aug 27 20:18:56 ipa001 ipa-dnskeysyncd[1541]: ipalib.plugable: DEBUG    importing all plugin modules in ipaserver.plugins...
...
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: DEBUG    Kerberos principal: ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG    Initializing principal ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org using keytab /etc/ipa/dnssec/ipa-dnskeysyncd.keytab
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG    using ccache /tmp/ipa-dnskeysyncd.ccache
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG    Attempt 1/5: success
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: DEBUG    LDAP URL: ldapi://%2Frun%2Fslapd-TEST-OPENQA-FEDORAPROJECT-ORG.socket/cn%3Ddns%2Cdc%3Dtest%2Cdc%3Dopenqa%2Cdc%3Dfedoraproject%2Cdc%3Dorg??sub?%28%7C%28objectClass%3DidnsZone%29%28obj
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: INFO     LDAP bind...
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: INFO     Commencing sync process
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG    Current cookie is: None (not received yet)
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG    Detected add of entry: idnsname=test.openqa.fedoraproject.org.,cn=dns,dc=test,dc=openqa,dc=fedoraproject,dc=org 85059d0f-0792-11ec-94a5-b25ee1ca36b3
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG    Detected add of entry: idnsname=2.16.172.in-addr.arpa.,cn=dns,dc=test,dc=openqa,dc=fedoraproject,dc=org 85059d13-0792-11ec-94a5-b25ee1ca36b3
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG    Detected add of entry: ipk11UniqueID=8969c66e-0792-11ec-9060-52540012010e,cn=keys,cn=sec,cn=dns,dc=test,dc=openqa,dc=fedoraproject,dc=org 85059d2a-0792-11ec-94a5-b25ee1ca36
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG    New cookie is: ipa001.test.openqa.fedoraproject.org:389#krbprincipalname=ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org.fedoraproject.org,cn=services,cn=a
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.keysyncer: INFO     Initial LDAP dump is done, sychronizing with ODS and BIND
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG    Starting external process
Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG    args=['/usr/libexec/ipa/ipa-dnskeysync-replica']
...
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG    Process finished, return code=0
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG    stdout=
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG    stderr=ipalib.plugable: DEBUG    importing all plugin modules in ipaserver.plugins...
...
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    Kerberos principal: ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG    Initializing principal ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org using keytab /etc/ipa/dnssec/ipa-dnskeysyncd.keytab
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG    using ccache /tmp/ipa-dnskeysync-replica.ccache
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG    Attempt 1/5: success
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    Got TGT
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    Connecting to LDAP
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    Connected
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    master keys in local HSM: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    master keys in LDAP HSM: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    new master keys in LDAP HSM: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: INFO     sync_metadata: keys in local HSM & LDAP: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    zone keys in local HSM: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    zone keys in LDAP HSM: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG    new zone keys in LDAP HSM: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: INFO     sync_metadata: keys in local HSM & LDAP: set()
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: Exception ignored in: <function LocalHSM.__del__ at 0x7ff4345be440>
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: Traceback (most recent call last):
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]:  File "/usr/lib/python3.10/site-packages/ipaserver/dnssec/localhsm.py", line 107, in __del__
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]:  File "/usr/lib/python3.10/site-packages/ipaserver/p11helper.py", line 939, in finalize
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: TypeError: 'NoneType' object is not callable
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.bindmgr: DEBUG    Key metadata in LDAP: {}
Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.bindmgr: DEBUG    Zones modified but skipped during bindmgr.sync: set()

the exception happens in the following line:

    def finalize(self):
        """
        Finalize operations with pkcs11 library
        """
        if self.p11 == NULL:
            return

        #
        # Logout
        #
        rv = self.p11.C_Logout(self.session)
>>>>    check_return_value(rv, "log out")

        #
        # End session
        #
        rv = self.p11.C_CloseSession(self.session)
        check_return_value(rv, "close session")

        #
        # Finalize
        #
        self.p11.C_Finalize(NULL)

        self.p11_ptr[0] = NULL
        self.session_ptr[0] = 0
        self.module_handle = None

and then in LocalHSM.__del__() their's self.p11 is None, so self.p11.finalize() could not be called. But I cannot understand how the call to finalize() actually happened in the first place if self.p11 is None...

Comment 5 Adam Williamson 2021-08-31 21:43:24 UTC
So I agreed with ab that this is a very weird traceback, though I thought it was actually even weirder than he did (it can't be self.p11 that's None, because we're already *in that instance's finalize() method in the traceback*; ostensibly the only thing that could be None is `check_return_value` because that's the only callable on the line where we encounter the error. Still, that's a function that clearly is defined, so it being None would be, uh, odd.

So I figured I'd debug it by logging status at every step of that `finalize()` method to a file, which involved adding this as line 934 in the file, right at the start of finalize():

        with open("/var/tmp/p11helper.log", "a") as logfh:

...and now, with that line, I get this error:

Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: Exception ignored in: <function LocalHSM.__del__ at 0x7f16fbc963b0>
Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: Traceback (most recent call last):
Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]:   File "/usr/lib/python3.10/site-packages/ipaserver/dnssec/localhsm.py", line 107, in __del__
Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]:   File "/usr/lib/python3.10/site-packages/ipaserver/p11helper.py", line 934, in finalize
Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: NameError: name 'open' is not defined

there is no normal circumstance in Python where 'open' would not be defined. So I think what's actually happening here is we're somehow winding up reaching this code very very late in the lifecycle of the entire process, while the interpreter itself is being shut down, and that's why we're getting these bizarre errors. I am not sure why this is happening. A hack might be to explicitly delete the `localhsm` object at the end of `localhsm.py`, possibly, rather than relying on garbage collection. I'll test that.

Comment 6 Adam Williamson 2021-08-31 23:13:41 UTC
I tried adding an explicit `del(localhsm)` but it didn't help, same traceback as above. :/

Comment 7 François Cami 2021-09-01 12:46:49 UTC
The "NameError: name 'open' is not defined" string led me to:
https://bugs.python.org/issue26789#msg380254

Quoting Victor:
"I pushed a change in Python 3.10 which reduces the risk of "NameError: name 'open' is not defined" when logging really late during the Python finalization.

It's a workaround on fragile finalization code. Please don't rely on it! You should release resources explicitly at exit. Don't rely on implicit __del__() finalizers."
+
"I chose to restrict this issue to the very specific case of NameError when calling logging.FileHandler.emit() late during the Python finalization.

If someone wants to enhance another function, please open a new issue."
+
"I will not backport my change since the Python finalization is really fragile and my logging fix rely on many enhancements made in the master that will not be backported to Python 3.9 and older on purpose. Changes on the finalization are also fragile and can introduce new subtile bugs."

Christian, could you please have a look?

Comment 8 Adam Williamson 2021-09-02 21:18:26 UTC
Looking at it again, I might not have added the del() call in the place where we need it to fix this issue. So I tweaked it and am trying again, we'll see how that goes.

Comment 9 Adam Williamson 2021-09-03 00:43:02 UTC
So I tried that three times and it worked, then a fourth time and it failed again :( The server logs from the fourth try don't seem to have the traceback, but I accidentally lost my hack on the server to upload the logs later, so I've restored that and am trying it again, will confirm the state next time it fails. It's possible that the bug is not actually caused by these end-of-process failures at all?

Comment 10 Adam Williamson 2021-09-03 03:32:57 UTC
yeah, confirmed, my change makes the ignored exceptions go away, but the actual bug still happens :( Will attach the new server log tarball.

Comment 11 Adam Williamson 2021-09-03 03:33:58 UTC
Created attachment 1820191 [details]
server log tarball with python exception issue patched

Comment 12 Adam Williamson 2021-09-03 03:35:27 UTC
https://github.com/AdamWill/freeipa/commit/a02fa51e81773ae99593ac5d3def367871d63c99 is the change I made, BTW.

Comment 13 Alexander Bokovoy 2021-09-03 06:23:22 UTC
In the logs you uploaded I don't see any issues at all. bind did load the zones, ipa-dnskeysyncd shows no problems in operation and so on. Perhaps, the issue is somewhere else?
We should be generating two config snippets, one for NetworkManager, one for systemd-resolved, that point them to 127.0.0.1 (IPA DNS server). In the messages I see the former to be used by NetworkManager:

Sep  2 21:15:40 ipa001 NetworkManager[717]: <info>  [1630631740.8781] Read config: /etc/NetworkManager/NetworkManager.conf (etc: zzz-ipa.conf)

and it looks like systemd-resolved at least knows to look into 127.0.0.1 for some zones:

Sep  2 21:15:41 ipa001 systemd-resolved[634]: Using degraded feature set UDP+EDNS0 instead of TLS+EDNS0 for DNS server 127.0.0.1.

so may be they are working. However, the logs have no entries for any named queries.

Comment 14 Adam Williamson 2021-09-03 06:24:55 UTC
I suppose it might be interesting to know if queries that the FreeIPA resolver can serve *itself* work (i.e. hosts within the domain), i.e. if this only affects queries that should be forwarded...I'll look into that tomorrow. any idea why the bind debug logging thing we set up doesn't seem to be working any more?

Comment 15 Alexander Bokovoy 2021-09-03 06:31:01 UTC
No idea. But you are not collecting /var/named/* and /var/log/named.log.

Comment 16 Adam Williamson 2021-09-03 15:35:40 UTC
No, because that was not what you told me to collect when we set this up :D We collect /var/named/data/named.run , which was supposed to be where the logging goes. It does *exist*, but it's a 0-length file.

Comment 17 Adam Williamson 2021-09-03 18:43:46 UTC
OK, so I tweaked things a bit more. I added a DNS query for the server's hostname to the cycle, and indeed, at the point where the server is no longer returning correct responses to a query for 'fedoraproject.org', it *does* return a correct response to a query for its own hostname (it returns its own IP address). So it seems the issue affects requests that need to be forwarded, but not ones the name server can resolve itself.

I also tweaked the server end to upload /var/named , and that does seem to have some interesting stuff in it. Particularly there's a /var/named/data/query_errors.log , which has these juicy errors:

03-Sep-2021 13:55:31.360 info: client @0x7fa770067ed8 172.16.2.103#51317 (fedoraproject.org): query failed (broken trust chain) for fedoraproject.org/IN/A at ../../../lib/ns/query.c:7376
03-Sep-2021 13:55:47.495 info: client @0x7fa770067ed8 172.16.2.103#43512 (fedoraproject.org): query failed (broken trust chain) for fedoraproject.org/IN/A at ../../../lib/ns
(...etc...same error repeated many times)

a note on timing: 13:55 is exactly when the server comes back up after the upgrade, going by the timestamps on the queries on the client side. 13:45 is the last time it gets a (correct) response before the server reboots for the upgrade and goes down, 13:55 is the first time it gets a (wrong) response after the server is back up. So these errors definitely started showing up after the upgrade, when the responses started being wrong.

So, we might be looking at some kind of SSL certificate issue here or something? But it is only happening on upgrades, this bug is not happening in tests of *fresh deployment* of FreeIPA on 35/Rawhide.

I'll attach the /var/named tarball.

Comment 18 Adam Williamson 2021-09-03 18:44:38 UTC
Created attachment 1820350 [details]
tarball of /var/named (server side) from a failed run

Comment 19 Alexander Bokovoy 2021-09-04 11:51:55 UTC
This error (broken trust chain) means we have an active DNSSEC validation but non-trusted trust chain.

From the tarballs' /var/named/data/dnssec.log:

03-Sep-2021 13:38:27.532 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out
03-Sep-2021 13:55:31.357 info: validating org/DS: no valid signature found
03-Sep-2021 13:55:31.359 info: validating org/DS: no valid signature found
03-Sep-2021 13:55:31.359 info: validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:55:38.915 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out
03-Sep-2021 13:55:47.495 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:56:02.577 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:56:17.568 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:56:32.727 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:56:47.771 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:57:02.861 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:57:17.995 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:57:33.096 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:03.203 info: validating org/DS: no valid signature found
03-Sep-2021 13:58:03.204 info: validating org/DS: no valid signature found
03-Sep-2021 13:58:03.204 info: validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:07.845 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:09.204 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out
03-Sep-2021 13:58:20.111 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS)
03-Sep-2021 13:58:20.113 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS)
03-Sep-2021 13:58:22.369 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:37.574 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:52.653 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:53.209 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:53.217 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:59:07.706 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:59:16.154 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS)
03-Sep-2021 13:59:16.155 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS)
03-Sep-2021 13:59:22.807 info:       validating org/DNSKEY: bad cache hit (org/DS)

This is what we see asked by the clients:
03-Sep-2021 13:59:16.151 info: client @0x7fed681d7058 127.0.0.1#47293 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN AAAA +E(0) (127.0.0.1)
03-Sep-2021 13:59:16.151 info: client @0x7fed70a97ff8 127.0.0.1#50497 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN A +E(0) (127.0.0.1)
03-Sep-2021 13:59:16.155 info: client @0x7fed681d7058 127.0.0.1#47293 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN AAAA +E(0) (127.0.0.1)
03-Sep-2021 13:59:16.155 info: client @0x7fed681d7058 127.0.0.1#47293 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN AAAA + (127.0.0.1)
03-Sep-2021 13:59:16.155 info: client @0x7fed70a97ff8 127.0.0.1#50497 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN A +E(0) (127.0.0.1)
03-Sep-2021 13:59:16.156 info: client @0x7fed70a97ff8 127.0.0.1#50497 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN A + (127.0.0.1)
03-Sep-2021 13:59:22.792 info: client @0x7fed71bed2b8 172.16.2.103#54345 (fedoraproject.org): query: fedoraproject.org IN A +E(0)K (172.16.2.100)
03-Sep-2021 13:59:28.022 info: client @0x7fed71bed2b8 172.16.2.103#46665 (ipa001.test.openqa.fedoraproject.org): query: ipa001.test.openqa.fedoraproject.org IN A +E(0)K (172.16.2.100)

Looking at the previous logs, I can see that you are installing IPA server without disabling DNSSEC validation:

2021-09-03T00:51:09Z DEBUG Logging to /var/log/ipaserver-install.log
2021-09-03T00:51:09Z DEBUG ipa-server-install was invoked with arguments [] and options: {'unattended': True, 'ip_addresses': None, 'domain_name': 'test.openqa.fedoraproject.org', 'realm_name': 'TEST.OPENQA.FEDORAPROJECT.ORG', 'host_name': None, 'ca_cert
_files': None, 'domain_level': None, 'setup_adtrust': False, 'setup_kra': False, 'setup_dns': True, 'idstart': None, 'idmax': None, 'no_hbac_allow': False, 'no_pkinit': False, 'no_ui_redirect': False, 'dirsrv_config_file': None, 'skip_mem_check': False,
'dirsrv_cert_files': None, 'http_cert_files': None, 'pkinit_cert_files': None, 'dirsrv_cert_name': None, 'http_cert_name': None, 'pkinit_cert_name': None, 'mkhomedir': False, 'ntp_servers': None, 'ntp_pool': None, 'no_ntp': False, 'force_ntpd': False, 's
sh_trust_dns': False, 'no_ssh': False, 'no_sshd': False, 'no_dns_sshfp': False, 'external_ca': False, 'external_ca_type': None, 'external_ca_profile': None, 'external_cert_files': None, 'subject_base': None, 'ca_subject': None, 'ca_signing_algorithm': No
ne, 'pki_config_override': None, 'allow_zone_overlap': True, 'reverse_zones': ['2.16.172.in-addr.arpa'], 'no_reverse': False, 'auto_reverse': False, 'zonemgr': None, 'forwarders': None, 'no_forwarders': False, 'auto_forwarders': True, 'forward_policy': N
one, 'no_dnssec_validation': False, 'no_host_dns': False, 'enable_compat': False, 'netbios_name': None, 'no_msdcs': False, 'rid_base': None, 'secondary_rid_base': None, 'ignore_topology_disconnect': False, 'ignore_last_of_role': False, 'verbose': False,
'quiet': False, 'log_file': None, 'uninstall': False}

e.g. 'no_dnssec_validation': False, so named is validating DNSSEC responses for the queries and rejecting the ones that are not trusted.

I think this comes from the F34 installation and kept over the upgrade. Then systemd-resolved as the system resolver is taking an effect and perhaps also enforcing DNSSEC validation on its own?

Comment 20 Adam Williamson 2021-09-06 15:14:52 UTC
+4 in https://pagure.io/fedora-qa/blocker-review/issue/419 , marking accepted.

Comment 21 Adam Williamson 2021-09-06 15:15:23 UTC
and yes, we don't explicitly specify anything either way about dnssec when deploying the server. The default will be used. This is the same on all releases.

Comment 22 François Cami 2021-09-07 09:00:12 UTC
Adam,

Would it be possible to test without systemd-resolved please?
e.g. before ipa-server-install:

# rm -f /etc/resolv.conf
# cp /run/systemd/resolve/resolv.conf /etc/resolv.conf

Comment 23 Christian Heimes 2021-09-07 09:26:26 UTC
As Adam already said, problems like "NameError: name 'open' is not defined" can occur during shutdown of the Python interpreter, when a finalizer function is called late in the shutdown process. The Python interpreter is shut down in multiple stages. The final steps of the lifecycle cleanup phase are not deterministic and can lead to bugs like the NameError exception.

It is recommended to avoid __del__ and not to rely on reference counting with implicit cleanups. The "with" statement with explicit __enter__ and __exit__ scopes is better suited for resource management. Other alternatives are weak references with callbacks or atexit hooks. If you have to use __del__, then the __del__ methods and all methods called by __del__ cannot rely on global or module local variables.

Comment 24 Adam Williamson 2021-09-09 04:37:41 UTC
fcami - I tried without resolved, by putting back the code we used to use to do that all the time before FreeIPA was fixed to work with resolved:

            script_run "systemctl stop systemd-resolved.service";
            script_run "systemctl disable systemd-resolved.service";
            script_run "rm -f /etc/resolv.conf";
            script_run "systemctl restart NetworkManager";

the failure still happens, see https://openqa.stg.fedoraproject.org/tests/1315922 . do you need logs from this case also?

Comment 25 Alexander Bokovoy 2021-09-09 07:05:29 UTC
We already had an issue with DNSSEC validation to unrelated domains last year with https://pagure.io/fedora-infrastructure/issue/9411. 

In this case the IPA server host where named is running cannot reach root DNS servers to get DNSSEC key for .org domain and more:

this is from your /var/named tarball, lame-servers.log:

03-Sep-2021 13:55:31.358 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.34#53
03-Sep-2021 13:55:31.359 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.33#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'org/DNSKEY/IN': 10.3.163.33#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DS/IN': 10.3.163.33#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DNSKEY/IN': 10.3.163.34#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.33#53
03-Sep-2021 13:55:47.495 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:02.577 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:17.568 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:32.727 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:47.771 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:57:02.861 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:57:17.995 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:57:33.096 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53

this is from the same tarball, dnssec.log:
03-Sep-2021 13:58:09.204 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out
03-Sep-2021 13:58:20.111 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS)
03-Sep-2021 13:58:20.113 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS)
03-Sep-2021 13:58:22.369 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:37.574 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:52.653 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:53.209 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:58:53.217 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:59:07.706 info:       validating org/DNSKEY: bad cache hit (org/DS)
03-Sep-2021 13:59:16.154 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS)
03-Sep-2021 13:59:16.155 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS)
03-Sep-2021 13:59:22.807 info:       validating org/DNSKEY: bad cache hit (org/DS)

If I'd do from my side 'dig +dnssec +crypto +trace 2.fedora.pool.ntp.org', here is what I get:


; <<>> DiG 9.16.20-RH <<>> +dnssec +crypto +trace 2.fedora.pool.ntp.org
;; global options: +cmd
.			83325	IN	NS	a.root-servers.net.
.			83325	IN	NS	g.root-servers.net.
.			83325	IN	NS	k.root-servers.net.
.			83325	IN	NS	d.root-servers.net.
.			83325	IN	NS	b.root-servers.net.
.			83325	IN	NS	l.root-servers.net.
.			83325	IN	NS	m.root-servers.net.
.			83325	IN	NS	i.root-servers.net.
.			83325	IN	NS	f.root-servers.net.
.			83325	IN	NS	j.root-servers.net.
.			83325	IN	NS	h.root-servers.net.
.			83325	IN	NS	e.root-servers.net.
.			83325	IN	NS	c.root-servers.net.
;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 6 ms

org.			172800	IN	NS	a0.org.afilias-nst.info.
org.			172800	IN	NS	a2.org.afilias-nst.info.
org.			172800	IN	NS	b0.org.afilias-nst.org.
org.			172800	IN	NS	b2.org.afilias-nst.org.
org.			172800	IN	NS	c0.org.afilias-nst.info.
org.			172800	IN	NS	d0.org.afilias-nst.org.
org.			86400	IN	DS	26974 8 2 4FEDE294C53F438A158C41D39489CD78A86BEB0D8A0AEAFF14745C0D 16E1DE32
org.			86400	IN	RRSIG	DS 8 1 86400 20210922050000 20210909040000 26838 . KqQbJs2Tj/xEWKWRnB28vSTEzVZKt7vJMsVuXKZby8X/pzGLCudt6l2M jwT36YWNf312gx/s5f3gZj4LXJXr3BrViT8LvLIVVM0Dgv0O4M3VQxeD Sk9zQ0fzrGMUuDJlcmFeOk+CHAuQJ+pRbe4Kpq3Fa2H8jULN7kLXwWNk lbrk1wQerX2jpoqCmy5/vtE+uL71Aa8DqOqVGG9MCXS6Ir0xPlRCOiVc xQtyzIXYDRhFZJcIMJyc5IBEFN6hmQj4SkoM2x7Q/j9Xzl+RizauV2+q vCvh+bQ24WXxejkKunVXAmOOTVX8HXCbPtgDLd6HBsap1t4a24ExbarU pQzuzA==
;; Received 787 bytes from 192.5.5.241#53(f.root-servers.net) in 5 ms

ntp.org.		86400	IN	NS	ns1.everett.org.
ntp.org.		86400	IN	NS	dns2.udel.edu.
ntp.org.		86400	IN	NS	anyns.pch.net.
ntp.org.		86400	IN	NS	ns3.p20.dynect.net.
ntp.org.		86400	IN	NS	dns1.udel.edu.
ntp.org.		86400	IN	NS	ns2.p20.dynect.net.
ntp.org.		86400	IN	NS	ns4.p20.dynect.net.
ntp.org.		86400	IN	NS	ns1.p20.dynect.net.
1i870vj5h429vj9pci7ar6e9gki74tr7.org. 86400 IN NSEC3 1 1 10 332539EE7F95C32A 1I87R64GAJU4O91MHKBU7I9EKBS7K8UT NS SOA RRSIG DNSKEY NSEC3PARAM
1i870vj5h429vj9pci7ar6e9gki74tr7.org. 86400 IN RRSIG NSEC3 8 2 86400 20210930065745 20210909055745 39681 org. OoqLJlC/o4Qbm8u9acQBI8mfetfRSUayji2GL4mEXOD0JG/K2TxnOWt7 tAYop425ab/aaj0EcSPgXNe/V1GxIL7gKKKR3O4Uz9mAEwHS6e6CKuKK Ga5wpcsH8SSIW1VWWCCE1LLEC3Uan7hokhtkPOcMv3UZhaNfUi8rYdWx de4=
lganq7sb2am5074kbdd23tmhq8bib2rk.org. 86400 IN NSEC3 1 1 10 332539EE7F95C32A LGATR75FR5K31LDKDA447JHEH6AFAVAN NS DS RRSIG
lganq7sb2am5074kbdd23tmhq8bib2rk.org. 86400 IN RRSIG NSEC3 8 2 86400 20210922152219 20210901142219 39681 org. c+E4fT8qV31p34ejoYa6qAX7l7z3ScW6i/E4DDXcCLAFfbrSID6aQ+za IkJYSaEAjB99xrylMB94Q9VImYC3bJiqm3dH/Wdo9n46rM5u3zKGgOWW sbCG/2R1QObcsybFdcyh8AqVIm3CDpG8iv8geamQINIvj96IJL2XcOKl D2I=
;; Received 762 bytes from 199.249.120.1#53(b2.org.afilias-nst.org) in 11 ms

pool.ntp.org.		604800	IN	NS	i.ntpns.org.
pool.ntp.org.		604800	IN	NS	a.ntpns.org.
pool.ntp.org.		604800	IN	NS	b.ntpns.org.
pool.ntp.org.		604800	IN	NS	c.ntpns.org.
pool.ntp.org.		604800	IN	NS	d.ntpns.org.
pool.ntp.org.		604800	IN	NS	e.ntpns.org.
pool.ntp.org.		604800	IN	NS	f.ntpns.org.
pool.ntp.org.		604800	IN	NS	g.ntpns.org.
pool.ntp.org.		604800	IN	NS	h.ntpns.org.
;; Received 243 bytes from 128.175.13.16#53(dns1.udel.edu) in 113 ms

2.fedora.pool.ntp.org.	150	IN	A	37.228.129.2
2.fedora.pool.ntp.org.	150	IN	A	95.216.218.15
2.fedora.pool.ntp.org.	150	IN	A	62.241.198.251
2.fedora.pool.ntp.org.	150	IN	A	194.100.49.152
;; Received 226 bytes from 185.126.112.98#53(a.ntpns.org) in 39 ms


I think this is not an issue in FreeIPA at all but rather a play of infrastructure limitations you have in the openqa.fedoraproject.org. I remember that Red Hat datacenters are not allowing to access root DNS servers directly, may be this is what we see here?

A solution is to instruct IPA to disable DNSSEC validation at install time with --no-dnssec-validation option to ipa-server-install.

Comment 26 Adam Williamson 2021-09-09 07:20:56 UTC
But that does not explain why this bug only happens on upgrade to F35 or F36. It does not happen on fresh deployment on F35 or F36, and it does not happen in any circumstance on F33 or F34 (including upgrades from F33 to F34).

Comment 27 Adam Williamson 2021-09-09 07:22:28 UTC
well...hmm. if the issue is restricted to fedoraproject.org, that could possibly explain it. it may happen to be the case that this is the only path where we actually wind up resolving a fedoraproject.org domain via the FreeIPA server's nameserver. in other cases we may happen to only do that before the FreeIPA server is acting as the nameserver. I'll have to look into that tomorrow.

Comment 28 Alexander Bokovoy 2021-09-09 07:23:07 UTC
Can you please try to disable DNSSEC validation in the test? We'll see if that's the reason to return the NTP server entries first.
There is definitely a problem with contacting the root servers.

Comment 29 Alexander Bokovoy 2021-09-09 07:29:25 UTC
BTW, test in https://openqa.stg.fedoraproject.org/tests/1315922 is a client machine and it is still running systemd-resolved:

Sep  9 00:22:32 client003 sssd_be[3016]: Backend is offline
Sep  9 00:24:33 client003 sssd_be[3016]: Backend is online
Sep  9 00:30:35 client003 systemd-resolved[615]: Grace period over, resuming full feature set (UDP+EDNS0) for DNS server 172.16.2.100.

(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [sdap_kinit_send] (0x0400): [RID#7] Attempting kinit (default, host/client003.test.openqa.fedoraproject.org, TEST.OPENQA.FEDORAPROJECT.ORG, 86400)
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#7] Resolving next KDC for service IPA
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#7] Trying to resolve service 'IPA'
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [get_server_status] (0x1000): [RID#7] Status of server 'ipa001.test.openqa.fedoraproject.org' is 'name resolved'
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [get_port_status] (0x1000): [RID#7] Port status of port 0 for server 'ipa001.test.openqa.fedoraproject.org' is 'working'
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#7] Resolve timeout [dns_resolver_timeout] set to 6 seconds
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [get_server_status] (0x1000): [RID#7] Status of server 'ipa001.test.openqa.fedoraproject.org' is 'name resolved'
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [be_resolve_server_process] (0x1000): [RID#7] Saving the first resolved server
(2021-09-09  0:30:52): [be[test.openqa.fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#7] Found address for server ipa001.test.openqa.fedoraproject.org: [172.16.2.100] TTL 1200

and then access to kojipkgs.fedoraproject.org fails:

2021-09-09T00:30:35-0400 DDEBUG Extra commands: ['-y', '--releasever=35', 'system-upgrade', 'download']
2021-09-09T00:30:35-0400 DEBUG User-Agent: constructed: 'libdnf (Fedora 34; server; Linux.x86_64)'
2021-09-09T00:30:35-0400 DEBUG repo: downloading from remote: fedora
2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml).
2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml).
2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml).
2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml).
2021-09-09T00:30:36-0400 WARNING Errors during downloading metadata for repository 'fedora':
  - Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org]
2021-09-09T00:30:36-0400 DDEBUG Cleaning up.

If that is DNSSEC validation, we don't have .org DNSSEC key to build the trust chain to validate, thus kojipkgs.fedoraproject.org would fail to resolve with DNSSEC validation enabled similar to 2.fedora.pool.ntp.org

Comment 30 Christian Heimes 2021-09-09 08:10:38 UTC
The problem could be related more tight crypto policies and deprecation of weak ciphers like SHA-1:

- deprecate derived properties: tls_cipher, ssh_cipher, ssh_group, ike_protocol, sha1_in_dnssec

https://src.fedoraproject.org/rpms/crypto-policies/c/3e079198c139b10d51f7d213652961a63704e384?branch=f35

Does any of the upstream DNS servers use old signatures or weak keys?

Comment 31 Alexander Bokovoy 2021-09-09 08:58:16 UTC
See above, I copied the corresponding .org entries here:

org.			86400	IN	DS	26974 8 2 4FEDE294C53F438A158C41D39489CD78A86BEB0D8A0AEAFF14745C0D 16E1DE32
org.			86400	IN	RRSIG	DS 8 1 86400 20210922050000 20210909040000 26838 . KqQbJs2Tj/xEWKWRnB28vSTEzVZKt7vJMsVuXKZby8X/pzGLCudt6l2M jwT36YWNf312gx/s5f3gZj4LXJXr3BrViT8LvLIVVM0Dgv0O4M3VQxeD Sk9zQ0fzrGMUuDJlcmFeOk+CHAuQJ+pRbe4Kpq3Fa2H8jULN7kLXwWNk lbrk1wQerX2jpoqCmy5/vtE+uL71Aa8DqOqVGG9MCXS6Ir0xPlRCOiVc xQtyzIXYDRhFZJcIMJyc5IBEFN6hmQj4SkoM2x7Q/j9Xzl+RizauV2+q vCvh+bQ24WXxejkKunVXAmOOTVX8HXCbPtgDLd6HBsap1t4a24ExbarU pQzuzA==
;; Received 787 bytes from 192.5.5.241#53(f.root-servers.net) in 5 ms

fedoraproject.org.	86400	IN	DS	16207 5 2 A7C9BF5AFE374C9650ED678F3D36931A7DE9256B86A7BC34D6DEED7D 4E492E5E
fedoraproject.org.	86400	IN	DS	16207 5 1 8DD099791A2A110851FDE5D14F6C62ADC3DD7C18
fedoraproject.org.	86400	IN	RRSIG	DS 8 2 86400 20210922152219 20210901142219 39681 org. JUYGOiOY2fcqhZBv66vS1lSzrTP8cQpFX6fkXlxPfSTlWAlkLYCON650 GcJ5JWukiVM66Mt6t+FCGANcma5XbttoQlRQZrMHBxUhsF2uBfLSsENH tj/9OxFJ9Ai/vJe2tUtY12/e/Rdd9C1Kp7obsLsTyKZrnLqZgzDceagh Ek8=
;; Received 506 bytes from 199.19.54.1#53(b0.org.afilias-nst.org) in 274 ms


DS 26974 8 2 is RSA/SHA-256 algorithm and RSA-SHA256 digest

but

RRSIG DS 8 1 is RSA/SHA-256 algorithm and RSA-SHA1 digest

DS 16207 5 2 is using SHA-256 digest too.

I have no problems resolving these in a F35 container with the DEFAULT or FIPS crypto policies using dig. NAMED should be using the same underlying library.

Comment 32 Adam Williamson 2021-09-09 15:35:19 UTC
"BTW, test in https://openqa.stg.fedoraproject.org/tests/1315922 is a client machine and it is still running systemd-resolved:"

oh, yeah, sorry, I forgot we only disabled it for servers before. I'll have to extend that snippet to also apply on clients.

Comment 33 Adam Williamson 2021-09-09 21:52:02 UTC
hmm, so, I looked at whether it's the case that we only resolve fedoraproject.org domains after we start using the FreeIPA server as the DNS server in the upgrade case. That's not true; we definitely resolve fedoraproject.org domains after the server becomes the DNS server in non-upgrade cases too. So it is still significant that non-upgrade tests do not encounter the bug; we are able to resolve fedoraproject.org domains just fine, reliably, in the non-upgrade tests where server and client ends are both F35 or F36 all along.

But it did make me notice something else. One thing that's unique at the time we fail in the upgrade case is that when we fail, the server is upgraded but *the client is not*. So we have, temporarily, a Fedora 34 client using a Fedora 35 or Fedora 36 FreeIPA server as its DNS server. Could that be related, somehow? Some kind of crypto mismatch there, maybe?

Comment 34 Adam Williamson 2021-09-09 23:24:02 UTC
Tried with resolved disabled on the client too, still got the bug:

https://openqa.stg.fedoraproject.org/tests/1318733

Comment 35 Alexander Bokovoy 2021-09-10 18:07:37 UTC
Is this ever reproducible outside OpenQA and Fedora infrastructure?

So the idea is that we do f34, then install IPA with DNS but don't enable DNSSEC for own zones, then upgrade to F35 and after upgrade happened (IPA upgraded as well), local IPA DNS server starts to reject Fedora hosts (e.g. bodhi.fedoraproject.org or NTP pool addresses). Perhaps, not just Fedora Project hosts but any .org hosts?

Is that the case? I am trying to find a way to build a reproducer outside Fedora infrastructure...

Comment 36 Adam Williamson 2021-09-10 18:28:47 UTC
"Is this ever reproducible outside OpenQA and Fedora infrastructure?"

I haven't tested, yet, because it's an awkward test to do manually and I have a dozen other blockers to juggle. I no longer have a pet local openQA deployment, I ditched all that hardware a while back. Lukas has a containerized one on his laptop but I dunno if it's capable of running the more complex tap tests. Lukas, is it?

Comment 37 Lukas Ruzicka 2021-09-13 16:30:55 UTC
(In reply to Adam Williamson from comment #36)
> "Is this ever reproducible outside OpenQA and Fedora infrastructure?"
> 
> I haven't tested, yet, because it's an awkward test to do manually and I
> have a dozen other blockers to juggle. I no longer have a pet local openQA
> deployment, I ditched all that hardware a while back. Lukas has a
> containerized one on his laptop but I dunno if it's capable of running the
> more complex tap tests. Lukas, is it?

Let me try to set it up for running tomorrow.

Comment 38 Adam Williamson 2021-09-15 22:42:58 UTC
So I can't say if it happens outside of Fedora infra (hopefully Lukas can tell us - Lukas, note you must run the test several times if it passes; the bug is intermittent, I wouldn't be sure it doesn't happen until you try at least 10 times), but I did just test that it affects more than just fedoraproject.org when it happens. I had the test also try to resolve two other domains with known-good DNSSEC configuration, per https://www.internetsociety.org/resources/deploy360/2013/dnssec-test-sites/ : internetsociety.org and dnssec-tools.org . When the bug happens, it also affects those - querying them returns no records. See https://openqa.stg.fedoraproject.org/tests/1330891#step/realmd_join_sssd/19 .

Comment 39 Lukas Ruzicka 2021-09-16 12:35:51 UTC
I am currently trying to make the tests run in my OpenQA, but although I have 5 workers running, the three tests refuse to start. 
I will keep you posted ...

Comment 40 Adam Williamson 2021-09-16 15:12:28 UTC
are your workers actually configured as tap workers? Do other tap tests run?

Comment 41 Lukas Ruzicka 2021-09-17 11:15:26 UTC
(In reply to Adam Williamson from comment #40)
> are your workers actually configured as tap workers? Do other tap tests run?
I am not sure. I am attaching the output of the my workers for you to see. How do I know if they are tap workers?

Btw, I was able to install FreeIPA on Fedora 35, so I will try to install F34 too and check manually if the issue hits me.

Comment 42 Lukas Ruzicka 2021-09-17 11:16:18 UTC
Created attachment 1823870 [details]
List of workers

Comment 43 Lukas Ruzicka 2021-09-17 11:17:12 UTC
Created attachment 1823872 [details]
Worker detail

Comment 44 Lukas Ruzicka 2021-09-17 14:39:47 UTC
Created attachment 1823921 [details]
Waiting workers

Comment 45 Lukas Ruzicka 2021-09-17 14:41:03 UTC
The attachment that shows working workers really are scheduled tests that refuse to start. So my workers might be set up incorrectly.

Comment 46 Lukas Ruzicka 2021-09-17 14:46:55 UTC
Also, I installed FreeIPA server on Fedora 34 and client on F35. I could use kinit to obtain the kerberos ticket and all required info from "
QA:Testcase realmd join sssd" could be obtained from the server. Then I started ping on the client and upgraded the server to F35. The host stopped answering when the upgrade was in progress, but then started answering to the ping again. Also, when it fully started, I could again perform the steps from "QA:Testcase realmd join sssd" and obtained the ticket.

I was using VMs on my local machine. To me, it seemed that it behaved normally on the first run, however I am not an experienced FreeIPA admin or even user. I could try more times if needed, but the setting up process takes quite some time.

Comment 47 Alexander Bokovoy 2021-09-17 14:52:22 UTC
Thanks Lukas. I assume it is using integrated DNS server (e.g. you set up IPA server as ipa-server-install --setup-dns) and your client was pointed to IPA server to resolve DNS. This is the part of the test case that seems to cause an issue in OpenQA: after installing IPA server on F34 and upgrading it to F35, at some point after completing the upgrade DNS server on IPA server stops resolving .org domains because DNSSEC chain of trust cannot be verified for them. At least, that's what happened in OpenQA instance in Fedora infrastructure.

Comment 48 Adam Williamson 2021-09-17 16:13:46 UTC
right. to reproduce the issue you need to do this:

1. deploy F34 server
2. enrol an F34 client
3. configure the client to use the server as its DNS server
4. upgrade the server to F35
5. from the client, try and resolve fedoraproject.org or internetsociety.org or dnssec-tools.org .

and you need to do it multiple times, because the bug doesn't always happen. I can try and do this with VMs today, but I don't really have my networking set up properly to allow it ATM.

Lukas, we can try and debug the openQA stuff elsewhere, I guess, to avoid bugging others on this ticket.

Comment 49 Lukas Ruzicka 2021-09-20 14:35:37 UTC
(In reply to Adam Williamson from comment #48)
> right. to reproduce the issue you need to do this:
> 
> 1. deploy F34 server
> 2. enrol an F34 client
> 3. configure the client to use the server as its DNS server
> 4. upgrade the server to F35
> 5. from the client, try and resolve fedoraproject.org or internetsociety.org
> or dnssec-tools.org .
> 

So, today I performed like 5 runs of this test to see if I hit it, but I could not. 
I am doing this:

1. Set up the server VM to resolve through my host -> 192.168.122.1
2. Obtain the server VM IP adress -> 192.168.122.32
3. Set up the client VM to resolve through the server.
4. Run the dig command for fedoraproject.org (for 34 and for upgrade to F35)

===

; <<>> DiG 9.16.20-RH <<>> fedoraproject.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13137
;; flags: qr rd ra; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: d65c4143da6dfd910100000061489b5c3df50b2f24cd3322 (good)
;; QUESTION SECTION:
;fedoraproject.org.             IN      A
 
;; ANSWER SECTION:
fedoraproject.org.      0       IN      A       209.132.190.2
fedoraproject.org.      0       IN      A       152.19.134.198
fedoraproject.org.      0       IN      A       18.133.140.134
fedoraproject.org.      0       IN      A       38.145.60.20
fedoraproject.org.      0       IN      A       152.19.134.142
fedoraproject.org.      0       IN      A       18.185.136.17
fedoraproject.org.      0       IN      A       67.219.144.68
fedoraproject.org.      0       IN      A       185.141.165.254
fedoraproject.org.      0       IN      A       18.159.254.57
fedoraproject.org.      0       IN      A       85.236.55.6
fedoraproject.org.      0       IN      A       38.145.60.21
fedoraproject.org.      0       IN      A       140.211.169.206
 
;; Query time: 70 msec
;; SERVER: 192.168.122.32#53(192.168.122.32)
;; WHEN: Mon Sep 20 16:31:56 CEST 2021
;; MSG SIZE  rcvd: 266
 
It takes ages to reset the VM and perform the upgrade, so I only could do it like 5 times. The sixth time is running just now.

Comment 50 Adam Williamson 2021-09-20 14:59:49 UTC
thanks for testing. can you also dig internetsociety.org and dnssec-tools.org for confirmation? if this works like 8-10 times we can probably say it doesn't need to block beta, at least.

Comment 51 Adam Williamson 2021-09-20 21:05:15 UTC
So I tried a couple more things too. I wondered if this may depend on which DNS server bind winds up forwarding requests to - there are two candidates, as we configure the server end with two DNS servers before deploying FreeIPA - but that doesn't seem to be it; I hacked the test to always use the same server, ran it a few times, and got mixed results, which is not what I'd expect if the DNS server was the cause (it should either always pass or always fail).

I also enabled dnssec on one of the openQA servers itself, which is a Fedora 34 system in Fedora infra, and tested resolving fedoraproject.org and some test dnssec-enabled domains. It behaved as expected every time.

So, I'm running out of things to poke here, honestly. Anyone have any ideas on any debug we can do to figure out precisely *where* the resolution is breaking down when it does go wrong?

Comment 52 Lukas Ruzicka 2021-09-21 07:43:45 UTC
With the sixth upgrade attempt, I also dug internetsociety.org and dnssec-tools.org and both have been resolved fine:

 
; <<>> DiG 9.16.20-RH <<>> internetsociety.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44765
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 7afebc2664d4208401000000614988b9973e4c0c3ea2758c (good)
;; QUESTION SECTION:
;internetsociety.org.           IN      A
 
;; ANSWER SECTION:
internetsociety.org.    299     IN      A       104.18.16.166
internetsociety.org.    299     IN      A       104.18.17.166
 
;; Query time: 104 msec
;; SERVER: 192.168.122.32#53(192.168.122.32)
;; WHEN: Tue Sep 21 09:24:41 CEST 2021
;; MSG SIZE  rcvd: 108

and

 
; <<>> DiG 9.16.20-RH <<>> dnssec-tools.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10762
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 9b628b0e4963b00201000000614988f10d5650262b8e9cca (good)
;; QUESTION SECTION:
;dnssec-tools.org.              IN      A
 
;; ANSWER SECTION:
dnssec-tools.org.       290     IN      A       185.199.108.153
dnssec-tools.org.       290     IN      A       185.199.111.153
dnssec-tools.org.       290     IN      A       185.199.109.153
dnssec-tools.org.       290     IN      A       185.199.110.153
 
;; Query time: 2 msec
;; SERVER: 192.168.122.32#53(192.168.122.32)
;; WHEN: Tue Sep 21 09:25:37 CEST 2021
;; MSG SIZE  rcvd: 137

Comment 53 Alexander Bokovoy 2021-09-21 09:06:00 UTC
So, looking again into the tarball of comment 18, I can see we get the wrong responses from the upstream DNS servers:

03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:dc3::35#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:7fe::53#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:500:2::c#53
03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53
03-Sep-2021 13:38:17.538 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53
03-Sep-2021 13:38:17.538 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:38:17.538 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:38:17.538 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:200::b#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:7fd::1#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:38:17.540 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53
03-Sep-2021 13:38:17.540 info: network unreachable resolving './NS/IN': 2001:500:1::53#53
03-Sep-2021 13:54:51.265 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
03-Sep-2021 13:54:51.265 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53
03-Sep-2021 13:54:51.265 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53
03-Sep-2021 13:54:51.265 info: network unreachable resolving './NS/IN': 2001:dc3::35#53
03-Sep-2021 13:54:51.265 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:7fd::1#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:7fe::53#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:54:51.267 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53
03-Sep-2021 13:54:51.267 info: network unreachable resolving './NS/IN': 2001:500:200::b#53
03-Sep-2021 13:54:51.267 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:500:2::c#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:500:1::53#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53
03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53
03-Sep-2021 13:54:51.269 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53
03-Sep-2021 13:54:51.270 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53
03-Sep-2021 13:54:51.270 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './NS/IN': 2001:7fe::53#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './NS/IN': 2001:500:1::53#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53
03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53
03-Sep-2021 13:54:51.564 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53
03-Sep-2021 13:54:51.564 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53
03-Sep-2021 13:54:51.564 info: network unreachable resolving './NS/IN': 2001:500:200::b#53
03-Sep-2021 13:54:51.564 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
03-Sep-2021 13:54:51.564 info: network unreachable resolving './NS/IN': 2001:7fd::1#53
03-Sep-2021 13:54:51.565 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53
03-Sep-2021 13:54:51.565 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:dc3::35#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:500:2::c#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:dc3::35#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53
03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:500:1::53#53
03-Sep-2021 13:55:28.923 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53
03-Sep-2021 13:55:28.923 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53
03-Sep-2021 13:55:28.923 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53
03-Sep-2021 13:55:28.923 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:500:2::c#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53
03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53
03-Sep-2021 13:55:28.929 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53
03-Sep-2021 13:55:28.929 info: network unreachable resolving './NS/IN': 2001:7fe::53#53
03-Sep-2021 13:55:28.929 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53
03-Sep-2021 13:55:28.929 info: network unreachable resolving './NS/IN': 2001:500:200::b#53
03-Sep-2021 13:55:28.930 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
03-Sep-2021 13:55:28.930 info: network unreachable resolving './NS/IN': 2001:7fd::1#53
03-Sep-2021 13:55:31.358 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.34#53
03-Sep-2021 13:55:31.359 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.33#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'org/DNSKEY/IN': 10.3.163.33#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DS/IN': 10.3.163.33#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DNSKEY/IN': 10.3.163.34#53
03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.33#53
03-Sep-2021 13:55:47.495 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:02.577 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:17.568 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:32.727 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:56:47.771 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:57:02.861 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:57:17.995 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53
03-Sep-2021 13:57:33.096 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53

So 10.3.163.33 and 10.3.163.34 would be the cause of the problem in Fedora infrastructure as none of the root servers are acessible in that environment.

I would strongly suggest this is not a bug in FreeIPA/BIND/bind-dyndb-ldap.

Comment 54 Adam Williamson 2021-09-22 19:07:13 UTC
would just be nice to know how we can debug exactly where/how the trust chain is breaking down. Does anyone know how to do that? Then at least we could reassign it somewhere useful.

I think at this point it's reasonable to re-vote this for Beta blocker status, since it does seem to be likely fairly limited in scope. So, clearing AcceptedBlocker to allow for a re-vote.

I'm also going to try and come up with a workaround for openQA so I don't have to restart the tests all the damn time (I'm going to try something more fine-grained than just 'disable dnssec').

Comment 55 Adam Williamson 2021-09-22 21:28:25 UTC
sigh, an annoying note from my workaround attempt: when this bug happens, it's still broken even after the client completes its upgrade. I tried to implement the workaround by having the client switch back to 'normal' DNS server for the upgrade and then back to the FreeIPA server as DNS server after upgrading, but it still can't resolve fedoraproject.org even after the upgrade.

So, I'm just gonna have to turn off dnssec on the server end for this test, I think :(

Comment 56 Kevin Fenzi 2021-09-23 19:02:01 UTC
'delv -4 +rtrace +dnssec +short fedoraproject.org' may provide more info... 

Some more general info: fedoraproject.org is currently signed with both a sha1 and a sha384 keys, but only the sha1 DS record is in .org so only that one should be 'live'.

dnssec seems to work fine from the worker nodes themselves... so it's just in the installs/qemu it's failing I guess?

Comment 57 Ben Cotton 2021-09-23 19:20:44 UTC
Marking as a rejected blocker for Beta, per Adam.

Comment 58 Adam Williamson 2021-09-23 21:05:09 UTC
Bit more on the above - we re-voted this as the Go/No-Go meeting today:

https://meetbot-raw.fedoraproject.org/fedora-meeting/2021-09-23/f35-beta-go_no_go-meeting.2021-09-23-17.00.html

and rejected it due to how it seems to be quite situation specific - nobody has yet reproduced it outside Fedora infra. There is a reasonable workaround, too - disable dnssec on the server.

Kevin, the specific failure case here is when we deploy a FreeIPA server and client on Fedora 34, then upgrade the server to Fedora 35. About half of the time, after that happens, lookups of dnssec-enabled domains from the client via the FreeIPA server's DNS server stop working - that is, the lookup completes, but the server returns no IP addresses. Upgrading the client to F35 doesn't fix things.

In tests where we deploy directly to Fedora 35, this is not happening. It's only happening in the upgrade test.

In this configuration the FreeIPA server is configured as a DNS server (using bind), forwarding any requests it doesn't resolve itself (so anything besides hosts enrolled in the domain) to the infra IP servers. If we disable dnssec validation on the FreeIPA server, the bug doesn't happen.

I'll try the command you suggested, thanks.

Comment 59 David Yaffe 2021-09-27 03:55:12 UTC
I don't now iof this will help, but I am also seeing similar behavior in after upgrading my network from Fedora 34 to Fedora 35 using dnf systen-upgrade. I have 2 systems running as a master/replica. I noticed another behaviour, once I start having DNS troubles, IPA runs really slow.

Let me know if you want any logs/configuration information, etc.

Output from delv after a rebooting both IPA servers:

$ delv -4 +rtrace +dnssec +short fedoraproject.org
;; fetch: fedoraproject.org/A
;; fetch: fedoraproject.org/DNSKEY
;; fetch: fedoraproject.org/DS
;; fetch: org/DNSKEY
;; fetch: org/DS
;; fetch: ./DNSKEY
8.43.85.67
8.43.85.73
38.145.60.20
38.145.60.21
67.219.144.68
140.211.169.196
140.211.169.206
152.19.134.142
152.19.134.198
209.132.190.2

I noticed that the master IPA server will not respond to any DNS requeries when this problem happens. My secondary server responds, but is unable to resolve queries for the local domain.

Comment 60 Adam Williamson 2021-09-27 06:08:46 UTC
Thanks David! When you say the master server "will not respond", do you mean that running a query against it literally gets no reply, or that it gets a reply with no IP addresses?

Comment 61 David Yaffe 2021-09-27 15:00:54 UTC
Last night ipa-01 would hang and time-out when I queried it using dig. ipa-02 would respond, but with no IP addresses.

Here are some tests from this morning. I get a reply with no IP addresses for internal zones, using against the same nameserver for external zones works:

[david@ipa-01 ~]$ dig @ipa-01 ipa-01.evilduckie.lan a

; <<>> DiG 9.16.21-RH <<>> @ipa-01 ipa-01.evilduckie.lan a
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13122
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2caa4738dba58e66010000006151dac477a23463242ae0db (good)
;; QUESTION SECTION:
;ipa-01.evilduckie.lan.         IN      A

;; AUTHORITY SECTION:
.                       5383    IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2021092700 1800 900 604800 86400

;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Mon Sep 27 08:52:52 CST 2021
;; MSG SIZE  rcvd: 153

[david@ipa-01 ~]$ dig @ipa-01 www.google.ca a

; <<>> DiG 9.16.21-RH <<>> @ipa-01 www.google.ca a
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62272
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: d94f6c4a3b635102010000006151db1b2d7b4edba752984b (good)
;; QUESTION SECTION:
;www.google.ca.                 IN      A

;; ANSWER SECTION:
www.google.ca.          300     IN      A       142.251.41.67

;; Query time: 58 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Mon Sep 27 08:54:19 CST 2021
;; MSG SIZE  rcvd: 86

The secondary nameserver, using +short..
[david@ipa-01 ~]$ dig @ipa-02 ipa-01.evilduckie.lan a +short
[david@ipa-01 ~]$ dig @ipa-02 www.google.ca a +short
142.251.33.163

Also, 'sudo ipactl restart' takes a really long time...

Comment 62 Adam Williamson 2021-09-27 15:55:08 UTC
Hmm. This may not be the same issue, but I'm not really a DNS expert so it's hard to be sure :) Is there a difference in behaviour between external domains that are dnssec-enabled and those that are not? Try some of the sample public dnssec-enabled domains, like www.internetsociety.org ... see https://www.internetsociety.org/resources/deploy360/2013/dnssec-test-sites/ .

Comment 63 Alexander Bokovoy 2021-09-27 16:14:05 UTC
If it is an issue on IPA DNS server side, then logs will tell where a forwarded query breaks. You'd need to enable debug logs for bind.

There are two separate parts here:

 - DNSSEC validation -- applied for any DNS entry obtained during query resolution. FreeIPA enables DNSSEC validation by default, unless you provided '--no-dnssec-validation' to ipa-server-install/ipa-replica-install/ipa-dns-install

 - DNSSEC signing of DNS zones hosted by IPA itself. This is not enabled by default and only one IPA master can be DNSSEC master. The procedure requires an explicit act from the admin to enable DNSSEC signing.

If resolving external DNS zones through IPA DNS server start to return empty IP addresses, there might also be few factors:

 - IPA DNS server applies DNSSEC validation and a trust chain for the requested record is not complete. This is something you'll see in the named's logs, similar to comments #18 and #53.

 - your local resolver on the host applies DNSSEC validation and considers a broken trust chain. This might happen with systemd-resolved, for example.

Between Fedora 34 and Fedora 35 both bind and systemd-resolved did change substantially. Note that tools like 'dig' or 'delv' use bind's internal libraries so any improvements in bind's code regarding DNSSEC validation would also apply to those utilities. With use of 'delv' or 'dig' you can, at least, isolate which DNS component is 'broken' -- for example, your local resolver or IPA DNS server or something else.

Comment 64 David Yaffe 2021-09-27 16:38:14 UTC
I'm okay with DNS, but definitely not an expert in DNSSEC. delv seems to be working as expected for DNSSEC enabled domains.

[david@ipa-01 ~]$ delv -4 +rtrace +dnssec internetsociety.org
;; fetch: internetsociety.org/A
;; fetch: internetsociety.org/DNSKEY
;; fetch: internetsociety.org/DS
;; fetch: org/DNSKEY
;; fetch: org/DS
;; fetch: ./DNSKEY
; fully validated
internetsociety.org.    172     IN      A       104.18.17.166
internetsociety.org.    172     IN      A       104.18.16.166
internetsociety.org.    172     IN      RRSIG   A 13 2 300 20210928172740 20210926152740 34505 internetsociety.org. or472O4CkqwJ15iXRYyXcb3Zj3t7aYmH3/3ure6o0SJ5k6CTlKdDpkJd IIy9SaWoPEs3FBb6ntcUphrnr0OOdQ==
[david@ipa-01 ~]$ delv -4 +rtrace +dnssec evilduckie.com
;; fetch: evilduckie.com/A
;; fetch: com/DS
;; fetch: ./DNSKEY
;; fetch: evilduckie.com/DS
;; fetch: com/DNSKEY
; unsigned answer
evilduckie.com.         78446   IN      A       142.165.61.118

I think this may be also tied into FreeIPA not running properly... on the same machines, I see errors when trying to use kinit:
[david@ipa-01 ~]$ kinit -V admin
Using existing cache: 1000
Using principal: admin
kinit: Generic error (see e-text) while getting initial credentials

I started down this rabbit hole after trying to use ipa-client-install on a third machine, and was getting DNS host not found errors.

Comment 65 François Cami 2021-09-27 16:46:17 UTC
David, you might want to install freeipa-healthcheck on your IPA cluster members and launch it. That will give you a better view of what's broken.

Comment 66 David Yaffe 2021-09-27 17:14:48 UTC
François, ipa-healthcheck runs about as well as ipactrl, it hangs when run. I'm going to reboot the IPA cluster again. Once the cluster is up, I'll rerun ipa-healthcheck on both servers and post the results.

Comment 67 David Yaffe 2021-09-27 19:02:21 UTC
Output of ipa-healthcheck, ipactl status and dig against the local zone (evilduckie.lan) after a reboot of the cluster:

[david@ipa-01 ~]$ date
Mon 27 Sep 2021 11:54:10 AM CST
[david@ipa-01 ~]$ time sudo ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful

real    0m1.048s
user    0m0.942s
sys     0m0.052s

[david@ipa-01 ~]$ dig @ipa-01 ipa-01.evilduckie.lan a

; <<>> DiG 9.16.21-RH <<>> @ipa-01 ipa-01.evilduckie.lan a
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51983
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 0b76330806706d5a010000006152057eab84731e657ff490 (good)
;; QUESTION SECTION:
;ipa-01.evilduckie.lan.         IN      A

;; ANSWER SECTION:
ipa-01.evilduckie.lan.  1200    IN      A       10.0.0.2

;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Mon Sep 27 11:55:10 CST 2021
;; MSG SIZE  rcvd: 94

[david@ipa-01 ~]$ dig @ipa-01 ipa-02.evilduckie.lan a +short
10.0.0.3
[david@ipa-01 ~]$ dig @ipa-02 ipa-01.evilduckie.lan a +short
10.0.0.2

[david@ipa-02 ~]$ time sudo ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful

real    0m1.500s
user    0m1.356s
sys     0m0.070s

[david@ipa-02 ~]$ dig @ipa-02 ipa-02.evilduckie.lan a

; <<>> DiG 9.16.21-RH <<>> @ipa-02 ipa-02.evilduckie.lan a
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33791
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 24bc3a05b5a7242e01000000615205c628f417263b47fec7 (good)
;; QUESTION SECTION:
;ipa-02.evilduckie.lan.         IN      A

;; ANSWER SECTION:
ipa-02.evilduckie.lan.  1200    IN      A       10.0.0.3

;; Query time: 0 msec
;; SERVER: 10.0.0.3#53(10.0.0.3)
;; WHEN: Mon Sep 27 11:56:22 CST 2021
;; MSG SIZE  rcvd: 94

[david@ipa-02 ~]$ dig @ipa-02 ipa-01.evilduckie.lan a +short
10.0.0.2
[david@ipa-02 ~]$ dig @ipa-01 ipa-02.evilduckie.lan a +short
10.0.0.3

Everything has been up for ~1.5 hours, and is still running properly.

Comment 68 David Yaffe 2021-09-27 19:09:27 UTC
Created attachment 1826754 [details]
ipa-healthcheck log files

Comment 69 Rob Crittenden 2021-09-27 19:58:59 UTC
The ERROR in each log is a known issue. The warnings shouldn't be causing any issues.

Comment 70 David Yaffe 2021-09-30 22:02:48 UTC
The problem has returned, after 3 days of uptime. I am unable to resolve internal domains, external domains seem to resolve fine.

[david@ipa-01 ~]$ uptime
 15:42:48 up 3 days,  4:21,  1 user,  load average: 0.13, 0.09, 0.09

ipactl status is now hanging, as is ipa-healthcheck:
[david@ipa-01 ~]$ time sudo ipactl status
^CCancelled.

real    2m9.740s
user    0m0.897s
sys     0m0.035s

[david@ipa-01 ~]$ time sudo ipa-healthcheck
^CTraceback (most recent call last):
  File "/usr/lib64/python3.10/site-packages/ldap/ldapobject.py", line 324, in _ldap_call
    result = func(*args,**kwargs)
ldap.SERVER_DOWN: {'result': -1, 'desc': "Can't contact LDAP server", 'errno': 4, 'ctrls': []}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ipa-healthcheck", line 33, in <module>
    sys.exit(load_entry_point('ipahealthcheck==0.9', 'console_scripts', 'ipa-healthcheck')())
  File "/usr/lib/python3.10/site-packages/ipahealthcheck/core/main.py", line 46, in main
    sys.exit(ipachecks.run_healthcheck())
  File "/usr/lib/python3.10/site-packages/ipahealthcheck/core/core.py", line 312, in run_healthcheck
    registry.initialize(framework, config, options)
  File "/usr/lib/python3.10/site-packages/ipahealthcheck/dogtag/plugin.py", line 30, in initialize
    api.Backend.ldap2.connect()
  File "/usr/lib/python3.10/site-packages/ipalib/backend.py", line 69, in connect
    conn = self.create_connection(*args, **kw)
  File "/usr/lib/python3.10/site-packages/ipaserver/plugins/ldap2.py", line 186, in create_connection
    client.external_bind(server_controls=serverctrls,
  File "/usr/lib/python3.10/site-packages/ipapython/ipaldap.py", line 1245, in external_bind
    self.conn.sasl_interactive_bind_s(
  File "/usr/lib64/python3.10/site-packages/ldap/ldapobject.py", line 476, in sasl_interactive_bind_s
    return self._ldap_call(self._l.sasl_interactive_bind_s,who,auth,RequestControlTuples(serverctrls),RequestControlTuples(clientctrls),sasl_flags)
  File "/usr/lib64/python3.10/site-packages/ldap/ldapobject.py", line 324, in _ldap_call
    result = func(*args,**kwargs)
KeyboardInterrupt


real    4m59.096s
user    0m0.949s
sys     0m0.037s

Running ipactl status on ipa-02 still responds properly:
[david@ipa-02 ~]$ time sudo ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful

real    0m1.530s
user    0m1.342s
sys     0m0.084s

However ipa-healthcheck is hanging... using ctrl-c to stop takes a few moments/keypresses..
[david@ipa-02 ~]$ time sudo ipa-healthcheck
Internal server error 404 Client Error: Not Found for url: https://ipa-02.evilduckie.lan:443/ca/rest/certs/search?size=3
^CInternal error testing KRA clone.
^C^C^CTraceback (most recent call last):
  File "/usr/lib64/python3.10/site-packages/dbus/connection.py", line 606, in msg_reply_handler
    reply_handler(*message.get_args_list(**get_args_opts))
  File "/usr/lib64/python3.10/site-packages/dbus/proxies.py", line 396, in _introspect_reply_handler
    self._introspect_method_map = process_introspection_data(data)
  File "/usr/lib64/python3.10/site-packages/dbus/_expat_introspect_parser.py", line 85, in process_introspection_data
    return _Parser().parse(data)
  File "/usr/lib64/python3.10/site-packages/dbus/_expat_introspect_parser.py", line 44, in parse
    parser.Parse(data)
  File "/builddir/build/BUILD/Python-3.10.0rc2/Modules/pyexpat.c", line 416, in StartElement
  File "/usr/lib64/python3.10/site-packages/dbus/_expat_introspect_parser.py", line 47, in StartElementHandler
    def StartElementHandler(self, name, attributes):
KeyboardInterrupt
Invalid IP address fe80::1e69:7aff:fea3:352a for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::1e69:7aff:fea3:352a
Invalid IP address fe80::503e:77ff:feed:6425 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::503e:77ff:feed:6425
Invalid IP address fe80::1088:4ff:fefb:8e81 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::1088:4ff:fefb:8e81
Invalid IP address fe80::a0be:28ff:fef5:7841 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::a0be:28ff:fef5:7841
Invalid IP address fe80::502e:dcff:fe3d:704d for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::502e:dcff:fe3d:704d
Invalid IP address fe80::80f3:41ff:fe64:2716 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::80f3:41ff:fe64:2716
Invalid IP address fe80::e46c:2cff:fe4a:906c for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::e46c:2cff:fe4a:906c
Invalid IP address fe80::98e2:e6ff:fedc:dc9c for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::98e2:e6ff:fedc:dc9c
Invalid IP address fe80::c8f6:9aff:fe0f:948 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::c8f6:9aff:fe0f:948
[
  {
    "source": "pki.server.healthcheck.clones.connectivity_and_data",
    "check": "ClonesConnectivyAndDataCheck",
    "result": "ERROR",
    "uuid": "c926cd3a-22f9-4c6f-a1a2-b239d226798d",
    "when": "20210930215717Z",
    "duration": "0.745401",
    "kw": {
      "status": "ERROR:  pki-tomcat : Internal error testing CA clone. Host: ipa-02.evilduckie.lan Port: 443"
    }
  },
  {
    "source": "pki.server.healthcheck.clones.connectivity_and_data",
    "check": "ClonesConnectivyAndDataCheck",
    "result": "ERROR",
    "uuid": "7c31dbaf-8d35-4c78-8a02-5a3568270534",
    "when": "20210930220017Z",
    "duration": "180.863800",
    "kw": {
      "status": "ERROR:  pki-tomcat : Internal error testing KRA clone. Host: ipa-02.evilduckie.lan Port: 443"
    }
  },
  {
    "source": "ipahealthcheck.dogtag.ca",
    "check": "DogtagCertsConnectivityCheck",
    "result": "ERROR",
    "uuid": "c7e61086-bf06-49fb-9711-35640deb6155",
    "when": "20210930220018Z",
    "duration": "0.131784",
    "kw": {
      "msg": "Request for certificate failed, cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/1': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)"
    }
  },
  {
    "source": "ipahealthcheck.ds.replication",
    "check": "ReplicationCheck",
    "result": "CRITICAL",
    "uuid": "1313f02f-53d0-4270-ade4-66d889f03b8e",
    "when": "20210930220027Z",
    "duration": "29.537328",
    "kw": {
      "key": "DSREPLLE0001",
      "items": [
        "Replication",
        "Agreement"
      ],
      "msg": "The replication agreement (metoipa-01.evilduckie.lan) under \"dc=evilduckie,dc=lan\" is not in synchronization."
    }
  },
  {
    "source": "ipahealthcheck.ds.replication",
    "check": "ReplicationCheck",
    "result": "CRITICAL",
    "uuid": "24064baf-232f-49e4-9278-88678e772721",
    "when": "20210930220048Z",
    "duration": "29.537368",
    "kw": {
      "key": "DSREPLLE0001",
      "items": [
        "Replication",
        "Agreement"
      ],
      "msg": "The replication agreement (catoipa-01.evilduckie.lan) under \"o=ipaca\" is not in synchronization."
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "4f2d503a-7393-416b-bd35-c58e86d4a9b5",
    "when": "20210930220051Z",
    "duration": "0.338079",
    "kw": {
      "key": "20210815190458",
      "serial": 7,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/7': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "54931860-5da9-44b5-9a00-eee7f0f3e999",
    "when": "20210930220051Z",
    "duration": "0.399151",
    "kw": {
      "key": "20210815190459",
      "serial": 5,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/5': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "e3e68aaa-1d02-4328-bbd7-73c97e98ada4",
    "when": "20210930220051Z",
    "duration": "0.462746",
    "kw": {
      "key": "20210815190500",
      "serial": 2,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/2': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "294c3790-aacf-4f5a-8a15-d222cb383f54",
    "when": "20210930220051Z",
    "duration": "0.525786",
    "kw": {
      "key": "20210815190501",
      "serial": 4,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/4': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "3b0b599f-bdfd-4096-8d45-8edde6d03dcd",
    "when": "20210930220051Z",
    "duration": "0.588363",
    "kw": {
      "key": "20210815190502",
      "serial": 1,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/1': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "3ba597c7-56a1-4a3d-a4c4-2d6dc0832da6",
    "when": "20210930220051Z",
    "duration": "0.650719",
    "kw": {
      "key": "20210815190503",
      "serial": 17,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/17': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "24758298-db1e-4ed2-a216-3291101ed5d3",
    "when": "20210930220052Z",
    "duration": "0.713848",
    "kw": {
      "key": "20210815190729",
      "serial": 13,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/13': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "0a2fbf04-f242-4ee0-92c4-731165f0a633",
    "when": "20210930220052Z",
    "duration": "0.777655",
    "kw": {
      "key": "20210815190730",
      "serial": 11,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/11': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "e4e72806-e1b3-4e75-b0ee-373c03a1b665",
    "when": "20210930220052Z",
    "duration": "0.840747",
    "kw": {
      "key": "20210815190731",
      "serial": 12,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/12': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "36508ef3-d756-420e-9bce-a42e9c1ae853",
    "when": "20210930220052Z",
    "duration": "0.873792",
    "kw": {
      "key": "20210815190322",
      "serial": 16,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/16': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "fb2acec3-4fa1-42de-a18d-22c20628e3b8",
    "when": "20210930220052Z",
    "duration": "0.935477",
    "kw": {
      "key": "20210815190310",
      "serial": 15,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/15': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.certs",
    "check": "IPACertRevocation",
    "result": "ERROR",
    "uuid": "28071b3b-202e-4f2e-98ff-4ee1a11b2507",
    "when": "20210930220052Z",
    "duration": "0.969785",
    "kw": {
      "key": "20210815190524",
      "serial": 18,
      "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/18': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)",
      "msg": "Request for certificate serial number {serial} in request {key} failed: {error}"
    }
  },
  {
    "source": "ipahealthcheck.ipa.dna",
    "check": "IPADNARangeCheck",
    "result": "WARNING",
    "uuid": "ba67229c-59e2-4d04-a30b-3e6d0d8ff50d",
    "when": "20210930220052Z",
    "duration": "0.137597",
    "kw": {
      "range_start": 0,
      "range_max": 0,
      "next_start": 0,
      "next_max": 0,
      "msg": "No DNA range defined. If no masters define a range then users and groups cannot be created."
    }
  },
  {
    "source": "ipahealthcheck.ipa.idns",
    "check": "IPADNSSystemRecordsCheck",
    "result": "WARNING",
    "uuid": "929698e2-4e0b-4e52-b3e3-2cb0e0d4e297",
    "when": "20210930220052Z",
    "duration": "0.036222",
    "kw": {
      "msg": "Got {count} ipa-ca A records, expected {expected}",
      "count": 3,
      "expected": 2
    }
  },
  {
    "source": "ipahealthcheck.ipa.files",
    "check": "IPAFileCheck",
    "result": "WARNING",
    "uuid": "57296159-3ba1-43a9-afbe-69e4ce0cda81",
    "when": "20210930220052Z",
    "duration": "0.001286",
    "kw": {
      "key": "_var_log_kadmind.log_mode",
      "path": "/var/log/kadmind.log",
      "type": "mode",
      "expected": "0600",
      "got": "0640",
      "msg": "Permissions of /var/log/kadmind.log are too permissive: 0640 and should be 0600"
    }
  }
]

real    3m39.286s
user    0m6.412s
sys     0m0.592s

Comment 71 Rob Crittenden 2021-09-30 22:17:27 UTC
The tools "hanging" is a result of the LDAP server being unresponsive. I'd recommend installing the 389-ds-base debuginfo and get a stacktrace from the ns-slapd process.

Comment 72 David Yaffe 2021-10-01 04:22:05 UTC
Hi Rob, that what I had noticed. I had also tried to do various ldap queries, against 389-DS, and had no response.

It's been a long time since I have done a stack trace. As you have suggested, I installed the debuginfo and symbols for 389-ds-base as well as gdb. gstack returned nothing:

[david@ipa-01 ~]$ pgrep ns-slapd
1320
[david@ipa-01 ~]$ gstack `pgrep ns-slapd`

Not sure where to go from here. I'm more than willing to follow instructions. :-)

Comment 73 Rob Crittenden 2021-10-01 13:03:44 UTC
From https://directory.fedoraproject.org/docs/389ds/FAQ/faq.html this is the preferred invocation:

gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1

Comment 74 David Yaffe 2021-10-01 14:39:17 UTC
Created attachment 1828068 [details]
Stack trace files from ipa-01

The first file has the debug symbols from 389-ds-base only. The second file has all of the missing debug symbols installed.

Comment 75 David Yaffe 2021-10-01 14:40:22 UTC
Created attachment 1828069 [details]
stack trace files from ipa-02

The first file has the debug symbols from 389-ds-base only. The second file has all of the missing debug symbols installed.

Comment 76 Alexander Bokovoy 2021-10-01 15:01:58 UTC
Are you using compat tree in IPA? If not, may be disabling it could help?

I see some lock-ups between changelog processing and slapi-nis (an entry was deleted and that triggered reaction in compat tree removal of a cached entry) on ipa-01. ipa-02 waits for SASL GSSAPI to complete when talking to ipa-01. If ipa-01 is stuck in lock-ups, it cannot respond back over replication plugin and that kind of explains why it 'hang'.

Thierry, could you please look into the logs from comments 74 and 75?

Comment 77 thierry bordaz 2021-10-01 15:55:01 UTC
The stack traces

        ipa-01-stacktrace
        is showing deadlock similar to https://bugzilla.redhat.com/show_bug.cgi?id=1751295#c3
        stacktrace.1633097467.txt.1 Deadlock between Threads 40/28
        stacktrace.1633098023.txt.1 Deadlock between Threads 40/28

        ipa-02-stacktrace
        is showing a RA waiting (poll) for an answer (sasl_bind) from the consumer. This long delay my block others
        RA to bind
        stacktrace.1633098323.txt.1 Possible long bind of RA (consumer not answering) Threads 10/11
        stacktrace.1633098705.txt.1 the same

Comment 78 David Yaffe 2021-10-01 17:20:36 UTC
Alexander, I'm not sure if I'm using compat mode. How can I confirm whether or not I am using compat?

Let me know what else you need me to do or test.

Comment 79 Alexander Bokovoy 2021-10-01 17:33:18 UTC
You can disable it with ipa-compat-manage tool, check man page first.
If you don't have non Linux clients or very old RHEL (like RHEL 5) in use and a need for resolving users from a trusted active directory domain, chances are you're not using the compat tree plugin.

Comment 80 Adam Williamson 2021-10-01 19:42:45 UTC
If David's bug doesn't have anything to do with mine, can we maybe move the debugging of it elsewhere? At this point more than half of this bug report is about it.

Comment 81 David Yaffe 2021-10-01 21:50:36 UTC
Adam, I have the feeling that our bugs are related. However, I'm more than happy to move it elsewhere if that makes sense.

Alexander, I was unable to run to tool on ipa-01, it just hung. I was able to run it on ipa-02. I can reboot ipa-01 , which will (temporarily) restore functionality, and then run 'sudo ipa-compat-manage disable'.

Comment 82 Alexander Bokovoy 2021-10-02 08:55:39 UTC
Yes, you need to restart ipa-01.

David, let's have a separate bug against slapi-nis in Fedoraa and add the same logs there.

Comment 83 David Yaffe 2021-10-05 23:11:53 UTC
New bug created: https://bugzilla.redhat.com/show_bug.cgi?id=2011086

Comment 84 Petr Menšík 2022-05-01 15:56:49 UTC
Based on comment #25, it seems to me there might be common deficiency of our internal RH infrastructure. Some recursive servers provided by network strip out RRSIG entries on their replies. If only some of servers forward these records well, but some are not, it starts random failing, which is kind of hard to track down.

Solution to this is removing all forwarders, which do not have dnssec support enabled.

Those broken servers might be filtered by this:

for NS in $(awk '"nameserver" == $1 { print $2 }' /etc/resolv.conf); do dig +dnssec $NS SOA . | grep -q RRSIG && echo $NS; done

Of course, systemd-resolved with DNSSEC validation disabled is known to be broken. If it gets involved, it would break DNSSEC validation for sure.

Maybe freeipa-setup should check system forwarders for DNSSEC support and refuse to continue, if not all of them serve signatures. It might offer disabling DNSSEC validation as a workaround.

Comment 85 Alexander Bokovoy 2022-05-01 16:56:53 UTC
I think this particular situation is unique to Red Hat datacenters. I already suggested to disable DNSSEC validation in the tests and we provide an option to that.
I am not sure it makes sense to probe extensively for these as we might be dealing with an isolated installation that is just being bootstrapped before moved into a proper location, for example. Admins have to handle that part themselves.

Comment 86 Adam Williamson 2022-05-02 06:19:08 UTC
For the record, disabling dnssec is what I wound up doing. More precisely, we disable dnssec if it's an upgrade test to a version higher than 34.

I guess once F34 goes EOL we can try turning it back on again, if it'll work OK for upgrades from F35 to later releases...

Comment 87 Adam Williamson 2022-09-20 21:00:48 UTC
So, I just tried turning this back on again (enabling dnssec on FreeIPA upgrade tests). Unfortunately, it still seems to be buggy. If we deploy the server pre-upgrade with dnssec enabled, about 60% of the time, the test fails with the client unable to resolve names after the server has upgraded. So I've gone back to having it disabled for the upgrade test :|

Comment 88 Ben Cotton 2022-11-29 17:03:53 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 89 Adam Williamson 2022-11-30 18:10:27 UTC
per my comment above, this still seems to be valid (if perhaps specific to some quirk of RH infrastructure; hopefully not many other folks have the same quirk, whatever it is).

Comment 90 Aoife Moloney 2023-11-23 00:06:08 UTC
This message is a reminder that Fedora Linux 37 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '37'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 37 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 91 Adam Williamson 2023-11-23 20:20:26 UTC
will test again if this is still an issue soon.

Comment 92 Aoife Moloney 2023-12-05 21:01:54 UTC
Fedora Linux 37 entered end-of-life (EOL) status on None.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 93 Petr Menšík 2023-12-06 19:23:07 UTC
I would like this bug fixed properly. dnssec-validation no; should be enough, if DNSSEC is responsible for the breakage.

The problem is still present, right?

Comment 94 Adam Williamson 2023-12-06 19:31:26 UTC
sorry, I didn't get to re-testing yet (and a catastrophic browser incident lost all my tabs, which I use as my todo list, so I forgot I was supposed to be doing it. :>)

I would *guess* it is. I'll try and confirm soon.

Comment 95 Aoife Moloney 2023-12-07 11:07:53 UTC
Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.