openQA is commonly encountering an issue in tests which test a scenario where a FreeIPA server is deployed on Fedora 33 or 34, and upgraded to 35 or 36. It seems that after the upgrade, the server stops returning correct DNS responses. It will respond to a query, but it gives an empty response. Here's an example: https://openqa.stg.fedoraproject.org/tests/1297421 (server test) https://openqa.stg.fedoraproject.org/tests/1297422 (client test) in that test, I rigged the client end to ping the server and run a DNS query (for fedoraproject.org) more or less once per second after it had finished enrolling as a client. During this time, the server upgrades itself from 34 to 35. For the first few minutes during this time, while the server is still running on 34 and is just downloading packages for the upgrade, the ping is returned and the DNS query works correctly, giving the expected list of four IP addresses (note this is happening within Fedora infra, so the IPs returned are the internal ones): https://openqa.stg.fedoraproject.org/tests/1297422#step/realmd_join_sssd/45 then, while the server is actually upgrading, neither pings nor DNS queries are replied to, as you'd expect: https://openqa.stg.fedoraproject.org/tests/1297422#step/realmd_join_sssd/313 but after the server has upgraded and rebooted and is up again, we reach a state where the ping is returned and a DNS query is answered, but the answer is empty (no IP addresses given): https://openqa.stg.fedoraproject.org/tests/1297422#step/realmd_join_sssd/457 the test fails shortly afterwards, when the client continues on with what it would normally do next, which is to download the packages from the update being tested - this fails because it cannot resolve bodhi.fedoraproject.org, due to this DNS problem. This doesn't happen on every test, but if you look at the list at https://openqa.fedoraproject.org/tests/963222#next_previous , it happens about 50% of the time, so it's a significant problem. I'll attach /var/log tarballs from both ends of a failed test. I don't see anything obvious in them, but maybe someone else will. We might need to tweak bind config to get more info, I guess. Proposing as a Beta blocker as a conditional violation of Beta criterion "Additionally, a system deployed as a FreeIPA domain controller must: ... Serve DNS host records on port 53 (assuming it is deployed with DNS capability)" combined with the upgrade requirement "It must be possible to successfully complete a direct upgrade from a fully updated installation of each of the last two stable Fedora Server releases with the system configured as a FreeIPA domain controller or postgresql server as specified in the relevant criteria. The upgraded system must meet all relevant release criteria, including criteria relating to functionality of the server software."
Created attachment 1819210 [details] /var/log tarball from client after bad state reached
Created attachment 1819212 [details] /var/log tarball from server after bad state reached
Hmm, we're actually supposed to be getting debug logs from bind via a config ab suggested. We write a service config which should cause bind to log at debug level to named.run: https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/tests/role_deploy_domain_controller.pm#_30 and then upload named.run in the post-fail hook: https://pagure.io/fedora-qa/os-autoinst-distri-fedora/blob/master/f/lib/installedtest.pm#_112 but it's showing up as a 0-length file, in these tests at least. I'm not sure why. I think this *did* used to work when we first set it up.
There are no visible issues with named after upgrade: Aug 27 20:18:08 ipa001 named[1042]: zone 2.16.172.in-addr.arpa/IN: loaded serial 1630109888 Aug 27 20:18:08 ipa001 named[1042]: zone test.openqa.fedoraproject.org/IN: loaded serial 1630109888 it loaded two zones from IPA LDAP as needed. What happens, however, is that a modification of zones is detected but apparently becomes empty and that causes a Python exception which is ignored: ... Aug 27 20:18:52 ipa001 ipactl[729]: ipa: INFO: The ipactl command was successful Aug 27 20:18:52 ipa001 ipactl[729]: Starting Directory Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting krb5kdc Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting kadmin Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting named Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting httpd Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting ipa-custodia Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting pki-tomcatd Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting ipa-otpd Service Aug 27 20:18:52 ipa001 ipactl[729]: Starting ipa-dnskeysyncd Service ... Aug 27 20:18:56 ipa001 ipa-dnskeysyncd[1541]: ipalib.plugable: DEBUG importing all plugin modules in ipaserver.plugins... ... Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: DEBUG Kerberos principal: ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG Initializing principal ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org using keytab /etc/ipa/dnssec/ipa-dnskeysyncd.keytab Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG using ccache /tmp/ipa-dnskeysyncd.ccache Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG Attempt 1/5: success Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: DEBUG LDAP URL: ldapi://%2Frun%2Fslapd-TEST-OPENQA-FEDORAPROJECT-ORG.socket/cn%3Ddns%2Cdc%3Dtest%2Cdc%3Dopenqa%2Cdc%3Dfedoraproject%2Cdc%3Dorg??sub?%28%7C%28objectClass%3DidnsZone%29%28obj Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: INFO LDAP bind... Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysyncd: INFO Commencing sync process Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG Current cookie is: None (not received yet) Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG Detected add of entry: idnsname=test.openqa.fedoraproject.org.,cn=dns,dc=test,dc=openqa,dc=fedoraproject,dc=org 85059d0f-0792-11ec-94a5-b25ee1ca36b3 Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG Detected add of entry: idnsname=2.16.172.in-addr.arpa.,cn=dns,dc=test,dc=openqa,dc=fedoraproject,dc=org 85059d13-0792-11ec-94a5-b25ee1ca36b3 Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG Detected add of entry: ipk11UniqueID=8969c66e-0792-11ec-9060-52540012010e,cn=keys,cn=sec,cn=dns,dc=test,dc=openqa,dc=fedoraproject,dc=org 85059d2a-0792-11ec-94a5-b25ee1ca36 Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.syncrepl: DEBUG New cookie is: ipa001.test.openqa.fedoraproject.org:389#krbprincipalname=ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org.fedoraproject.org,cn=services,cn=a Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.keysyncer: INFO Initial LDAP dump is done, sychronizing with ODS and BIND Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG Starting external process Aug 27 20:18:59 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG args=['/usr/libexec/ipa/ipa-dnskeysync-replica'] ... Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG Process finished, return code=0 Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG stdout= Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipapython.ipautil: DEBUG stderr=ipalib.plugable: DEBUG importing all plugin modules in ipaserver.plugins... ... Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG Kerberos principal: ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG Initializing principal ipa-dnskeysyncd/ipa001.test.openqa.fedoraproject.org using keytab /etc/ipa/dnssec/ipa-dnskeysyncd.keytab Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG using ccache /tmp/ipa-dnskeysync-replica.ccache Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipalib.install.kinit: DEBUG Attempt 1/5: success Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG Got TGT Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG Connecting to LDAP Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG Connected Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG master keys in local HSM: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG master keys in LDAP HSM: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG new master keys in LDAP HSM: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: INFO sync_metadata: keys in local HSM & LDAP: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG zone keys in local HSM: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG zone keys in LDAP HSM: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: DEBUG new zone keys in LDAP HSM: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipa-dnskeysync-replica: INFO sync_metadata: keys in local HSM & LDAP: set() Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: Exception ignored in: <function LocalHSM.__del__ at 0x7ff4345be440> Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: Traceback (most recent call last): Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: File "/usr/lib/python3.10/site-packages/ipaserver/dnssec/localhsm.py", line 107, in __del__ Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: File "/usr/lib/python3.10/site-packages/ipaserver/p11helper.py", line 939, in finalize Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: TypeError: 'NoneType' object is not callable Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.bindmgr: DEBUG Key metadata in LDAP: {} Aug 27 20:19:02 ipa001 ipa-dnskeysyncd[1541]: ipaserver.dnssec.bindmgr: DEBUG Zones modified but skipped during bindmgr.sync: set() the exception happens in the following line: def finalize(self): """ Finalize operations with pkcs11 library """ if self.p11 == NULL: return # # Logout # rv = self.p11.C_Logout(self.session) >>>> check_return_value(rv, "log out") # # End session # rv = self.p11.C_CloseSession(self.session) check_return_value(rv, "close session") # # Finalize # self.p11.C_Finalize(NULL) self.p11_ptr[0] = NULL self.session_ptr[0] = 0 self.module_handle = None and then in LocalHSM.__del__() their's self.p11 is None, so self.p11.finalize() could not be called. But I cannot understand how the call to finalize() actually happened in the first place if self.p11 is None...
So I agreed with ab that this is a very weird traceback, though I thought it was actually even weirder than he did (it can't be self.p11 that's None, because we're already *in that instance's finalize() method in the traceback*; ostensibly the only thing that could be None is `check_return_value` because that's the only callable on the line where we encounter the error. Still, that's a function that clearly is defined, so it being None would be, uh, odd. So I figured I'd debug it by logging status at every step of that `finalize()` method to a file, which involved adding this as line 934 in the file, right at the start of finalize(): with open("/var/tmp/p11helper.log", "a") as logfh: ...and now, with that line, I get this error: Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: Exception ignored in: <function LocalHSM.__del__ at 0x7f16fbc963b0> Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: Traceback (most recent call last): Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: File "/usr/lib/python3.10/site-packages/ipaserver/dnssec/localhsm.py", line 107, in __del__ Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: File "/usr/lib/python3.10/site-packages/ipaserver/p11helper.py", line 934, in finalize Aug 31 14:12:11 ipa001.test.openqa.fedoraproject.org ipa-dnskeysyncd[1541]: NameError: name 'open' is not defined there is no normal circumstance in Python where 'open' would not be defined. So I think what's actually happening here is we're somehow winding up reaching this code very very late in the lifecycle of the entire process, while the interpreter itself is being shut down, and that's why we're getting these bizarre errors. I am not sure why this is happening. A hack might be to explicitly delete the `localhsm` object at the end of `localhsm.py`, possibly, rather than relying on garbage collection. I'll test that.
I tried adding an explicit `del(localhsm)` but it didn't help, same traceback as above. :/
The "NameError: name 'open' is not defined" string led me to: https://bugs.python.org/issue26789#msg380254 Quoting Victor: "I pushed a change in Python 3.10 which reduces the risk of "NameError: name 'open' is not defined" when logging really late during the Python finalization. It's a workaround on fragile finalization code. Please don't rely on it! You should release resources explicitly at exit. Don't rely on implicit __del__() finalizers." + "I chose to restrict this issue to the very specific case of NameError when calling logging.FileHandler.emit() late during the Python finalization. If someone wants to enhance another function, please open a new issue." + "I will not backport my change since the Python finalization is really fragile and my logging fix rely on many enhancements made in the master that will not be backported to Python 3.9 and older on purpose. Changes on the finalization are also fragile and can introduce new subtile bugs." Christian, could you please have a look?
Looking at it again, I might not have added the del() call in the place where we need it to fix this issue. So I tweaked it and am trying again, we'll see how that goes.
So I tried that three times and it worked, then a fourth time and it failed again :( The server logs from the fourth try don't seem to have the traceback, but I accidentally lost my hack on the server to upload the logs later, so I've restored that and am trying it again, will confirm the state next time it fails. It's possible that the bug is not actually caused by these end-of-process failures at all?
yeah, confirmed, my change makes the ignored exceptions go away, but the actual bug still happens :( Will attach the new server log tarball.
Created attachment 1820191 [details] server log tarball with python exception issue patched
https://github.com/AdamWill/freeipa/commit/a02fa51e81773ae99593ac5d3def367871d63c99 is the change I made, BTW.
In the logs you uploaded I don't see any issues at all. bind did load the zones, ipa-dnskeysyncd shows no problems in operation and so on. Perhaps, the issue is somewhere else? We should be generating two config snippets, one for NetworkManager, one for systemd-resolved, that point them to 127.0.0.1 (IPA DNS server). In the messages I see the former to be used by NetworkManager: Sep 2 21:15:40 ipa001 NetworkManager[717]: <info> [1630631740.8781] Read config: /etc/NetworkManager/NetworkManager.conf (etc: zzz-ipa.conf) and it looks like systemd-resolved at least knows to look into 127.0.0.1 for some zones: Sep 2 21:15:41 ipa001 systemd-resolved[634]: Using degraded feature set UDP+EDNS0 instead of TLS+EDNS0 for DNS server 127.0.0.1. so may be they are working. However, the logs have no entries for any named queries.
I suppose it might be interesting to know if queries that the FreeIPA resolver can serve *itself* work (i.e. hosts within the domain), i.e. if this only affects queries that should be forwarded...I'll look into that tomorrow. any idea why the bind debug logging thing we set up doesn't seem to be working any more?
No idea. But you are not collecting /var/named/* and /var/log/named.log.
No, because that was not what you told me to collect when we set this up :D We collect /var/named/data/named.run , which was supposed to be where the logging goes. It does *exist*, but it's a 0-length file.
OK, so I tweaked things a bit more. I added a DNS query for the server's hostname to the cycle, and indeed, at the point where the server is no longer returning correct responses to a query for 'fedoraproject.org', it *does* return a correct response to a query for its own hostname (it returns its own IP address). So it seems the issue affects requests that need to be forwarded, but not ones the name server can resolve itself. I also tweaked the server end to upload /var/named , and that does seem to have some interesting stuff in it. Particularly there's a /var/named/data/query_errors.log , which has these juicy errors: 03-Sep-2021 13:55:31.360 info: client @0x7fa770067ed8 172.16.2.103#51317 (fedoraproject.org): query failed (broken trust chain) for fedoraproject.org/IN/A at ../../../lib/ns/query.c:7376 03-Sep-2021 13:55:47.495 info: client @0x7fa770067ed8 172.16.2.103#43512 (fedoraproject.org): query failed (broken trust chain) for fedoraproject.org/IN/A at ../../../lib/ns (...etc...same error repeated many times) a note on timing: 13:55 is exactly when the server comes back up after the upgrade, going by the timestamps on the queries on the client side. 13:45 is the last time it gets a (correct) response before the server reboots for the upgrade and goes down, 13:55 is the first time it gets a (wrong) response after the server is back up. So these errors definitely started showing up after the upgrade, when the responses started being wrong. So, we might be looking at some kind of SSL certificate issue here or something? But it is only happening on upgrades, this bug is not happening in tests of *fresh deployment* of FreeIPA on 35/Rawhide. I'll attach the /var/named tarball.
Created attachment 1820350 [details] tarball of /var/named (server side) from a failed run
This error (broken trust chain) means we have an active DNSSEC validation but non-trusted trust chain. From the tarballs' /var/named/data/dnssec.log: 03-Sep-2021 13:38:27.532 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out 03-Sep-2021 13:55:31.357 info: validating org/DS: no valid signature found 03-Sep-2021 13:55:31.359 info: validating org/DS: no valid signature found 03-Sep-2021 13:55:31.359 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:55:38.915 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out 03-Sep-2021 13:55:47.495 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:56:02.577 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:56:17.568 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:56:32.727 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:56:47.771 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:57:02.861 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:57:17.995 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:57:33.096 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:03.203 info: validating org/DS: no valid signature found 03-Sep-2021 13:58:03.204 info: validating org/DS: no valid signature found 03-Sep-2021 13:58:03.204 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:07.845 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:09.204 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out 03-Sep-2021 13:58:20.111 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS) 03-Sep-2021 13:58:20.113 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS) 03-Sep-2021 13:58:22.369 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:37.574 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:52.653 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:53.209 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:53.217 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:59:07.706 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:59:16.154 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS) 03-Sep-2021 13:59:16.155 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS) 03-Sep-2021 13:59:22.807 info: validating org/DNSKEY: bad cache hit (org/DS) This is what we see asked by the clients: 03-Sep-2021 13:59:16.151 info: client @0x7fed681d7058 127.0.0.1#47293 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN AAAA +E(0) (127.0.0.1) 03-Sep-2021 13:59:16.151 info: client @0x7fed70a97ff8 127.0.0.1#50497 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN A +E(0) (127.0.0.1) 03-Sep-2021 13:59:16.155 info: client @0x7fed681d7058 127.0.0.1#47293 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN AAAA +E(0) (127.0.0.1) 03-Sep-2021 13:59:16.155 info: client @0x7fed681d7058 127.0.0.1#47293 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN AAAA + (127.0.0.1) 03-Sep-2021 13:59:16.155 info: client @0x7fed70a97ff8 127.0.0.1#50497 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN A +E(0) (127.0.0.1) 03-Sep-2021 13:59:16.156 info: client @0x7fed70a97ff8 127.0.0.1#50497 (2.fedora.pool.ntp.org): query: 2.fedora.pool.ntp.org IN A + (127.0.0.1) 03-Sep-2021 13:59:22.792 info: client @0x7fed71bed2b8 172.16.2.103#54345 (fedoraproject.org): query: fedoraproject.org IN A +E(0)K (172.16.2.100) 03-Sep-2021 13:59:28.022 info: client @0x7fed71bed2b8 172.16.2.103#46665 (ipa001.test.openqa.fedoraproject.org): query: ipa001.test.openqa.fedoraproject.org IN A +E(0)K (172.16.2.100) Looking at the previous logs, I can see that you are installing IPA server without disabling DNSSEC validation: 2021-09-03T00:51:09Z DEBUG Logging to /var/log/ipaserver-install.log 2021-09-03T00:51:09Z DEBUG ipa-server-install was invoked with arguments [] and options: {'unattended': True, 'ip_addresses': None, 'domain_name': 'test.openqa.fedoraproject.org', 'realm_name': 'TEST.OPENQA.FEDORAPROJECT.ORG', 'host_name': None, 'ca_cert _files': None, 'domain_level': None, 'setup_adtrust': False, 'setup_kra': False, 'setup_dns': True, 'idstart': None, 'idmax': None, 'no_hbac_allow': False, 'no_pkinit': False, 'no_ui_redirect': False, 'dirsrv_config_file': None, 'skip_mem_check': False, 'dirsrv_cert_files': None, 'http_cert_files': None, 'pkinit_cert_files': None, 'dirsrv_cert_name': None, 'http_cert_name': None, 'pkinit_cert_name': None, 'mkhomedir': False, 'ntp_servers': None, 'ntp_pool': None, 'no_ntp': False, 'force_ntpd': False, 's sh_trust_dns': False, 'no_ssh': False, 'no_sshd': False, 'no_dns_sshfp': False, 'external_ca': False, 'external_ca_type': None, 'external_ca_profile': None, 'external_cert_files': None, 'subject_base': None, 'ca_subject': None, 'ca_signing_algorithm': No ne, 'pki_config_override': None, 'allow_zone_overlap': True, 'reverse_zones': ['2.16.172.in-addr.arpa'], 'no_reverse': False, 'auto_reverse': False, 'zonemgr': None, 'forwarders': None, 'no_forwarders': False, 'auto_forwarders': True, 'forward_policy': N one, 'no_dnssec_validation': False, 'no_host_dns': False, 'enable_compat': False, 'netbios_name': None, 'no_msdcs': False, 'rid_base': None, 'secondary_rid_base': None, 'ignore_topology_disconnect': False, 'ignore_last_of_role': False, 'verbose': False, 'quiet': False, 'log_file': None, 'uninstall': False} e.g. 'no_dnssec_validation': False, so named is validating DNSSEC responses for the queries and rejecting the ones that are not trusted. I think this comes from the F34 installation and kept over the upgrade. Then systemd-resolved as the system resolver is taking an effect and perhaps also enforcing DNSSEC validation on its own?
+4 in https://pagure.io/fedora-qa/blocker-review/issue/419 , marking accepted.
and yes, we don't explicitly specify anything either way about dnssec when deploying the server. The default will be used. This is the same on all releases.
Adam, Would it be possible to test without systemd-resolved please? e.g. before ipa-server-install: # rm -f /etc/resolv.conf # cp /run/systemd/resolve/resolv.conf /etc/resolv.conf
As Adam already said, problems like "NameError: name 'open' is not defined" can occur during shutdown of the Python interpreter, when a finalizer function is called late in the shutdown process. The Python interpreter is shut down in multiple stages. The final steps of the lifecycle cleanup phase are not deterministic and can lead to bugs like the NameError exception. It is recommended to avoid __del__ and not to rely on reference counting with implicit cleanups. The "with" statement with explicit __enter__ and __exit__ scopes is better suited for resource management. Other alternatives are weak references with callbacks or atexit hooks. If you have to use __del__, then the __del__ methods and all methods called by __del__ cannot rely on global or module local variables.
fcami - I tried without resolved, by putting back the code we used to use to do that all the time before FreeIPA was fixed to work with resolved: script_run "systemctl stop systemd-resolved.service"; script_run "systemctl disable systemd-resolved.service"; script_run "rm -f /etc/resolv.conf"; script_run "systemctl restart NetworkManager"; the failure still happens, see https://openqa.stg.fedoraproject.org/tests/1315922 . do you need logs from this case also?
We already had an issue with DNSSEC validation to unrelated domains last year with https://pagure.io/fedora-infrastructure/issue/9411. In this case the IPA server host where named is running cannot reach root DNS servers to get DNSSEC key for .org domain and more: this is from your /var/named tarball, lame-servers.log: 03-Sep-2021 13:55:31.358 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.34#53 03-Sep-2021 13:55:31.359 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.33#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'org/DNSKEY/IN': 10.3.163.33#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DS/IN': 10.3.163.33#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DNSKEY/IN': 10.3.163.34#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.33#53 03-Sep-2021 13:55:47.495 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:02.577 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:17.568 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:32.727 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:47.771 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:57:02.861 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:57:17.995 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:57:33.096 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 this is from the same tarball, dnssec.log: 03-Sep-2021 13:58:09.204 warning: managed-keys-zone: Unable to fetch DNSKEY set '.': timed out 03-Sep-2021 13:58:20.111 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS) 03-Sep-2021 13:58:20.113 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS) 03-Sep-2021 13:58:22.369 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:37.574 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:52.653 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:53.209 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:58:53.217 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:59:07.706 info: validating org/DNSKEY: bad cache hit (org/DS) 03-Sep-2021 13:59:16.154 info: validating 2.fedora.pool.ntp.org/AAAA: bad cache hit (org/DS) 03-Sep-2021 13:59:16.155 info: validating 2.fedora.pool.ntp.org/A: bad cache hit (org/DS) 03-Sep-2021 13:59:22.807 info: validating org/DNSKEY: bad cache hit (org/DS) If I'd do from my side 'dig +dnssec +crypto +trace 2.fedora.pool.ntp.org', here is what I get: ; <<>> DiG 9.16.20-RH <<>> +dnssec +crypto +trace 2.fedora.pool.ntp.org ;; global options: +cmd . 83325 IN NS a.root-servers.net. . 83325 IN NS g.root-servers.net. . 83325 IN NS k.root-servers.net. . 83325 IN NS d.root-servers.net. . 83325 IN NS b.root-servers.net. . 83325 IN NS l.root-servers.net. . 83325 IN NS m.root-servers.net. . 83325 IN NS i.root-servers.net. . 83325 IN NS f.root-servers.net. . 83325 IN NS j.root-servers.net. . 83325 IN NS h.root-servers.net. . 83325 IN NS e.root-servers.net. . 83325 IN NS c.root-servers.net. ;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 6 ms org. 172800 IN NS a0.org.afilias-nst.info. org. 172800 IN NS a2.org.afilias-nst.info. org. 172800 IN NS b0.org.afilias-nst.org. org. 172800 IN NS b2.org.afilias-nst.org. org. 172800 IN NS c0.org.afilias-nst.info. org. 172800 IN NS d0.org.afilias-nst.org. org. 86400 IN DS 26974 8 2 4FEDE294C53F438A158C41D39489CD78A86BEB0D8A0AEAFF14745C0D 16E1DE32 org. 86400 IN RRSIG DS 8 1 86400 20210922050000 20210909040000 26838 . KqQbJs2Tj/xEWKWRnB28vSTEzVZKt7vJMsVuXKZby8X/pzGLCudt6l2M jwT36YWNf312gx/s5f3gZj4LXJXr3BrViT8LvLIVVM0Dgv0O4M3VQxeD Sk9zQ0fzrGMUuDJlcmFeOk+CHAuQJ+pRbe4Kpq3Fa2H8jULN7kLXwWNk lbrk1wQerX2jpoqCmy5/vtE+uL71Aa8DqOqVGG9MCXS6Ir0xPlRCOiVc xQtyzIXYDRhFZJcIMJyc5IBEFN6hmQj4SkoM2x7Q/j9Xzl+RizauV2+q vCvh+bQ24WXxejkKunVXAmOOTVX8HXCbPtgDLd6HBsap1t4a24ExbarU pQzuzA== ;; Received 787 bytes from 192.5.5.241#53(f.root-servers.net) in 5 ms ntp.org. 86400 IN NS ns1.everett.org. ntp.org. 86400 IN NS dns2.udel.edu. ntp.org. 86400 IN NS anyns.pch.net. ntp.org. 86400 IN NS ns3.p20.dynect.net. ntp.org. 86400 IN NS dns1.udel.edu. ntp.org. 86400 IN NS ns2.p20.dynect.net. ntp.org. 86400 IN NS ns4.p20.dynect.net. ntp.org. 86400 IN NS ns1.p20.dynect.net. 1i870vj5h429vj9pci7ar6e9gki74tr7.org. 86400 IN NSEC3 1 1 10 332539EE7F95C32A 1I87R64GAJU4O91MHKBU7I9EKBS7K8UT NS SOA RRSIG DNSKEY NSEC3PARAM 1i870vj5h429vj9pci7ar6e9gki74tr7.org. 86400 IN RRSIG NSEC3 8 2 86400 20210930065745 20210909055745 39681 org. OoqLJlC/o4Qbm8u9acQBI8mfetfRSUayji2GL4mEXOD0JG/K2TxnOWt7 tAYop425ab/aaj0EcSPgXNe/V1GxIL7gKKKR3O4Uz9mAEwHS6e6CKuKK Ga5wpcsH8SSIW1VWWCCE1LLEC3Uan7hokhtkPOcMv3UZhaNfUi8rYdWx de4= lganq7sb2am5074kbdd23tmhq8bib2rk.org. 86400 IN NSEC3 1 1 10 332539EE7F95C32A LGATR75FR5K31LDKDA447JHEH6AFAVAN NS DS RRSIG lganq7sb2am5074kbdd23tmhq8bib2rk.org. 86400 IN RRSIG NSEC3 8 2 86400 20210922152219 20210901142219 39681 org. c+E4fT8qV31p34ejoYa6qAX7l7z3ScW6i/E4DDXcCLAFfbrSID6aQ+za IkJYSaEAjB99xrylMB94Q9VImYC3bJiqm3dH/Wdo9n46rM5u3zKGgOWW sbCG/2R1QObcsybFdcyh8AqVIm3CDpG8iv8geamQINIvj96IJL2XcOKl D2I= ;; Received 762 bytes from 199.249.120.1#53(b2.org.afilias-nst.org) in 11 ms pool.ntp.org. 604800 IN NS i.ntpns.org. pool.ntp.org. 604800 IN NS a.ntpns.org. pool.ntp.org. 604800 IN NS b.ntpns.org. pool.ntp.org. 604800 IN NS c.ntpns.org. pool.ntp.org. 604800 IN NS d.ntpns.org. pool.ntp.org. 604800 IN NS e.ntpns.org. pool.ntp.org. 604800 IN NS f.ntpns.org. pool.ntp.org. 604800 IN NS g.ntpns.org. pool.ntp.org. 604800 IN NS h.ntpns.org. ;; Received 243 bytes from 128.175.13.16#53(dns1.udel.edu) in 113 ms 2.fedora.pool.ntp.org. 150 IN A 37.228.129.2 2.fedora.pool.ntp.org. 150 IN A 95.216.218.15 2.fedora.pool.ntp.org. 150 IN A 62.241.198.251 2.fedora.pool.ntp.org. 150 IN A 194.100.49.152 ;; Received 226 bytes from 185.126.112.98#53(a.ntpns.org) in 39 ms I think this is not an issue in FreeIPA at all but rather a play of infrastructure limitations you have in the openqa.fedoraproject.org. I remember that Red Hat datacenters are not allowing to access root DNS servers directly, may be this is what we see here? A solution is to instruct IPA to disable DNSSEC validation at install time with --no-dnssec-validation option to ipa-server-install.
But that does not explain why this bug only happens on upgrade to F35 or F36. It does not happen on fresh deployment on F35 or F36, and it does not happen in any circumstance on F33 or F34 (including upgrades from F33 to F34).
well...hmm. if the issue is restricted to fedoraproject.org, that could possibly explain it. it may happen to be the case that this is the only path where we actually wind up resolving a fedoraproject.org domain via the FreeIPA server's nameserver. in other cases we may happen to only do that before the FreeIPA server is acting as the nameserver. I'll have to look into that tomorrow.
Can you please try to disable DNSSEC validation in the test? We'll see if that's the reason to return the NTP server entries first. There is definitely a problem with contacting the root servers.
BTW, test in https://openqa.stg.fedoraproject.org/tests/1315922 is a client machine and it is still running systemd-resolved: Sep 9 00:22:32 client003 sssd_be[3016]: Backend is offline Sep 9 00:24:33 client003 sssd_be[3016]: Backend is online Sep 9 00:30:35 client003 systemd-resolved[615]: Grace period over, resuming full feature set (UDP+EDNS0) for DNS server 172.16.2.100. (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [sdap_kinit_send] (0x0400): [RID#7] Attempting kinit (default, host/client003.test.openqa.fedoraproject.org, TEST.OPENQA.FEDORAPROJECT.ORG, 86400) (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [sdap_kinit_next_kdc] (0x1000): [RID#7] Resolving next KDC for service IPA (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [fo_resolve_service_send] (0x0100): [RID#7] Trying to resolve service 'IPA' (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [get_server_status] (0x1000): [RID#7] Status of server 'ipa001.test.openqa.fedoraproject.org' is 'name resolved' (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [get_port_status] (0x1000): [RID#7] Port status of port 0 for server 'ipa001.test.openqa.fedoraproject.org' is 'working' (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [fo_resolve_service_activate_timeout] (0x2000): [RID#7] Resolve timeout [dns_resolver_timeout] set to 6 seconds (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [get_server_status] (0x1000): [RID#7] Status of server 'ipa001.test.openqa.fedoraproject.org' is 'name resolved' (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [be_resolve_server_process] (0x1000): [RID#7] Saving the first resolved server (2021-09-09 0:30:52): [be[test.openqa.fedoraproject.org]] [be_resolve_server_process] (0x0200): [RID#7] Found address for server ipa001.test.openqa.fedoraproject.org: [172.16.2.100] TTL 1200 and then access to kojipkgs.fedoraproject.org fails: 2021-09-09T00:30:35-0400 DDEBUG Extra commands: ['-y', '--releasever=35', 'system-upgrade', 'download'] 2021-09-09T00:30:35-0400 DEBUG User-Agent: constructed: 'libdnf (Fedora 34; server; Linux.x86_64)' 2021-09-09T00:30:35-0400 DEBUG repo: downloading from remote: fedora 2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml). 2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml). 2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml). 2021-09-09T00:30:35-0400 DEBUG error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] (https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml). 2021-09-09T00:30:36-0400 WARNING Errors during downloading metadata for repository 'fedora': - Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/compose/branched/Fedora-35-20210908.n.0/compose/Everything/x86_64/os/repodata/repomd.xml [Could not resolve host: kojipkgs.fedoraproject.org] 2021-09-09T00:30:36-0400 DDEBUG Cleaning up. If that is DNSSEC validation, we don't have .org DNSSEC key to build the trust chain to validate, thus kojipkgs.fedoraproject.org would fail to resolve with DNSSEC validation enabled similar to 2.fedora.pool.ntp.org
The problem could be related more tight crypto policies and deprecation of weak ciphers like SHA-1: - deprecate derived properties: tls_cipher, ssh_cipher, ssh_group, ike_protocol, sha1_in_dnssec https://src.fedoraproject.org/rpms/crypto-policies/c/3e079198c139b10d51f7d213652961a63704e384?branch=f35 Does any of the upstream DNS servers use old signatures or weak keys?
See above, I copied the corresponding .org entries here: org. 86400 IN DS 26974 8 2 4FEDE294C53F438A158C41D39489CD78A86BEB0D8A0AEAFF14745C0D 16E1DE32 org. 86400 IN RRSIG DS 8 1 86400 20210922050000 20210909040000 26838 . KqQbJs2Tj/xEWKWRnB28vSTEzVZKt7vJMsVuXKZby8X/pzGLCudt6l2M jwT36YWNf312gx/s5f3gZj4LXJXr3BrViT8LvLIVVM0Dgv0O4M3VQxeD Sk9zQ0fzrGMUuDJlcmFeOk+CHAuQJ+pRbe4Kpq3Fa2H8jULN7kLXwWNk lbrk1wQerX2jpoqCmy5/vtE+uL71Aa8DqOqVGG9MCXS6Ir0xPlRCOiVc xQtyzIXYDRhFZJcIMJyc5IBEFN6hmQj4SkoM2x7Q/j9Xzl+RizauV2+q vCvh+bQ24WXxejkKunVXAmOOTVX8HXCbPtgDLd6HBsap1t4a24ExbarU pQzuzA== ;; Received 787 bytes from 192.5.5.241#53(f.root-servers.net) in 5 ms fedoraproject.org. 86400 IN DS 16207 5 2 A7C9BF5AFE374C9650ED678F3D36931A7DE9256B86A7BC34D6DEED7D 4E492E5E fedoraproject.org. 86400 IN DS 16207 5 1 8DD099791A2A110851FDE5D14F6C62ADC3DD7C18 fedoraproject.org. 86400 IN RRSIG DS 8 2 86400 20210922152219 20210901142219 39681 org. JUYGOiOY2fcqhZBv66vS1lSzrTP8cQpFX6fkXlxPfSTlWAlkLYCON650 GcJ5JWukiVM66Mt6t+FCGANcma5XbttoQlRQZrMHBxUhsF2uBfLSsENH tj/9OxFJ9Ai/vJe2tUtY12/e/Rdd9C1Kp7obsLsTyKZrnLqZgzDceagh Ek8= ;; Received 506 bytes from 199.19.54.1#53(b0.org.afilias-nst.org) in 274 ms DS 26974 8 2 is RSA/SHA-256 algorithm and RSA-SHA256 digest but RRSIG DS 8 1 is RSA/SHA-256 algorithm and RSA-SHA1 digest DS 16207 5 2 is using SHA-256 digest too. I have no problems resolving these in a F35 container with the DEFAULT or FIPS crypto policies using dig. NAMED should be using the same underlying library.
"BTW, test in https://openqa.stg.fedoraproject.org/tests/1315922 is a client machine and it is still running systemd-resolved:" oh, yeah, sorry, I forgot we only disabled it for servers before. I'll have to extend that snippet to also apply on clients.
hmm, so, I looked at whether it's the case that we only resolve fedoraproject.org domains after we start using the FreeIPA server as the DNS server in the upgrade case. That's not true; we definitely resolve fedoraproject.org domains after the server becomes the DNS server in non-upgrade cases too. So it is still significant that non-upgrade tests do not encounter the bug; we are able to resolve fedoraproject.org domains just fine, reliably, in the non-upgrade tests where server and client ends are both F35 or F36 all along. But it did make me notice something else. One thing that's unique at the time we fail in the upgrade case is that when we fail, the server is upgraded but *the client is not*. So we have, temporarily, a Fedora 34 client using a Fedora 35 or Fedora 36 FreeIPA server as its DNS server. Could that be related, somehow? Some kind of crypto mismatch there, maybe?
Tried with resolved disabled on the client too, still got the bug: https://openqa.stg.fedoraproject.org/tests/1318733
Is this ever reproducible outside OpenQA and Fedora infrastructure? So the idea is that we do f34, then install IPA with DNS but don't enable DNSSEC for own zones, then upgrade to F35 and after upgrade happened (IPA upgraded as well), local IPA DNS server starts to reject Fedora hosts (e.g. bodhi.fedoraproject.org or NTP pool addresses). Perhaps, not just Fedora Project hosts but any .org hosts? Is that the case? I am trying to find a way to build a reproducer outside Fedora infrastructure...
"Is this ever reproducible outside OpenQA and Fedora infrastructure?" I haven't tested, yet, because it's an awkward test to do manually and I have a dozen other blockers to juggle. I no longer have a pet local openQA deployment, I ditched all that hardware a while back. Lukas has a containerized one on his laptop but I dunno if it's capable of running the more complex tap tests. Lukas, is it?
(In reply to Adam Williamson from comment #36) > "Is this ever reproducible outside OpenQA and Fedora infrastructure?" > > I haven't tested, yet, because it's an awkward test to do manually and I > have a dozen other blockers to juggle. I no longer have a pet local openQA > deployment, I ditched all that hardware a while back. Lukas has a > containerized one on his laptop but I dunno if it's capable of running the > more complex tap tests. Lukas, is it? Let me try to set it up for running tomorrow.
So I can't say if it happens outside of Fedora infra (hopefully Lukas can tell us - Lukas, note you must run the test several times if it passes; the bug is intermittent, I wouldn't be sure it doesn't happen until you try at least 10 times), but I did just test that it affects more than just fedoraproject.org when it happens. I had the test also try to resolve two other domains with known-good DNSSEC configuration, per https://www.internetsociety.org/resources/deploy360/2013/dnssec-test-sites/ : internetsociety.org and dnssec-tools.org . When the bug happens, it also affects those - querying them returns no records. See https://openqa.stg.fedoraproject.org/tests/1330891#step/realmd_join_sssd/19 .
I am currently trying to make the tests run in my OpenQA, but although I have 5 workers running, the three tests refuse to start. I will keep you posted ...
are your workers actually configured as tap workers? Do other tap tests run?
(In reply to Adam Williamson from comment #40) > are your workers actually configured as tap workers? Do other tap tests run? I am not sure. I am attaching the output of the my workers for you to see. How do I know if they are tap workers? Btw, I was able to install FreeIPA on Fedora 35, so I will try to install F34 too and check manually if the issue hits me.
Created attachment 1823870 [details] List of workers
Created attachment 1823872 [details] Worker detail
Created attachment 1823921 [details] Waiting workers
The attachment that shows working workers really are scheduled tests that refuse to start. So my workers might be set up incorrectly.
Also, I installed FreeIPA server on Fedora 34 and client on F35. I could use kinit to obtain the kerberos ticket and all required info from " QA:Testcase realmd join sssd" could be obtained from the server. Then I started ping on the client and upgraded the server to F35. The host stopped answering when the upgrade was in progress, but then started answering to the ping again. Also, when it fully started, I could again perform the steps from "QA:Testcase realmd join sssd" and obtained the ticket. I was using VMs on my local machine. To me, it seemed that it behaved normally on the first run, however I am not an experienced FreeIPA admin or even user. I could try more times if needed, but the setting up process takes quite some time.
Thanks Lukas. I assume it is using integrated DNS server (e.g. you set up IPA server as ipa-server-install --setup-dns) and your client was pointed to IPA server to resolve DNS. This is the part of the test case that seems to cause an issue in OpenQA: after installing IPA server on F34 and upgrading it to F35, at some point after completing the upgrade DNS server on IPA server stops resolving .org domains because DNSSEC chain of trust cannot be verified for them. At least, that's what happened in OpenQA instance in Fedora infrastructure.
right. to reproduce the issue you need to do this: 1. deploy F34 server 2. enrol an F34 client 3. configure the client to use the server as its DNS server 4. upgrade the server to F35 5. from the client, try and resolve fedoraproject.org or internetsociety.org or dnssec-tools.org . and you need to do it multiple times, because the bug doesn't always happen. I can try and do this with VMs today, but I don't really have my networking set up properly to allow it ATM. Lukas, we can try and debug the openQA stuff elsewhere, I guess, to avoid bugging others on this ticket.
(In reply to Adam Williamson from comment #48) > right. to reproduce the issue you need to do this: > > 1. deploy F34 server > 2. enrol an F34 client > 3. configure the client to use the server as its DNS server > 4. upgrade the server to F35 > 5. from the client, try and resolve fedoraproject.org or internetsociety.org > or dnssec-tools.org . > So, today I performed like 5 runs of this test to see if I hit it, but I could not. I am doing this: 1. Set up the server VM to resolve through my host -> 192.168.122.1 2. Obtain the server VM IP adress -> 192.168.122.32 3. Set up the client VM to resolve through the server. 4. Run the dig command for fedoraproject.org (for 34 and for upgrade to F35) === ; <<>> DiG 9.16.20-RH <<>> fedoraproject.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13137 ;; flags: qr rd ra; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: d65c4143da6dfd910100000061489b5c3df50b2f24cd3322 (good) ;; QUESTION SECTION: ;fedoraproject.org. IN A ;; ANSWER SECTION: fedoraproject.org. 0 IN A 209.132.190.2 fedoraproject.org. 0 IN A 152.19.134.198 fedoraproject.org. 0 IN A 18.133.140.134 fedoraproject.org. 0 IN A 38.145.60.20 fedoraproject.org. 0 IN A 152.19.134.142 fedoraproject.org. 0 IN A 18.185.136.17 fedoraproject.org. 0 IN A 67.219.144.68 fedoraproject.org. 0 IN A 185.141.165.254 fedoraproject.org. 0 IN A 18.159.254.57 fedoraproject.org. 0 IN A 85.236.55.6 fedoraproject.org. 0 IN A 38.145.60.21 fedoraproject.org. 0 IN A 140.211.169.206 ;; Query time: 70 msec ;; SERVER: 192.168.122.32#53(192.168.122.32) ;; WHEN: Mon Sep 20 16:31:56 CEST 2021 ;; MSG SIZE rcvd: 266 It takes ages to reset the VM and perform the upgrade, so I only could do it like 5 times. The sixth time is running just now.
thanks for testing. can you also dig internetsociety.org and dnssec-tools.org for confirmation? if this works like 8-10 times we can probably say it doesn't need to block beta, at least.
So I tried a couple more things too. I wondered if this may depend on which DNS server bind winds up forwarding requests to - there are two candidates, as we configure the server end with two DNS servers before deploying FreeIPA - but that doesn't seem to be it; I hacked the test to always use the same server, ran it a few times, and got mixed results, which is not what I'd expect if the DNS server was the cause (it should either always pass or always fail). I also enabled dnssec on one of the openQA servers itself, which is a Fedora 34 system in Fedora infra, and tested resolving fedoraproject.org and some test dnssec-enabled domains. It behaved as expected every time. So, I'm running out of things to poke here, honestly. Anyone have any ideas on any debug we can do to figure out precisely *where* the resolution is breaking down when it does go wrong?
With the sixth upgrade attempt, I also dug internetsociety.org and dnssec-tools.org and both have been resolved fine: ; <<>> DiG 9.16.20-RH <<>> internetsociety.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44765 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 7afebc2664d4208401000000614988b9973e4c0c3ea2758c (good) ;; QUESTION SECTION: ;internetsociety.org. IN A ;; ANSWER SECTION: internetsociety.org. 299 IN A 104.18.16.166 internetsociety.org. 299 IN A 104.18.17.166 ;; Query time: 104 msec ;; SERVER: 192.168.122.32#53(192.168.122.32) ;; WHEN: Tue Sep 21 09:24:41 CEST 2021 ;; MSG SIZE rcvd: 108 and ; <<>> DiG 9.16.20-RH <<>> dnssec-tools.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10762 ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 9b628b0e4963b00201000000614988f10d5650262b8e9cca (good) ;; QUESTION SECTION: ;dnssec-tools.org. IN A ;; ANSWER SECTION: dnssec-tools.org. 290 IN A 185.199.108.153 dnssec-tools.org. 290 IN A 185.199.111.153 dnssec-tools.org. 290 IN A 185.199.109.153 dnssec-tools.org. 290 IN A 185.199.110.153 ;; Query time: 2 msec ;; SERVER: 192.168.122.32#53(192.168.122.32) ;; WHEN: Tue Sep 21 09:25:37 CEST 2021 ;; MSG SIZE rcvd: 137
So, looking again into the tarball of comment 18, I can see we get the wrong responses from the upstream DNS servers: 03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:dc3::35#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:7fe::53#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './NS/IN': 2001:500:2::c#53 03-Sep-2021 13:38:17.537 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53 03-Sep-2021 13:38:17.538 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53 03-Sep-2021 13:38:17.538 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:38:17.538 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:38:17.538 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:200::b#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:7fd::1#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:38:17.539 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:38:17.540 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53 03-Sep-2021 13:38:17.540 info: network unreachable resolving './NS/IN': 2001:500:1::53#53 03-Sep-2021 13:54:51.265 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53 03-Sep-2021 13:54:51.265 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53 03-Sep-2021 13:54:51.265 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53 03-Sep-2021 13:54:51.265 info: network unreachable resolving './NS/IN': 2001:dc3::35#53 03-Sep-2021 13:54:51.265 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:7fd::1#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:7fe::53#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:54:51.266 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:54:51.267 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53 03-Sep-2021 13:54:51.267 info: network unreachable resolving './NS/IN': 2001:500:200::b#53 03-Sep-2021 13:54:51.267 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:500:2::c#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:500:1::53#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53 03-Sep-2021 13:54:51.268 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53 03-Sep-2021 13:54:51.269 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53 03-Sep-2021 13:54:51.270 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53 03-Sep-2021 13:54:51.270 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './NS/IN': 2001:7fe::53#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './NS/IN': 2001:500:1::53#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53 03-Sep-2021 13:54:51.563 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53 03-Sep-2021 13:54:51.564 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53 03-Sep-2021 13:54:51.564 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53 03-Sep-2021 13:54:51.564 info: network unreachable resolving './NS/IN': 2001:500:200::b#53 03-Sep-2021 13:54:51.564 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53 03-Sep-2021 13:54:51.564 info: network unreachable resolving './NS/IN': 2001:7fd::1#53 03-Sep-2021 13:54:51.565 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53 03-Sep-2021 13:54:51.565 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:dc3::35#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:500:2::c#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53 03-Sep-2021 13:54:51.566 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:dc3::35#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:dc3::35#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:500:2f::f#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:500:2f::f#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:503:ba3e::2:30#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './DNSKEY/IN': 2001:500:1::53#53 03-Sep-2021 13:55:28.922 info: network unreachable resolving './NS/IN': 2001:500:1::53#53 03-Sep-2021 13:55:28.923 info: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53 03-Sep-2021 13:55:28.923 info: network unreachable resolving './NS/IN': 2001:500:9f::42#53 03-Sep-2021 13:55:28.923 info: network unreachable resolving './DNSKEY/IN': 2001:500:a8::e#53 03-Sep-2021 13:55:28.923 info: network unreachable resolving './NS/IN': 2001:500:a8::e#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:500:2::c#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:500:2::c#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:503:c27::2:30#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:500:12::d0d#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './DNSKEY/IN': 2001:500:2d::d#53 03-Sep-2021 13:55:28.928 info: network unreachable resolving './NS/IN': 2001:500:2d::d#53 03-Sep-2021 13:55:28.929 info: network unreachable resolving './DNSKEY/IN': 2001:7fe::53#53 03-Sep-2021 13:55:28.929 info: network unreachable resolving './NS/IN': 2001:7fe::53#53 03-Sep-2021 13:55:28.929 info: network unreachable resolving './DNSKEY/IN': 2001:500:200::b#53 03-Sep-2021 13:55:28.929 info: network unreachable resolving './NS/IN': 2001:500:200::b#53 03-Sep-2021 13:55:28.930 info: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53 03-Sep-2021 13:55:28.930 info: network unreachable resolving './NS/IN': 2001:7fd::1#53 03-Sep-2021 13:55:31.358 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.34#53 03-Sep-2021 13:55:31.359 info: no valid RRSIG resolving 'org/DS/IN': 10.3.163.33#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'org/DNSKEY/IN': 10.3.163.33#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DS/IN': 10.3.163.33#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/DNSKEY/IN': 10.3.163.34#53 03-Sep-2021 13:55:31.359 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.33#53 03-Sep-2021 13:55:47.495 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:02.577 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:17.568 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:32.727 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:56:47.771 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:57:02.861 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:57:17.995 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 03-Sep-2021 13:57:33.096 info: broken trust chain resolving 'fedoraproject.org/A/IN': 10.3.163.34#53 So 10.3.163.33 and 10.3.163.34 would be the cause of the problem in Fedora infrastructure as none of the root servers are acessible in that environment. I would strongly suggest this is not a bug in FreeIPA/BIND/bind-dyndb-ldap.
would just be nice to know how we can debug exactly where/how the trust chain is breaking down. Does anyone know how to do that? Then at least we could reassign it somewhere useful. I think at this point it's reasonable to re-vote this for Beta blocker status, since it does seem to be likely fairly limited in scope. So, clearing AcceptedBlocker to allow for a re-vote. I'm also going to try and come up with a workaround for openQA so I don't have to restart the tests all the damn time (I'm going to try something more fine-grained than just 'disable dnssec').
sigh, an annoying note from my workaround attempt: when this bug happens, it's still broken even after the client completes its upgrade. I tried to implement the workaround by having the client switch back to 'normal' DNS server for the upgrade and then back to the FreeIPA server as DNS server after upgrading, but it still can't resolve fedoraproject.org even after the upgrade. So, I'm just gonna have to turn off dnssec on the server end for this test, I think :(
'delv -4 +rtrace +dnssec +short fedoraproject.org' may provide more info... Some more general info: fedoraproject.org is currently signed with both a sha1 and a sha384 keys, but only the sha1 DS record is in .org so only that one should be 'live'. dnssec seems to work fine from the worker nodes themselves... so it's just in the installs/qemu it's failing I guess?
Marking as a rejected blocker for Beta, per Adam.
Bit more on the above - we re-voted this as the Go/No-Go meeting today: https://meetbot-raw.fedoraproject.org/fedora-meeting/2021-09-23/f35-beta-go_no_go-meeting.2021-09-23-17.00.html and rejected it due to how it seems to be quite situation specific - nobody has yet reproduced it outside Fedora infra. There is a reasonable workaround, too - disable dnssec on the server. Kevin, the specific failure case here is when we deploy a FreeIPA server and client on Fedora 34, then upgrade the server to Fedora 35. About half of the time, after that happens, lookups of dnssec-enabled domains from the client via the FreeIPA server's DNS server stop working - that is, the lookup completes, but the server returns no IP addresses. Upgrading the client to F35 doesn't fix things. In tests where we deploy directly to Fedora 35, this is not happening. It's only happening in the upgrade test. In this configuration the FreeIPA server is configured as a DNS server (using bind), forwarding any requests it doesn't resolve itself (so anything besides hosts enrolled in the domain) to the infra IP servers. If we disable dnssec validation on the FreeIPA server, the bug doesn't happen. I'll try the command you suggested, thanks.
I don't now iof this will help, but I am also seeing similar behavior in after upgrading my network from Fedora 34 to Fedora 35 using dnf systen-upgrade. I have 2 systems running as a master/replica. I noticed another behaviour, once I start having DNS troubles, IPA runs really slow. Let me know if you want any logs/configuration information, etc. Output from delv after a rebooting both IPA servers: $ delv -4 +rtrace +dnssec +short fedoraproject.org ;; fetch: fedoraproject.org/A ;; fetch: fedoraproject.org/DNSKEY ;; fetch: fedoraproject.org/DS ;; fetch: org/DNSKEY ;; fetch: org/DS ;; fetch: ./DNSKEY 8.43.85.67 8.43.85.73 38.145.60.20 38.145.60.21 67.219.144.68 140.211.169.196 140.211.169.206 152.19.134.142 152.19.134.198 209.132.190.2 I noticed that the master IPA server will not respond to any DNS requeries when this problem happens. My secondary server responds, but is unable to resolve queries for the local domain.
Thanks David! When you say the master server "will not respond", do you mean that running a query against it literally gets no reply, or that it gets a reply with no IP addresses?
Last night ipa-01 would hang and time-out when I queried it using dig. ipa-02 would respond, but with no IP addresses. Here are some tests from this morning. I get a reply with no IP addresses for internal zones, using against the same nameserver for external zones works: [david@ipa-01 ~]$ dig @ipa-01 ipa-01.evilduckie.lan a ; <<>> DiG 9.16.21-RH <<>> @ipa-01 ipa-01.evilduckie.lan a ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13122 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 2caa4738dba58e66010000006151dac477a23463242ae0db (good) ;; QUESTION SECTION: ;ipa-01.evilduckie.lan. IN A ;; AUTHORITY SECTION: . 5383 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2021092700 1800 900 604800 86400 ;; Query time: 0 msec ;; SERVER: 10.0.0.2#53(10.0.0.2) ;; WHEN: Mon Sep 27 08:52:52 CST 2021 ;; MSG SIZE rcvd: 153 [david@ipa-01 ~]$ dig @ipa-01 www.google.ca a ; <<>> DiG 9.16.21-RH <<>> @ipa-01 www.google.ca a ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62272 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: d94f6c4a3b635102010000006151db1b2d7b4edba752984b (good) ;; QUESTION SECTION: ;www.google.ca. IN A ;; ANSWER SECTION: www.google.ca. 300 IN A 142.251.41.67 ;; Query time: 58 msec ;; SERVER: 10.0.0.2#53(10.0.0.2) ;; WHEN: Mon Sep 27 08:54:19 CST 2021 ;; MSG SIZE rcvd: 86 The secondary nameserver, using +short.. [david@ipa-01 ~]$ dig @ipa-02 ipa-01.evilduckie.lan a +short [david@ipa-01 ~]$ dig @ipa-02 www.google.ca a +short 142.251.33.163 Also, 'sudo ipactl restart' takes a really long time...
Hmm. This may not be the same issue, but I'm not really a DNS expert so it's hard to be sure :) Is there a difference in behaviour between external domains that are dnssec-enabled and those that are not? Try some of the sample public dnssec-enabled domains, like www.internetsociety.org ... see https://www.internetsociety.org/resources/deploy360/2013/dnssec-test-sites/ .
If it is an issue on IPA DNS server side, then logs will tell where a forwarded query breaks. You'd need to enable debug logs for bind. There are two separate parts here: - DNSSEC validation -- applied for any DNS entry obtained during query resolution. FreeIPA enables DNSSEC validation by default, unless you provided '--no-dnssec-validation' to ipa-server-install/ipa-replica-install/ipa-dns-install - DNSSEC signing of DNS zones hosted by IPA itself. This is not enabled by default and only one IPA master can be DNSSEC master. The procedure requires an explicit act from the admin to enable DNSSEC signing. If resolving external DNS zones through IPA DNS server start to return empty IP addresses, there might also be few factors: - IPA DNS server applies DNSSEC validation and a trust chain for the requested record is not complete. This is something you'll see in the named's logs, similar to comments #18 and #53. - your local resolver on the host applies DNSSEC validation and considers a broken trust chain. This might happen with systemd-resolved, for example. Between Fedora 34 and Fedora 35 both bind and systemd-resolved did change substantially. Note that tools like 'dig' or 'delv' use bind's internal libraries so any improvements in bind's code regarding DNSSEC validation would also apply to those utilities. With use of 'delv' or 'dig' you can, at least, isolate which DNS component is 'broken' -- for example, your local resolver or IPA DNS server or something else.
I'm okay with DNS, but definitely not an expert in DNSSEC. delv seems to be working as expected for DNSSEC enabled domains. [david@ipa-01 ~]$ delv -4 +rtrace +dnssec internetsociety.org ;; fetch: internetsociety.org/A ;; fetch: internetsociety.org/DNSKEY ;; fetch: internetsociety.org/DS ;; fetch: org/DNSKEY ;; fetch: org/DS ;; fetch: ./DNSKEY ; fully validated internetsociety.org. 172 IN A 104.18.17.166 internetsociety.org. 172 IN A 104.18.16.166 internetsociety.org. 172 IN RRSIG A 13 2 300 20210928172740 20210926152740 34505 internetsociety.org. or472O4CkqwJ15iXRYyXcb3Zj3t7aYmH3/3ure6o0SJ5k6CTlKdDpkJd IIy9SaWoPEs3FBb6ntcUphrnr0OOdQ== [david@ipa-01 ~]$ delv -4 +rtrace +dnssec evilduckie.com ;; fetch: evilduckie.com/A ;; fetch: com/DS ;; fetch: ./DNSKEY ;; fetch: evilduckie.com/DS ;; fetch: com/DNSKEY ; unsigned answer evilduckie.com. 78446 IN A 142.165.61.118 I think this may be also tied into FreeIPA not running properly... on the same machines, I see errors when trying to use kinit: [david@ipa-01 ~]$ kinit -V admin Using existing cache: 1000 Using principal: admin kinit: Generic error (see e-text) while getting initial credentials I started down this rabbit hole after trying to use ipa-client-install on a third machine, and was getting DNS host not found errors.
David, you might want to install freeipa-healthcheck on your IPA cluster members and launch it. That will give you a better view of what's broken.
François, ipa-healthcheck runs about as well as ipactrl, it hangs when run. I'm going to reboot the IPA cluster again. Once the cluster is up, I'll rerun ipa-healthcheck on both servers and post the results.
Output of ipa-healthcheck, ipactl status and dig against the local zone (evilduckie.lan) after a reboot of the cluster: [david@ipa-01 ~]$ date Mon 27 Sep 2021 11:54:10 AM CST [david@ipa-01 ~]$ time sudo ipactl status Directory Service: RUNNING krb5kdc Service: RUNNING kadmin Service: RUNNING named Service: RUNNING httpd Service: RUNNING ipa-custodia Service: RUNNING pki-tomcatd Service: RUNNING ipa-otpd Service: RUNNING ipa-dnskeysyncd Service: RUNNING ipa: INFO: The ipactl command was successful real 0m1.048s user 0m0.942s sys 0m0.052s [david@ipa-01 ~]$ dig @ipa-01 ipa-01.evilduckie.lan a ; <<>> DiG 9.16.21-RH <<>> @ipa-01 ipa-01.evilduckie.lan a ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51983 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 0b76330806706d5a010000006152057eab84731e657ff490 (good) ;; QUESTION SECTION: ;ipa-01.evilduckie.lan. IN A ;; ANSWER SECTION: ipa-01.evilduckie.lan. 1200 IN A 10.0.0.2 ;; Query time: 0 msec ;; SERVER: 10.0.0.2#53(10.0.0.2) ;; WHEN: Mon Sep 27 11:55:10 CST 2021 ;; MSG SIZE rcvd: 94 [david@ipa-01 ~]$ dig @ipa-01 ipa-02.evilduckie.lan a +short 10.0.0.3 [david@ipa-01 ~]$ dig @ipa-02 ipa-01.evilduckie.lan a +short 10.0.0.2 [david@ipa-02 ~]$ time sudo ipactl status Directory Service: RUNNING krb5kdc Service: RUNNING kadmin Service: RUNNING named Service: RUNNING httpd Service: RUNNING ipa-custodia Service: RUNNING pki-tomcatd Service: RUNNING ipa-otpd Service: RUNNING ipa-dnskeysyncd Service: RUNNING ipa: INFO: The ipactl command was successful real 0m1.500s user 0m1.356s sys 0m0.070s [david@ipa-02 ~]$ dig @ipa-02 ipa-02.evilduckie.lan a ; <<>> DiG 9.16.21-RH <<>> @ipa-02 ipa-02.evilduckie.lan a ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33791 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 24bc3a05b5a7242e01000000615205c628f417263b47fec7 (good) ;; QUESTION SECTION: ;ipa-02.evilduckie.lan. IN A ;; ANSWER SECTION: ipa-02.evilduckie.lan. 1200 IN A 10.0.0.3 ;; Query time: 0 msec ;; SERVER: 10.0.0.3#53(10.0.0.3) ;; WHEN: Mon Sep 27 11:56:22 CST 2021 ;; MSG SIZE rcvd: 94 [david@ipa-02 ~]$ dig @ipa-02 ipa-01.evilduckie.lan a +short 10.0.0.2 [david@ipa-02 ~]$ dig @ipa-01 ipa-02.evilduckie.lan a +short 10.0.0.3 Everything has been up for ~1.5 hours, and is still running properly.
Created attachment 1826754 [details] ipa-healthcheck log files
The ERROR in each log is a known issue. The warnings shouldn't be causing any issues.
The problem has returned, after 3 days of uptime. I am unable to resolve internal domains, external domains seem to resolve fine. [david@ipa-01 ~]$ uptime 15:42:48 up 3 days, 4:21, 1 user, load average: 0.13, 0.09, 0.09 ipactl status is now hanging, as is ipa-healthcheck: [david@ipa-01 ~]$ time sudo ipactl status ^CCancelled. real 2m9.740s user 0m0.897s sys 0m0.035s [david@ipa-01 ~]$ time sudo ipa-healthcheck ^CTraceback (most recent call last): File "/usr/lib64/python3.10/site-packages/ldap/ldapobject.py", line 324, in _ldap_call result = func(*args,**kwargs) ldap.SERVER_DOWN: {'result': -1, 'desc': "Can't contact LDAP server", 'errno': 4, 'ctrls': []} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/bin/ipa-healthcheck", line 33, in <module> sys.exit(load_entry_point('ipahealthcheck==0.9', 'console_scripts', 'ipa-healthcheck')()) File "/usr/lib/python3.10/site-packages/ipahealthcheck/core/main.py", line 46, in main sys.exit(ipachecks.run_healthcheck()) File "/usr/lib/python3.10/site-packages/ipahealthcheck/core/core.py", line 312, in run_healthcheck registry.initialize(framework, config, options) File "/usr/lib/python3.10/site-packages/ipahealthcheck/dogtag/plugin.py", line 30, in initialize api.Backend.ldap2.connect() File "/usr/lib/python3.10/site-packages/ipalib/backend.py", line 69, in connect conn = self.create_connection(*args, **kw) File "/usr/lib/python3.10/site-packages/ipaserver/plugins/ldap2.py", line 186, in create_connection client.external_bind(server_controls=serverctrls, File "/usr/lib/python3.10/site-packages/ipapython/ipaldap.py", line 1245, in external_bind self.conn.sasl_interactive_bind_s( File "/usr/lib64/python3.10/site-packages/ldap/ldapobject.py", line 476, in sasl_interactive_bind_s return self._ldap_call(self._l.sasl_interactive_bind_s,who,auth,RequestControlTuples(serverctrls),RequestControlTuples(clientctrls),sasl_flags) File "/usr/lib64/python3.10/site-packages/ldap/ldapobject.py", line 324, in _ldap_call result = func(*args,**kwargs) KeyboardInterrupt real 4m59.096s user 0m0.949s sys 0m0.037s Running ipactl status on ipa-02 still responds properly: [david@ipa-02 ~]$ time sudo ipactl status Directory Service: RUNNING krb5kdc Service: RUNNING kadmin Service: RUNNING named Service: RUNNING httpd Service: RUNNING ipa-custodia Service: RUNNING pki-tomcatd Service: RUNNING ipa-otpd Service: RUNNING ipa-dnskeysyncd Service: RUNNING ipa: INFO: The ipactl command was successful real 0m1.530s user 0m1.342s sys 0m0.084s However ipa-healthcheck is hanging... using ctrl-c to stop takes a few moments/keypresses.. [david@ipa-02 ~]$ time sudo ipa-healthcheck Internal server error 404 Client Error: Not Found for url: https://ipa-02.evilduckie.lan:443/ca/rest/certs/search?size=3 ^CInternal error testing KRA clone. ^C^C^CTraceback (most recent call last): File "/usr/lib64/python3.10/site-packages/dbus/connection.py", line 606, in msg_reply_handler reply_handler(*message.get_args_list(**get_args_opts)) File "/usr/lib64/python3.10/site-packages/dbus/proxies.py", line 396, in _introspect_reply_handler self._introspect_method_map = process_introspection_data(data) File "/usr/lib64/python3.10/site-packages/dbus/_expat_introspect_parser.py", line 85, in process_introspection_data return _Parser().parse(data) File "/usr/lib64/python3.10/site-packages/dbus/_expat_introspect_parser.py", line 44, in parse parser.Parse(data) File "/builddir/build/BUILD/Python-3.10.0rc2/Modules/pyexpat.c", line 416, in StartElement File "/usr/lib64/python3.10/site-packages/dbus/_expat_introspect_parser.py", line 47, in StartElementHandler def StartElementHandler(self, name, attributes): KeyboardInterrupt Invalid IP address fe80::1e69:7aff:fea3:352a for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::1e69:7aff:fea3:352a Invalid IP address fe80::503e:77ff:feed:6425 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::503e:77ff:feed:6425 Invalid IP address fe80::1088:4ff:fefb:8e81 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::1088:4ff:fefb:8e81 Invalid IP address fe80::a0be:28ff:fef5:7841 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::a0be:28ff:fef5:7841 Invalid IP address fe80::502e:dcff:fe3d:704d for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::502e:dcff:fe3d:704d Invalid IP address fe80::80f3:41ff:fe64:2716 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::80f3:41ff:fe64:2716 Invalid IP address fe80::e46c:2cff:fe4a:906c for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::e46c:2cff:fe4a:906c Invalid IP address fe80::98e2:e6ff:fedc:dc9c for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::98e2:e6ff:fedc:dc9c Invalid IP address fe80::c8f6:9aff:fe0f:948 for ipa-02.evilduckie.lan.: cannot use link-local IP address fe80::c8f6:9aff:fe0f:948 [ { "source": "pki.server.healthcheck.clones.connectivity_and_data", "check": "ClonesConnectivyAndDataCheck", "result": "ERROR", "uuid": "c926cd3a-22f9-4c6f-a1a2-b239d226798d", "when": "20210930215717Z", "duration": "0.745401", "kw": { "status": "ERROR: pki-tomcat : Internal error testing CA clone. Host: ipa-02.evilduckie.lan Port: 443" } }, { "source": "pki.server.healthcheck.clones.connectivity_and_data", "check": "ClonesConnectivyAndDataCheck", "result": "ERROR", "uuid": "7c31dbaf-8d35-4c78-8a02-5a3568270534", "when": "20210930220017Z", "duration": "180.863800", "kw": { "status": "ERROR: pki-tomcat : Internal error testing KRA clone. Host: ipa-02.evilduckie.lan Port: 443" } }, { "source": "ipahealthcheck.dogtag.ca", "check": "DogtagCertsConnectivityCheck", "result": "ERROR", "uuid": "c7e61086-bf06-49fb-9711-35640deb6155", "when": "20210930220018Z", "duration": "0.131784", "kw": { "msg": "Request for certificate failed, cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/1': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)" } }, { "source": "ipahealthcheck.ds.replication", "check": "ReplicationCheck", "result": "CRITICAL", "uuid": "1313f02f-53d0-4270-ade4-66d889f03b8e", "when": "20210930220027Z", "duration": "29.537328", "kw": { "key": "DSREPLLE0001", "items": [ "Replication", "Agreement" ], "msg": "The replication agreement (metoipa-01.evilduckie.lan) under \"dc=evilduckie,dc=lan\" is not in synchronization." } }, { "source": "ipahealthcheck.ds.replication", "check": "ReplicationCheck", "result": "CRITICAL", "uuid": "24064baf-232f-49e4-9278-88678e772721", "when": "20210930220048Z", "duration": "29.537368", "kw": { "key": "DSREPLLE0001", "items": [ "Replication", "Agreement" ], "msg": "The replication agreement (catoipa-01.evilduckie.lan) under \"o=ipaca\" is not in synchronization." } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "4f2d503a-7393-416b-bd35-c58e86d4a9b5", "when": "20210930220051Z", "duration": "0.338079", "kw": { "key": "20210815190458", "serial": 7, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/7': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "54931860-5da9-44b5-9a00-eee7f0f3e999", "when": "20210930220051Z", "duration": "0.399151", "kw": { "key": "20210815190459", "serial": 5, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/5': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "e3e68aaa-1d02-4328-bbd7-73c97e98ada4", "when": "20210930220051Z", "duration": "0.462746", "kw": { "key": "20210815190500", "serial": 2, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/2': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "294c3790-aacf-4f5a-8a15-d222cb383f54", "when": "20210930220051Z", "duration": "0.525786", "kw": { "key": "20210815190501", "serial": 4, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/4': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "3b0b599f-bdfd-4096-8d45-8edde6d03dcd", "when": "20210930220051Z", "duration": "0.588363", "kw": { "key": "20210815190502", "serial": 1, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/1': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "3ba597c7-56a1-4a3d-a4c4-2d6dc0832da6", "when": "20210930220051Z", "duration": "0.650719", "kw": { "key": "20210815190503", "serial": 17, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/17': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "24758298-db1e-4ed2-a216-3291101ed5d3", "when": "20210930220052Z", "duration": "0.713848", "kw": { "key": "20210815190729", "serial": 13, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/13': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "0a2fbf04-f242-4ee0-92c4-731165f0a633", "when": "20210930220052Z", "duration": "0.777655", "kw": { "key": "20210815190730", "serial": 11, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/11': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "e4e72806-e1b3-4e75-b0ee-373c03a1b665", "when": "20210930220052Z", "duration": "0.840747", "kw": { "key": "20210815190731", "serial": 12, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/12': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "36508ef3-d756-420e-9bce-a42e9c1ae853", "when": "20210930220052Z", "duration": "0.873792", "kw": { "key": "20210815190322", "serial": 16, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/16': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "fb2acec3-4fa1-42de-a18d-22c20628e3b8", "when": "20210930220052Z", "duration": "0.935477", "kw": { "key": "20210815190310", "serial": 15, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/15': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.certs", "check": "IPACertRevocation", "result": "ERROR", "uuid": "28071b3b-202e-4f2e-98ff-4ee1a11b2507", "when": "20210930220052Z", "duration": "0.969785", "kw": { "key": "20210815190524", "serial": 18, "error": "cannot connect to 'https://ipa-02.evilduckie.lan:443/ca/rest/certs/18': [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)", "msg": "Request for certificate serial number {serial} in request {key} failed: {error}" } }, { "source": "ipahealthcheck.ipa.dna", "check": "IPADNARangeCheck", "result": "WARNING", "uuid": "ba67229c-59e2-4d04-a30b-3e6d0d8ff50d", "when": "20210930220052Z", "duration": "0.137597", "kw": { "range_start": 0, "range_max": 0, "next_start": 0, "next_max": 0, "msg": "No DNA range defined. If no masters define a range then users and groups cannot be created." } }, { "source": "ipahealthcheck.ipa.idns", "check": "IPADNSSystemRecordsCheck", "result": "WARNING", "uuid": "929698e2-4e0b-4e52-b3e3-2cb0e0d4e297", "when": "20210930220052Z", "duration": "0.036222", "kw": { "msg": "Got {count} ipa-ca A records, expected {expected}", "count": 3, "expected": 2 } }, { "source": "ipahealthcheck.ipa.files", "check": "IPAFileCheck", "result": "WARNING", "uuid": "57296159-3ba1-43a9-afbe-69e4ce0cda81", "when": "20210930220052Z", "duration": "0.001286", "kw": { "key": "_var_log_kadmind.log_mode", "path": "/var/log/kadmind.log", "type": "mode", "expected": "0600", "got": "0640", "msg": "Permissions of /var/log/kadmind.log are too permissive: 0640 and should be 0600" } } ] real 3m39.286s user 0m6.412s sys 0m0.592s
The tools "hanging" is a result of the LDAP server being unresponsive. I'd recommend installing the 389-ds-base debuginfo and get a stacktrace from the ns-slapd process.
Hi Rob, that what I had noticed. I had also tried to do various ldap queries, against 389-DS, and had no response. It's been a long time since I have done a stack trace. As you have suggested, I installed the debuginfo and symbols for 389-ds-base as well as gdb. gstack returned nothing: [david@ipa-01 ~]$ pgrep ns-slapd 1320 [david@ipa-01 ~]$ gstack `pgrep ns-slapd` Not sure where to go from here. I'm more than willing to follow instructions. :-)
From https://directory.fedoraproject.org/docs/389ds/FAQ/faq.html this is the preferred invocation: gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1
Created attachment 1828068 [details] Stack trace files from ipa-01 The first file has the debug symbols from 389-ds-base only. The second file has all of the missing debug symbols installed.
Created attachment 1828069 [details] stack trace files from ipa-02 The first file has the debug symbols from 389-ds-base only. The second file has all of the missing debug symbols installed.
Are you using compat tree in IPA? If not, may be disabling it could help? I see some lock-ups between changelog processing and slapi-nis (an entry was deleted and that triggered reaction in compat tree removal of a cached entry) on ipa-01. ipa-02 waits for SASL GSSAPI to complete when talking to ipa-01. If ipa-01 is stuck in lock-ups, it cannot respond back over replication plugin and that kind of explains why it 'hang'. Thierry, could you please look into the logs from comments 74 and 75?
The stack traces ipa-01-stacktrace is showing deadlock similar to https://bugzilla.redhat.com/show_bug.cgi?id=1751295#c3 stacktrace.1633097467.txt.1 Deadlock between Threads 40/28 stacktrace.1633098023.txt.1 Deadlock between Threads 40/28 ipa-02-stacktrace is showing a RA waiting (poll) for an answer (sasl_bind) from the consumer. This long delay my block others RA to bind stacktrace.1633098323.txt.1 Possible long bind of RA (consumer not answering) Threads 10/11 stacktrace.1633098705.txt.1 the same
Alexander, I'm not sure if I'm using compat mode. How can I confirm whether or not I am using compat? Let me know what else you need me to do or test.
You can disable it with ipa-compat-manage tool, check man page first. If you don't have non Linux clients or very old RHEL (like RHEL 5) in use and a need for resolving users from a trusted active directory domain, chances are you're not using the compat tree plugin.
If David's bug doesn't have anything to do with mine, can we maybe move the debugging of it elsewhere? At this point more than half of this bug report is about it.
Adam, I have the feeling that our bugs are related. However, I'm more than happy to move it elsewhere if that makes sense. Alexander, I was unable to run to tool on ipa-01, it just hung. I was able to run it on ipa-02. I can reboot ipa-01 , which will (temporarily) restore functionality, and then run 'sudo ipa-compat-manage disable'.
Yes, you need to restart ipa-01. David, let's have a separate bug against slapi-nis in Fedoraa and add the same logs there.
New bug created: https://bugzilla.redhat.com/show_bug.cgi?id=2011086
Based on comment #25, it seems to me there might be common deficiency of our internal RH infrastructure. Some recursive servers provided by network strip out RRSIG entries on their replies. If only some of servers forward these records well, but some are not, it starts random failing, which is kind of hard to track down. Solution to this is removing all forwarders, which do not have dnssec support enabled. Those broken servers might be filtered by this: for NS in $(awk '"nameserver" == $1 { print $2 }' /etc/resolv.conf); do dig +dnssec $NS SOA . | grep -q RRSIG && echo $NS; done Of course, systemd-resolved with DNSSEC validation disabled is known to be broken. If it gets involved, it would break DNSSEC validation for sure. Maybe freeipa-setup should check system forwarders for DNSSEC support and refuse to continue, if not all of them serve signatures. It might offer disabling DNSSEC validation as a workaround.
I think this particular situation is unique to Red Hat datacenters. I already suggested to disable DNSSEC validation in the tests and we provide an option to that. I am not sure it makes sense to probe extensively for these as we might be dealing with an isolated installation that is just being bootstrapped before moved into a proper location, for example. Admins have to handle that part themselves.
For the record, disabling dnssec is what I wound up doing. More precisely, we disable dnssec if it's an upgrade test to a version higher than 34. I guess once F34 goes EOL we can try turning it back on again, if it'll work OK for upgrades from F35 to later releases...
So, I just tried turning this back on again (enabling dnssec on FreeIPA upgrade tests). Unfortunately, it still seems to be buggy. If we deploy the server pre-upgrade with dnssec enabled, about 60% of the time, the test fails with the client unable to resolve names after the server has upgraded. So I've gone back to having it disabled for the upgrade test :|
This message is a reminder that Fedora Linux 35 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '35'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 35 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
per my comment above, this still seems to be valid (if perhaps specific to some quirk of RH infrastructure; hopefully not many other folks have the same quirk, whatever it is).
This message is a reminder that Fedora Linux 37 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '37'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 37 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
will test again if this is still an issue soon.
Fedora Linux 37 entered end-of-life (EOL) status on None. Fedora Linux 37 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.
I would like this bug fixed properly. dnssec-validation no; should be enough, if DNSSEC is responsible for the breakage. The problem is still present, right?
sorry, I didn't get to re-testing yet (and a catastrophic browser incident lost all my tabs, which I use as my todo list, so I forgot I was supposed to be doing it. :>) I would *guess* it is. I'll try and confirm soon.
Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05. Fedora Linux 37 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.