Hide Forgot
Description of problem: During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6). Logs from the alertmanager: level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)" level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)" alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073 We tried to manually increase the timers in the startupprobe, but it was not possible. Version-Release number of selected component (if applicable): OCP 4.10.10 How reproducible: OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome. Actual results: CMO is not being deployed Expected results: CMO deploys without errors Additional info: - CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster - cluster is 3 nodes with masters schedulable
The logs of the CoreDNS pod on the node running alertmanager-main-0 show lots of resolution timeouts: 2022-05-03T08:40:57.189917578Z [INFO] 10.130.0.40:52110 - 33955 "AAAA IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003229945s 2022-05-03T08:40:57.189917578Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. AAAA: read udp 10.130.0.15:50909->10.59.65.234:53: i/o timeout 2022-05-03T08:40:57.189983488Z [INFO] 10.130.0.40:39136 - 20044 "A IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003144467s 2022-05-03T08:40:57.189983488Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. A: read udp 10.130.0.15:59046->10.59.65.234:53: i/o timeout 2022-05-03T08:41:02.189591491Z [INFO] 10.130.0.40:40007 - 5672 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.00194194s 2022-05-03T08:41:02.189591491Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:50118->10.59.65.234:53: i/o timeout 2022-05-03T08:41:02.189696944Z [INFO] 10.130.0.40:51633 - 58405 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.002144225s 2022-05-03T08:41:02.189696944Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:47068->10.59.65.234:53: i/o timeout 2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:51623 - 50084 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003063336s 2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:44552 - 25129 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003075525s 2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:56363->10.59.65.234:53: i/o timeout 2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:36628->10.59.65.234:53: i/o timeout Would it be possible to increase the log level [1] to see if it provides more information? [1] https://docs.openshift.com/container-platform/4.10/networking/dns-operator.html#nw-dns-loglevel_dns-operator
also it would worth having a copy of the /etc/resolv.conf from the alertmanager-main-0 pod.
It probably makes sense to increase the failureThreshold to 10 so the startup probe would allow for up to 120s.
I still suspect that there's something not quite right with the DNS configuration of this cluster.
Transferring to the DNS component for investigation. Could you help us understand what happens at the DNS level? It seems that the Alertmanager pods fail to resolve alertmanager-main-0.alertmanager-operated/alertmanager-main-1.alertmanager-operated from time to time but it's not clear to me what is the reason (bug or misconfiguration?).
Hi All, It's been quite a while since the dns designation, any chance we can get someone to take a look?
Hi All, Checking in again. Could we get a status update for this ticket so we can report back to the customer please? Thanks in advance :]
> I see that when alertmanager starts it probes many DNS names, This is the expected behaviour in k8s/openshift if any pod tries to reach a service in by only the service name. Alertmanager deployment as you can see from its yaml manifest reaches out to alertmanager-main-0 and alertmanager-main-1 to form a quorum. > The only relevant query should be "alertmanager-main-1.alertmanager-operated.openshift-monitoring.svc.cluster.local." which is usually answered with NOERROR when the alertmanager is already running, all the others are looking like tentatives to find the correct one. My **wild** guess (can only be confirmed with a packet capture since timing is important) is that the initial query for `alertmanager-main-0/1` may be sent to all resolvers (including) upstream and the first one to answer with a SERVFAIL may be upstream resolver - the firewall (sandboxed env) could be terminating the connection resulting in SERVFAIL. This may be cached in coredns because of ``` cache { denial XXXX 30 } ``` Does the situation improve if you don't cache or reduce the TTL? Are you able to packet capture the alertmanager pods? [1]: https://github.com/coredns/coredns/issues/5074
We have no HPE ProLiant BL460c Gen10, test with arm bare mental cluster versioned-installer-arm_bm-disconnected-ci Install payload 4.12.0-0.nightly-arm64-2022-07-24-180825 Bound PV to alarm pod No regression found.
I eventually had the chance to check with the customer in a remote, the upstream DNS (based on dnsmasq) was looping on itself due to a wrong configuration. dnsmasq by default is relying on /etc/resolv.conf to forward queries related to unknown hostnames. /etc/resolv.conf was pointing to itself, in order to have that bastion host be able to resolve cluster related queries. We put the "no-resolv" option in the dnsmasq configuration and the alertmanager pods were able to eventually come up. I still believe that we need to be more tolerant towards unresponsive DNS servers, especially if the queries are supposed to get a negative answer. Thanks for your patience and collaboration. --- Francesco Cristini Software Maintenance Engineer OpenShift ESS Support Red Hat EMEA
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399