Description of problem:
During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6).
Logs from the alertmanager:
level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)"
level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)"
alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073
We tried to manually increase the timers in the startupprobe, but it was not possible.
Version-Release number of selected component (if applicable):
OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome.
CMO is not being deployed
CMO deploys without errors
- CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster
- cluster is 3 nodes with masters schedulable
The logs of the CoreDNS pod on the node running alertmanager-main-0 show lots of resolution timeouts:
2022-05-03T08:40:57.189917578Z [INFO] 10.130.0.40:52110 - 33955 "AAAA IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003229945s
2022-05-03T08:40:57.189917578Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. AAAA: read udp 10.130.0.15:50909->10.59.65.234:53: i/o timeout
2022-05-03T08:40:57.189983488Z [INFO] 10.130.0.40:39136 - 20044 "A IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003144467s
2022-05-03T08:40:57.189983488Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. A: read udp 10.130.0.15:59046->10.59.65.234:53: i/o timeout
2022-05-03T08:41:02.189591491Z [INFO] 10.130.0.40:40007 - 5672 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.00194194s
2022-05-03T08:41:02.189591491Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:50118->10.59.65.234:53: i/o timeout
2022-05-03T08:41:02.189696944Z [INFO] 10.130.0.40:51633 - 58405 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.002144225s
2022-05-03T08:41:02.189696944Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:47068->10.59.65.234:53: i/o timeout
2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:51623 - 50084 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003063336s
2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:44552 - 25129 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003075525s
2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:56363->10.59.65.234:53: i/o timeout
2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:36628->10.59.65.234:53: i/o timeout
Would it be possible to increase the log level  to see if it provides more information?
also it would worth having a copy of the /etc/resolv.conf from the alertmanager-main-0 pod.
It probably makes sense to increase the failureThreshold to 10 so the startup probe would allow for up to 120s.
I still suspect that there's something not quite right with the DNS configuration of this cluster.
Transferring to the DNS component for investigation.
Could you help us understand what happens at the DNS level? It seems that the Alertmanager pods fail to resolve alertmanager-main-0.alertmanager-operated/alertmanager-main-1.alertmanager-operated from time to time but it's not clear to me what is the reason (bug or misconfiguration?).
It's been quite a while since the dns designation, any chance we can get someone to take a look?
Checking in again. Could we get a status update for this ticket so we can report back to the customer please?
Thanks in advance :]
> I see that when alertmanager starts it probes many DNS names,
This is the expected behaviour in k8s/openshift if any pod tries to reach a
service in by only the service name. Alertmanager deployment as you can see
from its yaml manifest reaches out to alertmanager-main-0 and alertmanager-main-1
to form a quorum.
> The only relevant query should be "alertmanager-main-1.alertmanager-operated.openshift-monitoring.svc.cluster.local." which is usually answered with NOERROR when the alertmanager is already running, all the others are looking like tentatives to find the correct one.
My **wild** guess (can only be confirmed with a packet capture since timing is
important) is that the initial query for `alertmanager-main-0/1` may be sent to
all resolvers (including) upstream and the first one to answer with a SERVFAIL
may be upstream resolver - the firewall (sandboxed env) could be terminating the
connection resulting in SERVFAIL. This may be cached in coredns because of
denial XXXX 30
Does the situation improve if you don't cache or reduce the TTL?
Are you able to packet capture the alertmanager pods?
We have no HPE ProLiant BL460c Gen10, test with arm bare mental cluster versioned-installer-arm_bm-disconnected-ci
Install payload 4.12.0-0.nightly-arm64-2022-07-24-180825
Bound PV to alarm pod
No regression found.
I eventually had the chance to check with the customer in a remote, the upstream DNS (based on dnsmasq) was looping on itself due to a wrong configuration.
dnsmasq by default is relying on /etc/resolv.conf to forward queries related to unknown hostnames. /etc/resolv.conf was pointing to itself, in order to have that bastion host be able to resolve cluster related queries.
We put the "no-resolv" option in the dnsmasq configuration and the alertmanager pods were able to eventually come up.
I still believe that we need to be more tolerant towards unresponsive DNS servers, especially if the queries are supposed to get a negative answer.
Thanks for your patience and collaboration.
Software Maintenance Engineer
OpenShift ESS Support
Red Hat EMEA
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.