Bug 2083226
Summary: | alertmanager-main pods failing to start due to startupprobe timeout | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Francesco Cristini <fcristin> | |
Component: | Monitoring | Assignee: | Sunil Thaha <sthaha> | |
Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> | |
Severity: | medium | Docs Contact: | Brian Burt <bburt> | |
Priority: | high | |||
Version: | 4.10 | CC: | anpicker, bbennett, bburt, dacarpen, hongyli, mmasters, rauferna, rfredette, spasquie, sthaha, vjaypurk, wking | |
Target Milestone: | --- | |||
Target Release: | 4.12.0 | |||
Hardware: | x86_64 | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
* Before this update, Alertmanager pod startup might time out because of slow DNS resolution, and the Alertmanager pods would not start. With this release, the timeout value has been increased to seven minutes, which prevents pod startup from timing out.
(link:https://bugzilla.redhat.com/show_bug.cgi?id=2083226[*BZ#2083226*])
|
Story Points: | --- | |
Clone Of: | ||||
: | 2108439 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-17 19:48:48 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2108439, 2109731 |
Description
Francesco Cristini
2022-05-09 13:52:49 UTC
The logs of the CoreDNS pod on the node running alertmanager-main-0 show lots of resolution timeouts: 2022-05-03T08:40:57.189917578Z [INFO] 10.130.0.40:52110 - 33955 "AAAA IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003229945s 2022-05-03T08:40:57.189917578Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. AAAA: read udp 10.130.0.15:50909->10.59.65.234:53: i/o timeout 2022-05-03T08:40:57.189983488Z [INFO] 10.130.0.40:39136 - 20044 "A IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003144467s 2022-05-03T08:40:57.189983488Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. A: read udp 10.130.0.15:59046->10.59.65.234:53: i/o timeout 2022-05-03T08:41:02.189591491Z [INFO] 10.130.0.40:40007 - 5672 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.00194194s 2022-05-03T08:41:02.189591491Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:50118->10.59.65.234:53: i/o timeout 2022-05-03T08:41:02.189696944Z [INFO] 10.130.0.40:51633 - 58405 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.002144225s 2022-05-03T08:41:02.189696944Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:47068->10.59.65.234:53: i/o timeout 2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:51623 - 50084 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003063336s 2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:44552 - 25129 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003075525s 2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:56363->10.59.65.234:53: i/o timeout 2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:36628->10.59.65.234:53: i/o timeout Would it be possible to increase the log level [1] to see if it provides more information? [1] https://docs.openshift.com/container-platform/4.10/networking/dns-operator.html#nw-dns-loglevel_dns-operator also it would worth having a copy of the /etc/resolv.conf from the alertmanager-main-0 pod. It probably makes sense to increase the failureThreshold to 10 so the startup probe would allow for up to 120s. I still suspect that there's something not quite right with the DNS configuration of this cluster. Transferring to the DNS component for investigation. Could you help us understand what happens at the DNS level? It seems that the Alertmanager pods fail to resolve alertmanager-main-0.alertmanager-operated/alertmanager-main-1.alertmanager-operated from time to time but it's not clear to me what is the reason (bug or misconfiguration?). Hi All, It's been quite a while since the dns designation, any chance we can get someone to take a look? Hi All, Checking in again. Could we get a status update for this ticket so we can report back to the customer please? Thanks in advance :] > I see that when alertmanager starts it probes many DNS names, This is the expected behaviour in k8s/openshift if any pod tries to reach a service in by only the service name. Alertmanager deployment as you can see from its yaml manifest reaches out to alertmanager-main-0 and alertmanager-main-1 to form a quorum. > The only relevant query should be "alertmanager-main-1.alertmanager-operated.openshift-monitoring.svc.cluster.local." which is usually answered with NOERROR when the alertmanager is already running, all the others are looking like tentatives to find the correct one. My **wild** guess (can only be confirmed with a packet capture since timing is important) is that the initial query for `alertmanager-main-0/1` may be sent to all resolvers (including) upstream and the first one to answer with a SERVFAIL may be upstream resolver - the firewall (sandboxed env) could be terminating the connection resulting in SERVFAIL. This may be cached in coredns because of ``` cache { denial XXXX 30 } ``` Does the situation improve if you don't cache or reduce the TTL? Are you able to packet capture the alertmanager pods? [1]: https://github.com/coredns/coredns/issues/5074 We have no HPE ProLiant BL460c Gen10, test with arm bare mental cluster versioned-installer-arm_bm-disconnected-ci Install payload 4.12.0-0.nightly-arm64-2022-07-24-180825 Bound PV to alarm pod No regression found. I eventually had the chance to check with the customer in a remote, the upstream DNS (based on dnsmasq) was looping on itself due to a wrong configuration. dnsmasq by default is relying on /etc/resolv.conf to forward queries related to unknown hostnames. /etc/resolv.conf was pointing to itself, in order to have that bastion host be able to resolve cluster related queries. We put the "no-resolv" option in the dnsmasq configuration and the alertmanager pods were able to eventually come up. I still believe that we need to be more tolerant towards unresponsive DNS servers, especially if the queries are supposed to get a negative answer. Thanks for your patience and collaboration. --- Francesco Cristini Software Maintenance Engineer OpenShift ESS Support Red Hat EMEA Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |