Bug 2083226

Summary: alertmanager-main pods failing to start due to startupprobe timeout
Product: OpenShift Container Platform Reporter: Francesco Cristini <fcristin>
Component: MonitoringAssignee: Sunil Thaha <sthaha>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: medium Docs Contact: Brian Burt <bburt>
Priority: high    
Version: 4.10CC: anpicker, bbennett, bburt, dacarpen, hongyli, mmasters, rauferna, rfredette, spasquie, sthaha, vjaypurk, wking
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Before this update, Alertmanager pod startup might time out because of slow DNS resolution, and the Alertmanager pods would not start. With this release, the timeout value has been increased to seven minutes, which prevents pod startup from timing out. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2083226[*BZ#2083226*])
Story Points: ---
Clone Of:
: 2108439 (view as bug list) Environment:
Last Closed: 2023-01-17 19:48:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2108439, 2109731    

Description Francesco Cristini 2022-05-09 13:52:49 UTC
Description of problem:
During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6).

Logs from the alertmanager:

level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)"
level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)"

alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073
We tried to manually increase the timers in the startupprobe, but it was not possible.


Version-Release number of selected component (if applicable):
OCP 4.10.10

How reproducible:
OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome.

Actual results:
CMO is not being deployed

Expected results:
CMO deploys without errors

Additional info:
- CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster
- cluster is 3 nodes with masters schedulable

Comment 1 Simon Pasquier 2022-05-09 14:32:54 UTC
The logs of the CoreDNS pod on the node running alertmanager-main-0 show lots of resolution timeouts:
 
2022-05-03T08:40:57.189917578Z [INFO] 10.130.0.40:52110 - 33955 "AAAA IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003229945s
2022-05-03T08:40:57.189917578Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. AAAA: read udp 10.130.0.15:50909->10.59.65.234:53: i/o timeout
2022-05-03T08:40:57.189983488Z [INFO] 10.130.0.40:39136 - 20044 "A IN alertmanager-main-0.alertmanager-operated.local. udp 76 false 512" - - 0 6.003144467s
2022-05-03T08:40:57.189983488Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated.local. A: read udp 10.130.0.15:59046->10.59.65.234:53: i/o timeout
2022-05-03T08:41:02.189591491Z [INFO] 10.130.0.40:40007 - 5672 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.00194194s
2022-05-03T08:41:02.189591491Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:50118->10.59.65.234:53: i/o timeout
2022-05-03T08:41:02.189696944Z [INFO] 10.130.0.40:51633 - 58405 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.002144225s
2022-05-03T08:41:02.189696944Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:47068->10.59.65.234:53: i/o timeout
2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:51623 - 50084 "A IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003063336s
2022-05-03T08:41:07.191932216Z [INFO] 10.130.0.40:44552 - 25129 "AAAA IN alertmanager-main-0.alertmanager-operated. udp 70 false 512" - - 0 6.003075525s
2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. A: read udp 10.130.0.15:56363->10.59.65.234:53: i/o timeout
2022-05-03T08:41:07.191932216Z [ERROR] plugin/errors: 2 alertmanager-main-0.alertmanager-operated. AAAA: read udp 10.130.0.15:36628->10.59.65.234:53: i/o timeout


Would it be possible to increase the log level [1] to see if it provides more information?

[1] https://docs.openshift.com/container-platform/4.10/networking/dns-operator.html#nw-dns-loglevel_dns-operator

Comment 2 Simon Pasquier 2022-05-09 14:42:41 UTC
also it would worth having a copy of the /etc/resolv.conf from the alertmanager-main-0 pod.

Comment 9 Simon Pasquier 2022-05-17 08:15:24 UTC
It probably makes sense to increase the failureThreshold to 10 so the startup probe would allow for up to 120s.

Comment 11 Simon Pasquier 2022-05-17 12:39:44 UTC
I still suspect that there's something not quite right with the DNS configuration of this cluster.

Comment 13 Simon Pasquier 2022-05-17 15:52:24 UTC
Transferring to the DNS component for investigation.

Could you help us understand what happens at the DNS level? It seems that the Alertmanager pods fail to resolve alertmanager-main-0.alertmanager-operated/alertmanager-main-1.alertmanager-operated from time to time but it's not clear to me what is the reason (bug or misconfiguration?).

Comment 16 Darren Carpenter 2022-06-21 14:27:19 UTC
Hi All,

It's been quite a while since the dns designation, any chance we can get someone to take a look?

Comment 22 Darren Carpenter 2022-07-08 13:43:18 UTC
Hi All,

Checking in again. Could we get a status update for this ticket so we can report back to the customer please?

Thanks in advance :]

Comment 28 Sunil Thaha 2022-07-19 00:29:18 UTC
> I see that when alertmanager starts it probes many DNS names, 

This is the expected behaviour in k8s/openshift if any pod tries to reach a
service in by only the service name. Alertmanager deployment as you can see
from its yaml manifest reaches out to alertmanager-main-0 and alertmanager-main-1
to form a quorum.


> The only relevant query should be "alertmanager-main-1.alertmanager-operated.openshift-monitoring.svc.cluster.local." which is usually answered with NOERROR when the alertmanager is already running, all the others are looking like tentatives to find the correct one.


My **wild** guess (can only be confirmed with a packet capture since timing is
important) is that the initial query for `alertmanager-main-0/1` may be sent to
all resolvers (including) upstream and the first one to answer with a SERVFAIL 
may be upstream resolver - the firewall (sandboxed env) could be terminating the
connection resulting in SERVFAIL. This may be cached in coredns because of 

```
cache {
  denial XXXX 30
}
 
```
Does the situation improve if you don't cache or reduce the TTL?
Are you able to packet capture the alertmanager pods? 


[1]: https://github.com/coredns/coredns/issues/5074

Comment 30 hongyan li 2022-07-25 06:27:36 UTC
We have no HPE ProLiant BL460c Gen10, test with arm bare mental cluster versioned-installer-arm_bm-disconnected-ci
Install payload 4.12.0-0.nightly-arm64-2022-07-24-180825
Bound PV to alarm pod
No regression found.

Comment 33 Francesco Cristini 2022-08-17 10:15:04 UTC
I eventually had the chance to check with the customer in a remote, the upstream DNS (based on dnsmasq) was looping on itself due to a wrong configuration.
dnsmasq by default is relying on /etc/resolv.conf to forward queries related to unknown hostnames. /etc/resolv.conf was pointing to itself, in order to have that bastion host be able to resolve cluster related queries.
We put the "no-resolv" option in the dnsmasq configuration and the alertmanager pods were able to eventually come up.

I still believe that we need to be more tolerant towards unresponsive DNS servers, especially if the queries are supposed to get a negative answer.


Thanks for your patience and collaboration.

---
Francesco Cristini
Software Maintenance Engineer
OpenShift ESS Support
Red Hat EMEA

Comment 37 errata-xmlrpc 2023-01-17 19:48:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399