2025458 – [IPI-AWS] cluster-baremetal-operator pod in a crashloop state after patching from 4.7.21 to 4.7.36

Bug 2025458 - [IPI-AWS] cluster-baremetal-operator pod in a crashloop state after patching from 4.7.21 to 4.7.36

Summary: [IPI-AWS] cluster-baremetal-operator pod in a crashloop state after patching ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Tomas Sedovic
QA Contact:	Ori Michaeli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2053581 2055279
TreeView+	depends on / blocked

Reported:	2021-11-22 09:25 UTC by Priyansh Magotra
Modified:	2022-03-23 15:11 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Cluster Baremetal Operator does not disable itself properly on unsupported platforms when api-int DNS lookup fails. Consequence: Cluster Baremetal Operator stays in crashlooping state. Fix: Reordered Cluster Baremetal Operator api-int DNS check logic to be done only on supported platforms and skip on unsupported platforms. Result: Cluster Baremetal Operator will disable itself properly on unsupported platforms independent of api-int DNS lookup failures happening some cases.
Clone Of:
Environment:
Last Closed:	2022-03-10 16:29:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 222	0	None	Merged	Bug 2025458: Calculating network stack only on supported Platforms	2022-02-11 11:22:34 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:30:17 UTC

Description Priyansh Magotra 2021-11-22 09:25:46 UTC

Description of problem:
After patching the OpenShift Container Platform from 4.7.21 to 4.7.36, the cluster-baremetal-operator pod is in crash loop state. All other components have been patched successfully and working fine.

Version-Release number of selected component (if applicable):
unknown


Steps to Reproduce:
1. Upgrade cluster from 4.7.21 to 4.7.36

Actual results:
cluster-baremetal-operator is in crash loop state.

Expected results:
cluster-baremetal-operator should be able to be patched successfully.

Additional info:

1. All the cluster operators are working fine.

2. logs from cluster-baremetal-operator pod:
~~~
2021-11-15T16:51:08.778392619Z I1115 16:51:08.778349       1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"=":8080"
2021-11-15T16:51:08.824513290Z E1115 16:51:08.824456       1 main.go:115] "unable to create controller" err="could not lookupIP for internal APIServer: api-int.rtl.svs.smgmt-aws.uk: lookup api-int.rtl.svs.smgmt-aws.uk on 172.30.0.10:53: no such host" controller="Provisioning"
~~~

3. Logs inside the DNS pod:
~~~
2021-11-09T17:27:06.489952120Z smgmt-aws.uk.:5353
2021-11-09T17:27:06.489952120Z svs.smgmt-aws.uk.:5353
2021-11-09T17:27:06.489952120Z .:5353
2021-11-09T17:27:06.489952120Z [INFO] plugin/reload: Running configuration MD5 = 51a0c0101dc401af5eea13c1d130523c
2021-11-09T17:27:06.489952120Z CoreDNS-1.6.6
2021-11-09T17:27:06.489952120Z linux/amd64, go1.15.14, 
2021-11-09T17:43:35.842198439Z [INFO] SIGTERM: Shutting down servers then terminating
2021-11-09T17:43:35.842198439Z [INFO] plugin/health: Going into lameduck mode for 20s
2021-11-09T17:45:52.383445500Z smgmt-aws.uk.:5353
2021-11-09T17:45:52.383445500Z svs.smgmt-aws.uk.:5353
2021-11-09T17:45:52.383445500Z .:5353
2021-11-09T17:45:52.386866376Z [INFO] plugin/reload: Running configuration MD5 = 51a0c0101dc401af5eea13c1d130523c
2021-11-09T17:45:52.386921496Z CoreDNS-1.6.6
2021-11-09T17:45:52.386954808Z linux/amd64, go1.15.14,
2021-11-11T07:00:17.472340651Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:57245->10.243.128.198:53: i/o timeout
2021-11-11T07:00:17.472340651Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:54326->10.243.128.198:53: i/o timeout
2021-11-11T07:00:20.466010759Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:51315->10.243.128.198:53: i/o timeout
2021-11-11T07:00:20.466010759Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:59420->10.243.130.164:53: i/o timeout
2021-11-11T07:00:22.466855314Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:54289->10.243.128.198:53: i/o timeout
2021-11-11T07:00:22.466855314Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:42726->10.243.128.198:53: i/o timeout
2021-11-11T07:01:13.454256588Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:33721->10.243.128.198:53: i/o timeout
2021-11-11T07:01:13.454596514Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:40906->10.243.130.164:53: i/o timeout
2021-11-11T07:01:15.454843984Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:33932->10.243.130.164:53: i/o timeout
2021-11-11T07:01:15.455044807Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:49511->10.243.130.164:53: i/o timeout
2021-11-11T07:01:17.456036512Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:50806->10.243.128.198:53: i/o timeout
2021-11-11T07:01:17.456036512Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:58326->10.243.128.198:53: i/o timeout
2021-11-11T07:01:19.456675648Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:44534->10.243.130.164:53: i/o timeout
2021-11-11T07:01:19.456675648Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:50198->10.243.128.198:53: i/o timeout
2021-11-11T07:02:13.454348755Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:50855->10.243.130.164:53: i/o timeout
2021-11-11T07:02:13.454435404Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:36019->10.243.128.198:53: i/o timeout
2021-11-11T07:02:15.455053969Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:43976->10.243.128.198:53: i/o timeout
2021-11-11T07:02:15.455137599Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:37707->10.243.128.198:53: i/o timeout
2021-11-11T07:02:17.455980340Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:38124->10.243.128.198:53: i/o timeout
2021-11-11T07:02:17.455980340Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:48358->10.243.130.164:53: i/o timeout
~~~
Note: Look at the two APIServers and two upstream servers IP's

4. Dig output from the DNS pods:
~~~
sh-4.4# cat /etc/resolv.conf
search svs.smgmt-aws.uk
nameserver 172.16.24.2

sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6238
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;api-int.rtl.svs.smgmt-aws.uk.  IN      A

;; ANSWER SECTION:
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.24.38
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.24.144
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.25.98

;; Query time: 1 msec
;; SERVER: 172.16.24.2#53(172.16.24.2)
;; WHEN: Wed Nov 17 09:54:11 UTC 2021
;; MSG SIZE  rcvd: 105

sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 60445
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;api-int.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. IN A

;; AUTHORITY SECTION:
svs.smgmt-aws.uk.       900     IN      SOA     ns-58.awsdns-07.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

;; Query time: 10 msec
;; SERVER: 172.16.24.2#53(172.16.24.2)
;; WHEN: Wed Nov 17 09:54:26 UTC 2021
;; MSG SIZE  rcvd: 154

sh-4.4# dig 10.243.130.164

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> 10.243.130.164
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18429
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;10.243.130.164.                        IN      A

;; ANSWER SECTION:
10.243.130.164.         0       IN      A       10.243.130.164

;; Query time: 0 msec
;; SERVER: 172.16.24.2#53(172.16.24.2)
;; WHEN: Wed Nov 17 09:54:33 UTC 2021
;; MSG SIZE  rcvd: 59

sh-4.4# dig 10.243.128.198

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> 10.243.128.198
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35091
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;10.243.128.198.                        IN      A

;; ANSWER SECTION:
10.243.128.198.         0       IN      A       10.243.128.198

;; Query time: 0 msec
;; SERVER: 172.16.24.2#53(172.16.24.2)
;; WHEN: Wed Nov 17 09:54:57 UTC 2021
;; MSG SIZE  rcvd: 59
~~~

5. Dig output from other pods running on the same node:
~~~
$ oc rsh sdn-nclxw
Defaulting container name to sdn.
Use 'oc describe pod/sdn-nclxw -n openshift-sdn' to see all of the containers in this pod.
sh-4.4# exit
exit
[amccann.gov@svsmgtssh01 ~]$ oc rsh sdn-nclxw
Defaulting container name to sdn.
Use 'oc describe pod/sdn-nclxw -n openshift-sdn' to see all of the containers in this pod.
sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38033
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;api-int.rtl.svs.smgmt-aws.uk.  IN      A

;; ANSWER SECTION:
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.24.144
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.24.38
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.25.98

;; Query time: 2 msec
;; SERVER: 172.16.24.2#53(172.16.24.2)
;; WHEN: Fri Nov 19 09:58:58 UTC 2021
;; MSG SIZE  rcvd: 105

########################################################################################################################################

$ oc project openshift-machine-config-operator
Now using project "openshift-machine-config-operator" on server "https://api.rtl.svs.smgmt-aws.uk:6443".
[amccann.gov@svsmgtssh01 ~]$ oc rsh machine-config-server-jgmfb
sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34174
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;api-int.rtl.svs.smgmt-aws.uk.  IN      A

;; ANSWER SECTION:
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.24.144
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.25.98
api-int.rtl.svs.smgmt-aws.uk. 60 IN     A       172.16.24.38

;; Query time: 2 msec
;; SERVER: 172.16.24.2#53(172.16.24.2)
;; WHEN: Fri Nov 19 10:17:53 UTC 2021
;; MSG SIZE  rcvd: 105
~~~

Comment 1 sdasu 2021-11-23 17:07:40 UTC

CBO is trying to lookup the API Server's IP (https://github.com/openshift/cluster-baremetal-operator/blob/f73e5fcb432e4b847cddec5ce8570f8c5c32e902/controllers/provisioning_controller.go#L428) to then later determine if it is an v6/v4 address.
But the DNS failures are preventing this call from succeeding.

Not sure if it is the SDN team or the mDNS team that needs to take a look. Assigning it to the SDN team to take a look first.

Comment 2 Ben Pickard 2021-11-24 14:36:48 UTC

Moving to DNS team, as this seems like a dns resolution problem. If you think its a networking issue please feel free to send it back

Comment 13 Steven Hardy 2021-12-13 12:14:22 UTC

(In reply to sdasu from comment #1)
> CBO is trying to lookup the API Server's IP
> (https://github.com/openshift/cluster-baremetal-operator/blob/
> f73e5fcb432e4b847cddec5ce8570f8c5c32e902/controllers/provisioning_controller.
> go#L428) to then later determine if it is an v6/v4 address.
> But the DNS failures are preventing this call from succeeding.

CBO should disable itself on AWS, so I wonder if we should reorder these checks, such that we only lookup the IP and do the networkStack check on platforms where CBO is actually enabled?

That said, it'd be good to clarify why CBO is failing to resolve the api-int endpoint, e.g is it an issue impacting other pods.

Comment 14 sdasu 2021-12-14 03:11:00 UTC

CBO is trying to determine the network stack of the cluster and tries to contact the DNS to determine the IP address of the api-int server. This DNS failure is causing this call to fail. Updated the CBO to not do this lookup on unsupported platforms and AWS happens to be one of them. This still does not take care of the underlying DNS issue.

Comment 16 sdasu 2021-12-14 14:11:26 UTC

The DNS team needs to take a look at the DNS logs to figure out the root cause for the failure. I have taken care of improving the behavior of CBO when this occurs.

Comment 17 sdasu 2021-12-14 14:38:01 UTC

Can we get a must-gather from a setup where this failure occurs?

Comment 19 sdasu 2021-12-15 13:43:34 UTC

Thanks for the must-gather. Passing it along to the DNS team to take a look at the DNS errors. CBO changes are complete at this time.

Comment 20 Miciah Dashiel Butler Masters 2021-12-16 17:13:35 UTC

Not a blocker as this doesn't appear to be a regression.  Assigning to Chad to investigate.

Comment 36 errata-xmlrpc 2022-03-10 16:29:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.