Description of problem: After patching the OpenShift Container Platform from 4.7.21 to 4.7.36, the cluster-baremetal-operator pod is in crash loop state. All other components have been patched successfully and working fine. Version-Release number of selected component (if applicable): unknown Steps to Reproduce: 1. Upgrade cluster from 4.7.21 to 4.7.36 Actual results: cluster-baremetal-operator is in crash loop state. Expected results: cluster-baremetal-operator should be able to be patched successfully. Additional info: 1. All the cluster operators are working fine. 2. logs from cluster-baremetal-operator pod: ~~~ 2021-11-15T16:51:08.778392619Z I1115 16:51:08.778349 1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen" "addr"=":8080" 2021-11-15T16:51:08.824513290Z E1115 16:51:08.824456 1 main.go:115] "unable to create controller" err="could not lookupIP for internal APIServer: api-int.rtl.svs.smgmt-aws.uk: lookup api-int.rtl.svs.smgmt-aws.uk on 172.30.0.10:53: no such host" controller="Provisioning" ~~~ 3. Logs inside the DNS pod: ~~~ 2021-11-09T17:27:06.489952120Z smgmt-aws.uk.:5353 2021-11-09T17:27:06.489952120Z svs.smgmt-aws.uk.:5353 2021-11-09T17:27:06.489952120Z .:5353 2021-11-09T17:27:06.489952120Z [INFO] plugin/reload: Running configuration MD5 = 51a0c0101dc401af5eea13c1d130523c 2021-11-09T17:27:06.489952120Z CoreDNS-1.6.6 2021-11-09T17:27:06.489952120Z linux/amd64, go1.15.14, 2021-11-09T17:43:35.842198439Z [INFO] SIGTERM: Shutting down servers then terminating 2021-11-09T17:43:35.842198439Z [INFO] plugin/health: Going into lameduck mode for 20s 2021-11-09T17:45:52.383445500Z smgmt-aws.uk.:5353 2021-11-09T17:45:52.383445500Z svs.smgmt-aws.uk.:5353 2021-11-09T17:45:52.383445500Z .:5353 2021-11-09T17:45:52.386866376Z [INFO] plugin/reload: Running configuration MD5 = 51a0c0101dc401af5eea13c1d130523c 2021-11-09T17:45:52.386921496Z CoreDNS-1.6.6 2021-11-09T17:45:52.386954808Z linux/amd64, go1.15.14, 2021-11-11T07:00:17.472340651Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:57245->10.243.128.198:53: i/o timeout 2021-11-11T07:00:17.472340651Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:54326->10.243.128.198:53: i/o timeout 2021-11-11T07:00:20.466010759Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:51315->10.243.128.198:53: i/o timeout 2021-11-11T07:00:20.466010759Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:59420->10.243.130.164:53: i/o timeout 2021-11-11T07:00:22.466855314Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:54289->10.243.128.198:53: i/o timeout 2021-11-11T07:00:22.466855314Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:42726->10.243.128.198:53: i/o timeout 2021-11-11T07:01:13.454256588Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:33721->10.243.128.198:53: i/o timeout 2021-11-11T07:01:13.454596514Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:40906->10.243.130.164:53: i/o timeout 2021-11-11T07:01:15.454843984Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:33932->10.243.130.164:53: i/o timeout 2021-11-11T07:01:15.455044807Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:49511->10.243.130.164:53: i/o timeout 2021-11-11T07:01:17.456036512Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:50806->10.243.128.198:53: i/o timeout 2021-11-11T07:01:17.456036512Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:58326->10.243.128.198:53: i/o timeout 2021-11-11T07:01:19.456675648Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:44534->10.243.130.164:53: i/o timeout 2021-11-11T07:01:19.456675648Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:50198->10.243.128.198:53: i/o timeout 2021-11-11T07:02:13.454348755Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:50855->10.243.130.164:53: i/o timeout 2021-11-11T07:02:13.454435404Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:36019->10.243.128.198:53: i/o timeout 2021-11-11T07:02:15.455053969Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. A: read udp 10.128.2.4:43976->10.243.128.198:53: i/o timeout 2021-11-11T07:02:15.455137599Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:37707->10.243.128.198:53: i/o timeout 2021-11-11T07:02:17.455980340Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. A: read udp 10.128.2.4:38124->10.243.128.198:53: i/o timeout 2021-11-11T07:02:17.455980340Z [ERROR] plugin/errors: 2 api.rtl.svs.smgmt-aws.uk. AAAA: read udp 10.128.2.4:48358->10.243.130.164:53: i/o timeout ~~~ Note: Look at the two APIServers and two upstream servers IP's 4. Dig output from the DNS pods: ~~~ sh-4.4# cat /etc/resolv.conf search svs.smgmt-aws.uk nameserver 172.16.24.2 sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6238 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;api-int.rtl.svs.smgmt-aws.uk. IN A ;; ANSWER SECTION: api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.24.38 api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.24.144 api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.25.98 ;; Query time: 1 msec ;; SERVER: 172.16.24.2#53(172.16.24.2) ;; WHEN: Wed Nov 17 09:54:11 UTC 2021 ;; MSG SIZE rcvd: 105 sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 60445 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;api-int.rtl.svs.smgmt-aws.uk.svs.smgmt-aws.uk. IN A ;; AUTHORITY SECTION: svs.smgmt-aws.uk. 900 IN SOA ns-58.awsdns-07.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400 ;; Query time: 10 msec ;; SERVER: 172.16.24.2#53(172.16.24.2) ;; WHEN: Wed Nov 17 09:54:26 UTC 2021 ;; MSG SIZE rcvd: 154 sh-4.4# dig 10.243.130.164 ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> 10.243.130.164 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18429 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;10.243.130.164. IN A ;; ANSWER SECTION: 10.243.130.164. 0 IN A 10.243.130.164 ;; Query time: 0 msec ;; SERVER: 172.16.24.2#53(172.16.24.2) ;; WHEN: Wed Nov 17 09:54:33 UTC 2021 ;; MSG SIZE rcvd: 59 sh-4.4# dig 10.243.128.198 ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> 10.243.128.198 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35091 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;10.243.128.198. IN A ;; ANSWER SECTION: 10.243.128.198. 0 IN A 10.243.128.198 ;; Query time: 0 msec ;; SERVER: 172.16.24.2#53(172.16.24.2) ;; WHEN: Wed Nov 17 09:54:57 UTC 2021 ;; MSG SIZE rcvd: 59 ~~~ 5. Dig output from other pods running on the same node: ~~~ $ oc rsh sdn-nclxw Defaulting container name to sdn. Use 'oc describe pod/sdn-nclxw -n openshift-sdn' to see all of the containers in this pod. sh-4.4# exit exit [amccann.gov@svsmgtssh01 ~]$ oc rsh sdn-nclxw Defaulting container name to sdn. Use 'oc describe pod/sdn-nclxw -n openshift-sdn' to see all of the containers in this pod. sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38033 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;api-int.rtl.svs.smgmt-aws.uk. IN A ;; ANSWER SECTION: api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.24.144 api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.24.38 api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.25.98 ;; Query time: 2 msec ;; SERVER: 172.16.24.2#53(172.16.24.2) ;; WHEN: Fri Nov 19 09:58:58 UTC 2021 ;; MSG SIZE rcvd: 105 ######################################################################################################################################## $ oc project openshift-machine-config-operator Now using project "openshift-machine-config-operator" on server "https://api.rtl.svs.smgmt-aws.uk:6443". [amccann.gov@svsmgtssh01 ~]$ oc rsh machine-config-server-jgmfb sh-4.4# dig api-int.rtl.svs.smgmt-aws.uk ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> api-int.rtl.svs.smgmt-aws.uk ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34174 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;api-int.rtl.svs.smgmt-aws.uk. IN A ;; ANSWER SECTION: api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.24.144 api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.25.98 api-int.rtl.svs.smgmt-aws.uk. 60 IN A 172.16.24.38 ;; Query time: 2 msec ;; SERVER: 172.16.24.2#53(172.16.24.2) ;; WHEN: Fri Nov 19 10:17:53 UTC 2021 ;; MSG SIZE rcvd: 105 ~~~
CBO is trying to lookup the API Server's IP (https://github.com/openshift/cluster-baremetal-operator/blob/f73e5fcb432e4b847cddec5ce8570f8c5c32e902/controllers/provisioning_controller.go#L428) to then later determine if it is an v6/v4 address. But the DNS failures are preventing this call from succeeding. Not sure if it is the SDN team or the mDNS team that needs to take a look. Assigning it to the SDN team to take a look first.
Moving to DNS team, as this seems like a dns resolution problem. If you think its a networking issue please feel free to send it back
(In reply to sdasu from comment #1) > CBO is trying to lookup the API Server's IP > (https://github.com/openshift/cluster-baremetal-operator/blob/ > f73e5fcb432e4b847cddec5ce8570f8c5c32e902/controllers/provisioning_controller. > go#L428) to then later determine if it is an v6/v4 address. > But the DNS failures are preventing this call from succeeding. CBO should disable itself on AWS, so I wonder if we should reorder these checks, such that we only lookup the IP and do the networkStack check on platforms where CBO is actually enabled? That said, it'd be good to clarify why CBO is failing to resolve the api-int endpoint, e.g is it an issue impacting other pods.
CBO is trying to determine the network stack of the cluster and tries to contact the DNS to determine the IP address of the api-int server. This DNS failure is causing this call to fail. Updated the CBO to not do this lookup on unsupported platforms and AWS happens to be one of them. This still does not take care of the underlying DNS issue.
The DNS team needs to take a look at the DNS logs to figure out the root cause for the failure. I have taken care of improving the behavior of CBO when this occurs.
Can we get a must-gather from a setup where this failure occurs?
Thanks for the must-gather. Passing it along to the DNS team to take a look at the DNS errors. CBO changes are complete at this time.
Not a blocker as this doesn't appear to be a regression. Assigning to Chad to investigate.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056