2062307 – canary timeouts leave ingress operator degraded

Bug 2062307 - canary timeouts leave ingress operator degraded

Summary: canary timeouts leave ingress operator degraded

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	aos-network-edge-staff
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-09 14:22 UTC by David Dreeggors
Modified:	2022-08-04 22:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:39:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description David Dreeggors 2022-03-09 14:22:51 UTC

Description of problem:

The 4.9.22 and 4.9.23 installer never completes due to the ingress cluster operator is in a degraded state due to a canary timeout. I can install 4.8.4 and any version prior in this environment with no issues but upgrading to 4.8.32 shows the same issue and fails upgrade as well. 



Version-Release number of selected component (if applicable):

Observed succeeding on 4.8.4
Observed failing on 4.8.32, 4.9.22, and 4.9.23


How reproducible:

Every install 


Steps to Reproduce:
1. Install or upgrade to 4.8.32 or above


Actual results:

Install eventually fails with all nodes up and running but some cluster operators degraded due to ingress canary timeouts 

```
[ddreggors@provisioner ~]$ oc get co|awk '/NAME/||$3~/False/||$4~/True/||$5~/True/'
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.9.23    False       True          False      17h     DeploymentAvailable: 0 replicas available for console deployment...
ingress                                    4.9.23    True        False         True       16h     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
```


Expected results:

Install completes and ingress is healthy with no canary timeouts

Additional info:

I have rsh'ed into the ingress controller and ran curl tests from this pod to the URL that is failing. The curl test responds with `200` (ok) but seems to take `12s` when negotiating TLS


```
[ddreggors@provisioner ~]$ oc logs ingress-operator-bbffddb96-dxl6q ingress-operator 2>&1|tail -n 5
2022-03-09T14:18:39.632Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
2022-03-09T14:19:02.997Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2022-03-09T14:19:39.633Z	INFO	operator.ingress_controller	controller/controller.go:298	reconciling	{"request": "openshift-ingress-operator/default"}
2022-03-09T14:19:39.965Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
2022-03-09T14:20:13.058Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
```



```
sh-4.4$ time curl -k -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/
*   Trying 192.168.4.121...
* TCP_NODELAY set
* Connected to canary-openshift-ingress-canary.apps.ocp4.teklocal.net (192.168.4.121) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=*.apps.ocp4.teklocal.net
*  start date: Mar  8 21:10:52 2022 GMT
*  expire date: Mar  7 21:10:53 2024 GMT
*  issuer: CN=ingress-operator@1646773672
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/1.1
> Host: canary-openshift-ingress-canary.apps.ocp4.teklocal.net
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/1.1 200 OK
< x-request-port: 8080
< date: Wed, 09 Mar 2022 14:19:26 GMT
< content-length: 22
< content-type: text/plain; charset=utf-8
< set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=d140d4c745044807153297df093c6a57; path=/; HttpOnly; Secure; SameSite=None
< cache-control: private
< 
Healthcheck requested
* Connection #0 to host canary-openshift-ingress-canary.apps.ocp4.teklocal.net left intact

real	0m13.216s
user	0m0.023s
sys	0m0.025s
```

Comment 1 David Dreeggors 2022-03-09 14:28:10 UTC

I have checked the following and all seems ok:

1. DNS entries for API and ingress are present
2. There are no duplicate IPs 
3. All worker nodes are up and ready
4. All nodes are on the same network/vlan in vsphere
5. Curl and ping tests from pods show no routing/connectivity issues that I can see

Comment 2 David Dreeggors 2022-03-09 14:33:51 UTC

Possibly related to https://access.redhat.com/solutions/5891131 but that was fixed in later versions so maybe a regression?

Comment 3 Riccardo Ravaioli 2022-03-09 14:56:38 UTC

Moving to routing component since it affects ingress. Please feel free to reassign to us if it's on openshifts-sdn or CNO.

Comment 4 Miciah Dashiel Butler Masters 2022-03-09 19:51:47 UTC

Setting blocker- as this is most likely a configuration issue and shouldn't block the next z-stream release.  

Looks like you're using vSphere.  Are you using OVN or openshift-sdn?  Are you using FIPS or any other non-default configuration option?  

Does the Curl output show where the delay is?  Timestamps might help; for example, if you have the moreutils package installed, you can use `curl -k -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/ &| ts` to add timestamps to Curl's output.  

Otherwise, a packet capture might be needed to diagnose the issue.

Comment 5 David Dreeggors 2022-03-09 20:31:10 UTC

(In reply to Miciah Dashiel Butler Masters from comment #4)
> Setting blocker- as this is most likely a configuration issue and shouldn't
> block the next z-stream release.  
> 
> Looks like you're using vSphere. 

Yes vSphere

> Are you using OVN or openshift-sdn?  

I have tried both at different times/install attempts but I am currently back on default (SDN) in latest install attempt

> Are you using FIPS or any other non-default configuration option?  


This is a vanilla install-config created by `openshift-install create install-config` so no non-default configuration.


> 
> Does the Curl output show where the delay is?  Timestamps might help; for
> example, if you have the moreutils package installed, you can use `curl -k
> -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/ &| ts` to
> add timestamps to Curl's output.  
> 
> Otherwise, a packet capture might be needed to diagnose the issue.


I cannot test the curl command as you gave it as the ts command is not installed on the controller pod. The curl command I can run from this pod is posted in the description.

Comment 6 David Dreeggors 2022-03-09 20:47:49 UTC

(In reply to Miciah Dashiel Butler Masters from comment #4)

Best I can offer for timestamps is the following given the lack of installed tools in the pod...

I start with a date command, then follow with a timed curl command using `--trace-time` to prepend timestamps


```
sh-4.4$ date; time curl -k -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/ --trace-time
Wed Mar  9 20:43:17 UTC 2022
20:43:30.093187 *   Trying 192.168.4.121...
20:43:30.093445 * TCP_NODELAY set
20:43:30.094541 * Connected to canary-openshift-ingress-canary.apps.ocp4.teklocal.net (192.168.4.121) port 443 (#0)
20:43:30.096582 * ALPN, offering h2
20:43:30.098084 * ALPN, offering http/1.1
20:43:30.116902 * successfully set certificate verify locations:
20:43:30.118931 *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
20:43:30.120037 * TLSv1.3 (OUT), TLS handshake, Client hello (1):
20:43:30.131469 * TLSv1.3 (IN), TLS handshake, Server hello (2):
20:43:30.136428 * TLSv1.3 (IN), TLS handshake, [no content] (0):
20:43:30.136808 * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
20:43:30.136955 * TLSv1.3 (IN), TLS handshake, [no content] (0):
20:43:30.137159 * TLSv1.3 (IN), TLS handshake, Certificate (11):
20:43:30.138263 * TLSv1.3 (IN), TLS handshake, [no content] (0):
20:43:30.138784 * TLSv1.3 (IN), TLS handshake, CERT verify (15):
20:43:30.139270 * TLSv1.3 (IN), TLS handshake, [no content] (0):
20:43:30.139808 * TLSv1.3 (IN), TLS handshake, Finished (20):
20:43:30.139963 * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
20:43:30.140168 * TLSv1.3 (OUT), TLS handshake, [no content] (0):
20:43:30.140341 * TLSv1.3 (OUT), TLS handshake, Finished (20):
20:43:30.140574 * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
20:43:30.140658 * ALPN, server did not agree to a protocol
20:43:30.140773 * Server certificate:
20:43:30.140879 *  subject: CN=*.apps.ocp4.teklocal.net
20:43:30.141391 *  start date: Mar  8 21:10:52 2022 GMT
20:43:30.141500 *  expire date: Mar  7 21:10:53 2024 GMT
20:43:30.141615 *  issuer: CN=ingress-operator@1646773672
20:43:30.141790 *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
20:43:30.142042 * TLSv1.3 (OUT), TLS app data, [no content] (0):
20:43:30.142230 > GET / HTTP/1.1
20:43:30.142230 > Host: canary-openshift-ingress-canary.apps.ocp4.teklocal.net
20:43:30.142230 > User-Agent: curl/7.61.1
20:43:30.142230 > Accept: */*
20:43:30.142230 > 
20:43:30.142664 * TLSv1.3 (IN), TLS handshake, [no content] (0):
20:43:30.148289 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
20:43:30.150040 * TLSv1.3 (IN), TLS handshake, [no content] (0):
20:43:30.150164 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
20:43:30.150927 * TLSv1.3 (IN), TLS app data, [no content] (0):
20:43:30.151284 < HTTP/1.1 200 OK
20:43:30.151443 < x-request-port: 8080
20:43:30.151511 < date: Wed, 09 Mar 2022 20:43:30 GMT
20:43:30.151597 < content-length: 22
20:43:30.151681 < content-type: text/plain; charset=utf-8
20:43:30.152110 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=1391f1a9dbcafe076e97d09351a46977; path=/; HttpOnly; Secure; SameSite=None
20:43:30.152435 < cache-control: private
20:43:30.152605 < 
Healthcheck requested
20:43:30.152736 * Connection #0 to host canary-openshift-ingress-canary.apps.ocp4.teklocal.net left intact

real	0m12.437s
user	0m0.025s
sys	0m0.027s

```

Comment 7 Miciah Dashiel Butler Masters 2022-03-09 23:56:03 UTC

It seems like there is no significant delay from the point where Curl prints the IP address to the completion of the request.  The delay could caused by DNS resolution; a slow upstream resolver could cause signfiicant delays, especially with long search paths that are typical in Kubernetes.  DNS caching should mitigate this though (on vSphere, entries should be cached for up to 30 seconds).  

Could you try the following?

   curl -k -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/ -w 'dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n'

It might also be useful to get the strace output:

    strace -ffto /tmp/strace.out curl -k -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/

Then gather the /tmp/strace.out file, and we can try to determine the cause of the delay there.

Comment 8 David Dreeggors 2022-03-10 00:24:22 UTC

(In reply to Miciah Dashiel Butler Masters from comment #7)
> It seems like there is no significant delay from the point where Curl prints
> the IP address to the completion of the request.  The delay could caused by
> DNS resolution; a slow upstream resolver could cause signfiicant delays,
> especially with long search paths that are typical in Kubernetes.  DNS
> caching should mitigate this though (on vSphere, entries should be cached
> for up to 30 seconds).  
> 
> Could you try the following?
> 
>    curl -k -v
> https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/ -w
> 'dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect:
> %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer:
> %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n'
> 
> It might also be useful to get the strace output:
> 
>     strace -ffto /tmp/strace.out curl -k -v
> https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/
> 
> Then gather the /tmp/strace.out file, and we can try to determine the cause
> of the delay there.


OK so DNS is the holdup, but it is not using my local DNS to resolve, it appears to be using the internal resolvers and they are NOT resolving the hostname...


```
sh-4.4$ curl -k -v https://canary-openshift-ingress-canary.apps.ocp4.teklocal.net/ -w 'dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n'
*   Trying 192.168.4.121...
* TCP_NODELAY set
* Connected to canary-openshift-ingress-canary.apps.ocp4.teklocal.net (192.168.4.121) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=*.apps.ocp4.teklocal.net
*  start date: Mar  8 21:10:52 2022 GMT
*  expire date: Mar  7 21:10:53 2024 GMT
*  issuer: CN=ingress-operator@1646773672
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/1.1
> Host: canary-openshift-ingress-canary.apps.ocp4.teklocal.net
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/1.1 200 OK
< x-request-port: 8080
< date: Thu, 10 Mar 2022 00:18:08 GMT
< content-length: 22
< content-type: text/plain; charset=utf-8
< set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=1391f1a9dbcafe076e97d09351a46977; path=/; HttpOnly; Secure; SameSite=None
< cache-control: private
< 
Healthcheck requested
* Connection #0 to host canary-openshift-ingress-canary.apps.ocp4.teklocal.net left intact
dnslookup: 12.128738 | connect: 12.130423 | appconnect: 12.219248 | pretransfer: 12.221369 | starttransfer: 12.229251 | total: 12.231538 | size: 22
```



USING CLUSTER DNS
```
sh-4.4$ nslookup canary-openshift-ingress-canary.apps.ocp4.teklocal.net
;; Truncated, retrying in TCP mode.
Server:		172.30.0.10
Address:	172.30.0.10#53

** server can't find canary-openshift-ingress-canary.apps.ocp4.teklocal.net.teklocal.net: SERVFAIL

sh-4.4$ 
```



USING MY LOCAL DNSMASQ TO RESOLVE
```
sh-4.4$ nslookup canary-openshift-ingress-canary.apps.ocp4.teklocal.net 192.168.4.3
Server:		192.168.4.3
Address:	192.168.4.3#53

Name:	canary-openshift-ingress-canary.apps.ocp4.teklocal.net
Address: 192.168.4.121

sh-4.4$ 
```

Comment 9 David Dreeggors 2022-03-10 00:37:16 UTC

DNS lookups with timiing:


Non-cluster dnsmasq for local network
```
sh-4.4$ time nslookup canary-openshift-ingress-canary.apps.ocp4.teklocal.net 192.168.4.3 
Server:		192.168.4.3
Address:	192.168.4.3#53

Name:	canary-openshift-ingress-canary.apps.ocp4.teklocal.net
Address: 192.168.4.121


real	0m0.229s
user	0m0.009s
sys	0m0.015s
```



Cluster DNS

```
sh-4.4$ time nslookup canary-openshift-ingress-canary.apps.ocp4.teklocal.net            
;; Truncated, retrying in TCP mode.
Server:		172.30.0.10
Address:	172.30.0.10#53

** server can't find canary-openshift-ingress-canary.apps.ocp4.teklocal.net.teklocal.net: SERVFAIL


real	0m6.311s
user	0m0.006s
sys	0m0.022s
```

When using the default cluster DNS resolver we see a 6s timeout and the failure to resolve the host

Comment 10 David Dreeggors 2022-03-10 00:49:27 UTC

Also, if this helps....


```
[ddreggors@provisioner ~]$ oc get svc -n openshift-dns dns-default -o wide
NAME          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
dns-default   ClusterIP   172.30.0.10   <none>        53/UDP,53/TCP,9154/TCP   27h   dns.operator.openshift.io/daemonset-dns=default
```


```
[ddreggors@provisioner ~]$ oc get pod -n openshift-dns -o wide -l dns.operator.openshift.io/daemonset-dns=default
NAME                READY   STATUS    RESTARTS   AGE   IP            NODE                      NOMINATED NODE   READINESS GATES
dns-default-cgpc9   2/2     Running   0          27h   10.129.0.5    ocp4-p4n85-master-0       <none>           <none>
dns-default-fxtg9   2/2     Running   0          26h   10.128.2.6    ocp4-p4n85-worker-96ntk   <none>           <none>
dns-default-l6frg   2/2     Running   0          27h   10.128.0.41   ocp4-p4n85-master-1       <none>           <none>
dns-default-msmsx   2/2     Running   0          27h   10.131.0.8    ocp4-p4n85-worker-56jvx   <none>           <none>
dns-default-nq5k5   2/2     Running   0          27h   10.130.0.6    ocp4-p4n85-master-2       <none>           <none>

```

Comment 11 Hongan Li 2022-03-10 10:40:29 UTC

Could you help to get the output of this command `oc -n openshift-vsphere-infra get pod` ? thanks

Comment 12 David Dreeggors 2022-03-10 13:14:26 UTC

(In reply to Hongan Li from comment #11)
> Could you help to get the output of this command `oc -n
> openshift-vsphere-infra get pod` ? thanks


```
[ddreggors@provisioner ~]$ oc -n openshift-vsphere-infra get pod
NAME                                 READY   STATUS    RESTARTS      AGE
coredns-ocp4-p4n85-master-0          2/2     Running   0             40h
coredns-ocp4-p4n85-master-1          2/2     Running   0             40h
coredns-ocp4-p4n85-master-2          2/2     Running   0             40h
coredns-ocp4-p4n85-worker-56jvx      2/2     Running   0             39h
coredns-ocp4-p4n85-worker-96ntk      2/2     Running   0             39h
haproxy-ocp4-p4n85-master-0          2/2     Running   0             40h
haproxy-ocp4-p4n85-master-1          2/2     Running   0             40h
haproxy-ocp4-p4n85-master-2          2/2     Running   0             40h
keepalived-ocp4-p4n85-master-0       2/2     Running   0             40h
keepalived-ocp4-p4n85-master-1       2/2     Running   0             40h
keepalived-ocp4-p4n85-master-2       2/2     Running   1 (40h ago)   40h
keepalived-ocp4-p4n85-worker-56jvx   2/2     Running   0             39h
keepalived-ocp4-p4n85-worker-96ntk   2/2     Running   0             39h
```

Comment 13 David Dreeggors 2022-03-10 13:49:44 UTC

(In reply to Hongan Li from comment #11)
> Could you help to get the output of this command `oc -n
> openshift-vsphere-infra get pod` ? thanks

However, the logs for the pods in openshift-vsphere-infra are not so nice, there are tons of messages like this:

```
[ddreggors@provisioner ~]$ oc logs -n openshift-vsphere-infra coredns-ocp4-p4n85-worker-56jvx coredns|tail -n 10
[ERROR] plugin/errors: 2 oauth-openshift.apps.ocp4.teklocal.net.teklocal.net. A: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 vcsa.teklocal.net.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 vcsa.teklocal.net.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 vcsa.teklocal.net.ocp4.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 vcsa.teklocal.net.ocp4.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 vcsa.teklocal.net.ocp4.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 console-openshift-console.apps.ocp4.teklocal.net.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 oauth-openshift.apps.ocp4.teklocal.net.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 oauth-openshift.apps.ocp4.teklocal.net.teklocal.net. A: dial tcp 192.168.4.3:53: connect: connection refused
[ERROR] plugin/errors: 2 console-openshift-console.apps.ocp4.teklocal.net.teklocal.net. AAAA: dial tcp 192.168.4.3:53: connect: connection refused
```

Comment 14 David Dreeggors 2022-03-10 16:11:05 UTC

I think I have resolved the issue....


Seeing the errors above for connection refused on coredns I looked back at the "Troubleshoting OpenShift Container Platform 4: DNS" solution:

https://access.redhat.com/solutions/3804501

There is a test in that solution that really brought to light my issue:


"
7. Verify that both TCP and UDP requests from the coredns container to the upstream DNS server are possible. Both TCP and UDP connections to the upstream DNS server are required for CoreDNS to function correctly:


# dig @<UPSTREAM-DNS-IP> redhat.com -p 5353 +tcp +short 
# dig @<UPSTREAM-DNS-IP> redhat.com -p 5353 +notcp +short 

"



When I tested UDP all was fine, however TCP was an issue:
```
sh-4.4# dig @192.168.4.3 redhat.com -p 53 +notcp +short
209.132.183.105

sh-4.4# dig @192.168.4.3 redhat.com -p 53 +tcp +short
;; Connection to 192.168.4.3#53(192.168.4.3) for redhat.com failed: connection refused.
```

I then looked at my DNS and even though it was configured correctly it had stopped listening on TCP port 53:

```
[root@dev-mini ~]# ss -anpl |grep :53|grep dnsmasq                                                                                                                                             
udp   UNCONN 0      0                                                           192.168.4.3:53               0.0.0.0:*    users:(("dnsmasq",pid=63918,fd=8))                                                                                                                                             
udp   UNCONN 0      0                                                             127.0.0.1:53               0.0.0.0:*    users:(("dnsmasq",pid=63918,fd=10))                                                                                                                                          
tcp   LISTEN 0      32                                                            127.0.0.1:53               0.0.0.0:*    users:(("dnsmasq",pid=63918,fd=11))
```


After a restart I now see dnsmasq listening on TCP 53 again and all errors are resolved in the cluster:

```
[ddreggors@provisioner ~]$ oc get co ingress console authentication
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress          4.9.23    True        False         False      59m     
console          4.9.23    True        False         False      9m4s    
authentication   4.9.23    True        False         False      9m2s 
```

Comment 15 David Dreeggors 2022-03-10 16:39:31 UTC

Closing as NOTABUG

Note You need to log in before you can comment on or make changes to this bug.