Bug 1991067 - github.com can not be resolved inside pods where cluster is running on openstack.
Summary: github.com can not be resolved inside pods where cluster is running on openst...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Miheer Salunke
QA Contact: Hongan Li
URL:
Whiteboard:
: 1963081 1995114 (view as bug list)
Depends On:
Blocks: 2009210
TreeView+ depends on / blocked
 
Reported: 2021-08-07 00:54 UTC by Johnny Liu
Modified: 2022-08-04 22:39 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Name resolution to github.com from application pod fails if the forward section of Corefile had an upstream DNS which provides a DNS response > 512 Consequence: KNI CoreDNS does not resolve hostnames if the nameservers in the forwarders provide a DNS reponse > 512 bytes Fix: Set 512 as bufsize for KNI coredns to avoid this issue Result: The limit for UDP DNS messages is 512 bytes long. Well behaved DNS servers are supposed to truncate the message and set the truncated bit. See RFC 1035 section 4.2.1. https://datatracker.ietf.org/doc/html/rfc1035#section-4.2.1 https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4 CoreDNS will compress messages that exceed 512 bytes, unless the client allows a larger maximum size by sending the corresponding edns0 option in the request. Name resolution to github.com from application pod succeeds if the forward section of Corefile had an upstream DNS which provides a DNS response > 512
Clone Of:
Environment:
Last Closed: 2022-03-12 04:37:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2730 0 None None None 2021-08-26 15:31:15 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:37:46 UTC

Description Johnny Liu 2021-08-07 00:54:37 UTC
Description of problem:
After cluster set up, deploy an app on it, app failed, due to it can not clone repo from github.com, due to dns failed to be resolved.

From dns-default pod log, saw the following error message:
.:5353
[INFO] plugin/reload: Running configuration MD5 = eb791f1fb4e1f964e4a7377f6b122c87
CoreDNS-1.8.1
linux/amd64, go1.16.6, 
[ERROR] plugin/errors: 2 github.com. A: read udp 10.128.2.5:47440->192.168.2.224:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 10.128.2.5:45955->192.168.2.224:53: i/o timeout

Track down a bit, found only github.com can not be resolved, google.com works.
[root@preserve-jialiu-ansible ~]# oc -n openshift-dns rsh dns-default-jd927
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# cat /etc/resolv.conf 
search jialiu49.0806-9hl.qe.rhcloud.com
nameserver 192.168.3.100
nameserver 10.11.142.1
sh-4.4# 
======> 192.168.3.100 is the host IP where the dns-default pod is running on
sh-4.4# dig @192.168.3.100 google.com github.com   

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @192.168.3.100 google.com github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33428
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: e2b664558e61c250 (echoed)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		30	IN	A	172.217.13.78

;; AUTHORITY SECTION:
google.com.		30	IN	NS	ns1.google.com.
google.com.		30	IN	NS	ns2.google.com.
google.com.		30	IN	NS	ns4.google.com.
google.com.		30	IN	NS	ns3.google.com.

;; ADDITIONAL SECTION:
ns2.google.com.		30	IN	A	216.239.34.10
ns1.google.com.		30	IN	A	216.239.32.10
ns3.google.com.		30	IN	A	216.239.36.10
ns4.google.com.		30	IN	A	216.239.38.10
ns2.google.com.		30	IN	AAAA	2001:4860:4802:34::a
ns1.google.com.		30	IN	AAAA	2001:4860:4802:32::a
ns3.google.com.		30	IN	AAAA	2001:4860:4802:36::a
ns4.google.com.		30	IN	AAAA	2001:4860:4802:38::a

;; Query time: 15 msec
;; SERVER: 192.168.3.100#53(192.168.3.100)
;; WHEN: Sat Aug 07 00:50:00 UTC 2021
;; MSG SIZE  rcvd: 517

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 53040
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: e2b664558e61c250 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1005 msec
;; SERVER: 192.168.3.100#53(192.168.3.100)
;; WHEN: Sat Aug 07 00:50:06 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# 
=========> 10.11.142.1 is the redhat network dns
sh-4.4# dig @10.11.142.1 google.com github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @10.11.142.1 google.com github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16536
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1200
; COOKIE: 77940b87a9159b541adbb543610dd8e016428d0d760d9b94 (good)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		260	IN	A	172.217.13.78

;; AUTHORITY SECTION:
google.com.		7354	IN	NS	ns3.google.com.
google.com.		7354	IN	NS	ns2.google.com.
google.com.		7354	IN	NS	ns4.google.com.
google.com.		7354	IN	NS	ns1.google.com.

;; ADDITIONAL SECTION:
ns2.google.com.		47474	IN	A	216.239.34.10
ns1.google.com.		47474	IN	A	216.239.32.10
ns3.google.com.		47474	IN	A	216.239.36.10
ns4.google.com.		47474	IN	A	216.239.38.10
ns2.google.com.		7354	IN	AAAA	2001:4860:4802:34::a
ns1.google.com.		7354	IN	AAAA	2001:4860:4802:32::a
ns3.google.com.		7354	IN	AAAA	2001:4860:4802:36::a
ns4.google.com.		7354	IN	AAAA	2001:4860:4802:38::a

;; Query time: 4 msec
;; SERVER: 10.11.142.1#53(10.11.142.1)
;; WHEN: Sat Aug 07 00:50:40 UTC 2021
;; MSG SIZE  rcvd: 331

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23445
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 17

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1200
; COOKIE: 77940b87a9159b5469b9e701610dd8e000083c80e1286682 (good)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		29	IN	A	140.82.114.3

;; AUTHORITY SECTION:
github.com.		747	IN	NS	ns-520.awsdns-01.net.
github.com.		747	IN	NS	ns-1707.awsdns-21.co.uk.
github.com.		747	IN	NS	ns-1283.awsdns-32.org.
github.com.		747	IN	NS	ns-421.awsdns-52.com.
github.com.		747	IN	NS	dns2.p08.nsone.net.
github.com.		747	IN	NS	dns3.p08.nsone.net.
github.com.		747	IN	NS	dns4.p08.nsone.net.
github.com.		747	IN	NS	dns1.p08.nsone.net.

;; ADDITIONAL SECTION:
dns1.p08.nsone.net.	7425	IN	A	198.51.44.8
dns2.p08.nsone.net.	7425	IN	A	198.51.45.8
dns3.p08.nsone.net.	7425	IN	A	198.51.44.72
dns4.p08.nsone.net.	7425	IN	A	198.51.45.72
ns-1283.awsdns-32.org.	7409	IN	A	205.251.197.3
ns-1707.awsdns-21.co.uk. 7425	IN	A	205.251.198.171
ns-421.awsdns-52.com.	151073	IN	A	205.251.193.165
ns-520.awsdns-01.net.	7362	IN	A	205.251.194.8
dns1.p08.nsone.net.	7425	IN	AAAA	2620:4d:4000:6259:7:8:0:1
dns2.p08.nsone.net.	7425	IN	AAAA	2a00:edc0:6259:7:8::2
dns3.p08.nsone.net.	7425	IN	AAAA	2620:4d:4000:6259:7:8:0:3
dns4.p08.nsone.net.	7425	IN	AAAA	2a00:edc0:6259:7:8::4
ns-1283.awsdns-32.org.	7409	IN	AAAA	2600:9000:5305:300::1
ns-1707.awsdns-21.co.uk. 7425	IN	AAAA	2600:9000:5306:ab00::1
ns-421.awsdns-52.com.	151073	IN	AAAA	2600:9000:5301:a500::1
ns-520.awsdns-01.net.	7362	IN	AAAA	2600:9000:5302:800::1

;; Query time: 1 msec
;; SERVER: 10.11.142.1#53(10.11.142.1)
;; WHEN: Sat Aug 07 00:50:40 UTC 2021
;; MSG SIZE  rcvd: 658



OpenShift release version:
4.9.0-0.nightly-2021-08-06-060446

Cluster Platform:
OpenStack (PSI)

How reproducible:
Always

Steps to Reproduce (in detail):
1.
2.
3.


Actual results:


Expected results:


Impact of the problem:


Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2021-08-10 16:10:43 UTC
This looks like an intermittent DNS issue outside of OpenShift.  Are you able to reproduce the problem again?  

Your `dig @192.168.3.100 google.com github.com` command is going to the KNI CoreDNS.  Can you test `dig @172.30.0.10 google.com github.com` to check whether the CoreDNS that the DNS operator manages can resolve github.com?

Comment 2 Johnny Liu 2021-08-10 17:39:18 UTC
> This looks like an intermittent DNS issue outside of OpenShift.  Are you able to reproduce the problem again?
I can always reproduce this issue. If this is an intermittent DNS issue outside of OpenShift, why always github.com fail to be resolved?


> Can you test `dig @172.30.0.10 google.com github.com` to check whether the CoreDNS that the DNS operator manages can resolve github.com?
[root@preserve-jialiu-ansible ~]# oc -n openshift-dns rsh dns-default-2hlxc
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# cat /etc/resolv.conf 
search xxia0810osp.0810-wg8.qe.rhcloud.com
nameserver 192.168.2.199
nameserver 10.11.142.1
sh-4.4# dig @192.168.2.199 google.com github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @192.168.2.199 google.com github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43900
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 2f3b19fd41edf90a (echoed)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		30	IN	A	172.217.13.238

;; AUTHORITY SECTION:
google.com.		30	IN	NS	ns3.google.com.
google.com.		30	IN	NS	ns1.google.com.
google.com.		30	IN	NS	ns4.google.com.
google.com.		30	IN	NS	ns2.google.com.

;; ADDITIONAL SECTION:
ns2.google.com.		30	IN	A	216.239.34.10
ns1.google.com.		30	IN	A	216.239.32.10
ns3.google.com.		30	IN	A	216.239.36.10
ns4.google.com.		30	IN	A	216.239.38.10
ns2.google.com.		30	IN	AAAA	2001:4860:4802:34::a
ns1.google.com.		30	IN	AAAA	2001:4860:4802:32::a
ns3.google.com.		30	IN	AAAA	2001:4860:4802:36::a
ns4.google.com.		30	IN	AAAA	2001:4860:4802:38::a

;; Query time: 38 msec
;; SERVER: 192.168.2.199#53(192.168.2.199)
;; WHEN: Tue Aug 10 17:36:50 UTC 2021
;; MSG SIZE  rcvd: 517

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 45289
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 2f3b19fd41edf90a (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1010 msec
;; SERVER: 192.168.2.199#53(192.168.2.199)
;; WHEN: Tue Aug 10 17:36:56 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4# dig @172.30.0.10 google.com github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @172.30.0.10 google.com github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56480
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: bc4b853a09fb3aba (echoed)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		30	IN	A	172.217.13.238

;; AUTHORITY SECTION:
google.com.		30	IN	NS	ns1.google.com.
google.com.		30	IN	NS	ns4.google.com.
google.com.		30	IN	NS	ns3.google.com.
google.com.		30	IN	NS	ns2.google.com.

;; ADDITIONAL SECTION:
ns2.google.com.		30	IN	A	216.239.34.10
ns1.google.com.		30	IN	A	216.239.32.10
ns3.google.com.		30	IN	A	216.239.36.10
ns4.google.com.		30	IN	A	216.239.38.10
ns2.google.com.		30	IN	AAAA	2001:4860:4802:34::a
ns1.google.com.		30	IN	AAAA	2001:4860:4802:32::a
ns3.google.com.		30	IN	AAAA	2001:4860:4802:36::a
ns4.google.com.		30	IN	AAAA	2001:4860:4802:38::a

;; Query time: 8 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Tue Aug 10 17:37:24 UTC 2021
;; MSG SIZE  rcvd: 315

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 28827
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: bc4b853a09fb3aba (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1018 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Tue Aug 10 17:37:30 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4#

Comment 3 Miheer Salunke 2021-08-11 03:43:19 UTC
Hi,

I am not able reproduce this issue in my openstack environment.

[stack@standalone ~]$ oc -n openshift-ingress rsh router-default-794ffc5d68-tvl69
sh-4.4$ 
sh-4.4$ 
sh-4.4$ 
sh-4.4$ 
sh-4.4$ 
sh-4.4$ dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62998
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: ab4b35edb9d96e30 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		30	IN	A	140.82.112.4

;; Query time: 5 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Wed Aug 11 03:38:04 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4$ 
sh-4.4$ 
sh-4.4$ curl -Ivk https://github.com
* Rebuilt URL to: https://github.com/
*   Trying 140.82.112.4...
* TCP_NODELAY set
* Connected to github.com (140.82.112.4) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=github.com
*  start date: Mar 25 00:00:00 2021 GMT
*  expire date: Mar 30 23:59:59 2022 GMT
*  issuer: C=US; O=DigiCert, Inc.; CN=DigiCert High Assurance TLS Hybrid ECC SHA256 2020 CA1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x55693cd3d780)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> HEAD / HTTP/2
> Host: github.com
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 200 
HTTP/2 200 
< server: GitHub.com
server: GitHub.com
< date: Wed, 11 Aug 2021 03:38:31 GMT
date: Wed, 11 Aug 2021 03:38:31 GMT
< content-type: text/html; charset=utf-8
content-type: text/html; charset=utf-8
< vary: X-PJAX, Accept-Language, Accept-Encoding, Accept, X-Requested-With
vary: X-PJAX, Accept-Language, Accept-Encoding, Accept, X-Requested-With
< permissions-policy: interest-cohort=()
permissions-policy: interest-cohort=()
< content-language: en-US
content-language: en-US
< etag: W/"baf0076cfd7ed7d5f2efaad9e11cb5d5"
etag: W/"baf0076cfd7ed7d5f2efaad9e11cb5d5"
< cache-control: max-age=0, private, must-revalidate
cache-control: max-age=0, private, must-revalidate
< strict-transport-security: max-age=31536000; includeSubdomains; preload
strict-transport-security: max-age=31536000; includeSubdomains; preload
< x-frame-options: deny
x-frame-options: deny
< x-content-type-options: nosniff
x-content-type-options: nosniff
< x-xss-protection: 0
x-xss-protection: 0
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
< content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events translator.github.com wss://alive.github.com github.githubassets.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com render-temp.githubusercontent.com viewscreen.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com secured-user-images.githubusercontent.com/ *.githubusercontent.com customer-stories-feed.github.com spotlights-feed.github.com; manifest-src 'self'; media-src github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/socket-worker-3f088aa2.js gist.github.com/socket-worker-3f088aa2.js
content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events translator.github.com wss://alive.github.com github.githubassets.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com render-temp.githubusercontent.com viewscreen.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com secured-user-images.githubusercontent.com/ *.githubusercontent.com customer-stories-feed.github.com spotlights-feed.github.com; manifest-src 'self'; media-src github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/socket-worker-3f088aa2.js gist.github.com/socket-worker-3f088aa2.js
< set-cookie: _gh_sess=K%2FpSZff8jvfVSIvC1yF1SYM%2F0z06p376jYqp4N1fa3EnMPEv0QHUhmgDu7I7NbUKAFJ0CBDJxZeomc5rQ9gBamz2eNFa6vFrzbkjRzXA8FLAT70h7%2FCDmFpuZCsf9w%2BipsVlEqRGdZJRZKlrDLf4WxIKnNL7Tl2SUUSsPd%2FkCeWuPJwqPehbv0DQMlUTmX37rcP4AfnyEzEt8Wk94Ro9Jav%2BAz6%2BQFgfxTZM4I%2BtzVSETHKFtV4BE4vQ03Ql0%2BTMZ6QYie0GXEch%2FSkI0CxoJQ%3D%3D--5gyMChQPyi0lHY8n--6FOfgfqsBOrSGnua3mP5tw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _gh_sess=K%2FpSZff8jvfVSIvC1yF1SYM%2F0z06p376jYqp4N1fa3EnMPEv0QHUhmgDu7I7NbUKAFJ0CBDJxZeomc5rQ9gBamz2eNFa6vFrzbkjRzXA8FLAT70h7%2FCDmFpuZCsf9w%2BipsVlEqRGdZJRZKlrDLf4WxIKnNL7Tl2SUUSsPd%2FkCeWuPJwqPehbv0DQMlUTmX37rcP4AfnyEzEt8Wk94Ro9Jav%2BAz6%2BQFgfxTZM4I%2BtzVSETHKFtV4BE4vQ03Ql0%2BTMZ6QYie0GXEch%2FSkI0CxoJQ%3D%3D--5gyMChQPyi0lHY8n--6FOfgfqsBOrSGnua3mP5tw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
< set-cookie: _octo=GH1.1.1605265127.1628653117; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 03:38:37 GMT; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1605265127.1628653117; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 03:38:37 GMT; Secure; SameSite=Lax
< set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 03:38:37 GMT; HttpOnly; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 03:38:37 GMT; HttpOnly; Secure; SameSite=Lax
< accept-ranges: bytes
accept-ranges: bytes
< x-github-request-id: B876:7A26:2E65A9:490C5E:6113463D
x-github-request-id: B876:7A26:2E65A9:490C5E:6113463D

< 
* Connection #0 to host github.com left intact
sh-4.4$ 




It looks like the upstream DNS in your environment is not configured to lookup github.com or may be github.com is blaklisted in your env.


Is  10.11.142.1 the upstream server ?

Comment 4 Johnny Liu 2021-08-11 06:17:52 UTC
> Is  10.11.142.1 the upstream server ?

Yes.

> It looks like the upstream DNS in your environment is not configured to lookup github.com or may be github.com is blaklisted in your env.
You can check comment 0, when I was using 10.11.142.1 to run dig testing, github.com is resolved correctly. So I do not think github.com is blaklisted in upstream DNS server. 

BTW, you can you help show me your /etc/resolve.conf in *dns-default* pod under openshift-dns namespace.

Comment 5 Pavol Pitonak 2021-08-11 06:27:13 UTC
I tried if github.com is accessible from node directly:

$ oc debug node/ppitonak48dp-h6hqq-worker-0-mkfkh
# chroot /host
# host google.com
google.com has address 172.217.13.78
google.com has IPv6 address 2607:f8b0:4004:808::200e
google.com mail is handled by 50 alt4.aspmx.l.google.com.
google.com mail is handled by 10 aspmx.l.google.com.
google.com mail is handled by 20 alt1.aspmx.l.google.com.
google.com mail is handled by 30 alt2.aspmx.l.google.com.
google.com mail is handled by 40 alt3.aspmx.l.google.com.
# host github.com
github.com has address 140.82.112.4
github.com mail is handled by 5 alt2.aspmx.l.google.com.
github.com mail is handled by 5 alt1.aspmx.l.google.com.
github.com mail is handled by 1 aspmx.l.google.com.
github.com mail is handled by 10 alt3.aspmx.l.google.com.
github.com mail is handled by 10 alt4.aspmx.l.google.com.

Then I tried to run a container directly on node, worked as well.
# podman run -it --rm registry.access.redhat.com/ubi8/ubi /bin/sh -c 'dnf install -y bind-utils && set -x && host google.com && host github.com'

When I create a pod on the cluster, github.com is not accessible. Another problematic domain is fedoraproject.org (so e.g. you cannot run "dnf update" inside Fedora container).

Comment 7 Miheer Salunke 2021-08-11 08:22:26 UTC
(In reply to Johnny Liu from comment #4)
> > Is  10.11.142.1 the upstream server ?
> 
> Yes.
> 
> > It looks like the upstream DNS in your environment is not configured to lookup github.com or may be github.com is blaklisted in your env.
> You can check comment 0, when I was using 10.11.142.1 to run dig testing,
> github.com is resolved correctly. So I do not think github.com is blaklisted
> in upstream DNS server. 
> 
> BTW, you can you help show me your /etc/resolve.conf in *dns-default* pod
> under openshift-dns namespace.

[stack@standalone ~]$ oc -n openshift-ingress rsh router-default-794ffc5d68-tvl69
sh-4.4$ 
sh-4.4$ 
sh-4.4$ 
sh-4.4$ cat /etc/resolv.conf 
search openshift-ingress.svc.cluster.local svc.cluster.local cluster.local shiftstack ostest.shiftstack.com
nameserver 172.30.0.10
options ndots:5
sh-4.4$ 


[stack@standalone ~]$ oc -n openshift-dns rsh dns-default-6d9rh
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# cat /etc/resolv.conf 
search shiftstack ostest.shiftstack.com
nameserver 10.0.3.78
nameserver 10.11.5.19
nameserver 10.10.160.2
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# 

[stack@standalone ~]$ oc get pods -o wide -n openshift-dns
NAME                  READY   STATUS    RESTARTS   AGE    IP               NODE                          NOMINATED NODE   READINESS GATES
dns-default-6d9rh     2/2     Running   0          2d3h   10.128.108.233   ostest-2gb7w-master-2         <none>           <none>


[stack@standalone ~]$ oc debug node/ostest-2gb7w-master-2
Starting pod/ostest-2gb7w-master-2-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.3.78
If you don't see a command prompt, try pressing enter.
sh-4.4# 
sh-4.4# 
sh-4.4# cat /etc/resolv.conf 
search shiftstack ostest.shiftstack.com
nameserver 10.0.3.78
nameserver 10.11.5.19
nameserver 10.10.160.2
sh-4.4# 

So the dns pod's resolv.conf and node on it where it is running is the same.

The pods other than dns pod shall have cluster DNS service IP in their resolv.conf

Comment 8 Miheer Salunke 2021-08-11 08:43:12 UTC
I was not successful reproducing this issue in openstack env

This time I did dig and curl to github from dns pod

[stack@standalone ~]$ oc -n openshift-dns rsh dns-default-6d9rh
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# curl -Ivk https://github.com
* Rebuilt URL to: https://github.com/
*   Trying 140.82.113.4...
* TCP_NODELAY set
* Connected to github.com (140.82.113.4) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=github.com
*  start date: Mar 25 00:00:00 2021 GMT
*  expire date: Mar 30 23:59:59 2022 GMT
*  issuer: C=US; O=DigiCert, Inc.; CN=DigiCert High Assurance TLS Hybrid ECC SHA256 2020 CA1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x55cb9aed0740)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> HEAD / HTTP/2
> Host: github.com
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 200 
HTTP/2 200 
< server: GitHub.com
server: GitHub.com
< date: Wed, 11 Aug 2021 08:38:35 GMT
date: Wed, 11 Aug 2021 08:38:35 GMT
< content-type: text/html; charset=utf-8
content-type: text/html; charset=utf-8
< vary: X-PJAX, Accept-Language, Accept-Encoding, Accept, X-Requested-With
vary: X-PJAX, Accept-Language, Accept-Encoding, Accept, X-Requested-With
< permissions-policy: interest-cohort=()
permissions-policy: interest-cohort=()
< content-language: en-US
content-language: en-US
< etag: W/"68b2abaad33f8c442839ea3ac20464ca"
etag: W/"68b2abaad33f8c442839ea3ac20464ca"
< cache-control: max-age=0, private, must-revalidate
cache-control: max-age=0, private, must-revalidate
< strict-transport-security: max-age=31536000; includeSubdomains; preload
strict-transport-security: max-age=31536000; includeSubdomains; preload
< x-frame-options: deny
x-frame-options: deny
< x-content-type-options: nosniff
x-content-type-options: nosniff
< x-xss-protection: 0
x-xss-protection: 0
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
< content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events translator.github.com wss://alive.github.com github.githubassets.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com render-temp.githubusercontent.com viewscreen.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com secured-user-images.githubusercontent.com/ *.githubusercontent.com customer-stories-feed.github.com spotlights-feed.github.com; manifest-src 'self'; media-src github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/socket-worker-3f088aa2.js gist.github.com/socket-worker-3f088aa2.js
content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events translator.github.com wss://alive.github.com github.githubassets.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com render-temp.githubusercontent.com viewscreen.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com secured-user-images.githubusercontent.com/ *.githubusercontent.com customer-stories-feed.github.com spotlights-feed.github.com; manifest-src 'self'; media-src github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/socket-worker-3f088aa2.js gist.github.com/socket-worker-3f088aa2.js
< set-cookie: _gh_sess=FT%2FIbbZl4SkMWnZ3AYViFhyg6XwM1JW%2BytptIypo5scVWSCFwTsPp6xqdRXjozAVU2%2FksmobW8Uea0Zz0l4nQHp1znBTZUNlKqBre106x584YHvxHWwo%2BYnEkFScfaJoyqaTjc7NAKEhYYBcSLbRQX%2FVGXbV030J6ouxXSNyr%2Ft8%2FRd%2BuPJCjYPb%2BEkABXmH3uytCBKXhbfNw9UowEN7LXhWZaWSatynCaEeUM5npQIK3hRQCXJIOpI3nGRSYIOQT2J8GVay6sCwdWXp%2Bg%2FqVA%3D%3D--%2BNhhV6BKOBBZEtt6--G86GFM7BV4kwl%2BTU17Tp%2Fw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _gh_sess=FT%2FIbbZl4SkMWnZ3AYViFhyg6XwM1JW%2BytptIypo5scVWSCFwTsPp6xqdRXjozAVU2%2FksmobW8Uea0Zz0l4nQHp1znBTZUNlKqBre106x584YHvxHWwo%2BYnEkFScfaJoyqaTjc7NAKEhYYBcSLbRQX%2FVGXbV030J6ouxXSNyr%2Ft8%2FRd%2BuPJCjYPb%2BEkABXmH3uytCBKXhbfNw9UowEN7LXhWZaWSatynCaEeUM5npQIK3hRQCXJIOpI3nGRSYIOQT2J8GVay6sCwdWXp%2Bg%2FqVA%3D%3D--%2BNhhV6BKOBBZEtt6--G86GFM7BV4kwl%2BTU17Tp%2Fw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
< set-cookie: _octo=GH1.1.1945855013.1628671120; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 08:38:40 GMT; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1945855013.1628671120; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 08:38:40 GMT; Secure; SameSite=Lax
< set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 08:38:40 GMT; HttpOnly; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 11 Aug 2022 08:38:40 GMT; HttpOnly; Secure; SameSite=Lax
< accept-ranges: bytes
accept-ranges: bytes
< x-github-request-id: 8CC8:51D8:41B46F:7A1E71:61138C90
x-github-request-id: 8CC8:51D8:41B46F:7A1E71:61138C90

< 
* Connection #0 to host github.com left intact
sh-4.4# dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11725
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 6a7cc27e316f94e9 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		17	IN	A	140.82.113.4

;; Query time: 3 msec
;; SERVER: 10.0.3.78#53(10.0.3.78)
;; WHEN: Wed Aug 11 08:38:52 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4# cat /etc/resolv.conf 
search shiftstack ostest.shiftstack.com
nameserver 10.0.3.78
nameserver 10.11.5.19
nameserver 10.10.160.2
sh-4.4# exit
exit
[stack@standalone ~]$ oc -n openshift-dns get dns-default-6d9rh -o yaml
error: the server doesn't have a resource type "dns-default-6d9rh"
[stack@standalone ~]$ oc -n openshift-dns rsh dns-default-6d9rh
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# exit
exit
[stack@standalone ~]$ oc get pods dns-default-6d9rh -o yaml
Error from server (NotFound): pods "dns-default-6d9rh" not found
[stack@standalone ~]$ oc get pods dns-default-6d9rh -o yaml -n openshift-dns
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "kuryr",
          "interface": "eth0",
          "ips": [
              "10.128.108.233"
          ],
          "mac": "fa:16:3e:f8:ba:a8",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "kuryr",
          "interface": "eth0",
          "ips": [
              "10.128.108.233"
          ],
          "mac": "fa:16:3e:f8:ba:a8",
          "default": true,
          "dns": {}
      }]
    workload.openshift.io/warning: only single-node clusters support workload partitioning
  creationTimestamp: "2021-08-09T04:29:32Z"
  finalizers:
  - kuryr.openstack.org/pod-finalizer
  generateName: dns-default-
  labels:
    controller-revision-hash: 77d7d5b487
    dns.operator.openshift.io/daemonset-dns: default
    pod-template-generation: "1"
  name: dns-default-6d9rh
  namespace: openshift-dns
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: dns-default
    uid: 8a3b9781-6cf0-486f-8237-92ae1f4ebb87
  resourceVersion: "21842"
  uid: fe7bbdd4-81cd-4842-af82-76f7a31fb846
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - ostest-2gb7w-master-2
  containers:
  - args:
    - -conf
    - /etc/coredns/Corefile
    command:
    - coredns
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8e2babcf2d5085a5a5d4cf646af4e9c173957bd00f4c32a75e2a886ddf0a9931
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: dns
    ports:
    - containerPort: 5353
      name: dns
      protocol: UDP
    - containerPort: 5353
      name: dns-tcp
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready
        port: 8181
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      requests:
        cpu: 50m
        memory: 70Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/coredns
      name: config-volume
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-mfq5t
      readOnly: true
  - args:
    - --logtostderr
    - --secure-listen-address=:9154
    - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
    - --upstream=http://127.0.0.1:9153/
    - --tls-cert-file=/etc/tls/private/tls.crt
    - --tls-private-key-file=/etc/tls/private/tls.key
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    imagePullPolicy: IfNotPresent
    name: kube-rbac-proxy
    ports:
    - containerPort: 9154
      name: metrics
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/tls/private
      name: metrics-tls
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-mfq5t
      readOnly: true
  dnsPolicy: Default
  enableServiceLinks: true
  nodeName: ostest-2gb7w-master-2
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: dns
  serviceAccountName: dns
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - configMap:
      defaultMode: 420
      items:
      - key: Corefile
        path: Corefile
      name: dns-default
    name: config-volume
  - name: metrics-tls
    secret:
      defaultMode: 420
      secretName: dns-default-metrics-tls
  - name: kube-api-access-mfq5t
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-08-09T04:29:32Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-08-09T04:33:53Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-08-09T04:33:53Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-08-09T04:29:32Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://395bc767624898461eb2b68875c57d60d0e6766fae38c407a01f0d51687bef4f
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8e2babcf2d5085a5a5d4cf646af4e9c173957bd00f4c32a75e2a886ddf0a9931
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8e2babcf2d5085a5a5d4cf646af4e9c173957bd00f4c32a75e2a886ddf0a9931
    lastState: {}
    name: dns
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-08-09T04:33:43Z"
  - containerID: cri-o://a1011aba97f608ddcce7631da21d20798546c38589ef1d292fef0ac7154596f8
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    lastState: {}
    name: kube-rbac-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-08-09T04:33:44Z"
  hostIP: 10.0.3.78
  phase: Running
  podIP: 10.128.108.233
  podIPs:
  - ip: 10.128.108.233
  qosClass: Burstable
  startTime: "2021-08-09T04:29:32Z"
[stack@standalone ~]$

Comment 9 Miheer Salunke 2021-08-11 08:58:09 UTC
(In reply to Johnny Liu from comment #4)
> > Is  10.11.142.1 the upstream server ?
> 
> Yes.
> 
> > It looks like the upstream DNS in your environment is not configured to lookup github.com or may be github.com is blaklisted in your env.
> You can check comment 0, when I was using 10.11.142.1 to run dig testing,
> github.com is resolved correctly. So I do not think github.com is blaklisted
> in upstream DNS server. 
> 
> BTW, you can you help show me your /etc/resolve.conf in *dns-default* pod
> under openshift-dns namespace.


So the resolv.conf of the dns pod and the resolv.conf of the node on it where it is running is the same.

The pods other than dns pod shall have cluster DNS service IP in their resolv.conf

I tried digging with all nameservers present in /etc/resolv.conf. All worked fine.


[stack@standalone ~]$ oc -n openshift-dns rsh dns-default-6d9rh
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# cat /etc/resolv.conf 
search shiftstack ostest.shiftstack.com
nameserver 10.0.3.78
nameserver 10.11.5.19
nameserver 10.10.160.2
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# dig @10.0.3.78 github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @10.0.3.78 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58781
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: bfe4452b7110e912 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		30	IN	A	140.82.112.3

;; Query time: 3 msec
;; SERVER: 10.0.3.78#53(10.0.3.78)
;; WHEN: Wed Aug 11 08:56:08 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4# dig @10.11.5.19 github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @10.11.5.19 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12272
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
; COOKIE: c6dee88b9f02db8aa82bdeb4611390bb4a9362b1b5794dc4 (good)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		38	IN	A	140.82.112.3

;; Query time: 1 msec
;; SERVER: 10.11.5.19#53(10.11.5.19)
;; WHEN: Wed Aug 11 08:56:26 UTC 2021
;; MSG SIZE  rcvd: 83

sh-4.4# 
sh-4.4# 
sh-4.4# dig @10.10.160.2 github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @10.10.160.2 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28777
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 17

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		60	IN	A	140.82.112.3

;; AUTHORITY SECTION:
github.com.		170	IN	NS	ns-421.awsdns-52.com.
github.com.		170	IN	NS	ns-1283.awsdns-32.org.
github.com.		170	IN	NS	dns2.p08.nsone.net.
github.com.		170	IN	NS	ns-520.awsdns-01.net.
github.com.		170	IN	NS	dns4.p08.nsone.net.
github.com.		170	IN	NS	dns3.p08.nsone.net.
github.com.		170	IN	NS	ns-1707.awsdns-21.co.uk.
github.com.		170	IN	NS	dns1.p08.nsone.net.

;; ADDITIONAL SECTION:
dns1.p08.nsone.net.	20202	IN	A	198.51.44.8
dns1.p08.nsone.net.	20202	IN	AAAA	2620:4d:4000:6259:7:8:0:1
dns2.p08.nsone.net.	20202	IN	A	198.51.45.8
dns2.p08.nsone.net.	20202	IN	AAAA	2a00:edc0:6259:7:8::2
ns-421.awsdns-52.com.	25187	IN	A	205.251.193.165
ns-421.awsdns-52.com.	25187	IN	AAAA	2600:9000:5301:a500::1
ns-1283.awsdns-32.org.	20174	IN	A	205.251.197.3
ns-1283.awsdns-32.org.	20174	IN	AAAA	2600:9000:5305:300::1
dns3.p08.nsone.net.	20202	IN	A	198.51.44.72
dns3.p08.nsone.net.	20202	IN	AAAA	2620:4d:4000:6259:7:8:0:3
ns-520.awsdns-01.net.	20219	IN	A	205.251.194.8
ns-520.awsdns-01.net.	20219	IN	AAAA	2600:9000:5302:800::1
ns-1707.awsdns-21.co.uk. 20219	IN	A	205.251.198.171
ns-1707.awsdns-21.co.uk. 20219	IN	AAAA	2600:9000:5306:ab00::1
dns4.p08.nsone.net.	20202	IN	A	198.51.45.72
dns4.p08.nsone.net.	20202	IN	AAAA	2a00:edc0:6259:7:8::4

;; Query time: 11 msec
;; SERVER: 10.10.160.2#53(10.10.160.2)
;; WHEN: Wed Aug 11 08:56:44 UTC 2021
;; MSG SIZE  rcvd: 630

sh-4.4#

Comment 10 Miheer Salunke 2021-08-11 09:15:55 UTC
As you mentioned 

sh-4.4# dig @192.168.3.100 google.com github.com   

does not work where 192.168.3.100 is the IP of the node where the coredns pod is running.

oc get nodes -o wide | grep 192.168.3.100


find the node and then

oc debug node/<node name>

then 

sh-4.4# netstat -tunlp | grep 53
tcp6       0      0 :::53                   :::*                    LISTEN      2141/coredns        
tcp6       0      0 :::9537                 :::*                    LISTEN      1510/crio           
udp6       0      0 :::53                   :::*                                2141/coredns        
sh-4.4# exit

Does it show the above output ?

Comment 11 Johnny Liu 2021-08-11 09:46:42 UTC
Here is my new reproducer:

[root@preserve-jialiu-ansible ~]# oc debug node/xxia0811osp-kglss-master-0
Starting pod/xxia0811osp-kglss-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.2.211
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
search   xxia0811osp.0811-atx.qe.rhcloud.com
nameserver 192.168.2.211
nameserver 10.11.142.1

[root@preserve-jialiu-ansible ~]# oc -n openshift-dns rsh dns-default-459lm
Defaulted container "dns" out of: dns, kube-rbac-proxy
sh-4.4# cat /etc/resolv.conf 
search xxia0811osp.0811-atx.qe.rhcloud.com
nameserver 192.168.2.211
nameserver 10.11.142.1

The resolv.conf of the dns pod and the resolv.conf of the node on it where it is running is the same.

After compare with your /etc/resolve.conf, why there is no "# Generated by KNI resolv prepender NM dispatcher script" comment in your /etc/resolv.conf?


> Does it show the above output ?
[root@preserve-jialiu-ansible ~]# oc debug node/xxia0811osp-kglss-master-0
Starting pod/xxia0811osp-kglss-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.2.211
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
search   xxia0811osp.0811-atx.qe.rhcloud.com
nameserver 192.168.2.211
nameserver 10.11.142.1
sh-4.4# dig @192.168.2.211 google.com github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @192.168.2.211 google.com github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62444
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 9

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: c49d5a50d2d74f8e (echoed)
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		30	IN	A	142.251.33.206

;; AUTHORITY SECTION:
google.com.		30	IN	NS	ns3.google.com.
google.com.		30	IN	NS	ns1.google.com.
google.com.		30	IN	NS	ns2.google.com.
google.com.		30	IN	NS	ns4.google.com.

;; ADDITIONAL SECTION:
ns2.google.com.		30	IN	A	216.239.34.10
ns1.google.com.		30	IN	A	216.239.32.10
ns3.google.com.		30	IN	A	216.239.36.10
ns4.google.com.		30	IN	A	216.239.38.10
ns2.google.com.		30	IN	AAAA	2001:4860:4802:34::a
ns1.google.com.		30	IN	AAAA	2001:4860:4802:32::a
ns3.google.com.		30	IN	AAAA	2001:4860:4802:36::a
ns4.google.com.		30	IN	AAAA	2001:4860:4802:38::a

;; Query time: 12 msec
;; SERVER: 192.168.2.211#53(192.168.2.211)
;; WHEN: Wed Aug 11 09:43:19 UTC 2021
;; MSG SIZE  rcvd: 517

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 44277
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: c49d5a50d2d74f8e (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1003 msec
;; SERVER: 192.168.2.211#53(192.168.2.211)
;; WHEN: Wed Aug 11 09:43:25 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4# netstat -tunlp | grep 53
tcp6       0      0 :::53                   :::*                    LISTEN      2557/coredns        
tcp6       0      0 :::9537                 :::*                    LISTEN      1743/crio           
udp6       0      0 :::53                   :::*                                2557/coredns   


Did not see some suspicious things.

And here is my ipi-on-osp cluster's install-config.yaml:
08-11 10:15:20.761  ---
08-11 10:15:20.761  apiVersion: v1
08-11 10:15:20.761  controlPlane:
08-11 10:15:20.761    architecture: amd64
08-11 10:15:20.761    hyperthreading: Enabled
08-11 10:15:20.761    name: master
08-11 10:15:20.761    platform: {}
08-11 10:15:20.761    replicas: 3
08-11 10:15:20.761  compute:
08-11 10:15:20.761  - architecture: amd64
08-11 10:15:20.761    hyperthreading: Enabled
08-11 10:15:20.761    name: worker
08-11 10:15:20.761    platform:
08-11 10:15:20.761      openstack:
08-11 10:15:20.761        type: ci.m1.large
08-11 10:15:20.761    replicas: 3
08-11 10:15:20.761  metadata:
08-11 10:15:20.761    name: xxia0811osp
08-11 10:15:20.761  platform:
08-11 10:15:20.761    openstack:
08-11 10:15:20.761      cloud: openstack
08-11 10:15:20.761      computeFlavor: ci.m1.xlarge
08-11 10:15:20.761      region: regionOne
08-11 10:15:20.762      trunkSupport: '1'
08-11 10:15:20.762      octaviaSupport: '0'
08-11 10:15:20.762      apiFloatingIP: 10.0.100.162
08-11 10:15:20.762      ingressFloatingIP: 10.0.100.31
08-11 10:15:20.762      externalNetwork: provider_net_cci_8
08-11 10:15:20.762  pullSecret: HIDDEN
08-11 10:15:20.762  networking:
08-11 10:15:20.762    clusterNetwork:
08-11 10:15:20.762    - cidr: 10.128.0.0/14
08-11 10:15:20.762      hostPrefix: 23
08-11 10:15:20.762    serviceNetwork:
08-11 10:15:20.762    - 172.30.0.0/16
08-11 10:15:20.762    machineNetwork:
08-11 10:15:20.762    - cidr: 192.168.0.0/18
08-11 10:15:20.762    networkType: OpenShiftSDN
08-11 10:15:20.762  publish: External

Comment 12 Hongan Li 2021-08-11 11:20:59 UTC
Hello, I think I find a way to workaround/fix it.

Since this is IPI on OpenStack and we can see some coredns pods are running in namespace openshift-openstack-infra, see
$ oc -n openshift-openstack-infra get pod -l app=openstack-infra-mdns
NAME                                       READY   STATUS    RESTARTS   AGE
coredns-xxia0811osp-kglss-master-0         2/2     Running   0          8h
coredns-xxia0811osp-kglss-master-1         2/2     Running   0          8h
coredns-xxia0811osp-kglss-master-2         2/2     Running   0          8h
coredns-xxia0811osp-kglss-worker-0-srm2f   2/2     Running   0          8h
coredns-xxia0811osp-kglss-worker-0-vp7nl   2/2     Running   0          8h
coredns-xxia0811osp-kglss-worker-0-w4g84   2/2     Running   0          8h

and after adding "bufsize 512" to the Corefile, the issues is dismissed.

sh-4.4# cat /etc/coredns/Corefile 
. {
    errors
    bufsize 512                                          #### <--- new added parameter
    log
    health :18080
    forward . 10.11.142.1 {
        policy sequential
    }
    cache 30
    reload
    template IN A xxia0811osp.0811-atx.qe.rhcloud.com {
        match .*.apps.xxia0811osp.0811-atx.qe.rhcloud.com
        answer "{{ .Name }} 60 in {{ .Type }} 192.168.0.7"
        fallthrough
    }
----<snip>-----


And also can get some useful logs from one of the pods:

$ oc -n openshift-openstack-infra logs coredns-xxia0811osp-kglss-worker-0-w4g84 -c coredns
.:53
[INFO] plugin/reload: Running configuration MD5 = 450aff8beef871d5f8b12fcd139aa9fb
CoreDNS-1.8.1
linux/amd64, go1.16.6, 
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = 5a6a27bb68cb52664b2a8b536462708b
[INFO] Reloading complete
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.168:46453->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.168:44383->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.168:43294->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.168:39006->10.11.142.1:53: i/o timeout

----------------<below is logs after adding the parameter "bufsize 512"
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = 3cf503e80d2e0375cf7e949358997856
[INFO] Reloading complete
[INFO] 10.129.2.4:56291 - 59146 "A IN github.com. udp 51 true 512" NOERROR qr,rd,ra 54 0.003689755s
[INFO] 10.129.2.4:40069 - 65062 "A IN github.com. udp 51 true 512" NOERROR qr,rd,ra 54 0.003927262s

Comment 13 Miheer Salunke 2021-08-11 15:53:08 UTC
This https://bugzilla.redhat.com/show_bug.cgi?id=1970140#c0 seems to be an related issue where the explanations also is given.

Hongli it seems you already have worked on this issue earlier. Thanks for pointing this setting!

The reason why this was working on my system is because I was on a latest z version of OCP for 4.7 

In the latest z release starting from version 4.6 we have added 512 as the bufsize

https://bugzilla.redhat.com/show_bug.cgi?id=1970140#c2


master branch - 512 setting 
https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/controller_dns_configmap.go


https://github.com/openshift/cluster-dns-operator/pull/266/commits/1eed44376164dc1dcc6bf405c6a0daa2d29761ab  here the fix for having buf size as 512 was added
The explanation why that was done is given in the above PR.

This is a regression caused by the fix merged for Bug 1949361. It was fixed by above PR.

Customers running workloads that utilize Go's built-in DNS resolver, such as Grafana Loki, to resolve DNS records exceeded 512 bytes so the size was increased to a greater value i.e 1232.
https://bugzilla.redhat.com/show_bug.cgi?id=1949361#c19
 
This bug is a regression caused by the fix for Bug 1949361, which merged into 4.7.11 and 4.6.30.

Other primitive DNS resolvers that cannot accept UDP DNS messages longer than 512 bytes got affected by the fix in https://bugzilla.redhat.com/show_bug.cgi?id=1949361#c19. 
Note that DNS resolvers that retry lookups using TCP (such as Dig) are not affected by this bug.

So buf size was again set to 512 for bug https://bugzilla.redhat.com/show_bug.cgi?id=1970140#c2
https://github.com/openshift/cluster-dns-operator/pull/266/commits/1eed44376164dc1dcc6bf405c6a0daa2d29761ab 

The permanent fix for this will be upgrade to latest z version of OCP 4.6, 4.7 as this issue was introduced in 4.7.11 and 4.6.30.

I will be closing this bug as we know the issue and fix for this is already merged in latest z releases of 4.6 and 4.7

Comment 14 Miheer Salunke 2021-08-11 15:54:40 UTC
erratas are mentioned in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1970140

Comment 15 Hongan Li 2021-08-12 01:26:43 UTC
Hi Miheer, 

I guess you misunderstood my comments, actually there are two kinds of coredns are running in this cluster (IPI on OpenStack)

For the coredns container running inside dns pod in namespace openshift-dns, yes you're right it has been fixed, and it consumes the configmap in openshift-dns namespace, see

$ oc -n openshift-dns get cm/dns-default -oyaml
apiVersion: v1
data:
  Corefile: |
    .:5353 {
        bufsize 512
        errors
        health {
            lameduck 20s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }

So as you have said it already has bufsize 512 added.

but please read my comment 12, that is another Corefile for coredns pod running in openshift-openstack-infra namespace, they are different.

Comment 16 Hongan Li 2021-08-12 01:35:42 UTC
And we can reproduce the issue with latest 4.8 and 4.9 build, 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-08-09-135211   True        False         14h     Cluster version is 4.8.0-0.nightly-2021-08-09-135211

$ oc get infrastructures.config.openshift.io cluster -ojson | jq '.status'
{
  "apiServerInternalURI": "https://api-int.hongli-osp.0811-z9z.qe.rhcloud.com:6443",
  "apiServerURL": "https://api.hongli-osp.0811-z9z.qe.rhcloud.com:6443",
  "controlPlaneTopology": "HighlyAvailable",
  "etcdDiscoveryDomain": "",
  "infrastructureName": "hongli-osp-tb46w",
  "infrastructureTopology": "HighlyAvailable",
  "platform": "OpenStack",
  "platformStatus": {
    "openstack": {
      "apiServerInternalIP": "192.168.0.5",
      "ingressIP": "192.168.0.7"
    },
    "type": "OpenStack"
  }
}


################ pods running in openshift-dns namespace

$ oc -n openshift-dns get pod -owide
NAME                  READY   STATUS    RESTARTS   AGE   IP              NODE                              NOMINATED NODE   READINESS GATES
dns-default-2225b     2/2     Running   0          15h   10.128.2.8      hongli-osp-tb46w-worker-0-tc88z   <none>           <none>
dns-default-7pjgb     2/2     Running   0          15h   10.129.0.6      hongli-osp-tb46w-master-2         <none>           <none>
dns-default-g2gcb     2/2     Running   0          15h   10.130.0.5      hongli-osp-tb46w-master-1         <none>           <none>
dns-default-k2d88     2/2     Running   0          15h   10.128.0.47     hongli-osp-tb46w-master-0         <none>           <none>
dns-default-rbfl8     2/2     Running   0          15h   10.131.0.4      hongli-osp-tb46w-worker-0-fbbtc   <none>           <none>
dns-default-rv5jl     2/2     Running   0          14h   10.129.2.5      hongli-osp-tb46w-worker-0-rpk97   <none>           <none>
node-resolver-dfj66   1/1     Running   0          15h   192.168.2.232   hongli-osp-tb46w-master-2         <none>           <none>
node-resolver-fls7z   1/1     Running   0          15h   192.168.0.59    hongli-osp-tb46w-worker-0-tc88z   <none>           <none>
node-resolver-pphdd   1/1     Running   0          14h   192.168.3.13    hongli-osp-tb46w-worker-0-rpk97   <none>           <none>
node-resolver-qhsmc   1/1     Running   0          15h   192.168.1.104   hongli-osp-tb46w-master-0         <none>           <none>
node-resolver-qjfdk   1/1     Running   0          15h   192.168.1.169   hongli-osp-tb46w-master-1         <none>           <none>
node-resolver-tqtq9   1/1     Running   0          15h   192.168.0.123   hongli-osp-tb46w-worker-0-fbbtc   <none>           <none>


###################  pods running in openshift-openstack-infra namespace

$ oc -n openshift-openstack-infra get pod -l app=openstack-infra-mdns -owide
NAME                                      READY   STATUS    RESTARTS   AGE   IP              NODE                              NOMINATED NODE   READINESS GATES
coredns-hongli-osp-tb46w-master-0         2/2     Running   0          15h   192.168.1.104   hongli-osp-tb46w-master-0         <none>           <none>
coredns-hongli-osp-tb46w-master-1         2/2     Running   0          15h   192.168.1.169   hongli-osp-tb46w-master-1         <none>           <none>
coredns-hongli-osp-tb46w-master-2         2/2     Running   0          15h   192.168.2.232   hongli-osp-tb46w-master-2         <none>           <none>
coredns-hongli-osp-tb46w-worker-0-fbbtc   2/2     Running   0          15h   192.168.0.123   hongli-osp-tb46w-worker-0-fbbtc   <none>           <none>
coredns-hongli-osp-tb46w-worker-0-rpk97   2/2     Running   0          14h   192.168.3.13    hongli-osp-tb46w-worker-0-rpk97   <none>           <none>
coredns-hongli-osp-tb46w-worker-0-tc88z   2/2     Running   0          15h   192.168.0.59    hongli-osp-tb46w-worker-0-tc88z   <none>           <none>

Comment 18 Hongan Li 2021-08-12 04:35:02 UTC
tried with an old 4.8 build and still can reproduce the issue:

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-08-04-231543   True        False         43m     Cluster version is 4.8.0-0.nightly-2021-08-04-231543

$ oc get infrastructures.config.openshift.io cluster -ojson | jq '.status'
{
  "apiServerInternalURI": "https://api-int.hongli-old.0812-ybq.qe.rhcloud.com:6443",
  "apiServerURL": "https://api.hongli-old.0812-ybq.qe.rhcloud.com:6443",
  "controlPlaneTopology": "HighlyAvailable",
  "etcdDiscoveryDomain": "",
  "infrastructureName": "hongli-old-ssbw5",
  "infrastructureTopology": "HighlyAvailable",
  "platform": "OpenStack",
  "platformStatus": {
    "openstack": {
      "apiServerInternalIP": "192.168.0.5",
      "ingressIP": "192.168.0.7"
    },
    "type": "OpenStack"
  }
}

$ oc rsh centos-pod 
sh-4.4# dig github.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27746
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: 250b3f88eb22da54 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1012 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Aug 12 04:33:26 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4#

Comment 19 Hongan Li 2021-08-12 04:44:25 UTC
and here is logs from pod coredns-hongli-old-ssbw5-worker-0-bf9j4:

$ oc -n openshift-openstack-infra logs coredns-hongli-old-ssbw5-worker-0-bf9j4 -c coredns
.:53
[INFO] plugin/reload: Running configuration MD5 = 77dd189913861de66e1bc76c34cb92c1
CoreDNS-1.8.1
linux/amd64, go1.16.6, 
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = 3d993e5061ddb59e0a614c38a61ae220
[INFO] Reloading complete
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:51972->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:55730->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:54157->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:44817->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:58787->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:38345->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 github.com. A: read udp 192.168.3.146:54367->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 fedoraproject.org. A: read udp 192.168.3.146:44536->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 fedoraproject.org. A: read udp 192.168.3.146:43504->10.11.142.1:53: i/o timeout
[ERROR] plugin/errors: 2 fedoraproject.org. A: read udp 192.168.3.146:46024->10.11.142.1:53: i/o timeout

Comment 20 Miheer Salunke 2021-08-12 05:12:12 UTC
I reinstalled the cluster, still not able to reproduce the issue.

I am comparing QE and my enviroment now.

#oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-08-09-135211   True        False         26s     Cluster version is 4.8.0-0.nightly-2021-08-09-135211


#oc get network cluster -ojson | jq '.status.networkType'
"OpenShiftSDN"



[stack@standalone ~]$ 
[stack@standalone ~]$ oc -n openshift-openstack-infra  get pods
NAME                                     READY   STATUS    RESTARTS   AGE
coredns-ostest-2fr2n-master-0            2/2     Running   0          31m
coredns-ostest-2fr2n-master-1            2/2     Running   0          31m
coredns-ostest-2fr2n-master-2            2/2     Running   0          31m
coredns-ostest-2fr2n-worker-0-92sh7      2/2     Running   0          14m
coredns-ostest-2fr2n-worker-0-jgq9v      2/2     Running   0          17m
coredns-ostest-2fr2n-worker-0-pfwlx      2/2     Running   0          18m
haproxy-ostest-2fr2n-master-0            2/2     Running   0          31m
haproxy-ostest-2fr2n-master-1            2/2     Running   0          31m
haproxy-ostest-2fr2n-master-2            2/2     Running   0          31m
keepalived-ostest-2fr2n-master-0         2/2     Running   0          31m
keepalived-ostest-2fr2n-master-1         2/2     Running   0          31m
keepalived-ostest-2fr2n-master-2         2/2     Running   0          31m
keepalived-ostest-2fr2n-worker-0-92sh7   2/2     Running   0          14m
keepalived-ostest-2fr2n-worker-0-jgq9v   2/2     Running   0          17m
keepalived-ostest-2fr2n-worker-0-pfwlx   2/2     Running   0          18m
[stack@standalone ~]$ 
[stack@standalone ~]$ 
[stack@standalone ~]$ 
[stack@standalone ~]$ oc -n openshift-openstack-infra  rsh coredns-ostest-2fr2n-master-0
Defaulted container "coredns" out of: coredns, coredns-monitor, render-config-coredns (init)
sh-4.4# dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37492
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: e1edfba25e215f99 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		14	IN	A	140.82.114.4

;; Query time: 2 msec
;; SERVER: 10.0.1.233#53(10.0.1.233)
;; WHEN: Thu Aug 12 04:51:16 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4# exit
exit
[stack@standalone ~]$ 
[stack@standalone ~]$ 
[stack@standalone ~]$ 
[stack@standalone ~]$ oc -n openshift-openstack-infra  get pods -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP           NODE                          NOMINATED NODE   READINESS GATES
coredns-ostest-2fr2n-master-0            2/2     Running   0          36m   10.0.1.233   ostest-2fr2n-master-0         <none>           <none>
coredns-ostest-2fr2n-master-1            2/2     Running   0          36m   10.0.0.139   ostest-2fr2n-master-1         <none>           <none>
coredns-ostest-2fr2n-master-2            2/2     Running   0          36m   10.0.2.135   ostest-2fr2n-master-2         <none>           <none>
coredns-ostest-2fr2n-worker-0-92sh7      2/2     Running   0          19m   10.0.3.197   ostest-2fr2n-worker-0-92sh7   <none>           <none>
coredns-ostest-2fr2n-worker-0-jgq9v      2/2     Running   0          22m   10.0.0.76    ostest-2fr2n-worker-0-jgq9v   <none>           <none>
coredns-ostest-2fr2n-worker-0-pfwlx      2/2     Running   0          23m   10.0.0.191   ostest-2fr2n-worker-0-pfwlx   <none>           <none>
haproxy-ostest-2fr2n-master-0            2/2     Running   0          36m   10.0.1.233   ostest-2fr2n-master-0         <none>           <none>
haproxy-ostest-2fr2n-master-1            2/2     Running   0          36m   10.0.0.139   ostest-2fr2n-master-1         <none>           <none>
haproxy-ostest-2fr2n-master-2            2/2     Running   0          36m   10.0.2.135   ostest-2fr2n-master-2         <none>           <none>
keepalived-ostest-2fr2n-master-0         2/2     Running   0          36m   10.0.1.233   ostest-2fr2n-master-0         <none>           <none>
keepalived-ostest-2fr2n-master-1         2/2     Running   0          36m   10.0.0.139   ostest-2fr2n-master-1         <none>           <none>
keepalived-ostest-2fr2n-master-2         2/2     Running   0          36m   10.0.2.135   ostest-2fr2n-master-2         <none>           <none>
keepalived-ostest-2fr2n-worker-0-92sh7   2/2     Running   0          19m   10.0.3.197   ostest-2fr2n-worker-0-92sh7   <none>           <none>
keepalived-ostest-2fr2n-worker-0-jgq9v   2/2     Running   0          21m   10.0.0.76    ostest-2fr2n-worker-0-jgq9v   <none>           <none>
keepalived-ostest-2fr2n-worker-0-pfwlx   2/2     Running   0          23m   10.0.0.191   ostest-2fr2n-worker-0-pfwlx   <none>           <none>
[stack@standalone ~]$ oc debug node/ostest-2fr2n-master-0
Starting pod/ostest-2fr2n-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.1.233
If you don't see a command prompt, try pressing enter.
sh-4.4#

sh-4.4# chroot /host
sh-4.4# cat /etc/coredns/Corefile 
. {
    errors
    health :18080
    forward . 10.11.5.19 10.10.160.2 10.5.30.160 {
        policy sequential
    }
    cache 30
    reload
    template IN A ostest.shiftstack.com {
        match .*.apps.ostest.shiftstack.com
        answer "{{ .Name }} 60 in {{ .Type }} 10.0.0.7"
        fallthrough
    }
    template IN AAAA ostest.shiftstack.com {
        match .*.apps.ostest.shiftstack.com
        fallthrough
    }
    template IN A ostest.shiftstack.com {
        match api.ostest.shiftstack.com
        answer "{{ .Name }} 60 in {{ .Type }} 10.0.0.5"
        fallthrough
    }
    template IN AAAA ostest.shiftstack.com {
        match api.ostest.shiftstack.com
        fallthrough
    }
    template IN A ostest.shiftstack.com {
        match api-int.ostest.shiftstack.com
        answer "{{ .Name }} 60 in {{ .Type }} 10.0.0.5"
        fallthrough
    }
    template IN AAAA ostest.shiftstack.com {
        match api-int.ostest.shiftstack.com
        fallthrough
    }
    hosts {
        10.0.1.233 ostest-2fr2n-master-0 ostest-2fr2n-master-0.ostest.shiftstack.com
        10.0.0.139 ostest-2fr2n-master-1 ostest-2fr2n-master-1.ostest.shiftstack.com
        10.0.2.135 ostest-2fr2n-master-2 ostest-2fr2n-master-2.ostest.shiftstack.com
        10.0.3.197 ostest-2fr2n-worker-0-92sh7 ostest-2fr2n-worker-0-92sh7.ostest.shiftstack.com
        10.0.0.76 ostest-2fr2n-worker-0-jgq9v ostest-2fr2n-worker-0-jgq9v.ostest.shiftstack.com
        10.0.0.191 ostest-2fr2n-worker-0-pfwlx ostest-2fr2n-worker-0-pfwlx.ostest.shiftstack.com
        fallthrough
    }
}
sh-4.4#

Comment 21 Miheer Salunke 2021-08-12 05:47:06 UTC
In your env it seems to be working. I don't see the buff size setting added.


[miheer@localhost ~]$ vi qe.kubeconfig
[miheer@localhost ~]$ export KUBECONFIG=qe.kubeconfig 
[miheer@localhost ~]$ oc whoami
system:admin
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ oc get pods -n openshift-openstack-infra
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-hongli-osp-tb46w-master-0            2/2     Running   0          19h
coredns-hongli-osp-tb46w-master-1            2/2     Running   0          19h
coredns-hongli-osp-tb46w-master-2            2/2     Running   0          19h
coredns-hongli-osp-tb46w-worker-0-fbbtc      2/2     Running   0          19h
coredns-hongli-osp-tb46w-worker-0-rpk97      2/2     Running   0          18h
coredns-hongli-osp-tb46w-worker-0-tc88z      2/2     Running   0          19h
haproxy-hongli-osp-tb46w-master-0            2/2     Running   0          19h
haproxy-hongli-osp-tb46w-master-1            2/2     Running   0          19h
keepalived-hongli-osp-tb46w-master-0         2/2     Running   0          19h
keepalived-hongli-osp-tb46w-master-1         2/2     Running   0          19h
keepalived-hongli-osp-tb46w-master-2         2/2     Running   0          19h
keepalived-hongli-osp-tb46w-worker-0-fbbtc   2/2     Running   0          19h
keepalived-hongli-osp-tb46w-worker-0-rpk97   2/2     Running   0          18h
keepalived-hongli-osp-tb46w-worker-0-tc88z   2/2     Running   0          19h
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ oc -n openshift-openstack-infra  rsh coredns-hongli-osp-tb46w-master-0 
Defaulted container "coredns" out of: coredns, coredns-monitor, render-config-coredns (init)
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50069
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 16

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1200
; COOKIE: e962ec93d352c77b9ce204126114b2c0b7274f37f921bf9d (good)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		9	IN	A	140.82.113.3

;; AUTHORITY SECTION:
github.com.		300	IN	NS	dns3.p08.nsone.net.
github.com.		300	IN	NS	ns-421.awsdns-52.com.
github.com.		300	IN	NS	dns1.p08.nsone.net.
github.com.		300	IN	NS	ns-520.awsdns-01.net.
github.com.		300	IN	NS	ns-1283.awsdns-32.org.
github.com.		300	IN	NS	ns-1707.awsdns-21.co.uk.
github.com.		300	IN	NS	dns2.p08.nsone.net.
github.com.		300	IN	NS	dns4.p08.nsone.net.

;; ADDITIONAL SECTION:
dns1.p08.nsone.net.	69955	IN	A	198.51.44.8
dns2.p08.nsone.net.	69955	IN	A	198.51.45.8
dns3.p08.nsone.net.	69955	IN	A	198.51.44.72
dns4.p08.nsone.net.	69955	IN	A	198.51.45.72
ns-1283.awsdns-32.org.	156356	IN	A	205.251.197.3
ns-1707.awsdns-21.co.uk. 156356	IN	A	205.251.198.171
ns-421.awsdns-52.com.	171256	IN	A	205.251.193.165
ns-520.awsdns-01.net.	156355	IN	A	205.251.194.8
dns1.p08.nsone.net.	69955	IN	AAAA	2620:4d:4000:6259:7:8:0:1
dns2.p08.nsone.net.	69955	IN	AAAA	2a00:edc0:6259:7:8::2
dns3.p08.nsone.net.	69955	IN	AAAA	2620:4d:4000:6259:7:8:0:3
dns4.p08.nsone.net.	69955	IN	AAAA	2a00:edc0:6259:7:8::4
ns-1283.awsdns-32.org.	156356	IN	AAAA	2600:9000:5305:300::1
ns-1707.awsdns-21.co.uk. 156356	IN	AAAA	2600:9000:5306:ab00::1
ns-520.awsdns-01.net.	156355	IN	AAAA	2600:9000:5302:800::1

;; Query time: 1 msec
;; SERVER: 10.11.142.1#53(10.11.142.1)
;; WHEN: Thu Aug 12 05:33:52 UTC 2021
;; MSG SIZE  rcvd: 630

sh-4.4# exi
sh: exi: command not found
sh-4.4# exit
exit
command terminated with exit code 127
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ oc get pods -n openshift-openstack-infra -w
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-hongli-osp-tb46w-master-0            2/2     Running   0          19h
coredns-hongli-osp-tb46w-master-1            2/2     Running   0          19h
coredns-hongli-osp-tb46w-master-2            2/2     Running   0          19h
coredns-hongli-osp-tb46w-worker-0-fbbtc      2/2     Running   0          19h
coredns-hongli-osp-tb46w-worker-0-rpk97      2/2     Running   0          18h
coredns-hongli-osp-tb46w-worker-0-tc88z      2/2     Running   0          19h
haproxy-hongli-osp-tb46w-master-0            2/2     Running   0          19h
haproxy-hongli-osp-tb46w-master-1            2/2     Running   0          19h
keepalived-hongli-osp-tb46w-master-0         2/2     Running   0          19h
keepalived-hongli-osp-tb46w-master-1         2/2     Running   0          19h
keepalived-hongli-osp-tb46w-master-2         2/2     Running   0          19h
keepalived-hongli-osp-tb46w-worker-0-fbbtc   2/2     Running   0          19h
keepalived-hongli-osp-tb46w-worker-0-rpk97   2/2     Running   0          18h
keepalived-hongli-osp-tb46w-worker-0-tc88z   2/2     Running   0          19h
^C[miheer@localhost ~]$ oc get pods -n openshift-openstack-infra -o wide
NAME                                         READY   STATUS    RESTARTS   AGE   IP              NODE                              NOMINATED NODE   READINESS GATES
coredns-hongli-osp-tb46w-master-0            2/2     Running   0          19h   192.168.1.104   hongli-osp-tb46w-master-0         <none>           <none>
coredns-hongli-osp-tb46w-master-1            2/2     Running   0          19h   192.168.1.169   hongli-osp-tb46w-master-1         <none>           <none>
coredns-hongli-osp-tb46w-master-2            2/2     Running   0          19h   192.168.2.232   hongli-osp-tb46w-master-2         <none>           <none>
coredns-hongli-osp-tb46w-worker-0-fbbtc      2/2     Running   0          19h   192.168.0.123   hongli-osp-tb46w-worker-0-fbbtc   <none>           <none>
coredns-hongli-osp-tb46w-worker-0-rpk97      2/2     Running   0          18h   192.168.3.13    hongli-osp-tb46w-worker-0-rpk97   <none>           <none>
coredns-hongli-osp-tb46w-worker-0-tc88z      2/2     Running   0          19h   192.168.0.59    hongli-osp-tb46w-worker-0-tc88z   <none>           <none>
haproxy-hongli-osp-tb46w-master-0            2/2     Running   0          19h   192.168.1.104   hongli-osp-tb46w-master-0         <none>           <none>
haproxy-hongli-osp-tb46w-master-1            2/2     Running   0          19h   192.168.1.169   hongli-osp-tb46w-master-1         <none>           <none>
keepalived-hongli-osp-tb46w-master-0         2/2     Running   0          19h   192.168.1.104   hongli-osp-tb46w-master-0         <none>           <none>
keepalived-hongli-osp-tb46w-master-1         2/2     Running   0          19h   192.168.1.169   hongli-osp-tb46w-master-1         <none>           <none>
keepalived-hongli-osp-tb46w-master-2         2/2     Running   0          19h   192.168.2.232   hongli-osp-tb46w-master-2         <none>           <none>
keepalived-hongli-osp-tb46w-worker-0-fbbtc   2/2     Running   0          19h   192.168.0.123   hongli-osp-tb46w-worker-0-fbbtc   <none>           <none>
keepalived-hongli-osp-tb46w-worker-0-rpk97   2/2     Running   0          18h   192.168.3.13    hongli-osp-tb46w-worker-0-rpk97   <none>           <none>
keepalived-hongli-osp-tb46w-worker-0-tc88z   2/2     Running   0          19h   192.168.0.59    hongli-osp-tb46w-worker-0-tc88z   <none>           <none>
[miheer@localhost ~]$ oc debug node/coredns-hongli-osp-tb46w-master-0 
Error from server (NotFound): nodes "coredns-hongli-osp-tb46w-master-0" not found
[miheer@localhost ~]$ oc debug node coredns-hongli-osp-tb46w-master-0
Error from server (NotFound): pods "node" not found
[miheer@localhost ~]$ oc debug node/coredns-hongli-osp-tb46w-master-0
Error from server (NotFound): nodes "coredns-hongli-osp-tb46w-master-0" not found
[miheer@localhost ~]$ 
[miheer@localhost ~]$ 
[miheer@localhost ~]$ oc debug node/hongli-osp-tb46w-master-0 
Starting pod/hongli-osp-tb46w-master-0-debug ...
To use host binaries, run `chroot /host`



Pod IP: 192.168.1.104
If you don't see a command prompt, try pressing enter.

sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# chroot /host
sh-4.4# 
sh-4.4# 
sh-4.4# cat /etc/co
conntrackd/                    console-login-helper-messages/ containers/                    coredns/                       
sh-4.4# cat /etc/co
conntrackd/                    console-login-helper-messages/ containers/                    coredns/                       
sh-4.4# cat /etc/co
conntrackd/                    console-login-helper-messages/ containers/                    coredns/                       
sh-4.4# cat /etc/coredns/Corefile 
bin/     boot/    dev/     etc/     home/    lib/     lib64/   media/   mnt/     opt/     ostree/  proc/    root/    run/     sbin/    srv/     sys/     sysroot/ tmp/     usr/     var/     
sh-4.4# cat /etc/coredns/Corefile 
. {
    errors
    health :18080
    forward . 10.11.142.1 {
        policy sequential
    }
    cache 30
    reload
    template IN A hongli-osp.0811-z9z.qe.rhcloud.com {
        match .*.apps.hongli-osp.0811-z9z.qe.rhcloud.com
        answer "{{ .Name }} 60 in {{ .Type }} 192.168.0.7"
        fallthrough
    }
    template IN AAAA hongli-osp.0811-z9z.qe.rhcloud.com {
        match .*.apps.hongli-osp.0811-z9z.qe.rhcloud.com
        fallthrough
    }
    template IN A hongli-osp.0811-z9z.qe.rhcloud.com {
        match api.hongli-osp.0811-z9z.qe.rhcloud.com
        answer "{{ .Name }} 60 in {{ .Type }} 192.168.0.5"
        fallthrough
    }
    template IN AAAA hongli-osp.0811-z9z.qe.rhcloud.com {
        match api.hongli-osp.0811-z9z.qe.rhcloud.com
        fallthrough
    }
    template IN A hongli-osp.0811-z9z.qe.rhcloud.com {
        match api-int.hongli-osp.0811-z9z.qe.rhcloud.com
        answer "{{ .Name }} 60 in {{ .Type }} 192.168.0.5"
        fallthrough
    }
    template IN AAAA hongli-osp.0811-z9z.qe.rhcloud.com {
        match api-int.hongli-osp.0811-z9z.qe.rhcloud.com
        fallthrough
    }
    hosts {
        192.168.1.104 hongli-osp-tb46w-master-0 hongli-osp-tb46w-master-0.hongli-osp.0811-z9z.qe.rhcloud.com
        192.168.1.169 hongli-osp-tb46w-master-1 hongli-osp-tb46w-master-1.hongli-osp.0811-z9z.qe.rhcloud.com
        192.168.2.232 hongli-osp-tb46w-master-2 hongli-osp-tb46w-master-2.hongli-osp.0811-z9z.qe.rhcloud.com
        192.168.0.123 hongli-osp-tb46w-worker-0-fbbtc hongli-osp-tb46w-worker-0-fbbtc.hongli-osp.0811-z9z.qe.rhcloud.com
        192.168.3.13 hongli-osp-tb46w-worker-0-rpk97 hongli-osp-tb46w-worker-0-rpk97.hongli-osp.0811-z9z.qe.rhcloud.com
        192.168.0.59 hongli-osp-tb46w-worker-0-tc88z hongli-osp-tb46w-worker-0-tc88z.hongli-osp.0811-z9z.qe.rhcloud.com
        fallthrough
    }
}
sh-4.4#

Comment 22 Miheer Salunke 2021-08-12 05:57:00 UTC
[stack@standalone ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-08-09-135211   True        False         68m     Cluster version is 4.8.0-0.nightly-2021-08-09-135211
[stack@standalone ~]$ 


[stack@standalone ~]$ oc new-app django-psql-example
--> Deploying template "openshift/django-psql-example" to project test-dig

     Django + PostgreSQL (Ephemeral)
     ---------
     An example Django application with a PostgreSQL database. For more information about using this template, including OpenShift considerations, see https://github.com/sclorg/django-ex/blob/master/README.md.
     
     WARNING: Any data stored will be lost upon pod destruction. Only use this template for testing.

     The following service(s) have been created in your project: django-psql-example, postgresql.
     
     For more information about using this template, including OpenShift considerations, see https://github.com/sclorg/django-ex/blob/master/README.md.

     * With parameters:
        * Name=django-psql-example
        * Namespace=openshift
        * Version of Python Image=3.8-ubi8
        * Version of PostgreSQL Image=12-el8
        * Memory Limit=512Mi
        * Memory Limit (PostgreSQL)=512Mi
        * Git Repository URL=https://github.com/sclorg/django-ex.git
        * Git Reference=
        * Context Directory=
        * Application Hostname=
        * GitHub Webhook Secret=4Y7e0esiTuIsA3gLvgIG0QDLdfMi2yCKPcNBXX6N # generated
        * Database Service Name=postgresql
        * Database Engine=postgresql
        * Database Name=default
        * Database Username=django
        * Database User Password=LUUmGyHpBWOgciFT # generated
        * Application Configuration File Path=
        * Django Secret Key=_YtvYl6jFtBeGxhpF9pKKdOYo0huybyIjldJeh_cWc3Dvm2eok # generated
        * Custom PyPi Index URL=

--> Creating resources ...
    secret "django-psql-example" created
    service "django-psql-example" created
    route.route.openshift.io "django-psql-example" created
    imagestream.image.openshift.io "django-psql-example" created
    buildconfig.build.openshift.io "django-psql-example" created
    deploymentconfig.apps.openshift.io "django-psql-example" created
    service "postgresql" created
    deploymentconfig.apps.openshift.io "postgresql" created
--> Success
    Access your application via route 'django-psql-example-test-dig.apps.ostest.shiftstack.com' 
    Build scheduled, use 'oc logs -f buildconfig/django-psql-example' to track its progress.
    Run 'oc status' to view your app.
[stack@standalone ~]$ oc get pods -w
NAME                          READY   STATUS              RESTARTS   AGE
django-psql-example-1-build   0/1     Init:0/2            0          6s
postgresql-1-6dlsp            0/1     ContainerCreating   0          3s
postgresql-1-deploy           1/1     Running             0          6s
django-psql-example-1-build   0/1     Init:0/2            0          12s
django-psql-example-1-build   0/1     Init:1/2            0          13s
django-psql-example-1-build   0/1     PodInitializing     0          14s
django-psql-example-1-build   1/1     Running             0          15s
postgresql-1-6dlsp            0/1     Running             0          27s
^C[stack@standalone ~]$ 
[stack@standalone ~]$ 
[stack@standalone ~]$ 
[stack@standalone ~]$ oc get pods -w
NAME                          READY   STATUS      RESTARTS   AGE
django-psql-example-1-build   0/1     Error       0          74s
postgresql-1-6dlsp            1/1     Running     0          71s
postgresql-1-deploy           0/1     Completed   0          74s
^C[stack@standalone ~]$ oc rsh postgresql-1-6dlsp
sh-4.4$ dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26594
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: db6c8bc02c151119 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		27	IN	A	140.82.113.4

;; Query time: 4 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Aug 12 05:55:22 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4$

Comment 23 Miheer Salunke 2021-08-12 06:03:33 UTC
In your env from pod it does fail

[miheer@localhost ~]$ oc -n hongli1  rsh centos-pod
sh-4.4# dig github.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 40868
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: 7157f3417ff09c28 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1004 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Aug 12 05:59:12 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4# 
sh-4.4# 
sh-4.4# exit
exit
[miheer@localhost ~]$

Comment 24 Miheer Salunke 2021-08-12 08:49:22 UTC
As per our discussion we figured out that the issue was with the external DNS.

When I asked you add IP of the external DNS from my env to your forwarders in Corefile of KNI coredns  your queries to github worked.

From the dig  output the difference I see between your env DNS and my env DNS is that in your external nameserver does not have recursion available.
 
Your env

[miheer@localhost ~]$ oc rsh postgresql-1-pnstq
sh-4.4$ dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 10896
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1   =============== ra flag is missing
;; WARNING: recursion requested but not available ============================warning message also there

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: e51c276df6774144 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1004 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Aug 12 06:06:58 UTC 2021
;; MSG SIZE  rcvd: 51

sh-4.4$ cat /etc/resolv.conf 
search test-dig.svc.cluster.local svc.cluster.local cluster.local hongli-osp.0811-z9z.qe.rhcloud.com
nameserver 172.30.0.10
options ndots:5
sh-4.4$ [miheer@localhost ~]$ 



My env

[stack@standalone ~]$ oc rsh postgresql-1-6dlsp
sh-4.4$ dig github.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26594
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1  =================================see the ra flag

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: db6c8bc02c151119 (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		27	IN	A	140.82.113.4

;; Query time: 4 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Aug 12 05:55:22 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4$ exit
exit

I will recommend you to check the external namserver config with the concerned team.

Comment 25 Miheer Salunke 2021-08-12 13:05:50 UTC
Please capture tcpdump from the node where KNI coredns pod is running.

Run the following to capture tcpdump on node

[stack@standalone ~]$ oc debug node/ostest-fxm7v-master-0 --image=rhel7/rhel-tools
Starting pod/ostest-fxm7v-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.3.140
If you don't see a command prompt, try pressing enter.
sh-4.2# ip a                   
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1442 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:df:66:7c brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.140/16 brd 10.0.255.255 scope global noprefixroute dynamic ens3
       valid_lft 40796sec preferred_lft 40796sec
    inet6 fe80::4c01:a788:ba5d:2b6d/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
sh-4.2# 


sh-4.2# tcpdump -i ens3 -nn port 53 or port 5353 -w dns-node.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C112 packets captured
122 packets received by filter
0 packets dropped by kernel


oc rsh to pod then run dig github.com

Then on the node ->
sh-4.2# mkdir -p /host/var/tmp/tcpdump

sh-4.2# cp dns-node.pcap  /host/var/tmp/tcpdump

sh-4.2# tshark -n dns-node.pcap

SSH to the node and tar the files:

tar cvJf /var/tmp/tcpdump.tar.xz /var/tmp/tcpdump/

Provide us the tar file.

Once we see that we are not getting an response or an answer from the external DNS then definitely something is happening at external DNS.

Comment 26 Johnny Liu 2021-08-12 20:40:17 UTC
We are using redhar shared openstack deployment - PSI openstack to install OCP cluster.

Just run a simple IPI install (install-config.yaml was mentioned in comment 11), the cluster has only 1 external DNS, which is set to 10.11.142.1 (actually I have no idea where it come from, `host 10.11.142.1` ---> ns01.util.rdu2.redhat.com)

And from the above testing, if I directly run `dig @10.11.142.1 github.com`, everything is going well (no ra flag is missing). So I think the external dns has no issue.


Per my understanding, the data flow would be like this:
dns query from user app pod --> 172.30.0.10 --> dns-default in openshift-dns namespace --> node-ip:53 (the 1st nameserver) --> coredns in openshift-openstack-infra namespace --> external dns

So the above testing is executed using `dig @node-ip github.com`, this issue can always reproduced.

I also updated the external dns from 10.11.142.1 to 10.11.5.19 (the one in your env), after that, `dig @node-ip github.com` also worked in my env.

So let us run dig command against this two external dns:

[root@preserve-jialiu-ansible ~]# dig @10.11.142.1 github.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @10.11.142.1 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2784
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 16

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1200
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		26	IN	A	140.82.112.3

;; AUTHORITY SECTION:
github.com.		252	IN	NS	dns2.p08.nsone.net.
github.com.		252	IN	NS	ns-1707.awsdns-21.co.uk.
github.com.		252	IN	NS	ns-520.awsdns-01.net.
github.com.		252	IN	NS	dns4.p08.nsone.net.
github.com.		252	IN	NS	dns1.p08.nsone.net.
github.com.		252	IN	NS	ns-421.awsdns-52.com.
github.com.		252	IN	NS	ns-1283.awsdns-32.org.
github.com.		252	IN	NS	dns3.p08.nsone.net.

;; ADDITIONAL SECTION:
dns1.p08.nsone.net.	16805	IN	A	198.51.44.8
dns2.p08.nsone.net.	16805	IN	A	198.51.45.8
dns3.p08.nsone.net.	16805	IN	A	198.51.44.72
dns4.p08.nsone.net.	16805	IN	A	198.51.45.72
ns-1283.awsdns-32.org.	103206	IN	A	205.251.197.3
ns-1707.awsdns-21.co.uk. 103206	IN	A	205.251.198.171
ns-421.awsdns-52.com.	134703	IN	A	205.251.193.165
ns-520.awsdns-01.net.	103205	IN	A	205.251.194.8
dns1.p08.nsone.net.	16805	IN	AAAA	2620:4d:4000:6259:7:8:0:1
dns2.p08.nsone.net.	16805	IN	AAAA	2a00:edc0:6259:7:8::2
dns3.p08.nsone.net.	16805	IN	AAAA	2620:4d:4000:6259:7:8:0:3
dns4.p08.nsone.net.	16805	IN	AAAA	2a00:edc0:6259:7:8::4
ns-1283.awsdns-32.org.	103206	IN	AAAA	2600:9000:5305:300::1
ns-1707.awsdns-21.co.uk. 103206	IN	AAAA	2600:9000:5306:ab00::1
ns-520.awsdns-01.net.	103205	IN	AAAA	2600:9000:5302:800::1

;; Query time: 1 msec
;; SERVER: 10.11.142.1#53(10.11.142.1)
;; WHEN: Thu Aug 12 16:18:17 EDT 2021
;; MSG SIZE  rcvd: 602



[root@preserve-jialiu-ansible ~]# dig @10.11.5.19 github.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @10.11.5.19 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56632
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		19	IN	A	140.82.112.4

;; Query time: 1 msec
;; SERVER: 10.11.5.19#53(10.11.5.19)
;; WHEN: Thu Aug 12 16:18:32 EDT 2021
;; MSG SIZE  rcvd: 55

Obviously the two dns server response with different message, I am not an expert on DNS, but I suspect if the response message from 10.11.142.1 is too long, coredns will fail it due to some configuration, not forward the message to the client?



In comment 21, the test was happening coredns pod, inside which its /etc/resolve.conf have two name servers which has the same context shared with host, but /etc/resolve.conf in a common user app pod only have 1 name server, 127.0.30.1, the query will be handled in different way. From the output, I can see the answer DNS server is 10.11.142.1, but node-ip, per resolve.conf man, I think that is because the 1st name server (node-ip) returned SERVFAIL status, so the query will try the 2nd name server (10.11.142.1), then it succeed. 


For comment 25, when I can not run `sh-4.2# tshark -n dns-node.pcap` step, because `tshark` command is not found in the container, so I have no way to provide tar log file.

Comment 27 Miheer Salunke 2021-08-13 03:28:31 UTC
Hi Johnny,

It is tcpdump not tshark. The tshark command later I wrote was from my env so please ignore. However I did had shared the tcpdump command in that comment 25 .

dig from kni coredns(host network) works fine even without adding bufsize.

dig from app pod (container network) fails and works only if bufsize 512 was added to corefile of KNI coredns.

This might be a MTU issue. I will need to verify this with tcpdumps.

Can you or Hongan Li give access to an env ?

A) Please capture tcpdump from the node where KNI coredns pod is running.

Run the following to capture tcpdump on node

[stack@standalone ~]$ oc debug node/ostest-fxm7v-master-0 --image=rhel7/rhel-tools
Starting pod/ostest-fxm7v-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.3.140
If you don't see a command prompt, try pressing enter.
sh-4.2# ip a                   
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1442 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:df:66:7c brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.140/16 brd 10.0.255.255 scope global noprefixroute dynamic ens3
       valid_lft 40796sec preferred_lft 40796sec
    inet6 fe80::4c01:a788:ba5d:2b6d/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
sh-4.2# 


sh-4.2# tcpdump -i ens3 -nn port 53 or port 5353 -w dns-node.pcap
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
^C112 packets captured
122 packets received by filter
0 packets dropped by kernel


B) You also need to capture tcpdump on pod

1. first start tcpdump command capture on node as per my earlier comment

2. then run tcpdump from the pod in container network
 
 oc debug node/<your node name>  --image=rhel7/rhel-tools

3. sh-4.4# crictl ps | grep <pod>

4. sh-4.4# crictl inspect  <id>

5. At the end you will see
  "pid": 5436,
  "sandboxId": "ad8043d0d403586154b5146e4f5ec7967247b342ae25d5392fd45bd09b357bf8"
}

That is your pid

Now start tcpdump using nsenter to the dns pid on port 5353 or 53
Port 5353 will show packets to and from CoreDNS. port (53 or 5353) will also show packets that CoreDNS forwards to upstream resolvers.

sh-4.2# nsenter -t 5436 -n -- tcpdump -i any -nn port (53 or 5353) -w dns-pod.pcap &
 
 5436 is the pid. Replace it with yours.

sh-4.2# POD_TCPDUMP_ID=$!

Once you see the issue run the following ->

sh-4.2#  kill $POD_TCPDUMP_ID

sh-4.2# mkdir -p /host/var/tmp/tcpdump

sh-4.2# cp dns-node.pcap  /host/var/tmp/tcpdump

sh-4.2# cp dns-pod.pcap  /host/var/tmp/tcpdump


SSH to the node and tar the files:

tar cvJf /var/tmp/tcpdump.tar.xz /var/tmp/tcpdump/

Provide us the tar file.


Thanks and regards,
Miheer

Comment 28 Miheer Salunke 2021-08-13 03:35:05 UTC
If it turns out to be an MTU issue, we can suggest that the KNI folks add bufsize 512 to https://github.com/openshift/machine-config-operator/blob/master/templates/common/on-prem/files/coredns-corefile.yaml.

Comment 38 Miheer Salunke 2021-08-24 16:00:35 UTC
10.128.2.2 coredns by dns operator.
192.168.1.65 KNI coredns running on host network.
10.11.142.1 Upstream nameserver

From the following I do see some issue happening at KNI core DNS level as it is not giving the response to coredns pod. 

dns-pod-coredns.pcap


udp.stream eq 10  

20	2021-08-16 20:33:18.167408	10.128.2.2	192.168.1.65	DNS	95	Standard query 0xbb55 A github.com OPT
22	2021-08-16 20:33:18.167408	10.128.2.2	192.168.1.65	DNS	95	Standard query 0xbb55 A github.com OPT
171	2021-08-16 20:33:24.171437	192.168.1.65	10.128.2.2	DNS	95	Standard query response 0xbb55 Server failure A github.com OPT
172	2021-08-16 20:33:24.171460	192.168.1.65	10.128.2.2	DNS	95	Standard query response 0xbb55 Server failure A github.com OPT



Also I do see upstream nameserver sending a dns response to KNI coredns

udp.stream eq 14


27	2021-08-16 20:33:18.167851	192.168.1.65	10.11.142.1	DNS	61	Standard query 0x4f90 NS <Root>
30	2021-08-16 20:33:18.169323	10.11.142.1	192.168.1.65	DNS	536	Standard query response 0x4f90 NS <Root> NS l.root-servers.net NS a.root-servers.net NS h.root-servers.net NS i.root-servers.net NS b.root-servers.net NS e.root-servers.net NS j.root-servers.net NS f.root-servers.net NS g.root-servers.net NS c.root-servers.net NS k.root-servers.net NS d.root-servers.net NS m.root-servers.net A 202.12.27.33 A 199.9.14.201 A 192.33.4.12 A 199.7.91.13 A 192.203.230.10 A 192.5.5.241 A 192.112.36.4 A 198.97.190.53 A 198.41.0.4 A 192.36.148.17 A 192.58.128.30 A 193.0.14.129 A 199.7.83.42 AAAA 2001:dc3::35 AAAA 2001:500:200::b


and after adding "bufsize 512" to the Corefile, the issues is dismissed.

sh-4.4# cat /etc/coredns/Corefile 
. {
    errors
    bufsize 512                                          #### <--- new added parameter
    log
    health :18080
    forward . 10.11.142.1 {
        policy sequential
    }
    cache 30
    reload
    template IN A test1.test2.test3.rhcloud.com {
        match .*.apps.test1.test2.test3.rhcloud.com
        answer "{{ .Name }} 60 in {{ .Type }} 192.168.0.x"
        fallthrough
    }
----<snip>-----



Having said this when the forwarder 10.11.142.1 was replaced with a different one  10.11.5.19  that also resolved the issue without needing the bufsize setting.

In summary the problem seems to be happening from KNI coredns sending a response to Core DNS pods. KNI coredns pods seem to getting DNS response from upstream nameserver 10.11.142.1

As mentioned earlier this might be related to MTU. The difference in the dig to github.com is the response size. For  10.11.142.1 it is 602 and for 10.11.5.19 it is 55  I will dig into this again tomorrow and get back to you.

 dig @10.11.142.1 github.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @10.11.142.1 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2784
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 16

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1200
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		26	IN	A	140.82.112.3

;; AUTHORITY SECTION:
github.com.		252	IN	NS	dns2.p08.nsone.net.
github.com.		252	IN	NS	ns-1707.awsdns-21.co.uk.
github.com.		252	IN	NS	ns-520.awsdns-01.net.
github.com.		252	IN	NS	dns4.p08.nsone.net.
github.com.		252	IN	NS	dns1.p08.nsone.net.
github.com.		252	IN	NS	ns-421.awsdns-52.com.
github.com.		252	IN	NS	ns-1283.awsdns-32.org.
github.com.		252	IN	NS	dns3.p08.nsone.net.

;; ADDITIONAL SECTION:
dns1.p08.nsone.net.	16805	IN	A	198.51.44.8
dns2.p08.nsone.net.	16805	IN	A	198.51.45.8
dns3.p08.nsone.net.	16805	IN	A	198.51.44.72
dns4.p08.nsone.net.	16805	IN	A	198.51.45.72
ns-1283.awsdns-32.org.	103206	IN	A	205.251.197.3
ns-1707.awsdns-21.co.uk. 103206	IN	A	205.251.198.171
ns-421.awsdns-52.com.	134703	IN	A	205.251.193.165
ns-520.awsdns-01.net.	103205	IN	A	205.251.194.8
dns1.p08.nsone.net.	16805	IN	AAAA	2620:4d:4000:6259:7:8:0:1
dns2.p08.nsone.net.	16805	IN	AAAA	2a00:edc0:6259:7:8::2
dns3.p08.nsone.net.	16805	IN	AAAA	2620:4d:4000:6259:7:8:0:3
dns4.p08.nsone.net.	16805	IN	AAAA	2a00:edc0:6259:7:8::4
ns-1283.awsdns-32.org.	103206	IN	AAAA	2600:9000:5305:300::1
ns-1707.awsdns-21.co.uk. 103206	IN	AAAA	2600:9000:5306:ab00::1
ns-520.awsdns-01.net.	103205	IN	AAAA	2600:9000:5302:800::1

;; Query time: 1 msec
;; SERVER: 10.11.142.1#53(10.11.142.1)
;; WHEN: Thu Aug 12 16:18:17 EDT 2021
;; MSG SIZE  rcvd: 602



 dig @10.11.5.19 github.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> @10.11.5.19 github.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56632
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		19	IN	A	140.82.112.4

;; Query time: 1 msec
;; SERVER: 10.11.5.19#53(10.11.5.19)
;; WHEN: Thu Aug 12 16:18:32 EDT 2021
;; MSG SIZE  rcvd: 55


But again as per comment 34 this worked in the earlier version of OCP  4.6/4.7

So there might be some code level changes which happened KNI core dns side.

Comment 39 Miheer Salunke 2021-08-24 16:07:42 UTC
*** Bug 1995114 has been marked as a duplicate of this bug. ***

Comment 40 Miheer Salunke 2021-08-24 22:45:45 UTC
The limit for UDP DNS messages is 512 bytes long. Well behaved DNS servers are supposed to truncate the message and set the truncated bit. See RFC 1035 section 4.2.1.

https://datatracker.ietf.org/doc/html/rfc1035#section-4.2.1

https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4

The difference in the dig to github.com is the response size. For  10.11.142.1 it is 602 > 512 and for 10.11.5.19 it is 55 < 512

CoreDNS will compress messages that exceed 512 bytes, unless the client allows a larger maximum size by sending the corresponding edns0 option in the request.

dig in particular sends a buffer size > 512 by default. I think the exact number depends on the dig version or perhaps the environment... on my OCP nodes it defaults to 4096 - I think this is most common.

[miheer@localhost ~]$ oc debug node/mykrbid-vcd8j-worker-0-hlkmf
Starting pod/mykrbid-vcd8j-worker-0-hlkmf-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.0.94
If you don't see a command prompt, try pressing enter.
sh-4.4#
sh-4.4# sysctl -a | grep  rmem
net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.ipv4.tcp_rmem = 4096	87380	6291456
net.ipv4.udp_rmem_min = 4096

So, we should be setting 512 as bufsize for KNI coredns to avoid this issue.

For now as a workaround on the OCP node you will need to set "bufsize 512" to the Corefile

[miheer@localhost ~]$ #oc debug node/hongli-osp-tb46w-master-0 --image=rhel7/rhel-tools
[miheer@localhost ~]$ oc debug node/mykrbid-vcd8j-worker-0-hlkmf
Starting pod/mykrbid-vcd8j-worker-0-hlkmf-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.0.94
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host/  
sh-4.4# vi /etc/coredns/Corefile 


sh-4.4# cat /etc/coredns/Corefile 
. {
    errors
    bufsize 512                                          #### <--- new added parameter
    log
    health :18080
    forward . 10.11.142.1 {
        policy sequential
    }
    cache 30
    reload
    template IN A test1.test2.test3.rhcloud.com {
        match .*.apps.test1.test2.test3.rhcloud.com
        answer "{{ .Name }} 60 in {{ .Type }} 192.168.0.x"
        fallthrough
    }
----<snip>-----

I will add a long term fix in https://github.com/openshift/machine-config-operator/blob/master/templates/common/on-prem/files/coredns-corefile.yaml and send the PR accordingly.

Comment 41 Miheer Salunke 2021-08-24 22:58:00 UTC
https://github.com/openshift/machine-config-operator/pull/2730  Added a PR

Comment 42 Martin André 2021-08-25 12:32:08 UTC
*** Bug 1963081 has been marked as a duplicate of this bug. ***

Comment 43 Pavol Pitonak 2021-08-26 06:11:55 UTC
I can confirm that workaround from comment #40 worked on 4.8.7 cluster.

Comment 44 Miciah Dashiel Butler Masters 2021-09-03 16:13:51 UTC
Today is code freeze, and the PR is still blocked on approval, so I'm resetting the target release.  We'll fix this in 4.10.0 and backport the fix to 4.9.z.

Comment 45 Pavol Pitonak 2021-09-29 08:03:22 UTC
Workaround from comment #40 worked fine for 4 weeks. We didn't perform any cluster update on purpose but still it was reverted today (caused by cluster itself, not humans). Is there any way how to make this configuration permanent?

Comment 48 Hongan Li 2021-09-30 10:08:49 UTC
Verified with 4.10.0-0.nightly-2021-09-30-041351 on IPI on OpenStack and passed

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-09-30-041351   True        False         137m    Cluster version is 4.10.0-0.nightly-2021-09-30-041351

$ oc debug node/hongli-iosp-n5w95-worker-0-7h7sc
To use host binaries, run `chroot /host`

Pod IP: 192.168.3.96
If you don't see a command prompt, try pressing enter.

sh-4.4# chroot /host
sh-4.4# cat /etc/coredns/Corefile 
. {
    errors
    bufsize 512
    health :18080
    forward . 10.11.142.1 {
        policy sequential
    }
<-----snip----->


$ oc rsh centos-pod
sh-4.4# 
sh-4.4# dig github.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41782
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: 09c6be442d8224ad (echoed)
;; QUESTION SECTION:
;github.com.			IN	A

;; ANSWER SECTION:
github.com.		30	IN	A	140.82.114.3

;; Query time: 106 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Sep 30 08:07:18 UTC 2021
;; MSG SIZE  rcvd: 77

sh-4.4# dig fedoraproject.org
;; Truncated, retrying in TCP mode.

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> fedoraproject.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24729
;; flags: qr rd ra; QUERY: 1, ANSWER: 10, AUTHORITY: 4, ADDITIONAL: 7

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: 1cc7d822640aa3f1 (echoed)
;; QUESTION SECTION:
;fedoraproject.org.		IN	A

;; ANSWER SECTION:
fedoraproject.org.	10	IN	A	8.43.85.67
fedoraproject.org.	10	IN	A	67.219.144.68
fedoraproject.org.	10	IN	A	209.132.190.2
fedoraproject.org.	10	IN	A	8.43.85.73
fedoraproject.org.	10	IN	A	152.19.134.198
fedoraproject.org.	10	IN	A	152.19.134.142
fedoraproject.org.	10	IN	A	38.145.60.20
fedoraproject.org.	10	IN	A	38.145.60.21
fedoraproject.org.	10	IN	A	140.211.169.196
fedoraproject.org.	10	IN	A	140.211.169.206

;; AUTHORITY SECTION:
fedoraproject.org.	10	IN	NS	ns-iad02.fedoraproject.org.
fedoraproject.org.	10	IN	NS	ns05.fedoraproject.org.
fedoraproject.org.	10	IN	NS	ns-iad01.fedoraproject.org.
fedoraproject.org.	10	IN	NS	ns02.fedoraproject.org.

;; ADDITIONAL SECTION:
ns-iad02.fedoraproject.org. 10	IN	A	38.145.60.14
ns-iad01.fedoraproject.org. 10	IN	A	38.145.60.13
ns02.fedoraproject.org.	10	IN	A	152.19.134.139
ns05.fedoraproject.org.	10	IN	A	85.236.55.10
ns02.fedoraproject.org.	10	IN	AAAA	2610:28:3090:3001:dead:beef:cafe:fed5
ns05.fedoraproject.org.	10	IN	AAAA	2001:4178:2:1269:dead:beef:cafe:fed5

;; Query time: 36 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Thu Sep 30 08:10:03 UTC 2021
;; MSG SIZE  rcvd: 868

sh-4.4#

Comment 51 errata-xmlrpc 2022-03-12 04:37:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.