Bug 1733867

Summary:	[IPI] [OSP] All request rely on frontproxy failed due to node name can not be resolved within cluster
Product:	OpenShift Container Platform	Reporter:	weiwei jiang <wjiang>
Component:	Installer	Assignee:	Tomas Sedovic <tsedovic>
Installer sub component:	openshift-installer	QA Contact:	weiwei jiang <wjiang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	adam.kaplan, eduen, egarcia, ncredi, ppitonak, scuppett, tsedovic, wewang, wzheng, xiuwang, yanpzhan
Version:	4.2.0	Keywords:	TestBlocker
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	DFG:OSasInfra
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:33:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1733892

Description weiwei jiang 2019-07-29 05:42:09 UTC

Description of problem:
As a user, I want to do oc logs/exec/debug/rsh/cp for a pod on worker role,
but failed to do these things.

# oc -n openshift-image-registry cp node-ca-wqd7t:/etc/hosts /tmp/hosts
Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host  
#oc -n openshift-image-registry logs -f image-registry-6bbb45b7fc-vck66
Error from server: Get https://wjosp0729a-s2xz4-worker-cqj2d:10250/containerLogs/openshift-image-registry/image-registry-6bbb45b7fc-vck66/registry?follow=true: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host
#oc -n openshift-image-registry debug image-registry-6bbb45b7fc-vck66
Starting pod/image-registry-6bbb45b7fc-vck66-debug ...
Pod IP: 10.128.2.14
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-776fw on 192.168.128.7:53: no such host

# oc -n openshift-image-registry exec  node-ca-wqd7t -- /bin/bash
Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host


Version-Release number of the following components:
4.2.0-0.nightly-2019-07-28-222114


How reproducible:
Always

Steps to Reproduce:
1. Install a cluster with IPI on OSP way
2. Try with oc logs/exec/debug/rsh/cp for a pod on worker role
3.

Actual results:
# oc -n openshift-image-registry cp node-ca-wqd7t:/etc/hosts /tmp/hosts
Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host  
#oc -n openshift-image-registry logs -f image-registry-6bbb45b7fc-vck66
Error from server: Get https://wjosp0729a-s2xz4-worker-cqj2d:10250/containerLogs/openshift-image-registry/image-registry-6bbb45b7fc-vck66/registry?follow=true: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host
#oc -n openshift-image-registry debug image-registry-6bbb45b7fc-vck66
Starting pod/image-registry-6bbb45b7fc-vck66-debug ...
Pod IP: 10.128.2.14
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-776fw on 192.168.128.7:53: no such host

# oc -n openshift-image-registry exec  node-ca-wqd7t -- /bin/bash
Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host


Expected results:
All these operations should work well

Additional info:
pods on master role did not met this issue

Comment 1 Eric Duen 2019-07-30 15:44:39 UTC

Should be resolved once we complete migration away from Service VM.  Assign to Tomas to follow up.

Comment 3 weiwei jiang 2019-08-16 07:39:03 UTC

Checked with 4.2.0-0.nightly-2019-08-15-232721, still not fixed.

➜ ✗ oc run h --image=aosqe/hello-openshift --replicas=6                                                    
kubectl run --generator=deploymentconfig/v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deploymentconfig.apps.openshift.io/h created         
➜ ✗ oc get pods -o wide                                
NAME         READY   STATUS      RESTARTS   AGE     IP            NODE                            NOMINATED NODE   READINESS GATES
h-1-2kd8f    1/1     Running     0          3m27s   10.128.2.13   wjosp0816d-p6jqc-worker-g87lp   <none>           <none>
h-1-8xdcr    1/1     Running     0          3m27s   10.129.2.8    wjosp0816d-p6jqc-worker-np44n   <none>           <none>
h-1-brnkf    1/1     Running     0          3m27s   10.129.2.6    wjosp0816d-p6jqc-worker-np44n   <none>           <none>
h-1-deploy   0/1     Completed   0          3m35s   10.129.2.4    wjosp0816d-p6jqc-worker-np44n   <none>           <none>
h-1-n9lh7    1/1     Running     0          3m27s   10.131.0.14   wjosp0816d-p6jqc-worker-79z2m   <none>           <none>
h-1-sglg8    1/1     Running     0          3m27s   10.129.2.5    wjosp0816d-p6jqc-worker-np44n   <none>           <none>
h-1-zd6hp    1/1     Running     0          3m27s   10.129.2.7    wjosp0816d-p6jqc-worker-np44n   <none>           <none>
➜ ✗ oc rsh h-1-sglg8     
Error from server: error dialing backend: dial tcp: lookup wjosp0816d-p6jqc-worker-np44n on 192.168.0.6:53: no such host

Comment 4 weiwei jiang 2019-08-19 03:19:04 UTC

This only block all the requests which need to fetch data from workers via openshift-kube-apiserver as a reversed proxy.

Like:
oc debug
oc exec
oc rsh
oc logs
oc rsync
oc proxy
oc attach 
oc cp
oc port-forward

and also for the same features in the web console.

Comment 5 egarcia 2019-08-19 20:47:33 UTC

So, I ran this from master, and I think I understand what is wrong here. In our current implementation, we expect the user to add certian entries to their dns. In order to make this work, you must attach a floating ip to the `ingress-port`. Please see here for more info: https://github.com/openshift/installer/tree/master/docs/user/openstack#using-floating-ips. Try this and let me know if this fixes your problems.

Comment 6 weiwei jiang 2019-08-21 04:20:43 UTC

checked with 4.2.0-0.nightly-2019-08-20-043744, 
we already make wildcard dns for ingress-port to work well,
but still got this issue.

➜  ~ oc get pods -o wide 
NAME         READY   STATUS      RESTARTS   AGE     IP            NODE                                 NOMINATED NODE   READINESS GATES
h-1-2mxfd    1/1     Running     0          2m49s   10.130.2.18   preserve-groupg-4cf4r-worker-thlsr   <none>           <none>
h-1-54vkj    1/1     Running     0          2m49s   10.130.2.16   preserve-groupg-4cf4r-worker-thlsr   <none>           <none>
h-1-deploy   0/1     Completed   0          2m52s   10.128.2.18   preserve-groupg-4cf4r-worker-lp9lk   <none>           <none>
h-1-lqdlm    1/1     Running     0          2m49s   10.128.2.21   preserve-groupg-4cf4r-worker-lp9lk   <none>           <none>
h-1-pbphl    1/1     Running     0          2m49s   10.130.2.17   preserve-groupg-4cf4r-worker-thlsr   <none>           <none>
h-1-rd68j    1/1     Running     0          2m49s   10.128.2.20   preserve-groupg-4cf4r-worker-lp9lk   <none>           <none>
h-1-smqd7    1/1     Running     0          2m49s   10.128.2.19   preserve-groupg-4cf4r-worker-lp9lk   <none>           <none>
➜  ~ oc logs h-1-54vkj     
Error from server: Get https://preserve-groupg-4cf4r-worker-thlsr:10250/containerLogs/default/h-1-54vkj/h: dial tcp: lookup preserve-groupg-4cf4r-worker-thlsr on 192.168.0.6:53: no such host


The issue here is that, within openshift-kube-apiserver, it can not resolve worker name to ip.

Comment 7 XiuJuan Wang 2019-08-28 03:27:11 UTC

The start-build with --from-* parameters to create a binary build failed too

+ oc start-build openshift-jee-sample-docker --from-file=target/ROOT.war -n u3gjn
Uploading file "target/ROOT.war" as binary input for the build ...
.
Uploading finished
Error from server (InternalError): Internal error occurred: error dialing backend: dial tcp: lookup preserve-groupg-4cf4r-worker1-x25qw on 192.168.0.6:53: no such host

Comment 12 Tomas Sedovic 2019-09-02 13:07:36 UTC

@weiwei the commit (8343c018c7b99525d2d13299533b7267ca317c48) is now in the `release-4.2` branch but it looks like a nightly with that commit has not been built yet (the latest one I checked was https://openshift-release-artifacts.svc.ci.openshift.org/4.2.0-0.nightly-2019-09-01-224700/ and the commit is not there).

Comment 13 weiwei jiang 2019-09-03 06:38:00 UTC

Verified on 4.2.0-0.nightly-2019-09-02-172410.

➜  ~ oc get pods -o wide
NAME         READY   STATUS      RESTARTS   AGE   IP            NODE                             NOMINATED NODE   READINESS GATES
h-1-4flbc    1/1     Running     0          12m   10.131.0.14   share-0903a-rngb2-worker-ghhlq   <none>           <none>
h-1-89rtq    1/1     Running     0          12m   10.128.2.12   share-0903a-rngb2-worker-x45bg   <none>           <none>
h-1-8wr45    1/1     Running     0          12m   10.129.2.10   share-0903a-rngb2-worker-h8z46   <none>           <none>
h-1-d8xvp    1/1     Running     0          12m   10.129.2.9    share-0903a-rngb2-worker-h8z46   <none>           <none>
h-1-deploy   0/1     Completed   0          12m   10.129.2.8    share-0903a-rngb2-worker-h8z46   <none>           <none>
h-1-qrqfc    1/1     Running     0          12m   10.131.0.15   share-0903a-rngb2-worker-ghhlq   <none>           <none>
h-1-wcx7d    1/1     Running     0          12m   10.128.2.13   share-0903a-rngb2-worker-x45bg   <none>           <none>

➜  ~ oc logs -f h-1-d8xvp
serving on 8081
serving on 8888

➜  ~ oc debug pods/h-1-d8xvp
Starting pod/h-1-d8xvp-debug ...
Pod IP: 10.129.2.11
If you don't see a command prompt, try pressing enter.
/ # ls
bin              dev              etc              hello            hello-openshift  home             proc             root             run              sys              tmp              usr              var

➜  ~ oc exec -it h-1-d8xvp /bin/sh
/ # id
uid=0(root) gid=0(root) groups=10(wheel)

➜  ~ oc rsh h-1-d8xvp
/ # id
uid=0(root) gid=0(root) groups=10(wheel)

Comment 15 weiwei jiang 2019-09-04 02:27:07 UTC

Verified on 4.2.0-0.nightly-2019-09-02-172410.

Comment 16 errata-xmlrpc 2019-10-16 06:33:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922