Description of problem: As a user, I want to do oc logs/exec/debug/rsh/cp for a pod on worker role, but failed to do these things. # oc -n openshift-image-registry cp node-ca-wqd7t:/etc/hosts /tmp/hosts Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host #oc -n openshift-image-registry logs -f image-registry-6bbb45b7fc-vck66 Error from server: Get https://wjosp0729a-s2xz4-worker-cqj2d:10250/containerLogs/openshift-image-registry/image-registry-6bbb45b7fc-vck66/registry?follow=true: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host #oc -n openshift-image-registry debug image-registry-6bbb45b7fc-vck66 Starting pod/image-registry-6bbb45b7fc-vck66-debug ... Pod IP: 10.128.2.14 If you don't see a command prompt, try pressing enter. Removing debug pod ... Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-776fw on 192.168.128.7:53: no such host # oc -n openshift-image-registry exec node-ca-wqd7t -- /bin/bash Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host Version-Release number of the following components: 4.2.0-0.nightly-2019-07-28-222114 How reproducible: Always Steps to Reproduce: 1. Install a cluster with IPI on OSP way 2. Try with oc logs/exec/debug/rsh/cp for a pod on worker role 3. Actual results: # oc -n openshift-image-registry cp node-ca-wqd7t:/etc/hosts /tmp/hosts Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host #oc -n openshift-image-registry logs -f image-registry-6bbb45b7fc-vck66 Error from server: Get https://wjosp0729a-s2xz4-worker-cqj2d:10250/containerLogs/openshift-image-registry/image-registry-6bbb45b7fc-vck66/registry?follow=true: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host #oc -n openshift-image-registry debug image-registry-6bbb45b7fc-vck66 Starting pod/image-registry-6bbb45b7fc-vck66-debug ... Pod IP: 10.128.2.14 If you don't see a command prompt, try pressing enter. Removing debug pod ... Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-776fw on 192.168.128.7:53: no such host # oc -n openshift-image-registry exec node-ca-wqd7t -- /bin/bash Error from server: error dialing backend: dial tcp: lookup wjosp0729a-s2xz4-worker-cqj2d on 192.168.128.7:53: no such host Expected results: All these operations should work well Additional info: pods on master role did not met this issue
Should be resolved once we complete migration away from Service VM. Assign to Tomas to follow up.
Checked with 4.2.0-0.nightly-2019-08-15-232721, still not fixed. ➜ ✗ oc run h --image=aosqe/hello-openshift --replicas=6 kubectl run --generator=deploymentconfig/v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead. deploymentconfig.apps.openshift.io/h created ➜ ✗ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES h-1-2kd8f 1/1 Running 0 3m27s 10.128.2.13 wjosp0816d-p6jqc-worker-g87lp <none> <none> h-1-8xdcr 1/1 Running 0 3m27s 10.129.2.8 wjosp0816d-p6jqc-worker-np44n <none> <none> h-1-brnkf 1/1 Running 0 3m27s 10.129.2.6 wjosp0816d-p6jqc-worker-np44n <none> <none> h-1-deploy 0/1 Completed 0 3m35s 10.129.2.4 wjosp0816d-p6jqc-worker-np44n <none> <none> h-1-n9lh7 1/1 Running 0 3m27s 10.131.0.14 wjosp0816d-p6jqc-worker-79z2m <none> <none> h-1-sglg8 1/1 Running 0 3m27s 10.129.2.5 wjosp0816d-p6jqc-worker-np44n <none> <none> h-1-zd6hp 1/1 Running 0 3m27s 10.129.2.7 wjosp0816d-p6jqc-worker-np44n <none> <none> ➜ ✗ oc rsh h-1-sglg8 Error from server: error dialing backend: dial tcp: lookup wjosp0816d-p6jqc-worker-np44n on 192.168.0.6:53: no such host
This only block all the requests which need to fetch data from workers via openshift-kube-apiserver as a reversed proxy. Like: oc debug oc exec oc rsh oc logs oc rsync oc proxy oc attach oc cp oc port-forward and also for the same features in the web console.
So, I ran this from master, and I think I understand what is wrong here. In our current implementation, we expect the user to add certian entries to their dns. In order to make this work, you must attach a floating ip to the `ingress-port`. Please see here for more info: https://github.com/openshift/installer/tree/master/docs/user/openstack#using-floating-ips. Try this and let me know if this fixes your problems.
checked with 4.2.0-0.nightly-2019-08-20-043744, we already make wildcard dns for ingress-port to work well, but still got this issue. ➜ ~ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES h-1-2mxfd 1/1 Running 0 2m49s 10.130.2.18 preserve-groupg-4cf4r-worker-thlsr <none> <none> h-1-54vkj 1/1 Running 0 2m49s 10.130.2.16 preserve-groupg-4cf4r-worker-thlsr <none> <none> h-1-deploy 0/1 Completed 0 2m52s 10.128.2.18 preserve-groupg-4cf4r-worker-lp9lk <none> <none> h-1-lqdlm 1/1 Running 0 2m49s 10.128.2.21 preserve-groupg-4cf4r-worker-lp9lk <none> <none> h-1-pbphl 1/1 Running 0 2m49s 10.130.2.17 preserve-groupg-4cf4r-worker-thlsr <none> <none> h-1-rd68j 1/1 Running 0 2m49s 10.128.2.20 preserve-groupg-4cf4r-worker-lp9lk <none> <none> h-1-smqd7 1/1 Running 0 2m49s 10.128.2.19 preserve-groupg-4cf4r-worker-lp9lk <none> <none> ➜ ~ oc logs h-1-54vkj Error from server: Get https://preserve-groupg-4cf4r-worker-thlsr:10250/containerLogs/default/h-1-54vkj/h: dial tcp: lookup preserve-groupg-4cf4r-worker-thlsr on 192.168.0.6:53: no such host The issue here is that, within openshift-kube-apiserver, it can not resolve worker name to ip.
The start-build with --from-* parameters to create a binary build failed too + oc start-build openshift-jee-sample-docker --from-file=target/ROOT.war -n u3gjn Uploading file "target/ROOT.war" as binary input for the build ... . Uploading finished Error from server (InternalError): Internal error occurred: error dialing backend: dial tcp: lookup preserve-groupg-4cf4r-worker1-x25qw on 192.168.0.6:53: no such host
@weiwei the commit (8343c018c7b99525d2d13299533b7267ca317c48) is now in the `release-4.2` branch but it looks like a nightly with that commit has not been built yet (the latest one I checked was https://openshift-release-artifacts.svc.ci.openshift.org/4.2.0-0.nightly-2019-09-01-224700/ and the commit is not there).
Verified on 4.2.0-0.nightly-2019-09-02-172410. ➜ ~ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES h-1-4flbc 1/1 Running 0 12m 10.131.0.14 share-0903a-rngb2-worker-ghhlq <none> <none> h-1-89rtq 1/1 Running 0 12m 10.128.2.12 share-0903a-rngb2-worker-x45bg <none> <none> h-1-8wr45 1/1 Running 0 12m 10.129.2.10 share-0903a-rngb2-worker-h8z46 <none> <none> h-1-d8xvp 1/1 Running 0 12m 10.129.2.9 share-0903a-rngb2-worker-h8z46 <none> <none> h-1-deploy 0/1 Completed 0 12m 10.129.2.8 share-0903a-rngb2-worker-h8z46 <none> <none> h-1-qrqfc 1/1 Running 0 12m 10.131.0.15 share-0903a-rngb2-worker-ghhlq <none> <none> h-1-wcx7d 1/1 Running 0 12m 10.128.2.13 share-0903a-rngb2-worker-x45bg <none> <none> ➜ ~ oc logs -f h-1-d8xvp serving on 8081 serving on 8888 ➜ ~ oc debug pods/h-1-d8xvp Starting pod/h-1-d8xvp-debug ... Pod IP: 10.129.2.11 If you don't see a command prompt, try pressing enter. / # ls bin dev etc hello hello-openshift home proc root run sys tmp usr var ➜ ~ oc exec -it h-1-d8xvp /bin/sh / # id uid=0(root) gid=0(root) groups=10(wheel) ➜ ~ oc rsh h-1-d8xvp / # id uid=0(root) gid=0(root) groups=10(wheel)
Verified on 4.2.0-0.nightly-2019-09-02-172410.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922