Hide Forgot
Description of problem: > You can't use oc rsh to get into a running pod. Version-Release number of selected component (if applicable): > $ oc get clusterversion -o jsonpath='{.items[].status.current.version}{"\n"}' > 4.0.0-9 How reproducible: 100% Steps to Reproduce: 1. $ oc rsh -t openshift-tests-1-rrgd6 Actual results: > Error from server: error dialing backend: dial tcp: lookup ip-10-0-143-162 on > 10.0.0.2:53: no such host Expected results: > You should get a tty shell in the pod that is running on the cluster. Additional info: > The issue seems to be in how the node/pod is referenced (using an internal IP and non-addressable node name).
Can you provide --loglevel=8 output? Are you able to do `oc exec` against that same pod?
If I just do the following: > $ oc create -f https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json > $ oc rsh hello-openshift --loglevel=8 I0114 12:12:16.349519 27643 loader.go:359] Config loaded from file /home/erich/Repositories/no_git/ocp_aws_install/ocp-aws-setup/test_cluster/auth/kubeconfig I0114 12:12:16.350463 27643 loader.go:359] Config loaded from file /home/erich/Repositories/no_git/ocp_aws_install/ocp-aws-setup/test_cluster/auth/kubeconfig I0114 12:12:16.353123 27643 loader.go:359] Config loaded from file /home/erich/Repositories/no_git/ocp_aws_install/ocp-aws-setup/test_cluster/auth/kubeconfig I0114 12:12:16.354033 27643 loader.go:359] Config loaded from file /home/erich/Repositories/no_git/ocp_aws_install/ocp-aws-setup/test_cluster/auth/kubeconfig I0114 12:12:16.354970 27643 loader.go:359] Config loaded from file /home/erich/Repositories/no_git/ocp_aws_install/ocp-aws-setup/test_cluster/auth/kubeconfig I0114 12:12:16.355218 27643 round_trippers.go:383] GET https://neverland-api.rhcee.support:6443/api/v1/namespaces/testing/pods/hello-openshift I0114 12:12:16.355227 27643 round_trippers.go:390] Request Headers: I0114 12:12:16.355232 27643 round_trippers.go:393] Accept: application/json, */* I0114 12:12:16.355236 27643 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0114 12:12:16.355246 27643 round_trippers.go:393] Authorization: Bearer MSoG8U6m-rbysZOWTO0oRfPBAn2CVAt-KXlLndmiZLY I0114 12:12:16.528694 27643 round_trippers.go:408] Response Status: 200 OK in 173 milliseconds I0114 12:12:16.528766 27643 round_trippers.go:411] Response Headers: I0114 12:12:16.528783 27643 round_trippers.go:414] Date: Mon, 14 Jan 2019 17:12:16 GMT I0114 12:12:16.528798 27643 round_trippers.go:414] Audit-Id: 4dd75800-d77e-44c7-90f7-de7d00d90a46 I0114 12:12:16.528819 27643 round_trippers.go:414] Cache-Control: no-store I0114 12:12:16.528840 27643 round_trippers.go:414] Content-Type: application/json I0114 12:12:16.528860 27643 round_trippers.go:414] Content-Length: 2295 I0114 12:12:16.529005 27643 request.go:897] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"hello-openshift","namespace":"testing","selfLink":"/api/v1/namespaces/testing/pods/hello-openshift","uid":"fd45e5e4-181e-11e9-b816-0e1fda3dc818","resourceVersion":"8035117","creationTimestamp":"2019-01-14T17:08:18Z","labels":{"name":"hello-openshift"},"annotations":{"openshift.io/scc":"anyuid"}},"spec":{"volumes":[{"name":"tmp","emptyDir":{}},{"name":"default-token-rpfdm","secret":{"secretName":"default-token-rpfdm","defaultMode":420}}],"containers":[{"name":"hello-openshift","image":"openshift/hello-openshift","ports":[{"containerPort":8080,"protocol":"TCP"}],"resources":{},"volumeMounts":[{"name":"tmp","mountPath":"/tmp"},{"name":"default-token-rpfdm","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent","securityContext":{"capabilities":{"drop":["MKNOD"]},"privileged":false}}],"restartPolicy":"Always","terminationGrace [truncated 1271 chars] I0114 12:12:16.549286 27643 round_trippers.go:383] POST https://neverland-api.rhcee.support:6443/api/v1/namespaces/testing/pods/hello-openshift/exec?command=%2Fbin%2Fsh&command=-c&command=TERM%3D%22xterm-256color%22+%2Fbin%2Fsh&container=hello-openshift&container=hello-openshift&stdin=true&stdout=true&tty=true I0114 12:12:16.549342 27643 round_trippers.go:390] Request Headers: I0114 12:12:16.549359 27643 round_trippers.go:393] X-Stream-Protocol-Version: v4.channel.k8s.io I0114 12:12:16.549372 27643 round_trippers.go:393] X-Stream-Protocol-Version: v3.channel.k8s.io I0114 12:12:16.549387 27643 round_trippers.go:393] X-Stream-Protocol-Version: v2.channel.k8s.io I0114 12:12:16.549401 27643 round_trippers.go:393] X-Stream-Protocol-Version: channel.k8s.io I0114 12:12:16.549415 27643 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0114 12:12:16.549432 27643 round_trippers.go:393] Authorization: Bearer MSoG8U6m-rbysZOWTO0oRfPBAn2CVAt-KXlLndmiZLY I0114 12:12:16.680122 27643 round_trippers.go:408] Response Status: 500 Internal Server Error in 130 milliseconds I0114 12:12:16.680189 27643 round_trippers.go:411] Response Headers: I0114 12:12:16.680213 27643 round_trippers.go:414] Date: Mon, 14 Jan 2019 17:12:16 GMT I0114 12:12:16.680237 27643 round_trippers.go:414] Content-Length: 177 I0114 12:12:16.680290 27643 round_trippers.go:414] Audit-Id: 8788ac7c-80f2-4abf-a715-cb718dd24fa6 I0114 12:12:16.680315 27643 round_trippers.go:414] Cache-Control: no-store I0114 12:12:16.680337 27643 round_trippers.go:414] Content-Type: application/json I0114 12:12:16.681791 27643 helpers.go:201] server response object: [{ "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "error dialing backend: dial tcp: lookup ip-10-0-149-151 on 10.0.0.2:53: no such host", "code": 500 }] F0114 12:12:16.681860 27643 helpers.go:119] Error from server: error dialing backend: dial tcp: lookup ip-10-0-149-151 on 10.0.0.2:53: no such host I also get something similar with exec: > $ oc exec -t hello-openshift /bin/bash > Error from server: error dialing backend: dial tcp: lookup ip-10-0-149-151 on 10.0.0.2:53: no such host
This is not a CLI issue, but rather an api-server issue when proxying a request to the node.
Cannot replicate with a 4.0 cluster. Additionally, using the hello-pod example from the bug description, you cannot rsh into it, even without a "backend dial error" due to the container not having a /bin/sh binary: ``` $ oc rsh hello-openshif exec failed: container_linux.go:336: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory" command terminated with exit code 1 ``` Can you confirm that this error does not happen with other pods?
This seems like a networking (DNS) issue in how the system is setup! If we look at the logs of the API server after trying to rsh to a pod I get a message like: > Error from server: Get https://ip-10-0-10-238:10250/containerLogs/openshift-apiserver/apiserver-z9wdp/openshift-apiserver: dial tcp: lookup ip-10-0-10-238 on 10.0.0.2:53: no such host It's not the error for the rsh (but given the logs loop I am running) its explains why some of my request for logs fail (in the same way rsh does). - command: $ for p in $(oc get pods -n openshift-apiserver -o name); do oc logs $p -n openshift-apiserver; done If you attempt to debug this; by running dig or some other host look up tool on the node/master where the api server is running you can see that the lookup for the hostname of where the pod lives fails > $ dig ip-10-0-10-238 +short > If you attempt to lookup the FQDN (however) > $ dig ip-10-0-10-238.ec2.internal +shor > 10.0.10.238 This resolves. The core issue here seems to be that the api server or the node's hostname is presenting the short name and not the FQDN. As a result, I am moving this to networking to review.
Created attachment 1521857 [details] nodes.yaml @dan this is what you asked for on slack, to prove out the theory that this might be an install issue, in how the nodes are registered.
so: metadata: name: ip-10-0-10-238.ec2.internal status: addresses: - address: ip-10-0-10-238 type: InternalDNS - address: ec2-3-91-38-228.compute-1.amazonaws.com type: ExternalDNS - address: ip-10-0-10-238 type: Hostname So the node is registering itself with a fully-qualified "name", but then shortening that for the "InternalDNS" and "Hostname" address records. (I'm not sure which one of those two is the relevant one here.) So that seems more like the node knows its correct FQDN but kubelet node registration is breaking it.
OK, metadata.name and InternalDNS come from different parts of AWS config, and Hostname comes from the output of "hostname" on the node, unless we're overriding it, which we're apparently not (Eric confirms that the node returns an unqualified name for "hostname"). I believe the Hostname NodeAddress is used in preference to the InternalDNS/ExternalDNS ones when doing "oc exec" when it is present. So the master is trying to connect to the Hostname that the node registered itself with, but it can't resolve that name as it is. So I think the bug is that the node's hostname should be its FQDN, but it's not. Update from Eric: $ hostnamectl Static hostname: localhost.localdomain Transient hostname: ip-10-0-10-238 ... so we're not explicitly setting the hostname at all, it's just figuring it out from DHCP or something?
Yeah, I doubt that we're setting the hostname; it comes from the instance metadata. Back on CoreOS Container Linux, we got this from the metadata service. I think this is a bug on RHCOS's implementation of ignition + instance metadata. Specifically, the FQDN needs to be set, not just the hostname.
There seems to be a difference between the CEE cluster's DHCP server and the devcluster DHCP server: CEE: Jan 21 14:30:44 ip-10-0-151-88 NetworkManager[4110]: <info> [1548081044.3147] dhcp4 (eth0): address 10.0.151.88 Jan 21 14:30:44 ip-10-0-151-88 NetworkManager[4110]: <info> [1548081044.3148] dhcp4 (eth0): plen 20 (255.255.240.0) Jan 21 14:30:44 ip-10-0-151-88 NetworkManager[4110]: <info> [1548081044.3148] dhcp4 (eth0): gateway 10.0.144.1 Jan 21 14:30:44 ip-10-0-151-88 NetworkManager[4110]: <info> [1548081044.3148] dhcp4 (eth0): lease time 3600 Jan 21 14:30:44 ip-10-0-151-88 NetworkManager[4110]: <info> [1548081044.3148] dhcp4 (eth0): hostname 'ip-10-0-151-88' Jan 21 14:30:44 ip-10-0-151-88 NetworkManager[4110]: <info> [1548081044.3148] dhcp4 (eth0): nameserver '10.0.0.2' devcluster: Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7872] dhcp4 (eth0): address 10.0.17.190 Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7872] dhcp4 (eth0): plen 20 (255.255.240.0) Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7872] dhcp4 (eth0): gateway 10.0.16.1 Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7873] dhcp4 (eth0): lease time 3600 Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7873] dhcp4 (eth0): hostname 'ip-10-0-17-190' Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7873] dhcp4 (eth0): nameserver '10.0.0.2' Jan 21 16:05:57 localhost.localdomain NetworkManager[4121]: <info> [1548086757.7873] dhcp4 (eth0): domain name 'us-east-2.compute.internal' The "domain name" value in the devcluster DHCP response results in an /etc/resolv.conf with a "search" line: [core@ip-10-0-17-190 ~]$ more /etc/resolv.conf # Generated by NetworkManager search us-east-2.compute.internal nameserver 10.0.0.2 while in the CEE cluster they get no "search" and so can't resolve the short hostnames: $ ansible -b -u core --private-key ~/.ssh/id_rsa -i ec2-3-89-79-40.compute-1.amazonaws.com,ec2-54-145-193-52.compute-1.amazonaws.com,ec2-3-91-38-228.compute-1.amazonaws.com all -m raw -a "cat /etc/resolv.conf" ec2-3-89-79-40.compute-1.amazonaws.com | CHANGED | rc=0 >> # Generated by NetworkManager nameserver 10.0.0.2 Shared connection to ec2-3-89-79-40.compute-1.amazonaws.com closed.
(In reply to Dan Winship from comment #12) > There seems to be a difference between the CEE cluster's DHCP server and the > devcluster DHCP server: When the installer creates a new VPC, it leaves most settings at their defaults [1]. That includes enabling DHCP [2], but disabling DHCP hostnames [3]. If there are DHCP differences, we may be inheriting them from the AWS account (I'm not familiar with the VPC DHCP flow). [1]: https://github.com/openshift/installer/blob/v0.10.0/data/data/aws/vpc/vpc.tf#L6-L14 [2]: https://www.terraform.io/docs/providers/aws/r/vpc.html#enable_dns_support [3]: https://www.terraform.io/docs/providers/aws/r/vpc.html#enable_dns_hostnames
I've floated a potential fix in https://github.com/openshift/installer/pull/1116 , if folks want to give that a spin.
This PR was merged, updating the bug to put this into the release pipeline.
Firstly QE can not reproduce this bug using our own aws account, but I could confirm the PR take effects now. # ./openshift-install version ./openshift-install v0.11.0 Trigger an install, get the following message: time="2019-01-28T01:56:56-05:00" level=debug msg="module.vpc.aws_vpc_dhcp_options.main: Creating..." time="2019-01-28T01:56:56-05:00" level=debug msg=" domain_name: \"\" => \"us-east-2.compute.internal\"" time="2019-01-28T01:56:56-05:00" level=debug msg=" domain_name_servers.#: \"\" => \"1\"" time="2019-01-28T01:56:56-05:00" level=debug msg=" domain_name_servers.0: \"\" => \"AmazonProvidedDNS\"" time="2019-01-28T01:56:56-05:00" level=debug msg=" owner_id: \"\" => \"<computed>\"" time="2019-01-28T01:56:56-05:00" level=debug msg=" tags.%: \"\" => \"2\"" time="2019-01-28T01:56:56-05:00" level=debug msg=" tags.kubernetes.io/cluster/qe-jialiu2: \"\" => \"owned\"" time="2019-01-28T01:56:56-05:00" level=debug msg=" tags.openshiftClusterID: \"\" => \"af991f71-8a30-4f01-a9d0-94888d7de7ff\"" # cat /etc/resolv.conf # Generated by NetworkManager search us-east-2.compute.internal nameserver 10.0.0.2 # oc rsh hello-openshift exec failed: container_linux.go:336: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory" command terminated with exit code 1 # oc exec -t hello-openshift /bin/bash exec failed: container_linux.go:336: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory" command terminated with exit code 1 Comment 2 is not reproduced any more.