Bug 1679625 - oc debug node/<node-name> lands the user on a random node
Summary: oc debug node/<node-name> lands the user on a random node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Maciej Szulik
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-21 15:24 UTC by Mike Fiedler
Modified: 2019-06-04 10:44 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: NodeName was being cleared. Consequence: Invoking debug node command would land user on a random node. Fix: Preserve node name in command. Result: oc debug node allows debugging specified node.
Clone Of:
Environment:
Last Closed: 2019-06-04 10:44:19 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:44:27 UTC

Description Mike Fiedler 2019-02-21 15:24:53 UTC
Description of problem:

oc debug node/<node name> does not always land the user on the specified node .

# oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-128-129.us-east-2.compute.internal   Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-131-167.us-east-2.compute.internal   Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-131-241.us-east-2.compute.internal   Ready    worker   37h   v1.12.4+6a9f178753
ip-10-0-131-9.us-east-2.compute.internal     Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-134-110.us-east-2.compute.internal   Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-135-244.us-east-2.compute.internal   Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-136-237.us-east-2.compute.internal   Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-142-32.us-east-2.compute.internal    Ready    master   37h   v1.12.4+6a9f178753
ip-10-0-143-3.us-east-2.compute.internal     Ready    worker   20h   v1.12.4+6a9f178753
ip-10-0-144-144.us-east-2.compute.internal   Ready    worker   37h   v1.12.4+6a9f178753
ip-10-0-146-21.us-east-2.compute.internal    Ready    master   37h   v1.12.4+6a9f178753
ip-10-0-164-160.us-east-2.compute.internal   Ready    worker   37h   v1.12.4+6a9f178753
ip-10-0-175-24.us-east-2.compute.internal    Ready    master   37h   v1.12.4+6a9f178753

-----------------------------------
# oc debug node/ip-10-0-131-167.us-east-2.compute.internal
Starting pod/ip-10-0-131-167us-east-2computeinternal-debug ...
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.2# hostname
ip-10-0-128-129

The user is not on ip-10-0-131-167

Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-02-19-195128


How reproducible: Usually, sometimes you get lucky and end up on the right node.

Comment 1 Maciej Szulik 2019-02-21 15:42:50 UTC
Fix in https://github.com/openshift/origin/pull/22086

Comment 2 Maciej Szulik 2019-02-26 11:40:23 UTC
Fix merged, moving to qa.

Comment 3 Hongkai Liu 2019-02-26 14:35:15 UTC
It works to me.

# oc get clusterversion version 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-26-104314   True        False         78m     Cluster version is 4.0.0-0.nightly-2019-02-26-104314


# oc get node --no-headers | awk '{print $1}' | while read i; do echo "node: $i"; oc debug node/$i -- curl -s http://169.254.169.254/latest/meta-data/local-hostname; done
node: ip-10-0-135-209.us-east-2.compute.internal
Starting pod/ip-10-0-135-209us-east-2computeinternal-debug ...
ip-10-0-150-120.us-east-2.compute.internal
Removing debug pod ...
node: ip-10-0-139-178.us-east-2.compute.internal
Starting pod/ip-10-0-139-178us-east-2computeinternal-debug ...
ip-10-0-169-105.us-east-2.compute.internal
Removing debug pod ...
node: ip-10-0-148-160.us-east-2.compute.internal
Starting pod/ip-10-0-148-160us-east-2computeinternal-debug ...
ip-10-0-135-209.us-east-2.compute.internal
Removing debug pod ...
node: ip-10-0-150-120.us-east-2.compute.internal
Starting pod/ip-10-0-150-120us-east-2computeinternal-debug ...
ip-10-0-169-105.us-east-2.compute.internal
Removing debug pod ...
node: ip-10-0-164-68.us-east-2.compute.internal
Starting pod/ip-10-0-164-68us-east-2computeinternal-debug ...
ip-10-0-150-120.us-east-2.compute.internal
Removing debug pod ...
node: ip-10-0-169-105.us-east-2.compute.internal
Starting pod/ip-10-0-169-105us-east-2computeinternal-debug ...
ip-10-0-169-105.us-east-2.compute.internal
Removing debug pod ...

Comment 4 Xingxing Xia 2019-02-27 10:53:07 UTC
(In reply to Hongkai Liu from comment #3)
> node: ip-10-0-135-209.us-east-2.compute.internal
> Starting pod/ip-10-0-135-209us-east-2computeinternal-debug ...
> ip-10-0-150-120.us-east-2.compute.internal
> Removing debug pod ...
Thanks, but I don't understand why the ouput "node: ip-10-0-135-209" & "ip-10-0-150-120" can verify user lands on node ip-10-0-135-209.
Appreciate much if explained.

Comment 5 Xingxing Xia 2019-02-27 10:54:26 UTC
Verified with oc v4.0.5-1 per comment 0's chroot way:
oc debug no/ip-172-31-168-62.ap-northeast-1.compute.internal                                                    
Starting pod/ip-172-31-168-62ap-northeast-1computeinternal-debug ...                                                                   
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.2# hostname
ip-172-31-168-62

PS:
A bit verbose:
From the PR, the bug fix is oc side. Thus oc version decides whether the bug is verified.
$ git tag --sort=taggerdate --contains 6686bde # The PR's commit
v4.0.3-1
v4.0.4-1
v4.0.5-1
Thus fixed in version >= v4.0.3

Use oc v4.0.2 and v4.0.5 to run `debug no/ip-172-31-168-62.ap-northeast-1.compute.internal -o yaml` respectively:
output of oc v4.0.5 has spec of nodeName: ip-172-31-168-62.ap-northeast-1.compute.internal
output of oc v4.0.2 does not have, this was bug's cause

Comment 6 Hongkai Liu 2019-02-27 12:14:16 UTC
> Thanks, but I don't understand why the ouput "node: ip-10-0-135-209" &
> "ip-10-0-150-120" can verify user lands on node ip-10-0-135-209.
> Appreciate much if explained.
You are absolutely right. My fault. Checked wrong lines.
I will do more test today. Thank you for pointing that out.

Comment 7 Hongkai Liu 2019-02-27 13:15:30 UTC
@Xingxing, Thanks again for pointing out my stupid error.

The following should do the job instead.

### old version of oc client
# oc version
oc v4.0.0-0.179.0
kubernetes v1.12.4+30bae8e6c3
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.qe-hongkliu01.qe.devcluster.openshift.com:6443
kubernetes v1.12.4+0cbcfc5afe
root@ip-172-31-31-218: ~/bin # oc get clusterversion version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-27-074704   True        False         11m     Cluster version is 4.0.0-0.nightly-2019-02-27-074704

# oc get node --no-headers | awk '{print $1}' | while read i; do echo "node: $i"; oc debug node/$i -- curl -s http://169.254.169.254/latest/meta-data/local-hostname; done
node: ip-172-31-130-62.us-east-2.compute.internal
Starting pod/ip-172-31-130-62us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-139-240.us-east-2.compute.internal
Starting pod/ip-172-31-139-240us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-147-207.us-east-2.compute.internal
Starting pod/ip-172-31-147-207us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-158-32.us-east-2.compute.internal
Starting pod/ip-172-31-158-32us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-164-151.us-east-2.compute.internal
Starting pod/ip-172-31-164-151us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-166-8.us-east-2.compute.internal
Starting pod/ip-172-31-166-8us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...


### new version of oc client against the same cluster
# oc version
Client Version: version.Info{Major:"4", Minor:"0+", GitVersion:"v4.0.5", GitCommit:"0cbcfc5afe", GitTreeState:"", BuildDate:"2019-02-27T02:13:46Z", GoVersion:"", Compiler:"", Platform:""}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4+0cbcfc5afe", GitCommit:"0cbcfc5afe", GitTreeState:"clean", BuildDate:"2019-02-27T02:12:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

# oc get node --no-headers | awk '{print $1}' | while read i; do echo "node: $i"; oc debug node/$i -- curl -s http://169.254.169.254/latest/meta-data/local-hostname; done
node: ip-172-31-130-62.us-east-2.compute.internal
Starting pod/ip-172-31-130-62us-east-2computeinternal-debug ...
ip-172-31-130-62.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-139-240.us-east-2.compute.internal
Starting pod/ip-172-31-139-240us-east-2computeinternal-debug ...
ip-172-31-139-240.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-147-207.us-east-2.compute.internal
Starting pod/ip-172-31-147-207us-east-2computeinternal-debug ...
ip-172-31-147-207.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-158-32.us-east-2.compute.internal
Starting pod/ip-172-31-158-32us-east-2computeinternal-debug ...
ip-172-31-158-32.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-164-151.us-east-2.compute.internal
Starting pod/ip-172-31-164-151us-east-2computeinternal-debug ...
ip-172-31-164-151.us-east-2.compute.internal
Removing debug pod ...
node: ip-172-31-166-8.us-east-2.compute.internal
Starting pod/ip-172-31-166-8us-east-2computeinternal-debug ...
ip-172-31-166-8.us-east-2.compute.internal
Removing debug pod ...


Now every node IP is the same as 169.254.169.254's answer.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html

BTW, @Xingxing, where do you get the oc cli builds?
Here is what I got:
https://mirror.openshift.com/pub/openshift-v3/clients/

Only those available:
[DIR]	4.0.2/	2019-02-26 12:14	-	 
[DIR]	4.0.5/	2019-02-27 07:02	-

Comment 13 errata-xmlrpc 2019-06-04 10:44:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.