Bug 2015119

Summary: Getting error while using `oc debug -T node/NODENAME`
Product: OpenShift Container Platform Reporter: schugh
Component: ocAssignee: Maciej Szulik <maszulik>
oc sub component: oc QA Contact: zhou ying <yinzhou>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, calfonso, cgaynor, kiyyappa, kurathod, maszulik, mfojtik, pkhilare
Version: 4.7   
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-22 07:19:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description schugh 2021-10-18 12:32:25 UTC
Description of problem:
- Observed some errors while use 'oc debug -T node/NODE_NAME` command in loop

Version-Release number of selected component (if applicable):
- Checked in 4.7 and 4.8

How reproducible:
- Random (Not always)

Steps to Reproduce:
- for i in {1..50}; do oc get nodes -o name | xargs -n 1 -i sh -c 'oc debug  -T {} -- chroot /host uptime';sleep 10; done

Actual results:
- Sometimes getting below errors on random nodes (Not on specific nodes):
[1] error: unable to upgrade connection: container container-00 not found in pod worker-2<LAB>-debug_<NS>
[2] error: Internal error occurred: error attaching to container: container is not created or running

Expected results:
- output of command, In this case `uptime` command

Comment 1 Maciej Szulik 2021-10-18 13:17:15 UTC
Can you provide more detailed output from those cases where this breaks?

Comment 20 Colum Gaynor 2022-10-01 11:40:57 UTC
@Maciej Szulik <maszulik>

See the original support case description of the issue and effect to the end customer ( Nokia NOM ) copy/pasted below:

What problem/issue/behavior are you having trouble with?  What do you expect to see?
We had a requirement to launch a pod, execute the curl command provided, print the result and exit(upon exit terminate the pod). 
So we have used “oc run” command for this purpose and my command looks like:
oc run -it --rm --image=image-registry.openshift-image-registry.svc:5000/cal-shared-product/nmcal-helper-utils:v1.0 nmcal-helper-utils-123 -n cal-shared-product --restart=Never -- /bin/sh -c "<CURL_COMMAND>"

It works, however frequently we are seeing an error message being printed during this operation though the command execution is completed successfully. 
The error message is(2 slight variants):

 Error attaching, falling back to logs: Internal error occurred: error attaching to container: container is not created or running
 Error attaching, falling back to logs: unable to upgrade connection: container nmcal-helper-utils-123 not found in pod nmcal-helper-utils-123_cal-shared-product

Expectation:
When it is able to perform the operation successfully why does it throws error? 
This will create issue for us while processing the result. <<<<<<<---------------------------------------------------  <<<CG: The Bug Creates issues for Nokia NOM's Automation Scripts>>>

Also I have attached the files which contains the logs(with log level 7 & 8) for both successful and failure scenarios.

What is the business impact? Please also provide timeframe information.
Even though the command execution is successful, due to this error present in the output our result processing will have issues <<<----- *

Colum Gaynor - Senior partner Success Manager, Global Account

Comment 28 Maciej Szulik 2022-10-19 11:18:04 UTC
I'm working on backports, PRs will be landing today.

Comment 30 Maciej Szulik 2022-10-19 11:28:05 UTC
As soon as https://github.com/openshift/oc/pull/1270 merges this should be available in 4.10

Comment 32 Colum Gaynor 2022-10-22 13:35:19 UTC
@Maciej Szulik <maszulik> ----> THANK YOU VERY MUCH. This made my week !

Colum Gaynor - Senior Partner Success Manager, Nokia Global Account

Comment 36 zhou ying 2022-10-27 04:24:05 UTC
with the merged pr , I still could reproduce this issue :

[root@localhost oc]# oc version  --client -oyaml 
clientVersion:
  buildDate: "2022-10-25T04:39:50Z"
  compiler: gc
  gitCommit: 8df677dc147fe8297d90c4757154469a931bdb90
  gitTreeState: clean
  gitVersion: 4.10.0-202210250416.p0.g8df677d.assembly.stream-8df677d
  goVersion: go1.17.12
  major: ""
  minor: ""
  platform: linux/amd64
releaseClientVersion: 4.10.39

[root@localhost oc]# git log
commit 8df677dc147fe8297d90c4757154469a931bdb90 (HEAD -> release-4.10, origin/release-4.10)
Merge: 442535c4d 39057a282
Author: OpenShift Merge Robot <openshift-merge-robot.github.com>
Date:   Thu Oct 20 09:20:56 2022 -0400

    Merge pull request #1270 from soltysh/bug2015119
    
    Bug 2015119: bump(k8s.io/kubectl) to pick up k/k#110764

for i in {1..50}; do oc get nodes -o name | xargs -n 1 -i sh -c 'oc debug  -T {} -- chroot /host uptime';sleep 10; done
xargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
Starting pod/ip-10-0-131-116us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
 03:54:22 up  1:22,  0 users,  load average: 1.75, 1.76, 1.27

....
Removing debug pod ...
error: unable to upgrade connection: container container-00 not found in pod ip-10-0-203-69us-east-2computeinternal-debug_default
Starting pod/ip-10-0-219-219us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
 04:02:22 up  1:25,  0 users,  load average: 0.24, 0.29, 0.23

Removing debug pod ...
xargs: warning: options --max-args and --replace/-I/-i are mutually exclusive, ignoring previous --max-args value
Starting pod/ip-10-0-131-116us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
 04:02:37 up  1:30,  0 users,  load average: 0.86, 1.11, 1.13


Removing debug pod ...
Starting pod/ip-10-0-150-56us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.150.56
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
error: unable to upgrade connection: container container-00 not found in pod ip-10-0-150-56us-east-2computeinternal-debug_default
Starting pod/ip-10-0-174-131us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
 04:06:36 up  1:29,  0 users,  load average: 0.00, 0.03, 0.05

Removing debug pod ...
Starting pod/ip-10-0-190-1us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
 04:06:40 up  1:35,  0 users,  load average: 1.41, 1.13, 0.97

Comment 38 Maciej Szulik 2022-11-09 09:30:50 UTC
The fix in this bug was only to improve only error #2 from initial description, ie:

Error attaching, falling back to logs...

from an error to a warning.

The other error is correct and is explicitly pointing that we started creating the connection sooner than the container was available. 

Based on the above, moving back to qa.

Comment 40 zhou ying 2022-11-10 05:35:03 UTC
checked with:
oc version --client
Client Version: 4.10.41

Can't see 'Error attaching' again.

Comment 43 errata-xmlrpc 2022-11-22 07:19:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.42 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8496