1802639 – oc debug node/<node_name> does not work reliably

Bug 1802639 - oc debug node/<node_name> does not work reliably

Summary: oc debug node/<node_name> does not work reliably

Keywords:
Status:	CLOSED DUPLICATE of bug 1810136
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.z
Hardware:	s390x
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:	multi-arch
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-13 15:25 UTC by wvoesch
Modified:	2020-03-26 15:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-26 15:36:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
output of dmesg from within the debug shell (18.39 KB, text/plain) 2020-03-17 16:11 UTC, wvoesch	no flags	Details
View All

Description wvoesch 2020-02-13 15:25:00 UTC

Description of problem:
oc debug node/<node_name> does not work reliably, in the sense, that not always the shell in the container within the pod can be accessed after creation. The pod is removed with the error message “error: watch closed before UntilWithoutRetry timeout“, however nodes can be pinged and sshed into.

This was tested in a cluster with 53 nodes in total. The test looped over all nodes several times. 

How reproducible:
Overall failure: 6.67% (19 / 285)

Reproducibility in more detail:
failure rates:
Node 1 and 2: 
100 % (6 / 6) 
These nodes were marked as notready.
It was possible to ping the nodes. 
Direct ssh into the nodes is possible.
After testing all nodes in cycle mode, I tried again several times with failure rates of 100 % (7/7) and 86 % (6/7)


Node  3: 
83 % (5 / 6) 
This node was marked as ready.
It was possible to ping the node. 
Direct ssh into the node is possible.
After testing all nodes in cycle mode, I tried again several times with a failure rate of 86 % (6/7).


Node 4: 
33 % (2 /6) 
This node was marked as ready.
It was possible to ping the node. 
Direct ssh into the node is possible. 
After testing all nodes in cycle mode, I tried again several times with a failure rate of 100 % (7/7).


Node 5 – 53:
0 % failures

Actual results:
# oc debug node/<node-name>
Starting pod/<node-name-debug> ...
To use host binaries, run `chroot /host`

Removing debug pod ...
error: watch closed before UntilWithoutRetry timeout

Expected results:
# oc debug node/<node-name>
Starting pod/<node-name-debug>  ...
To use host binaries, run `chroot /host`
Pod IP: <pod-ip>
If you don't see a command prompt, try pressing enter.
sh-4.2#

Comment 1 Holger Wolf 2020-03-16 14:59:27 UTC

According to the new process, will re-assign bug to a OpenShift Component

Comment 2 Ryan Phillips 2020-03-16 18:31:10 UTC

This might be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1802687 and the fix https://github.com/openshift/origin/pull/24568 . Which version of RHCOS are you running?

Comment 3 wvoesch 2020-03-17 16:11:23 UTC

Created attachment 1670841 [details]
output of dmesg from within the debug shell

Comment 4 wvoesch 2020-03-17 16:12:51 UTC

the used RHCOS version was:

rhcos-4.2.18-s390x

I retested, and the issue remains with a newer ocp version. Happened 4 out of 17 tries. 

new version:
Server Version: 4.2.20
Kubernetes Version: v1.14.6+681f635


I don't think it is related to the other bug, as the used memory of the node is 13%. See the output of # oc describe no <node_name> below. 

Furthermore, I attached the output of # dmesg from within the debug container. 

# oc describe no <node_name>
Name:               <node_name>
Roles:              worker
Labels:             beta.kubernetes.io/arch=s390x
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=s390x
                    kubernetes.io/hostname=<node_name>
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-0ebe48e122b82ec1a9d8d2b798b15e74
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-0ebe48e122b82ec1a9d8d2b798b15e74
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 14 Feb 2020 18:28:02 +0100
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 17 Mar 2020 16:45:04 +0100   Tue, 10 Mar 2020 13:11:07 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 17 Mar 2020 16:45:04 +0100   Tue, 10 Mar 2020 13:11:07 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 17 Mar 2020 16:45:04 +0100   Tue, 10 Mar 2020 13:11:07 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 17 Mar 2020 16:45:04 +0100   Tue, 10 Mar 2020 13:11:27 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  <node_ip>
  Hostname:    <node_name>
Capacity:
 cpu:            2
 hugepages-1Mi:  0
 memory:         8238524Ki
 pods:           250
Allocatable:
 cpu:            1500m
 hugepages-1Mi:  0
 memory:         7624124Ki
 pods:           250
System Info:
 Machine ID:                              91da5bf09f4e4ad49938293b20967eab
 System UUID:                             91da5bf09f4e4ad49938293b20967eab
 Boot ID:                                 b13a4075-dfed-4cfe-9e66-ca88e88716a1
 Kernel Version:                          4.18.0-147.el8.s390x
 OS Image:                                Red Hat Enterprise Linux CoreOS 42s390x.81.20200217.0 (Ootpa)
 Operating System:                        linux
 Architecture:                            s390x
 Container Runtime Version:               cri-o://1.14.12-10.dev.rhaos4.2.git313d784.el8
 Kubelet Version:                         v1.14.6+47933cbcc
 Kube-Proxy Version:                      v1.14.6+47933cbcc
Non-terminated Pods:                      (9 in total)
  Namespace                               Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                 ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-wpnt2                          10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         7d3h
  openshift-dns                           dns-default-j6zn5                    110m (7%)     0 (0%)      70Mi (0%)        512Mi (6%)     7d4h
  openshift-image-registry                node-ca-qvbvp                        10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         7d4h
  openshift-machine-config-operator       machine-config-daemon-nlh8v          20m (1%)      0 (0%)      50Mi (0%)        0 (0%)         7d3h
  openshift-monitoring                    node-exporter-jpdht                  10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         7d4h
  openshift-monitoring                    telemeter-client-59f7947dc6-txcv9    10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         7d2h
  openshift-multus                        multus-hlv5g                         10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         7d4h
  openshift-sdn                           ovs-ml8dk                            200m (13%)    0 (0%)      400Mi (5%)       0 (0%)         7d4h
  openshift-sdn                           sdn-4rfrr                            100m (6%)     0 (0%)      200Mi (2%)       0 (0%)         7d4h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                480m (32%)   0 (0%)
  memory             970Mi (13%)  512Mi (6%)
  ephemeral-storage  0 (0%)       0 (0%)
Events:              <none>

Comment 5 Ryan Phillips 2020-03-26 15:36:12 UTC

This is usually due to an overloaded node (memory or cpu). There is a fix being propagated through the releases in BZ1810136.

*** This bug has been marked as a duplicate of bug 1810136 ***

Note You need to log in before you can comment on or make changes to this bug.