1941815 – From the web console the terminal can no longer connect after using leaving and returning to the terminal view

Bug 1941815 - From the web console the terminal can no longer connect after using leaving and returning to the terminal view

Summary: From the web console the terminal can no longer connect after using leaving a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	s390x
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Kir Kolyshkin
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	ocp-48-z-tracker
TreeView+	depends on / blocked

Reported:	2021-03-22 20:26 UTC by jhusta
Modified:	2021-07-27 22:55 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:55:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screen shot of broken terminal (117.50 KB, application/pdf) 2021-03-23 12:58 UTC, jhusta	no flags	Details
Updated Screen shot including worker node (143.42 KB, application/pdf) 2021-03-23 20:00 UTC, jhusta	no flags	Details
screen recording (10.42 MB, application/zip) 2021-03-24 09:54 UTC, Yadan Pei	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:55:16 UTC

Description jhusta 2021-03-22 20:26:33 UTC

Description of problem: Via the web console after connecting to the terminal if I jump over to the logs tab or jump out and come back to the terminal tab it shows as the terminal is closed. I click reconnect and it does not reconnect. I have tried this with different pods and I get the same result regardless.

Version-Release number of selected component (if applicable):
OCP 4.8 build 4.8.0-0.nightly-s390x-2021-03-22-155743

How reproducible:
I kicked created a deployment to run stress-ng with an io workload
pods-pod details - Terminal
After the container is created I check the Logs tab
Then I use the Terminal tab so I can run iostat
I go out of the terminal view either by going to another project or to go back and check on logs. I then go back to the terminal tab and that is when the failure occurs

Steps to Reproduce:
1.
2.
3.

Actual results:
Terminal disconnects and can no longer connect
Expected results:
Should be able to switch back to terminal and be connected

Additional info:
Error Message when i try to go back to the terminal tab
ERRO[0000] exec failed: container_linux.go:367: starting container process caused: open /dev/pts/4294967296: no such file or directory
command terminated with non-zero exit code: exit status 1The terminal connection has closed.
connecting to openshift-apiserver

My Workload same output
ERRO[0000] exec failed: container_linux.go:367: starting container process caused: open /dev/pts/4294967296: no such file or directory
command terminated with non-zero exit code: exit status 1The terminal connection has closed.
connecting to iomixstress1

I am not sure which logs to provide so please let me know what additional info you will need.

Comment 1 Jakub Hadvig 2021-03-23 08:56:02 UTC

Johanna could you please attach at least screen shot of the error, cause I'm not sure if that error is in console or is printed in terminal.
Checked for the `container_linux.go` in our codebase and there is no such file, so I have a feeling that the error is done the k8s or openshift or by the pod itself.

Comment 2 jhusta 2021-03-23 12:58:44 UTC

Created attachment 1765518 [details]
screen shot of broken terminal

Comment 3 jhusta 2021-03-23 20:00:02 UTC

Created attachment 1765708 [details]
Updated Screen shot including worker node

This screen shot includes the message when trying to connect to a worker node terminal. For this one it did not work from first try.

Comment 4 Yadan Pei 2021-03-24 09:54:31 UTC

Created attachment 1765878 [details]
screen recording

Comment 5 Yadan Pei 2021-03-24 09:59:31 UTC

I tried on 4.8.0-0.nightly-2021-03-22-104536, re-visiting pod terminal works for me, see the screen recording attached

Comment 6 jhusta 2021-03-24 14:06:00 UTC

Thanks Yadan I tried the same pod on my KVM environment which is running build 4.8.0-0.nightly-s390x-2021-03-22-155743 which is newer so I would assume same results. Unfortunately, I hit the issue as I have been. I am running on a z15 not sure if that matters what are you running on. Are there any specific logs I can take a look at to see why the terminal keeps breaking for me? Thank you for your help.

Comment 7 Stefan Orth 2021-03-25 14:57:41 UTC

I hit the same issue on a z14 with z/VM and also version 4.8.0-0.nightly-s390x-2021-03-22-155743. I was able to access the terminal once, after the creation of the pod. After some time, I got the error as mentioned above.

Comment 8 Stefan Orth 2021-03-25 16:02:28 UTC

Same result with oc rsh or oc exec:

Directly after POD deployment:

[root@m3558001 4_8]# oc create -f 4_8_LSO_POD_worker001_MP.yaml
pod/so-test02 created
[root@m3558001 4_8]# oc rsh so-test02
sh-4.2# 

After a couple of minutes, the session is terminated and I am not able to login again:


[root@m3558001 4_8]# oc rsh so-test02
ERRO[0000] exec failed: container_linux.go:367: starting container process caused: open /dev/pts/4294967296: no such file or directory 
command terminated with exit code 1

Comment 9 Yadan Pei 2021-03-26 01:42:26 UTC

It seems a general pod issue, I would like to check pod status with several command and confirm why the container is stopped

* oc describe pod so-test02
* oc logs -f so-test02

Comment 10 Stefan Orth 2021-03-26 09:53:23 UTC

[root@m3558001 4_8]# oc describe pod so-test02
Name:         so-test02
Namespace:    default
Priority:     0
Node:         worker-001.m3558001.lnxne.boe/10.107.1.56
Start Time:   Fri, 26 Mar 2021 10:30:50 +0100
Labels:       <none>
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.36"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.128.2.36"
                    ],
                    "default": true,
                    "dns": {}
                }]
Status:       Running
IP:           10.128.2.36
IPs:
  IP:  10.128.2.36
Containers:
  solsotest02:
    Container ID:   cri-o://6bd322324327be641668ae7c8e0a89ce0eae0dd5ad6437fbce98bd0962129b12
    Image:          sys-loz-test-team-docker-local.artifactory.swg-devops.com/s390x_blank_base_image:3.0
    Image ID:       sys-loz-test-team-docker-local.artifactory.swg-devops.com/s390x_blank_base_image@sha256:62403430158da217a0056ac0b4f8dad258d06fba132bc59a0f96aaebbff69106
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 26 Mar 2021 10:30:53 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /lsodata from localpvcso2 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dlhjm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  localpvcso2:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  localstorage-mp
    ReadOnly:   false
  default-token-dlhjm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-dlhjm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       17m   default-scheduler  Successfully assigned default/so-test02 to worker-001.m3558001.lnxne.boe
  Normal  AddedInterface  17m   multus             Add eth0 [10.128.2.36/23]
  Normal  Pulled          17m   kubelet            Container image "sys-loz-test-team-docker-local.artifactory.swg-devops.com/s390x_blank_base_image:3.0" already present on machine
  Normal  Created         17m   kubelet            Created container solsotest02
  Normal  Started         17m   kubelet            Started container solsotest02




No logs available for the pod.

Comment 11 Jakub Hadvig 2021-03-29 07:40:31 UTC

Moving to Node team since this looks not as a Console issue, but rather a general one.

Comment 12 Holger Wolf 2021-03-29 12:25:35 UTC

Raise the severity of the bug since it is a base function currently not working.

Comment 13 Peter Hunt 2021-04-01 16:20:39 UTC

this looks suspiciously like https://github.com/moby/moby/issues/36467
which was supposed to be fixed in https://github.com/opencontainers/runc/pull/1727/

Kir can you PTAL? it looks like a runc regression

Comment 14 Dan Li 2021-04-06 12:04:57 UTC

Hi Kir and node team, this bug is blocking a base function as discovered by the multi-arch s390x team, and therefore, can we consider labeling the bug as a "Blocker+" flag?

Comment 15 jhusta 2021-04-06 13:50:52 UTC

Hi this is also blocking workload development on 4.8. I had to go back to 4.7 to workaround the issue. Anything we can do to move this one along would be great! Thank You

Comment 16 Stefan Orth 2021-04-06 15:06:18 UTC

It is also blocking me to create a pod and do some testing with block storage and other stuff like iSCSI which I have to check inside the pod.

Comment 17 Kir Kolyshkin 2021-04-06 17:10:07 UTC

> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: open /dev/pts/4294967296: no such file or directory 

The number 4294967296 is -1. It seems that someone passes -1 to runc, and instead of treating it as an error, it fails to open it.

So the bug is in an upper level (but I will take a look at how runc interprets it -- maybe it need to provide a more sensible error).

Comment 18 Kir Kolyshkin 2021-04-06 18:01:08 UTC

> The number 4294967296 is -1.

I was wrong, it is 1 shifted 32 bytes to the right.

(gdb) p /x 4294967296
$1 = 0x100000000
(gdb) p 1ULL << 32
$2 = 4294967296

Peter's analysis is correct. This is a regression in containerd/console, which was once fixed by https://github.com/containerd/console/pull/20
and then broken again (most probably by https://github.com/containerd/console/commit/f1b333f2c5050f2c71fcf782caa0b7ccb540bfcb).

Proposed fix: https://github.com/containerd/console/pull/51

Comment 19 Kir Kolyshkin 2021-04-06 18:54:46 UTC

Filed runc issue: https://github.com/opencontainers/runc/issues/2896

Hope we'll be able to fix this in time for rc94.

Comment 20 Kir Kolyshkin 2021-04-06 19:55:28 UTC

@jhusta is there a way for me to get access to s390 environment? I'd like to run some tests. No need to have anything installed, just bare Linux is fine.

Comment 21 jhusta 2021-04-06 20:17:05 UTC

Maybe reach out to Red Hat support's Prashanth Sundararaman and see if they have RedHat internal s390x resources you can have access to?

Comment 22 Dan Li 2021-04-06 20:36:13 UTC

Cc'ing Prashanth from multi-arch for awareness per Comment 21 and Comment 20

Comment 23 Dan Li 2021-04-07 14:13:33 UTC

Cc'ing Doug Slavens from our multi-arch team for availability of s390x environment for Kir

Comment 24 Prashanth Sundararaman 2021-04-07 14:33:20 UTC

Was able to test Kir's fix on my s390x env with the go tests and it works fine.

Comment 25 Kir Kolyshkin 2021-04-15 21:14:31 UTC

This is fixed by https://github.com/opencontainers/runc/pull/2898 which is merged and the fix will be available in runc rc94.

I have backported this to rhaos-4.8 branch: https://github.com/projectatomic/runc/pull/46

Comment 26 Stefan Orth 2021-04-19 15:44:49 UTC

I tried it on:

Client Version: 4.8.0-0.nightly-s390x-2021-04-19-052408
Server Version: 4.8.0-0.nightly-s390x-2021-04-19-052408
Kubernetes Version: v1.21.0-rc.0+2993be8

[core@worker-01 ~]$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="48.84.202104171019-0"

and it looks fine for me. I "exit" the console and it reconnects directly. Also jump off and on of the console tab works fine.

Comment 28 jhusta 2021-04-21 20:34:23 UTC

Server Version: 4.8.0-0.nightly-s390x-2021-04-21-170513
Kubernetes Version: v1.21.0-rc.0+3ced7a9

I tested from cli and console and the terminal is now working with no issues. Thank you

Comment 29 MinLi 2021-04-23 09:47:07 UTC

verified on version: 4.8.0-0.nightly-2021-04-22-225832

Comment 32 errata-xmlrpc 2021-07-27 22:55:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.