Bug 1549259

Summary:	[CRI-O] Cannot run oc cp or oc exec against some pods.
Product:	OpenShift Container Platform	Reporter:	Jason Montleon <jmontleo>
Component:	Containers	Assignee:	Mrunal Patel <mpatel>
Status:	CLOSED ERRATA	QA Contact:	weiwei jiang <wjiang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	amurdaca, anli, aos-bugs, chezhang, decarr, dma, jmatthew, jmontleo, jokerman, mfojtik, mmccomas, sjenning, wmeng, xxia, zitang
Target Milestone:	---
Target Release:	3.9.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1549019	Environment:
Last Closed:	2018-08-09 22:13:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1549683
Bug Blocks:	1544606, 1549019

Description Jason Montleon 2018-02-26 20:11:14 UTC

QE has a 3.9.0 environment set up with crio where it seems that I cannot oc cp or oc exec without using -it.

This works:
# oc exec -n pg-test postgresql-9.6-dev-1-d785k  -- /bin/bash -c "pg_dumpall -f /tmp/db.dump"

This does not:
# oc exec -n post-96-dev postgresql-9.6-dev-1-n8br8 -- /bin/bash -c "pg_dumpall -f /tmp/db.dump"

Those are the same command against different pods.

This works though:
# oc exec -it -n post-96-dev postgresql-9.6-dev-1-n8br8 -- /bin/bash -c "pg_dumpall -f /tmp/db.dump"

That's with -it added. I tried -i and -t individually and they did not appear to make a difference. 

oc cp commands hang as well. I can't do:
oc cp -n post-96-dev postgresql-9.6-dev-1-n8br8:/tmp/db.dump ./db.dump

Usually it hangs without output. Occassionally it gives and error:
error: archive/tar: invalid tar header

I was trying to dig around logs or get verbose output to try and help diagnose the issue, but I didn't come up with much. Please let me know if there is somewhere/some way I can provide some better information.

Comment 2 Juan Vallejo 2018-02-26 22:38:03 UTC

Just to make sure I am understanding correctly, is `oc cp ...` failing on all of the pods? or just the ones you were unable to `oc exec` to without using "-it"? Does `oc cp` just hang, or does it also give you an error?

Comment 3 Juan Vallejo 2018-02-26 22:38:45 UTC

Also, could you provide --loglevel 8 output if possible?

Comment 6 Jason Montleon 2018-02-27 05:16:41 UTC

Juan,
The same pods were hanging with oc cp and oc exec without -it.

It looks like you were able to get onto the qe host, but let me know if there's anything further that I can provide.

Comment 8 Jason Montleon 2018-02-27 16:01:10 UTC

Some obvervations. I'm not sure if they're related to the issue:

I see that there is a connection upgrade happening using SPDY when I do exec 
without -it and cp:

I0227 09:58:03.299332  124987 round_trippers.go:442] Response Headers:
I0227 09:58:03.299337  124987 round_trippers.go:445]     Connection: Upgrade
I0227 09:58:03.299341  124987 round_trippers.go:445]     Upgrade: SPDY/3.1
I0227 09:58:03.299345  124987 round_trippers.go:445]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0227 09:58:03.299349  124987 round_trippers.go:445]     Date: Tue, 27 Feb 2018 14:58:03 GMT

The connection always seems to hang after this on the affected pods. 


Also, if I oc describe the origin-web-console I see:

  Warning  Unhealthy  3m (x384 over 1h)  kubelet, 172.16.120.58  Liveness probe errored: rpc error: code = Unknown desc = command error: exec failed: cannot allocate tty if runc will detach without setting console socket
, stdout: , stderr: , exit code -1

I can see the same error message from runc if I do -d with --tty=true:

# runc exec -d --tty=true e96650303da6b8444919cc0c8dd6e377c9aa5cba80ad1eae978de561fa845d9d ps
exec failed: cannot allocate tty if runc will detach without setting console socket

Comment 9 Juan Vallejo 2018-02-27 16:20:33 UTC

(In reply to Jason Montleon from comment #8)
> Some obvervations. I'm not sure if they're related to the issue:
> 
> I see that there is a connection upgrade happening using SPDY when I do exec 
> without -it and cp:
> 
> I0227 09:58:03.299332  124987 round_trippers.go:442] Response Headers:
> I0227 09:58:03.299337  124987 round_trippers.go:445]     Connection: Upgrade
> I0227 09:58:03.299341  124987 round_trippers.go:445]     Upgrade: SPDY/3.1
> I0227 09:58:03.299345  124987 round_trippers.go:445]    
> X-Stream-Protocol-Version: v4.channel.k8s.io
> I0227 09:58:03.299349  124987 round_trippers.go:445]     Date: Tue, 27 Feb
> 2018 14:58:03 GMT
> 
> The connection always seems to hang after this on the affected pods. 

Yeah, I noticed that the reason why it kept hanging there, was because two goroutines blocked, waiting to copy remote stdout and stderr. Not entirely sure why this is happening yet.

 
> Also, if I oc describe the origin-web-console I see:
> 
>   Warning  Unhealthy  3m (x384 over 1h)  kubelet, 172.16.120.58  Liveness
> probe errored: rpc error: code = Unknown desc = command error: exec failed:
> cannot allocate tty if runc will detach without setting console socket
> , stdout: , stderr: , exit code -1

Thanks for providing this info. I did find this doc [1] and related issues [2][3].

1. https://github.com/opencontainers/runc/blob/v1.0.0-rc4/README.md#running-containers

2. https://github.com/opencontainers/runc/issues/1580

3. https://github.com/opencontainers/runc/issues/1202

> I can see the same error message from runc if I do -d with --tty=true:
> 
> # runc exec -d --tty=true
> e96650303da6b8444919cc0c8dd6e377c9aa5cba80ad1eae978de561fa845d9d ps
> exec failed: cannot allocate tty if runc will detach without setting console
> socket

Thanks, tagging Derek for his input on this.

Comment 11 Seth Jennings 2018-02-28 19:21:27 UTC

looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1549683

Comment 12 Jason Montleon 2018-02-28 21:49:58 UTC

Seth, is this that relevant only to the errors I saw on the origin-web-console pod, or is it likely to be causing the issues exec'ing into the other pods as well?

Comment 13 Seth Jennings 2018-02-28 22:15:04 UTC

Jason,

See commment 3 https://bugzilla.redhat.com/show_bug.cgi?id=1549683#c3

Has the same error:

rpc error: code = Unknown desc = command error: exec failed: cannot allocate tty if runc will detach without setting console socket

Comment 14 Juan Vallejo 2018-03-01 16:23:50 UTC

PR for the linked related issue from comment 13 has merged. Marking as ON_QA
Upstream PR: https://github.com/kubernetes-incubator/cri-o/pull/1386

Comment 15 Anping Li 2018-03-02 05:56:40 UTC

Hit this issue on crio version 1.9.7 when oc exec command against my logging-es pod.

Comment 16 Xingxing Xia 2018-03-05 13:11:13 UTC

Yesterday checked Anping Li's env and double confirmed the result. In project "logging", all pods had the hanging issue when I ran oc cp and oc exec without -it (pods in other projects e.g. "default" didn't have.)

Today rebuilt the env with latest crio 1.9.8 which merged above cri-o/pull/1386. But the issue still exists in some pod:
# oc exec -n logging logging-es-data-master-fbhpmbzx-1-zcdtq date # first time it DOES NOT hang, later it becomes to always hang
Defaulting container name to elasticsearch.
Use 'oc describe pod/logging-es-data-master-fbhpmbzx-1-zcdtq -n logging' to see all of the containers in this pod.
^C
# oc cp pod.yaml logging-es-data-master-fbhpmbzx-1-zcdtq:/tmp/c.yaml
^C

Other pod replica, however, DOES NOT hang:
# oc exec -n logging logging-es-data-master-63xl9a5p-1-6mg4n date
Defaulting container name to elasticsearch.
Use 'oc describe pod/logging-es-data-master-63xl9a5p-1-6mg4n -n logging' to see all of the containers in this pod.
Mon Mar  5 13:08:10 UTC 2018
#

Comment 18 Jason Montleon 2018-03-05 14:38:11 UTC

Could the oc cp issues be caused by this overlay/tar problem? It's stated in there that this is most likely affects overlay and overlay2 and the QE hosts are configured with 'DOCKER_STORAGE_OPTIONS="--storage-driver overlay2 "'

https://github.com/moby/moby/issues/19647

I'm wondering if after a cp fails that's when the terminal issues start.

Comment 19 Juan Vallejo 2018-03-05 16:32:25 UTC

*** Bug 1550644 has been marked as a duplicate of this bug. ***

Comment 20 Juan Vallejo 2018-03-05 16:46:06 UTC

@xxia thanks for testing the patch from BZ1549683.

I am marking this bug as dependent on https://bugzilla.redhat.com/show_bug.cgi?id=1549683.

Moving to Container team.

Comment 21 Jason Montleon 2018-03-05 18:24:12 UTC

BZ#1550644 was marked a duplicate of this but per comment 10 on that bug this is reproducible on QE hosts using docker as well.

I don't see how then it can be caused by the issue in BZ#1549683 alone. I don't think a crio bug wouldn't impact docker environments?

Comment 22 Mrunal Patel 2018-03-09 19:32:57 UTC

I have created https://github.com/kubernetes-incubator/cri-o/pull/1443 as a fix for this. It will be available in cri-o 1.9.10.

Comment 23 Jason Montleon 2018-03-16 12:22:20 UTC

QE's feedback on https://bugzilla.redhat.com/show_bug.cgi?id=1549019 suggests 1.9.10 in the latest puddle fixed the issue.

Comment 25 weiwei jiang 2018-07-30 06:01:30 UTC

Checked with cri-o://1.9.14
on 
# oc version 
oc v3.9.38
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-8-232.ec2.internal:8443
openshift v3.9.38
kubernetes v1.9.1+a0ce1bc657

And can not reproduce this issue, so move to verified.

Comment 27 errata-xmlrpc 2018-08-09 22:13:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2335