QE has a 3.9.0 environment set up with crio where it seems that I cannot oc cp or oc exec without using -it. This works: # oc exec -n pg-test postgresql-9.6-dev-1-d785k -- /bin/bash -c "pg_dumpall -f /tmp/db.dump" This does not: # oc exec -n post-96-dev postgresql-9.6-dev-1-n8br8 -- /bin/bash -c "pg_dumpall -f /tmp/db.dump" Those are the same command against different pods. This works though: # oc exec -it -n post-96-dev postgresql-9.6-dev-1-n8br8 -- /bin/bash -c "pg_dumpall -f /tmp/db.dump" That's with -it added. I tried -i and -t individually and they did not appear to make a difference. oc cp commands hang as well. I can't do: oc cp -n post-96-dev postgresql-9.6-dev-1-n8br8:/tmp/db.dump ./db.dump Usually it hangs without output. Occassionally it gives and error: error: archive/tar: invalid tar header I was trying to dig around logs or get verbose output to try and help diagnose the issue, but I didn't come up with much. Please let me know if there is somewhere/some way I can provide some better information.
Just to make sure I am understanding correctly, is `oc cp ...` failing on all of the pods? or just the ones you were unable to `oc exec` to without using "-it"? Does `oc cp` just hang, or does it also give you an error?
Also, could you provide --loglevel 8 output if possible?
Juan, The same pods were hanging with oc cp and oc exec without -it. It looks like you were able to get onto the qe host, but let me know if there's anything further that I can provide.
Some obvervations. I'm not sure if they're related to the issue: I see that there is a connection upgrade happening using SPDY when I do exec without -it and cp: I0227 09:58:03.299332 124987 round_trippers.go:442] Response Headers: I0227 09:58:03.299337 124987 round_trippers.go:445] Connection: Upgrade I0227 09:58:03.299341 124987 round_trippers.go:445] Upgrade: SPDY/3.1 I0227 09:58:03.299345 124987 round_trippers.go:445] X-Stream-Protocol-Version: v4.channel.k8s.io I0227 09:58:03.299349 124987 round_trippers.go:445] Date: Tue, 27 Feb 2018 14:58:03 GMT The connection always seems to hang after this on the affected pods. Also, if I oc describe the origin-web-console I see: Warning Unhealthy 3m (x384 over 1h) kubelet, 172.16.120.58 Liveness probe errored: rpc error: code = Unknown desc = command error: exec failed: cannot allocate tty if runc will detach without setting console socket , stdout: , stderr: , exit code -1 I can see the same error message from runc if I do -d with --tty=true: # runc exec -d --tty=true e96650303da6b8444919cc0c8dd6e377c9aa5cba80ad1eae978de561fa845d9d ps exec failed: cannot allocate tty if runc will detach without setting console socket
(In reply to Jason Montleon from comment #8) > Some obvervations. I'm not sure if they're related to the issue: > > I see that there is a connection upgrade happening using SPDY when I do exec > without -it and cp: > > I0227 09:58:03.299332 124987 round_trippers.go:442] Response Headers: > I0227 09:58:03.299337 124987 round_trippers.go:445] Connection: Upgrade > I0227 09:58:03.299341 124987 round_trippers.go:445] Upgrade: SPDY/3.1 > I0227 09:58:03.299345 124987 round_trippers.go:445] > X-Stream-Protocol-Version: v4.channel.k8s.io > I0227 09:58:03.299349 124987 round_trippers.go:445] Date: Tue, 27 Feb > 2018 14:58:03 GMT > > The connection always seems to hang after this on the affected pods. Yeah, I noticed that the reason why it kept hanging there, was because two goroutines blocked, waiting to copy remote stdout and stderr. Not entirely sure why this is happening yet. > Also, if I oc describe the origin-web-console I see: > > Warning Unhealthy 3m (x384 over 1h) kubelet, 172.16.120.58 Liveness > probe errored: rpc error: code = Unknown desc = command error: exec failed: > cannot allocate tty if runc will detach without setting console socket > , stdout: , stderr: , exit code -1 Thanks for providing this info. I did find this doc [1] and related issues [2][3]. 1. https://github.com/opencontainers/runc/blob/v1.0.0-rc4/README.md#running-containers 2. https://github.com/opencontainers/runc/issues/1580 3. https://github.com/opencontainers/runc/issues/1202 > I can see the same error message from runc if I do -d with --tty=true: > > # runc exec -d --tty=true > e96650303da6b8444919cc0c8dd6e377c9aa5cba80ad1eae978de561fa845d9d ps > exec failed: cannot allocate tty if runc will detach without setting console > socket Thanks, tagging Derek for his input on this.
looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1549683
Seth, is this that relevant only to the errors I saw on the origin-web-console pod, or is it likely to be causing the issues exec'ing into the other pods as well?
Jason, See commment 3 https://bugzilla.redhat.com/show_bug.cgi?id=1549683#c3 Has the same error: rpc error: code = Unknown desc = command error: exec failed: cannot allocate tty if runc will detach without setting console socket
PR for the linked related issue from comment 13 has merged. Marking as ON_QA Upstream PR: https://github.com/kubernetes-incubator/cri-o/pull/1386
Hit this issue on crio version 1.9.7 when oc exec command against my logging-es pod.
Yesterday checked Anping Li's env and double confirmed the result. In project "logging", all pods had the hanging issue when I ran oc cp and oc exec without -it (pods in other projects e.g. "default" didn't have.) Today rebuilt the env with latest crio 1.9.8 which merged above cri-o/pull/1386. But the issue still exists in some pod: # oc exec -n logging logging-es-data-master-fbhpmbzx-1-zcdtq date # first time it DOES NOT hang, later it becomes to always hang Defaulting container name to elasticsearch. Use 'oc describe pod/logging-es-data-master-fbhpmbzx-1-zcdtq -n logging' to see all of the containers in this pod. ^C # oc cp pod.yaml logging-es-data-master-fbhpmbzx-1-zcdtq:/tmp/c.yaml ^C Other pod replica, however, DOES NOT hang: # oc exec -n logging logging-es-data-master-63xl9a5p-1-6mg4n date Defaulting container name to elasticsearch. Use 'oc describe pod/logging-es-data-master-63xl9a5p-1-6mg4n -n logging' to see all of the containers in this pod. Mon Mar 5 13:08:10 UTC 2018 #
Could the oc cp issues be caused by this overlay/tar problem? It's stated in there that this is most likely affects overlay and overlay2 and the QE hosts are configured with 'DOCKER_STORAGE_OPTIONS="--storage-driver overlay2 "' https://github.com/moby/moby/issues/19647 I'm wondering if after a cp fails that's when the terminal issues start.
*** Bug 1550644 has been marked as a duplicate of this bug. ***
@xxia thanks for testing the patch from BZ1549683. I am marking this bug as dependent on https://bugzilla.redhat.com/show_bug.cgi?id=1549683. Moving to Container team.
BZ#1550644 was marked a duplicate of this but per comment 10 on that bug this is reproducible on QE hosts using docker as well. I don't see how then it can be caused by the issue in BZ#1549683 alone. I don't think a crio bug wouldn't impact docker environments?
I have created https://github.com/kubernetes-incubator/cri-o/pull/1443 as a fix for this. It will be available in cri-o 1.9.10.
QE's feedback on https://bugzilla.redhat.com/show_bug.cgi?id=1549019 suggests 1.9.10 in the latest puddle fixed the issue.
Checked with cri-o://1.9.14 on # oc version oc v3.9.38 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-8-232.ec2.internal:8443 openshift v3.9.38 kubernetes v1.9.1+a0ce1bc657 And can not reproduce this issue, so move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335