Bug 1549259
| Summary: | [CRI-O] Cannot run oc cp or oc exec against some pods. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jason Montleon <jmontleo> |
| Component: | Containers | Assignee: | Mrunal Patel <mpatel> |
| Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.9.0 | CC: | amurdaca, anli, aos-bugs, chezhang, decarr, dma, jmatthew, jmontleo, jokerman, mfojtik, mmccomas, sjenning, wmeng, xxia, zitang |
| Target Milestone: | --- | ||
| Target Release: | 3.9.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1549019 | Environment: | |
| Last Closed: | 2018-08-09 22:13:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1549683 | ||
| Bug Blocks: | 1544606, 1549019 | ||
|
Description
Jason Montleon
2018-02-26 20:11:14 UTC
Just to make sure I am understanding correctly, is `oc cp ...` failing on all of the pods? or just the ones you were unable to `oc exec` to without using "-it"? Does `oc cp` just hang, or does it also give you an error? Also, could you provide --loglevel 8 output if possible? Juan, The same pods were hanging with oc cp and oc exec without -it. It looks like you were able to get onto the qe host, but let me know if there's anything further that I can provide. Some obvervations. I'm not sure if they're related to the issue: I see that there is a connection upgrade happening using SPDY when I do exec without -it and cp: I0227 09:58:03.299332 124987 round_trippers.go:442] Response Headers: I0227 09:58:03.299337 124987 round_trippers.go:445] Connection: Upgrade I0227 09:58:03.299341 124987 round_trippers.go:445] Upgrade: SPDY/3.1 I0227 09:58:03.299345 124987 round_trippers.go:445] X-Stream-Protocol-Version: v4.channel.k8s.io I0227 09:58:03.299349 124987 round_trippers.go:445] Date: Tue, 27 Feb 2018 14:58:03 GMT The connection always seems to hang after this on the affected pods. Also, if I oc describe the origin-web-console I see: Warning Unhealthy 3m (x384 over 1h) kubelet, 172.16.120.58 Liveness probe errored: rpc error: code = Unknown desc = command error: exec failed: cannot allocate tty if runc will detach without setting console socket , stdout: , stderr: , exit code -1 I can see the same error message from runc if I do -d with --tty=true: # runc exec -d --tty=true e96650303da6b8444919cc0c8dd6e377c9aa5cba80ad1eae978de561fa845d9d ps exec failed: cannot allocate tty if runc will detach without setting console socket (In reply to Jason Montleon from comment #8) > Some obvervations. I'm not sure if they're related to the issue: > > I see that there is a connection upgrade happening using SPDY when I do exec > without -it and cp: > > I0227 09:58:03.299332 124987 round_trippers.go:442] Response Headers: > I0227 09:58:03.299337 124987 round_trippers.go:445] Connection: Upgrade > I0227 09:58:03.299341 124987 round_trippers.go:445] Upgrade: SPDY/3.1 > I0227 09:58:03.299345 124987 round_trippers.go:445] > X-Stream-Protocol-Version: v4.channel.k8s.io > I0227 09:58:03.299349 124987 round_trippers.go:445] Date: Tue, 27 Feb > 2018 14:58:03 GMT > > The connection always seems to hang after this on the affected pods. Yeah, I noticed that the reason why it kept hanging there, was because two goroutines blocked, waiting to copy remote stdout and stderr. Not entirely sure why this is happening yet. > Also, if I oc describe the origin-web-console I see: > > Warning Unhealthy 3m (x384 over 1h) kubelet, 172.16.120.58 Liveness > probe errored: rpc error: code = Unknown desc = command error: exec failed: > cannot allocate tty if runc will detach without setting console socket > , stdout: , stderr: , exit code -1 Thanks for providing this info. I did find this doc [1] and related issues [2][3]. 1. https://github.com/opencontainers/runc/blob/v1.0.0-rc4/README.md#running-containers 2. https://github.com/opencontainers/runc/issues/1580 3. https://github.com/opencontainers/runc/issues/1202 > I can see the same error message from runc if I do -d with --tty=true: > > # runc exec -d --tty=true > e96650303da6b8444919cc0c8dd6e377c9aa5cba80ad1eae978de561fa845d9d ps > exec failed: cannot allocate tty if runc will detach without setting console > socket Thanks, tagging Derek for his input on this. looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1549683 Seth, is this that relevant only to the errors I saw on the origin-web-console pod, or is it likely to be causing the issues exec'ing into the other pods as well? Jason, See commment 3 https://bugzilla.redhat.com/show_bug.cgi?id=1549683#c3 Has the same error: rpc error: code = Unknown desc = command error: exec failed: cannot allocate tty if runc will detach without setting console socket PR for the linked related issue from comment 13 has merged. Marking as ON_QA Upstream PR: https://github.com/kubernetes-incubator/cri-o/pull/1386 Hit this issue on crio version 1.9.7 when oc exec command against my logging-es pod. Yesterday checked Anping Li's env and double confirmed the result. In project "logging", all pods had the hanging issue when I ran oc cp and oc exec without -it (pods in other projects e.g. "default" didn't have.) Today rebuilt the env with latest crio 1.9.8 which merged above cri-o/pull/1386. But the issue still exists in some pod: # oc exec -n logging logging-es-data-master-fbhpmbzx-1-zcdtq date # first time it DOES NOT hang, later it becomes to always hang Defaulting container name to elasticsearch. Use 'oc describe pod/logging-es-data-master-fbhpmbzx-1-zcdtq -n logging' to see all of the containers in this pod. ^C # oc cp pod.yaml logging-es-data-master-fbhpmbzx-1-zcdtq:/tmp/c.yaml ^C Other pod replica, however, DOES NOT hang: # oc exec -n logging logging-es-data-master-63xl9a5p-1-6mg4n date Defaulting container name to elasticsearch. Use 'oc describe pod/logging-es-data-master-63xl9a5p-1-6mg4n -n logging' to see all of the containers in this pod. Mon Mar 5 13:08:10 UTC 2018 # Could the oc cp issues be caused by this overlay/tar problem? It's stated in there that this is most likely affects overlay and overlay2 and the QE hosts are configured with 'DOCKER_STORAGE_OPTIONS="--storage-driver overlay2 "' https://github.com/moby/moby/issues/19647 I'm wondering if after a cp fails that's when the terminal issues start. *** Bug 1550644 has been marked as a duplicate of this bug. *** @xxia thanks for testing the patch from BZ1549683. I am marking this bug as dependent on https://bugzilla.redhat.com/show_bug.cgi?id=1549683. Moving to Container team. BZ#1550644 was marked a duplicate of this but per comment 10 on that bug this is reproducible on QE hosts using docker as well. I don't see how then it can be caused by the issue in BZ#1549683 alone. I don't think a crio bug wouldn't impact docker environments? I have created https://github.com/kubernetes-incubator/cri-o/pull/1443 as a fix for this. It will be available in cri-o 1.9.10. QE's feedback on https://bugzilla.redhat.com/show_bug.cgi?id=1549019 suggests 1.9.10 in the latest puddle fixed the issue. Checked with cri-o://1.9.14 on # oc version oc v3.9.38 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-8-232.ec2.internal:8443 openshift v3.9.38 kubernetes v1.9.1+a0ce1bc657 And can not reproduce this issue, so move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335 |