Somewhere around 25% of the time oc cp fails to copy Description of problem: # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump ^C # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump # oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump ^C Version-Release number of selected component (if applicable): atomic-openshift-3.9.1-1.git.0.82b8f99.el7.x86_64 atomic-openshift-clients-3.9.1-1.git.0.82b8f99.el7.x86_64 atomic-openshift-docker-excluder-3.9.1-1.git.0.82b8f99.el7.noarch atomic-openshift-excluder-3.9.1-1.git.0.82b8f99.el7.noarch atomic-openshift-master-3.9.1-1.git.0.82b8f99.el7.x86_64 atomic-openshift-node-3.9.1-1.git.0.82b8f99.el7.x86_64 atomic-openshift-sdn-ovs-3.9.1-1.git.0.82b8f99.el7.x86_64 atomic-registries-1.22.1-1.gitd36c015.el7.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Uncertain. QE seems able to reliably reproduce it in their environments Actual results: Somewhere around 25% of the time oc cp fails to copy Expected results: oc cp works near 100% of the time Additional info: This is a small database dump in the examples above. When cp fails to return it looks like the file is partially transferred. rsync appears to work fine. I ran it 30 plus times without a hang.
I cannot reproduce this locally. Does this happen with only certain pods? All pods? Is there an error message, or does the command just hang? Can I get --loglevel=8 output for the command? `oc logs` for the pods as well please. Is there an environment I can use to reproduce this?
It seems to be any pod in the environment. I tried: oc cp /tmp/db/db.dump -n default docker-registry-1-zrt6s:/tmp/db.dump And get the same intermittent failures. Other DB pods that were provisioned are doing the same.
I believe they're seeing this in a docker environment.
Sorry, hit enter too soon and I didn't mean to cancel needinfo. I am trying to get them to provide a host where I can reproduce this to confirm it's with docker. If true I agree it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1549259, but I don't think it can be the same, if the fix is in cri-o.
I am seeing this with a system using docker as well. I'm not sure if this is the same issue as BZ#1549259. It looks like it was suggested that that was due to a cri-o but, and if that's the case I don't see how they can be related. However, it is possible they are the same and it's not a cri-o (or not just a cri-o) issue, as it seems I now can't exec into the pod oc cp occassionally hangs with. 'oc exec -it -n default docker-registry-1-j2t8j /bin/bash' works now but 'oc exec -n default docker-registry-1-j2t8j' /bin/bash does not. I do see several tar and bash processes running inside, from trying to cp and exec. $ ps -ef UID PID PPID C STIME TTY TIME CMD 1000000+ 1 0 0 08:42 ? 00:00:15 /usr/bin/dockerregistry /etc/registry/config.yml 1000000+ 24 0 0 13:25 ? 00:00:00 tar xf - -C /tmp 1000000+ 64 0 0 13:25 ? 00:00:00 tar xf - -C /tmp 1000000+ 68 0 0 13:29 ? 00:00:00 /bin/bash 1000000+ 72 0 0 13:30 ? 00:00:00 /bin/bash 1000000+ 76 0 0 13:30 ? 00:00:00 /bin/bash 1000000+ 84 0 0 13:30 ? 00:00:00 tar xf - -C /tmp 1000000+ 104 0 0 13:30 ? 00:00:00 /bin/bash 1000000+ 108 0 0 13:30 ? 00:00:00 /bin/bash 1000000+ 112 0 0 13:30 ? 00:00:00 /bin/bash 1000000+ 116 0 0 13:31 ? 00:00:00 /bin/bash 1000000+ 120 0 0 13:32 ? 00:00:00 /bin/bash 1000000+ 124 0 0 13:33 ? 00:00:00 /bin/bash 1000000+ 131 124 0 13:33 ? 00:00:00 ps -ef
(In reply to Jason Montleon from comment #10) > 'oc exec -it -n default docker-registry-1-j2t8j /bin/bash' works now but 'oc > exec -n default docker-registry-1-j2t8j' /bin/bash does not. Based on this information, this bug does appear to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1549259 Closing this one as a duplicate. Thanks *** This bug has been marked as a duplicate of bug 1549259 ***
Origin PR: https://github.com/openshift/origin/pull/18883
Same PR as 1552670 waiting PR lands in OCP and then will check
First, tried to confirm way to reproduce issue: In on hand old version OCP 3.9.1 with docker 1.12,try repeating command [1], exec command not hangs. Later upgrade docker 1.12 to 1.13, restart docker/master/node, try command again, it hangs often enough. Second, install new version OCP 3.9.4 env with docker 1.13, try same command, exec didn't hang. So moving to VERIFIED [1] the tried steps: $ oc new-app mysql-ephemeral # wait pod running $ for i in $(seq 1 500) do echo "$i testing ... `date '+%H:%M:%S'`" oc cp pod.yaml mysql-1-b9nb7:/tmp/pod-$i.yaml echo "$i ended `date '+%H:%M:%S'`" # check timestamp difference echo "-----------------" done