Bug 1550644

Summary:	oc cp intermittently fails to finish copying files
Product:	OpenShift Container Platform	Reporter:	Jason Montleon <jmontleo>
Component:	oc	Assignee:	Juan Vallejo <jvallejo>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, chezhang, jmatthew, jmontleo, jokerman, mmccomas, smunilla, wmeng, zitang
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	OCP 3.9.4	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1544606	Environment:
Last Closed:	2018-06-18 17:42:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1552670
Bug Blocks:	1544606

Description Jason Montleon 2018-03-01 16:56:34 UTC

Somewhere around 25% of the time oc cp fails to copy

Description of problem:
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
^C  
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
# oc cp -n msyql-down4 /tmp/db/db.dump mysql-5.6-prod-1-qqmt4:/tmp/db.dump
^C

Version-Release number of selected component (if applicable):
atomic-openshift-3.9.1-1.git.0.82b8f99.el7.x86_64
atomic-openshift-clients-3.9.1-1.git.0.82b8f99.el7.x86_64
atomic-openshift-docker-excluder-3.9.1-1.git.0.82b8f99.el7.noarch
atomic-openshift-excluder-3.9.1-1.git.0.82b8f99.el7.noarch
atomic-openshift-master-3.9.1-1.git.0.82b8f99.el7.x86_64
atomic-openshift-node-3.9.1-1.git.0.82b8f99.el7.x86_64
atomic-openshift-sdn-ovs-3.9.1-1.git.0.82b8f99.el7.x86_64
atomic-registries-1.22.1-1.gitd36c015.el7.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Uncertain. QE seems able to reliably reproduce it in their environments

Actual results:
Somewhere around 25% of the time oc cp fails to copy

Expected results:
oc cp works near 100% of the time

Additional info:
This is a small database dump in the examples above. When cp fails to return it looks like the file is partially transferred.

rsync appears to work fine. I ran it 30 plus times without a hang.

Comment 1 Juan Vallejo 2018-03-01 19:03:25 UTC

I cannot reproduce this locally.
Does this happen with only certain pods? All pods?
Is there an error message, or does the command just hang?

Can I get --loglevel=8 output for the command?
`oc logs` for the pods as well please.

Is there an environment I can use to reproduce this?

Comment 3 Jason Montleon 2018-03-01 19:12:43 UTC

It seems to be any pod in the environment. I tried:

oc cp /tmp/db/db.dump -n default  docker-registry-1-zrt6s:/tmp/db.dump

And get the same intermittent failures. Other DB pods that were provisioned are doing the same.

Comment 7 Jason Montleon 2018-03-02 15:45:24 UTC

I believe they're seeing this in a docker environment.

Comment 8 Jason Montleon 2018-03-02 15:47:22 UTC

Sorry, hit enter too soon and I didn't mean to cancel needinfo. 

I am trying to get them to provide a host where I can reproduce this to confirm it's with docker.

If true I agree it looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1549259, but I don't think it can be the same, if the fix is in cri-o.

Comment 10 Jason Montleon 2018-03-05 13:38:04 UTC

I am seeing this with a system using docker as well.

I'm not sure if this is the same issue as BZ#1549259. It looks like it was suggested that that was due to a cri-o but, and if that's the case I don't see how they can be related. 

However, it is possible they are the same and it's not a cri-o (or not just a cri-o) issue, as it seems I now can't exec into the pod oc cp occassionally hangs with.

'oc exec -it -n default docker-registry-1-j2t8j /bin/bash' works now but 'oc exec -n default docker-registry-1-j2t8j' /bin/bash does not.

I do see several tar and bash processes running inside, from trying to cp and exec.
$ ps -ef
UID         PID   PPID  C STIME TTY          TIME CMD
1000000+      1      0  0 08:42 ?        00:00:15 /usr/bin/dockerregistry /etc/registry/config.yml
1000000+     24      0  0 13:25 ?        00:00:00 tar xf - -C /tmp
1000000+     64      0  0 13:25 ?        00:00:00 tar xf - -C /tmp
1000000+     68      0  0 13:29 ?        00:00:00 /bin/bash
1000000+     72      0  0 13:30 ?        00:00:00 /bin/bash
1000000+     76      0  0 13:30 ?        00:00:00 /bin/bash
1000000+     84      0  0 13:30 ?        00:00:00 tar xf - -C /tmp
1000000+    104      0  0 13:30 ?        00:00:00 /bin/bash
1000000+    108      0  0 13:30 ?        00:00:00 /bin/bash
1000000+    112      0  0 13:30 ?        00:00:00 /bin/bash
1000000+    116      0  0 13:31 ?        00:00:00 /bin/bash
1000000+    120      0  0 13:32 ?        00:00:00 /bin/bash
1000000+    124      0  0 13:33 ?        00:00:00 /bin/bash
1000000+    131    124  0 13:33 ?        00:00:00 ps -ef

Comment 11 Juan Vallejo 2018-03-05 16:32:25 UTC

(In reply to Jason Montleon from comment #10)
> 'oc exec -it -n default docker-registry-1-j2t8j /bin/bash' works now but 'oc
> exec -n default docker-registry-1-j2t8j' /bin/bash does not.


Based on this information, this bug does appear to be a duplicate of
https://bugzilla.redhat.com/show_bug.cgi?id=1549259

Closing this one as a duplicate.
Thanks

*** This bug has been marked as a duplicate of bug 1549259 ***

Comment 12 Juan Vallejo 2018-03-07 21:12:32 UTC

Origin PR: https://github.com/openshift/origin/pull/18883

Comment 13 Xingxing Xia 2018-03-08 09:57:15 UTC

Same PR as 1552670 waiting PR lands in OCP and then will check

Comment 14 Xingxing Xia 2018-03-09 09:23:52 UTC

First, tried to confirm way to reproduce issue:
  In on hand old version OCP 3.9.1 with docker 1.12,try repeating command [1], exec command not hangs.
  Later upgrade docker 1.12 to 1.13, restart docker/master/node, try command again, it hangs often enough.

Second, install new version OCP 3.9.4 env with docker 1.13, try same command, exec didn't hang. So moving to VERIFIED

[1] the tried steps:
$ oc new-app mysql-ephemeral # wait pod running

$ for i in $(seq 1 500)
do
  echo "$i testing ... `date '+%H:%M:%S'`"
  oc cp pod.yaml mysql-1-b9nb7:/tmp/pod-$i.yaml
  echo "$i ended       `date '+%H:%M:%S'`"  # check timestamp difference
  echo "-----------------"
done