Description of problem: Fail to push images on env with ha-registry Version-Release number of selected component (if applicable): puddle[2015-11-02.1] How reproducible: Always Steps to Reproduce: 1. Install env with ha-registry and shared nfs storage #oc scale --replicas=2 rc/docker-registry-2 2. do a sti-build #oc start-build nodejs-example 3. check the build logs #oc build-logs nodejs-example-9 Actual results: I1103 00:54:52.729793 1 sti.go:213] Using provided push secret for pushing 172.30.28.101:5000/xiama/nodejs-example:latest image I1103 00:54:52.729816 1 sti.go:217] Pushing 172.30.28.101:5000/xiama/nodejs-example:latest image ... I1103 00:54:54.030085 1 sti.go:222] Registry server Address: I1103 00:54:54.030115 1 sti.go:223] Registry server User Name: serviceaccount I1103 00:54:54.030125 1 sti.go:224] Registry server Email: serviceaccount I1103 00:54:54.030133 1 sti.go:229] Registry server Password: <<non-empty>> F1103 00:54:54.030142 1 builder.go:59] Build error: Failed to push image. Response from registry is: digest invalid: provided digest did not match uploaded content Fail to push the image to registry Expected results: Push images successfully Additional info: When QE set the replicas to 1, the sti-build can push the images successfully
Have you setup the storage [1] for registry and made it available on all your nodes hosting registry? [1] https://docs.openshift.com/enterprise/3.0/install_config/install/docker_registry.html#storage-for-the-registry There are reports of similar behaviour with upstream distribution: https://github.com/docker/distribution/issues/1013 Long story short - may happen on less consistent distributed storage. There's an attempt to address it: https://github.com/docker/distribution/pull/1141 I need to know details about your storage setup in order to debug it. But even so, it's rather an upstream (docker/distribution) issue.
I wrongly assumed that the upstream patch apply to this case. But it's related only to swift. Interesting observation: thu push works with older docker-1.7.1-115.el7.x86_64 I'll try upstream 1.8.2 Docker as well to see whether internal patches are at fault.
Created attachment 1089731 [details] log of the fst registry instance for a push with docker 1.7.1-115 log of registry instance #1 when following command is executed with docker-1.7.1-115.el7.x86_64: docker -D push 172.30.177.244:5000/joe/hello-world The push refers to a repository [172.30.177.244:5000/joe/hello-world] (len: 1) af340544ed62: Image already exists 535020c3e8ad: Image successfully pushed Digest: sha256:729c7f8b8ee41e952083865694b85bc9b38830d48b98b1f92ce7cf3b658a8aba
Created attachment 1089735 [details] log of the snd registry instance for a push with docker 1.7.1-115 log of registry instance #2 when following command is executed with docker-1.7.1-115.el7.x86_64: docker -D push 172.30.177.244:5000/joe/hello-world The push refers to a repository [172.30.177.244:5000/joe/hello-world] (len: 1) af340544ed62: Image already exists 535020c3e8ad: Image successfully pushed Digest: sha256:729c7f8b8ee41e952083865694b85bc9b38830d48b98b1f92ce7cf3b658a8aba
Created attachment 1089737 [details] log of the fst registry instance for a push with docker 1.8.2-7 log of registry instance #1 when following command is executed with docker-1.8.2-7.el7.x86_64: docker push 172.30.177.244:5000/joe/hello-world-from-node2 The push refers to a repository [172.30.177.244:5000/joe/hello-world-from-node2] (len: 1) 975b84d108f1: Pushing 1.024 kB digest invalid: provided digest did not match uploaded content
Created attachment 1089738 [details] log of the snd registry instance for a push with docker 1.8.2-7 log of registry instance #2 when following command is executed with docker-1.8.2-7.el7.x86_64: docker push 172.30.177.244:5000/joe/hello-world-from-node2 The push refers to a repository [172.30.177.244:5000/joe/hello-world-from-node2] (len: 1) 975b84d108f1: Pushing 1.024 kB digest invalid: provided digest did not match uploaded content
Upstream docker 1.8.2 results in the same error.
This is weird... here are some snippets of the errors when pushing with 1.8.2: level=error msg="canonical digest does match provided digest" canonical=sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef http.request.uri="[SNIP]&digest=sha256%3A5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef" level=error msg="An error occured" err.code=DIGEST_INVALID err.detail="invalid digest for referenced layer: sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef, content does not match digest" err.message="provided digest did not match uploaded content" It sure seems to me like Docker is telling the registry "the digest is sha256%3A5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef" (%3A = ":"), and the registry is computing it as "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef", which looks identical to me. I wonder if there's some URL query parsing bug somewhere?
If you run 1 replica, don't use NFS, with Docker 1.8.2, does the push succeed?
I coulnd't reproduce it with one replica. Neither with 2 replicas running on the same node sharing a host directory. Now it looks more nfs-related.
Reproduce-able even with docker-distribution 2.1.0 (with OSO patches).
Here I summarize our testing result as the following: 2 replicas on different nodes + NFS: FAIL 1 replica + NFS: FAIL 1 replica + host dir: PASS 2 replicas on the same node + host dir: PASS
Now this bug is blocking testing, so raise its priority.
Just verified that with 1 replica + NFS it fails, but not always. And only with larger layers (e.g. with registry.access.redhat.com/openshift3/mongodb-24-rhel7 image). I wasn't able to reproduce in local environment though.
The "%3A" artifact in a digest isn't an issue. Error reporting has two bugs. The first is returning given digest (the one comming with a request) as a canonical [1]. The second is a formating of the very same digest in error message which results in the artifact. In fact, the two digests being compared really differ in hex strings: level=debug msg="(*layerWriter).validateLayer: checking cannonical_digest == expected_digest (sha256:5e7d48da6780e1bf2b20f0b5d1ca35037d2ec33ced2f7d0f1cab4dc9cd4a6497 == sha256:6e0917773a3ba1fa40c8fdd50fe9d38e9fedfdd96a444637d6bf7f090b34ca71)"
Created attachment 1090104 [details] single replica with NFS mount and additional debug statements Now it seems more like a timing issue. After adding debug statements, I can reproduce even with hostPath: docker push 172.30.87.121:5000/jialiu/mongodb-24-rhel7 The push refers to a repository [172.30.87.121:5000/jialiu/mongodb-24-rhel7] (len: 1) d17602c1d664: Pushing [==================================================>] 234.4 MB digest invalid: provided digest did not match uploaded content
Comparing 2 layer blobs from successful and unsuccessful push. Both belong to the first layer of an image: registry.access.redhat.com/openshift3/mongodb-24-rhel7 With a size of 234434560 bytes. They differ in 4 blocks, each 49 bytes long. They blocks start at offsets: 1. 15076559 0xe60ccf 2. 76979915 0x4969ecb 3. 83012204 0x4f2aa6c 4. 161540983 0x9a0eb77
I'd just like to mention that there's something wrong with network on those testing machines. I just got this when copying data from openshift-124 to openshift-114: scp -r root.0.89:/var/tmp/docker-registry/docker/registry/v2/blobs/sha256/6e/6e0917773a3ba1fa40c8fdd50fe9d38e9fedfdd96a444637d6bf7f090b34ca71/data . Warning: Permanently added '192.168.0.89' (ECDSA) to the list of known hosts. data 87% 195MB 70.3MB/s 00:00 ETACorrupted MAC on input. Disconnecting: Packet corrupt lost connection
I switched to my local environment because I believe the QE test environment is broken. Locally I could reproduce with 2 replicas with a shared nfs storage. NOTE: Uploaded blob during a failed push does NOT differ from a correct layer blob resulting from successful push. So the problem is just with computing digest. I'll upload logs with extended debug messages for both replicas.
Created attachment 1090317 [details] log of 1st replica from local setup with a failed push Push command was: docker -D push 172.30.228.121:5000/joe/mongodb-24-rhel7 The push refers to a repository [172.30.228.121:5000/joe/mongodb-24-rhel7] (len: 1) d17602c1d664: Pushing [==================================================>] 234.4 MB digest invalid: provided digest did not match uploaded content
Created attachment 1090318 [details] log of 2nd replica from local setup with a failed push Push command was: docker -D push 172.30.228.121:5000/joe/mongodb-24-rhel7 The push refers to a repository [172.30.228.121:5000/joe/mongodb-24-rhel7] (len: 1) d17602c1d664: Pushing [==================================================>] 234.4 MB digest invalid: provided digest did not match uploaded content
I came up with a fix which solves issues on my setup. I haven't tested in QE's environment: https://github.com/openshift/origin/pull/5749 The problem is really a NFS storage. After an upload of a large layer (250M), it takes around 40 seconds for os.Stat() to succeed on data blob file on my VMs. And it takes few more seconds for the file to appear to another replica. So even when the layer push succeeds, its fetch may fail if the other replica is being asked.
So it seems that NFS issue won't be resolved in 3.1. Our recommendation to our customers will be to use ClientIP session affinity in registry's service configuration, which causes requests from particular docker daemon to be handled by the same registry replica. Johnny, Ma, could you please re-test with this setting applied? Specifically: oadm registry --credentials=/etc/origin/master/openshift-registry.kubeconfig --images='registry.access.redhat.com/openshift3/ose-${component}:${version}' --replicas=2 oc get -o yaml svc docker-registry | sed 's/\(sessionAffinity:\s*\).*/\1ClientIP/' | oc replace -f - # any other setup needed And if successful, could you please start using it for tests currently blocked on this one?
Marking as UpcomingRelease. Michal, we should document the known issue with NFS and the possible workaround with ClientIP for the release notes while we continue to work on this.
Some additional (possibly repeated) information: - pushes fail with both Docker 1.7.1 and 1.8.2 - errors are either of the form "blob upload unknown" or "digest invalid: provided digest did not match uploaded content" - I set up my client mounts with the 'noac' option and was unable to get a push to fail after several hundred iterations. My inter-host latency, however, was minimal, as the 3 VMs in question were all on the same laptop.
Upstream issue: https://github.com/docker/distribution/issues/1176
From both the client side and server please do a network trace of the NFS traffic. On the server side # yum install wireshark # tshark -w /tmp/server.pcap host <client_ip> # bzip2 /tmp/server.pcap On the client side # yum install wireshark # tshark -w /tmp/client.pcap host <server_ip> # bzip2 /tmp/client.pcap
Created attachment 1092481 [details] server pcap
Created attachment 1092482 [details] client1 pcap
Created attachment 1092483 [details] client2 pcap
I've attached the packet captures from the NFS server, node1, and node2. During these captures, I attempted to push an image to the load balancer sitting in front of the 2 registry backends. I think I tried to push 10 or 12 times. Each time I got the same error: [root@ose3-node1 haproxy]# docker push localhost:5010/test1:centos7 The push refers to a repository [localhost:5010/test1] (len: 1) ce20c473cd8a: Pushing 1.024 kB blob upload unknown (Note: these captures were with haproxy and the registry:2.2.0 image)
From https://github.com/docker/distribution/issues/1176#issuecomment-155558615: "Looks like NFS isn't flushing the writes from one instance to another. Read-after-write consistency is a requirement of the backend."
I did few experiments with client mount options. Tested on 3VMs sharing single NFS export of a host: /var/shared 192.168.122.0/24(rw,sync,all_squash) Two nodes were running upstream docker registry (v2.2.0) balanced with an haproxy with a round robin balance algorithm. I run docker -D push 192.168.122.101:5010/joe/mongodb-24-rhel7 100 times from the 3rd VM after each change to NFS mount options on nodes. Storage had been wiped out before each consequent push. The only option modified was `actimeo` which according to NFS(5) man page: > Using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the same value. It's a number of seconds before a client cache is invalidated for file/dir. All the other NFS options were default: rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.101,local_lock=none,addr=192.168.122.1 Here are my foundings: actimeo [s] successful rate [%] 0 100 3 100 4 90 5 5 10 0 In order words. Invalidating client cache after 3 seconds was enough to get to 100% success rate on a low-latency network. I went further and found out that setting just `acdirmin` and `acdirmax` to 3 while keeping default values of `acregmin` and `acregmax` reduced success rate to 97%. So these two options are the most critical. On the other hand setting just `acregmin` and `acregmax` didn't help at all. Steve, can I do anything else to help you?
There's an upstream issue to address NFS mount option specification: https://github.com/kubernetes/kubernetes/issues/17226
Erik, I run registry instances like this: docker run --rm --name upstream-registry -v /mnt/shared/registry:/var/lib/registry:rw docker.io/registry:2.2.0 outside of OpenShift. With OpenShift, the easiest way is to use hostMount: oc volume deploymentconfigs/docker-registry --add --overwrite --name=registry-storage -t hostPath -m /registry --path mnt/shared/registry Or you could also "remount" existing persistent NFS volume: mount /var/lib/origin/openshift.local.volumes/pods/5164de5d-846e-11e5-8629-525400045043/volumes/kubernetes.io~nfs/registry-storage -o remount,noac
Steve, are you looking into this? Is there anything I can help with?
(In reply to Michal Minar from comment #44) > Steve, > > are you looking into this? Is there anything I can help with? I'm not ignoring it.. ;-) but would mind setting up a couple beaker machines or VMs where I can reproduce this problem? That would definitely help get started.
1 replica + host dir: sometime will fail
Could you please provide more info around your issue with hostDir? Is it a hostDir that happens to be stored on an NFS server? Or just a regular directory on the host? What error messages are you seeing, and what's in the logs? It's possible you're running into a different issue.
Hi: It is just a regular directory on the host, openshift is startup using container install version on atomic host. see the error hereunder: [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1:16.101s [INFO] Finished at: Thu Jan 14 03:05:55 EST 2016 [INFO] Final Memory: 14M/93M [INFO] ------------------------------------------------------------------------ [WARNING] The requested profile "openshift" could not be activated because it does not exist. Copying all WAR artifacts from /home/jboss/source/target directory into /opt/webserver/webapps for later deployment... '/home/jboss/source/target/websocket-chat.war' -> '/opt/webserver/webapps/websocket-chat.war' I0114 03:05:55.835371 1 docker.go:481] Container wait returns with 0 and <nil> I0114 03:05:55.865325 1 docker.go:488] Container exited I0114 03:05:55.910049 1 docker.go:571] Invoking postExecution function I0114 03:05:55.910129 1 sti.go:270] No .sti/environment provided (no environment file found in application sources) I0114 03:05:56.294482 1 docker.go:606] Committing container with config: {Hostname: Domainname: User:185 Memory:0 MemorySwap:0 CPUShares:0 CPUSet: AttachStdin:false AttachStdout:false AttachStderr:false PortSpecs:[] ExposedPorts:map[] Tty:false OpenStdin:false StdinOnce:false Env:[OPENSHIFT_BUILD_SOURCE=https://github.com/jboss-openshift/openshift-quickstarts.git OPENSHIFT_BUILD_REFERENCE=1.2 OPENSHIFT_BUILD_NAME=openshift-quickstarts-1 OPENSHIFT_BUILD_NAMESPACE=ovuu3] Cmd:[/usr/local/s2i/run] DNS:[] Image: Volumes:map[] VolumeDriver: VolumesFrom: WorkingDir: MacAddress: Entrypoint:[] NetworkDisabled:false SecurityOpts:[] OnBuild:[] Mounts:[] Labels:map[Authoritative_Registry:registry.access.redhat.com BZComponent:jboss-webserver-3-webserver30-tomcat8-openshift-docker Release:7 io.openshift.build.source-context-dir:tomcat-websocket-chat io.openshift.build.source-location:https://github.com/jboss-openshift/openshift-quickstarts.git org.jboss.deployments-dir:/opt/webserver/webapps vcs-type:git Build_Host:rcm-img-docker01.build.eng.bos.redhat.com Name:jboss-webserver-3/webserver30-tomcat8-openshift io.openshift.build.image:rcm-img-docker01.build.eng.bos.redhat.com:5001/jboss-webserver-3/webserver30-tomcat8-openshift:latest io.openshift.build.commit.message:set pom versions of org.openshift quickstarts to 1.2.0.Final io.openshift.build.commit.author:David Ward <dward> Architecture:x86_64 io.openshift.expose-services:8080:http io.k8s.description:Platform for building and running web applications on JBoss Web Server 3.0 - Tomcat v8 io.openshift.build.commit.id:dd6ef49437a8b9aec08523e69166854cc11a0805 Version:1.1 Vendor:Red Hat, Inc. vcs-ref:6db374ff8ce77187745cdc0b09d62991c7820c89 architecture:x86_64 io.openshift.s2i.scripts-url:image:///usr/local/s2i io.k8s.display-name:172.30.115.226:5000/ovuu3/openshift-quickstarts:latest io.openshift.tags:builder,java,tomcat8 build-date:2015-12-10T19:36:07.840739Z io.openshift.build.commit.date:Tue Dec 15 13:22:35 2015 -0500 io.openshift.build.commit.ref:1.2]} I0114 03:06:36.423794 1 sti.go:315] Successfully built 172.30.115.226:5000/ovuu3/openshift-quickstarts:latest I0114 03:06:36.498740 1 cleanup.go:23] Removing temporary directory /tmp/s2i-build217028334 I0114 03:06:36.498779 1 fs.go:117] Removing directory '/tmp/s2i-build217028334' I0114 03:06:36.524868 1 sti.go:214] Using provided push secret for pushing 172.30.115.226:5000/ovuu3/openshift-quickstarts:latest image I0114 03:06:36.524908 1 sti.go:218] Pushing 172.30.115.226:5000/ovuu3/openshift-quickstarts:latest image ... I0114 03:07:45.126836 1 sti.go:223] Registry server Address: I0114 03:07:45.126964 1 sti.go:224] Registry server User Name: serviceaccount I0114 03:07:45.127180 1 sti.go:225] Registry server Email: serviceaccount I0114 03:07:45.127202 1 sti.go:230] Registry server Password: <<non-empty>> F0114 03:07:45.127349 1 builder.go:185] Error: build error: Failed to push image. Response from registry is: digest invalid: provided digest did not match uploaded content
Wang, can you please post a version of the registry you're using? `oc -n default status | grep registry` And the version of docker daemon?
I just deployed a new ha registry (two separate nodes running the registry) onto an OSE3.1 using fedora 23 as nfs server and was getting the exact same error: "Response from registry is: digest invalid: provided digest did not match uploaded content" using this registry version: registry.access.redhat.com/openshift3/ose-docker-registry:v3.1.0.4 i have had some success with changing the nfs server options to add no_wdelay: -- /etc/exports /mnt/docker-registry *(rw,sync,no_root_squash,no_wdelay) STI builds now push OK.
(In reply to Michal Minar from comment #52) > Wang, can you please post a version of the registry you're using? > `oc -n default status | grep registry` > And the version of docker daemon? [root@openshift-135 ~]# oc -n default status | grep registry svc/docker-registry - 172.30.160.148:5000 dc/docker-registry deploys registry.access.redhat.com/openshift3/ose-docker-registry:v3.1.1.4 dc/router deploys registry.access.redhat.com/openshift3/ose-haproxy-router:v3.1.1.4 [root@openshift-135 ~]# docker version Client: Version: 1.8.2-el7 API version: 1.20 Package Version: docker-1.8.2-10.el7.x86_64 Go version: go1.4.2 Git commit: a01dc02/1.8.2 Built: OS/Arch: linux/amd64 Server: Version: 1.8.2-el7 API version: 1.20 Package Version: Go version: go1.4.2 Git commit: a01dc02/1.8.2 Built: OS/Arch: linux/amd64
After add "no_wdelay" option for nfs server mentioned in comment 54, sti build succeed to push data to docker-registry.
I can confirm the "no_wdelay" option worked in my environment as well. I have a RHEL 7 OSE environment with an HA load balancer but my NFS storage was on a CentOS 6.7 server. [root@master00 ~]# oc -n default status | grep registry svc/docker-registry - 172.50.225.185:5000 dc/docker-registry deploys registry.access.redhat.com/openshift3/ose-docker-registry:v3.1.1.6 I was receiving the following in one of my registries: time="2016-03-10T09:00:56.671348073-05:00" level=error msg="response completed with error" err.code="BLOB_UPLOAD_INVALID" err.detail="Invalid token" err.message="blob upload invalid" go.version=go1.4.2 http.request.host="172.50.225.185:5000" http.request.id=e4066c94-950d-4306-89de-57a1ac573f72 http.request.method=PUT http.request.remoteaddr="10.5.0.1:34874" attempting to deploy TicketMonster with EAP 1.2.
Origin documentation PR: https://github.com/openshift/openshift-docs/pull/1908
Documented in PR https://github.com/openshift/openshift-docs/pull/1935
The PR seem good to QE, so move it to verified.