Description of problem: Push to registry failing after OCP upgrade with error parsing HTTP 403 response body: unexpected end of JSON input: Version-Release number of selected component (if applicable): 3.11 How reproducible: always Steps to Reproduce: 1. docker push docker-registry-default.company.com/chat/nginx:p0-green 2. 3. Actual results: The push refers to a repository [docker-registry-default.company.com/chat/nginx] 12fdf55172df: Pushing [==================================================>] 3.584 kB 002a63507c1c: Pushing [================> ] 19.27 MB/56.76 MB 1c95c77433e8: Pushing [===========> ] 16.3 MB/69.2 MB error parsing HTTP 403 response body: unexpected end of JSON input: "" Expected results: Push should have succeeded in pushing all layers. Additional info: - The OpenShift platform is running on Azure. - The docker registry is using Azure Storage Account's blob storage as it /registry. - We have no problem pushing the image before. - Recently, we have upgraded our OpenShift v3.11.xx to v3.11.135 (the latest release). - oc adm diagnostics does not show any issue - SOS report messages do not show any relevant messages
oc adm diagnostics [Note] Determining if client configuration exists for client/cluster diagnostics [Note] No client configuration specified; skipping client and cluster diagnostics. [Note] Performing systemd discovery [Note] Running diagnostic: AnalyzeLogs Description: Check for recent problems in systemd service logs Info: Checking journalctl logs for 'atomic-openshift-node' service Info: Checking journalctl logs for 'docker' service [Note] Running diagnostic: UnitStatus Description: Check status for related systemd units [Note] Summary of diagnostics execution (version v3.11.135): [Note] Completed with no errors or warnings seen.
SOS report of the node hosting registry pod is attached to case
Registry pod log is pointing to unknown blob errors, but it's not clear why. No file access issues reported in audit.log. Registry errors eventually are leading to "client disconnected during blob PATCH" and unexpected EOF, but root cause seems to be unknown blob. I have asked customer to create a new project, do a build and try a push. Waiting for results.
1.13.1
Customer also reported the below. Since images cannot be pushed to prod, they are blocked. - we tested it ok on pushing the rhscl/nginx image from both admin console (by creating new project & build example program) - we tested it ok on pushing the rhscl/nginx image from the command line (docker push) in one of the master node. - we got HTTP 403 on pushing the rhscl/nginx image from another RHEL server not under openshift nodes. But, we can pull it from registry.redhat.io - we have set the insecure-registries in the /etc/docker/daemon.json file on the RHEL server - we have run this setting for a long time to sync the openshift image between test and production environment before. not sure when but the setting failed recently. Suspect (just guess) that happened after we upgrade it from v3.11.xx (should be 4x) version to be.11.135
Attached the router logs and screenshot of the new routes created.
Ricardo. the command did not work in OCP 3.11. Is there an equivalent one for 3.11 ?
@Ricardo. Customer reported that router logs are not showing anything. He checked all three pods. I will ask for docker -D output, while you are checking on the #19. I might ask for a web-ex to make sure on the router logs. Anything else you need if I do get to do a web-ex. ? "Yes, have tried on all router pods, but only see the health check messages." Below are some of the logs that I can see (can only see the health check log: ======================================================= I0910 14:53:24.129851 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0910 14:53:29.105896 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). W0910 15:02:52.975065 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. W0910 15:12:39.087256 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. I0910 15:17:22.325293 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0910 15:17:49.847037 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). W0910 15:20:25.177671 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. W0910 15:30:22.262110 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
Created attachment 1614611 [details] haproxy-config
Created attachment 1614612 [details] master-config
Created attachment 1614829 [details] production router logs
Created attachment 1614830 [details] Test router logs
@Ricardo Attached the router debug logs for both production and test env's. for both success and failure scenarios. The failure case is when they are trying to push from outside of network. I did not ask what the outside host is, but it's a Linux host that they are using to push into production running in Azure. I think customer tried to push from RHEL server in cloud and they saw the same issue. URL is the same in both success and fail cases (you can see in the logs)
Around midnight on 9/12, early 9/12 our time. The files in the prod files tar file had the timestamp of 9/13 00.22 I believe he uploaded the files right after testing the cases because I shared the process only at 10 PM on 9/12.
Created attachment 1615562 [details] docker registry debug logs
Added the docker registry logs with debug enabled.
@Ricardo customer is asking if there is any workaround to the situation. They pull from test env and push into prod (and vice versa) and they can no longer wait as its going to impact production plans. I suggested using docker save to save the image to tar file (on a pull) and then transfer it to the right host and doing a docker import to get back the image file and then do a push, but they are not ok with it.
@Ricardo Customer closed the case. It seems they have identified the issue to be related to Azure blob storage. They did not specify, but the blob storage access might have been limited to the virtual network only. Thanks for your help. and I am closing this ticket.