Bug 1749333

Summary: Push to registry failing after OCP upgrade with error parsing HTTP 403 response body: unexpected end of JSON input:
Product: OpenShift Container Platform Reporter: Anand Paladugu <apaladug>
Component: Image RegistryAssignee: Ricardo Maraschini <rmarasch>
Status: CLOSED NOTABUG QA Contact: Wenjing Zheng <wzheng>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: adam.kaplan, aos-bugs, hgomes, jolee, rmarasch
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-17 23:13:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
haproxy-config
none
master-config
none
production router logs
none
Test router logs
none
docker registry debug logs none

Description Anand Paladugu 2019-09-05 11:51:12 UTC
Description of problem:

Push to registry failing after OCP upgrade with error parsing HTTP 403 response body: unexpected end of JSON input:

Version-Release number of selected component (if applicable):

3.11

How reproducible:

always

Steps to Reproduce:
1. docker push docker-registry-default.company.com/chat/nginx:p0-green
2.
3.

Actual results:

The push refers to a repository [docker-registry-default.company.com/chat/nginx]
12fdf55172df: Pushing [==================================================>] 3.584 kB
002a63507c1c: Pushing [================>                                  ] 19.27 MB/56.76 MB
1c95c77433e8: Pushing [===========>                                       ]  16.3 MB/69.2 MB
error parsing HTTP 403 response body: unexpected end of JSON input: ""


Expected results:

Push should have succeeded in pushing all layers.


Additional info:

- The OpenShift platform is running on Azure.
 - The docker registry is using Azure Storage Account's blob storage as it /registry.
 - We have no problem pushing the image before.
 - Recently, we have upgraded our OpenShift v3.11.xx to v3.11.135 (the latest release).

 - oc adm diagnostics does not show any issue
 - SOS report messages do not show any relevant messages

Comment 1 Anand Paladugu 2019-09-05 11:53:34 UTC
oc adm diagnostics
[Note] Determining if client configuration exists for client/cluster diagnostics
[Note] No client configuration specified; skipping client and cluster diagnostics.
[Note] Performing systemd discovery

[Note] Running diagnostic: AnalyzeLogs
       Description: Check for recent problems in systemd service logs

Info:  Checking journalctl logs for 'atomic-openshift-node' service
Info:  Checking journalctl logs for 'docker' service

[Note] Running diagnostic: UnitStatus
       Description: Check status for related systemd units

[Note] Summary of diagnostics execution (version v3.11.135):
[Note] Completed with no errors or warnings seen.

Comment 3 Anand Paladugu 2019-09-05 11:58:15 UTC
SOS report of the node hosting registry pod is attached to case

Comment 4 Anand Paladugu 2019-09-05 12:49:01 UTC
Registry pod log is pointing to unknown blob errors, but it's not clear why. No file access issues reported in audit.log.  

Registry errors eventually are leading to "client disconnected during blob PATCH" and unexpected EOF, but root cause seems to be unknown blob.

I have asked customer to create a new project, do a build and try a push.  Waiting for results.

Comment 6 Anand Paladugu 2019-09-05 17:57:58 UTC
1.13.1

Comment 7 Anand Paladugu 2019-09-06 12:18:20 UTC
Customer also reported the below.  Since images cannot be pushed to prod, they are blocked.

 - we tested it ok on pushing the rhscl/nginx image from both admin console (by creating new project & build example program)

 - we tested it ok on pushing the rhscl/nginx image from the command line (docker push) in one of the master node.
 
 - we got HTTP 403 on pushing the rhscl/nginx image from another RHEL server not under openshift nodes.  But, we can pull it from registry.redhat.io

 - we have set the insecure-registries in the /etc/docker/daemon.json file on the RHEL server

 - we have run this setting for a long time to sync the openshift image between test and production environment before.
   not sure when but the setting failed recently.  Suspect (just guess) that happened after we upgrade it from v3.11.xx (should be 4x) version to be.11.135

Comment 12 Anand Paladugu 2019-09-09 16:56:56 UTC
Attached the router logs and screenshot of the new routes created.

Comment 20 Anand Paladugu 2019-09-11 12:01:35 UTC
Ricardo.

the command did not work in OCP 3.11.  Is there an equivalent one for 3.11 ?

Comment 22 Anand Paladugu 2019-09-11 12:44:15 UTC
@Ricardo.

Customer reported that router logs are not showing anything.  He checked all three pods.   I will ask for docker -D output, while you are checking on the #19.   I might ask for a web-ex to make sure on the router logs.  Anything else you need if I do get to do a web-ex. ?


"Yes, have tried on all router pods, but only see the health check messages."

Below are some of the logs that I can see (can only see the health check log:
=======================================================
I0910 14:53:24.129851       1 router.go:481] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0910 14:53:29.105896       1 router.go:481] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
W0910 15:02:52.975065       1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
W0910 15:12:39.087256       1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
I0910 15:17:22.325293       1 router.go:481] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0910 15:17:49.847037       1 router.go:481] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
W0910 15:20:25.177671       1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
W0910 15:30:22.262110       1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.

Comment 25 Anand Paladugu 2019-09-12 17:45:46 UTC
Created attachment 1614611 [details]
haproxy-config

Comment 26 Anand Paladugu 2019-09-12 17:51:56 UTC
Created attachment 1614612 [details]
master-config

Comment 29 Anand Paladugu 2019-09-13 12:11:02 UTC
Created attachment 1614829 [details]
production router logs

Comment 30 Anand Paladugu 2019-09-13 12:11:59 UTC
Created attachment 1614830 [details]
Test router logs

Comment 31 Anand Paladugu 2019-09-13 12:18:40 UTC
@Ricardo

Attached the router debug logs for both production and test env's.  for both success and failure scenarios.

The failure case is when they are trying to push from outside of network.  I did not ask what the outside host is, but it's a Linux host that they are using to push into production running in Azure. I think customer tried to push from RHEL server in cloud and they saw the same issue.

URL is the same in both success and fail cases (you can see in the logs)

Comment 33 Anand Paladugu 2019-09-13 18:29:59 UTC
Around midnight on 9/12, early 9/12 our time.

The files in the prod files tar file had the timestamp of

9/13 00.22

I believe he uploaded the files right after testing the cases because I shared the process only at 10 PM on 9/12.

Comment 35 Anand Paladugu 2019-09-16 16:22:45 UTC
Created attachment 1615562 [details]
docker registry debug logs

Comment 36 Anand Paladugu 2019-09-16 16:23:19 UTC
Added the docker registry logs with debug enabled.

Comment 37 Anand Paladugu 2019-09-16 18:50:17 UTC
@Ricardo

customer is asking if there is any workaround to the situation.  They pull from test env and push into prod (and vice versa) and they can no longer wait as its going to impact production plans. 

I suggested using docker save to save the image to tar file (on a pull) and then transfer it to the right host and doing a docker import to get back the image file and then do a push,  but they are not ok with it.

Comment 38 Anand Paladugu 2019-09-17 23:12:53 UTC
@Ricardo

Customer closed the case.  It seems they have identified the issue to be related to Azure blob storage.  They did not specify, but the blob storage access might have been limited to the virtual network only.

Thanks for your help.  and I am closing this ticket.