Bug 1749333
| Summary: | Push to registry failing after OCP upgrade with error parsing HTTP 403 response body: unexpected end of JSON input: | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anand Paladugu <apaladug> | ||||||||||||
| Component: | Image Registry | Assignee: | Ricardo Maraschini <rmarasch> | ||||||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Wenjing Zheng <wzheng> | ||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 3.11.0 | CC: | adam.kaplan, aos-bugs, hgomes, jolee, rmarasch | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | 3.11.z | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2019-09-17 23:13:11 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Anand Paladugu
2019-09-05 11:51:12 UTC
oc adm diagnostics
[Note] Determining if client configuration exists for client/cluster diagnostics
[Note] No client configuration specified; skipping client and cluster diagnostics.
[Note] Performing systemd discovery
[Note] Running diagnostic: AnalyzeLogs
Description: Check for recent problems in systemd service logs
Info: Checking journalctl logs for 'atomic-openshift-node' service
Info: Checking journalctl logs for 'docker' service
[Note] Running diagnostic: UnitStatus
Description: Check status for related systemd units
[Note] Summary of diagnostics execution (version v3.11.135):
[Note] Completed with no errors or warnings seen.
SOS report of the node hosting registry pod is attached to case Registry pod log is pointing to unknown blob errors, but it's not clear why. No file access issues reported in audit.log. Registry errors eventually are leading to "client disconnected during blob PATCH" and unexpected EOF, but root cause seems to be unknown blob. I have asked customer to create a new project, do a build and try a push. Waiting for results. 1.13.1 Customer also reported the below. Since images cannot be pushed to prod, they are blocked. - we tested it ok on pushing the rhscl/nginx image from both admin console (by creating new project & build example program) - we tested it ok on pushing the rhscl/nginx image from the command line (docker push) in one of the master node. - we got HTTP 403 on pushing the rhscl/nginx image from another RHEL server not under openshift nodes. But, we can pull it from registry.redhat.io - we have set the insecure-registries in the /etc/docker/daemon.json file on the RHEL server - we have run this setting for a long time to sync the openshift image between test and production environment before. not sure when but the setting failed recently. Suspect (just guess) that happened after we upgrade it from v3.11.xx (should be 4x) version to be.11.135 Attached the router logs and screenshot of the new routes created. Ricardo. the command did not work in OCP 3.11. Is there an equivalent one for 3.11 ? @Ricardo. Customer reported that router logs are not showing anything. He checked all three pods. I will ask for docker -D output, while you are checking on the #19. I might ask for a web-ex to make sure on the router logs. Anything else you need if I do get to do a web-ex. ? "Yes, have tried on all router pods, but only see the health check messages." Below are some of the logs that I can see (can only see the health check log: ======================================================= I0910 14:53:24.129851 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0910 14:53:29.105896 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). W0910 15:02:52.975065 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. W0910 15:12:39.087256 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. I0910 15:17:22.325293 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). I0910 15:17:49.847037 1 router.go:481] Router reloaded: - Checking http://localhost:80 ... - Health check ok : 0 retry attempt(s). W0910 15:20:25.177671 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. W0910 15:30:22.262110 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. Created attachment 1614611 [details]
haproxy-config
Created attachment 1614612 [details]
master-config
Created attachment 1614829 [details]
production router logs
Created attachment 1614830 [details]
Test router logs
@Ricardo Attached the router debug logs for both production and test env's. for both success and failure scenarios. The failure case is when they are trying to push from outside of network. I did not ask what the outside host is, but it's a Linux host that they are using to push into production running in Azure. I think customer tried to push from RHEL server in cloud and they saw the same issue. URL is the same in both success and fail cases (you can see in the logs) Around midnight on 9/12, early 9/12 our time. The files in the prod files tar file had the timestamp of 9/13 00.22 I believe he uploaded the files right after testing the cases because I shared the process only at 10 PM on 9/12. Created attachment 1615562 [details]
docker registry debug logs
Added the docker registry logs with debug enabled. @Ricardo customer is asking if there is any workaround to the situation. They pull from test env and push into prod (and vice versa) and they can no longer wait as its going to impact production plans. I suggested using docker save to save the image to tar file (on a pull) and then transfer it to the right host and doing a docker import to get back the image file and then do a push, but they are not ok with it. @Ricardo Customer closed the case. It seems they have identified the issue to be related to Azure blob storage. They did not specify, but the blob storage access might have been limited to the virtual network only. Thanks for your help. and I am closing this ticket. |