Bug 2015459 - [azure][openstack]When image registry configure an invalid proxy, registry pods are CrashLoopBackOff
Summary: [azure][openstack]When image registry configure an invalid proxy, registry po...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.11.0
Assignee: Oleg Bulatov
QA Contact: wewang
Depends On:
TreeView+ depends on / blocked
Reported: 2021-10-19 09:14 UTC by wewang
Modified: 2022-08-10 10:38 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2022-08-10 10:38:21 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 732 0 None Merged Bug 2015459: Enable health check for storage driver 2022-03-16 17:56:36 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:38:47 UTC

Description wewang 2021-10-19 09:14:35 UTC
Description of problem:
When add invalid proxy to config.imagereigstry cluster, registry pods are CrashLoopBackOff,and co/image-registry is degrade

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.Setup a cluster with a global proxy
[wewang@localhost]$  oc get proxy.config -oyaml
apiVersion: v1
- apiVersion: config.openshift.io/v1
  kind: Proxy
    creationTimestamp: "2021-10-19T02:49:51Z"
    generation: 1
    name: cluster
    resourceVersion: "568"
    uid: daf12f74-5b4e-44b6-b6f1-0643e79a8990
    httpProxy: http://proxy-user1:xxxRZV4DY4PXJbxJK@10.0.xx.xx:31xx
    httpsProxy: http://proxy-user1:xxxx8qRZV4DY4PXJbxJK@10.0.xx.xx:31xx
    noProxy: test.no-proxy.com
      name: ""
    httpProxy: http://proxy-user1:xxxxxRZV4DY4PXJbxJK@10.0.xxx.xx:31xx
    httpsProxy: http://proxy-user1:xxRZV4DY4PXJbxJK@10.0.xx.xx:31xx
    noProxy: .cluster.local,.svc,,10.xx.0.0/14,xx0.0.1,xx.xx.169.254,xxx.xx0.0.0/16,xx-int.wewang-reprod.xx.azure.xxxluster.openshift.com,localhost,test.no-proxy.com

2. Add invalid proxy to config.image/cluster
[wewang@localhost]$  oc get config.image -oyaml
    managementState: Managed
    observedConfig: null
    operatorLogLevel: Normal
      http: http://test:3128
      https: http://test:3128
      noProxy: test.no-proxy.com
    replicas: 2
        maxWaitInQueue: 0s
        maxWaitInQueue: 0s
    rolloutStrategy: RollingUpdate
        accountName: imageregistrywewangkxwzj
        cloudName: AzurePublicCloud
        container: wewang-reprod-b6mhg-image-registry-vamlnqokmkldqfvmgtelldhkurj
      managementState: Managed
3. Check the image registry pods 
[wewang@localhost ~]$ oc get pods -n openshift-image-registry
NAME                                               READY   STATUS             RESTARTS        AGE
cluster-image-registry-operator-5998498858-jsc77   1/1     Running            1 (3h47m ago)   3h58m
image-registry-7b557bd6fb-v6n84                    1/1     Running            0               3h45m
image-registry-8554cf844-j7lgd                     0/1     CrashLoopBackOff   6 (116s ago)    10m
image-registry-8554cf844-xrkkc                     0/1     CrashLoopBackOff   6 (100s ago)    10m

Here’s log: http://pastebin.test.redhat.com/1002261

4.Check the image registry operator
[wewang@localhost]$ oc get co/image-registry 
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.9.0-0.nightly-2021-10-18-182325   True        True          True       3h53m   Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-8554cf844" has timed out progressing.

5. Must-gather info: 

Actual results:
1. Pods are not running

Expected results:
2. Pods are running

Additional info:
Only reproduced this issue on azure & proxy and openstack & proxy clusters, proxy clusters on other platforms like aws,vsphere,gcp and bm, did not met the issue.

Comment 1 Oleg Bulatov 2021-10-19 10:40:42 UTC
The registry pods cannot reach storage when an invalid proxy is set, so they should become unhealthy and be killed. That's exactly what happens on your cluster. I'd say it's a bug that it doesn't happen on AWS/GCP. It's ok for the registry to stay alive when it doesn't use HTTP connections and uses a regular file system (i.e. PVC).

Comment 2 wewang 2021-11-23 07:47:01 UTC
Verfied in version:
tested in azure and openstack cluster, when set invalid proxy in config.image, registry pods are running.

Comment 3 Oleg Bulatov 2021-11-23 07:57:08 UTC
That's not how it should work. The pod should be unhealthy (and eventually be killed) when invalid proxy is set.

Comment 4 Oleg Bulatov 2021-11-23 10:25:57 UTC
On my local cluster:

$ oc get config.imageregistry/cluster -o json | jq .spec.proxy
  "http": "http://localhost",
  "https": "http://localhost"

$ oc -n openshift-image-registry get pods -l docker-registry=default
NAME                              READY   STATUS    RESTARTS      AGE
image-registry-6b674466bf-8kp5j   0/1     Running   2 (71s ago)   4m14s
image-registry-6b674466bf-vxqjp   0/1     Running   2 (67s ago)   4m14s

The pod starts to crash (see restarts).

Comment 8 XiuJuan Wang 2022-03-17 10:09:00 UTC
Verified on 4.11.0-0.nightly-2022-03-17-024314
Image registry pod will report crash(restart) when add invalid proxy
 Warning  ProbeError        10s (x2 over 20s)  kubelet            Readiness probe error: HTTP probe failed with statuscode: 503
body: {"errors":[{"code":"UNAVAILABLE","message":"service unavailable","detail":"health check failed: please see /debug/health"}]}
  Warning  Unhealthy   10s (x2 over 20s)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  ProbeError  10s (x2 over 20s)  kubelet  Liveness probe error: HTTP probe failed with statuscode: 503
body: {"errors":[{"code":"UNAVAILABLE","message":"service unavailable","detail":"health check failed: please see /debug/health"}]}
  Warning  Unhealthy  10s (x2 over 20s)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503

Comment 10 errata-xmlrpc 2022-08-10 10:38:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.