Bug 2015459

Summary: [azure][openstack]When image registry configure an invalid proxy, registry pods are CrashLoopBackOff
Product: OpenShift Container Platform Reporter: wewang <wewang>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.9CC: aos-bugs, wking, xiuwang
Target Milestone: ---Keywords: Regression
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:38:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description wewang 2021-10-19 09:14:35 UTC
Description of problem:
When add invalid proxy to config.imagereigstry cluster, registry pods are CrashLoopBackOff,and co/image-registry is degrade

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-10-18-182325

How reproducible:
Always

Steps to Reproduce:
1.Setup a cluster with a global proxy
[wewang@localhost]$  oc get proxy.config -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Proxy
  metadata:
    creationTimestamp: "2021-10-19T02:49:51Z"
    generation: 1
    name: cluster
    resourceVersion: "568"
    uid: daf12f74-5b4e-44b6-b6f1-0643e79a8990
  spec:
    httpProxy: http://proxy-user1:xxxRZV4DY4PXJbxJK@10.0.xx.xx:31xx
    httpsProxy: http://proxy-user1:xxxx8qRZV4DY4PXJbxJK@10.0.xx.xx:31xx
    noProxy: test.no-proxy.com
    trustedCA:
      name: ""
  status:
    httpProxy: http://proxy-user1:xxxxxRZV4DY4PXJbxJK@10.0.xxx.xx:31xx
    httpsProxy: http://proxy-user1:xxRZV4DY4PXJbxJK@10.0.xx.xx:31xx
    noProxy: .cluster.local,.svc,10.0.0.0/xx,10.xx.0.0/14,xx0.0.1,xx.xx.169.254,xxx.xx0.0.0/16,xx-int.wewang-reprod.xx.azure.xxxluster.openshift.com,localhost,test.no-proxy.com

2. Add invalid proxy to config.image/cluster
[wewang@localhost]$  oc get config.image -oyaml
``
    managementState: Managed
    observedConfig: null
    operatorLogLevel: Normal
    proxy:
      http: http://test:3128
      https: http://test:3128
      noProxy: test.no-proxy.com
    replicas: 2
    requests:
      read:
        maxWaitInQueue: 0s
      write:
        maxWaitInQueue: 0s
    rolloutStrategy: RollingUpdate
    storage:
      azure:
        accountName: imageregistrywewangkxwzj
        cloudName: AzurePublicCloud
        container: wewang-reprod-b6mhg-image-registry-vamlnqokmkldqfvmgtelldhkurj
      managementState: Managed
```
3. Check the image registry pods 
[wewang@localhost ~]$ oc get pods -n openshift-image-registry
NAME                                               READY   STATUS             RESTARTS        AGE
cluster-image-registry-operator-5998498858-jsc77   1/1     Running            1 (3h47m ago)   3h58m
image-registry-7b557bd6fb-v6n84                    1/1     Running            0               3h45m
image-registry-8554cf844-j7lgd                     0/1     CrashLoopBackOff   6 (116s ago)    10m
image-registry-8554cf844-xrkkc                     0/1     CrashLoopBackOff   6 (100s ago)    10m

Here’s log: http://pastebin.test.redhat.com/1002261

4.Check the image registry operator
[wewang@localhost]$ oc get co/image-registry 
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.9.0-0.nightly-2021-10-18-182325   True        True          True       3h53m   Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-8554cf844" has timed out progressing.

5. Must-gather info: 
http://virt-openshift-05.lab.eng.nay.redhat.com/wewang/image-registry/

Actual results:
1. Pods are not running

Expected results:
2. Pods are running


Additional info:
Only reproduced this issue on azure & proxy and openstack & proxy clusters, proxy clusters on other platforms like aws,vsphere,gcp and bm, did not met the issue.

Comment 1 Oleg Bulatov 2021-10-19 10:40:42 UTC
The registry pods cannot reach storage when an invalid proxy is set, so they should become unhealthy and be killed. That's exactly what happens on your cluster. I'd say it's a bug that it doesn't happen on AWS/GCP. It's ok for the registry to stay alive when it doesn't use HTTP connections and uses a regular file system (i.e. PVC).

Comment 2 wewang 2021-11-23 07:47:01 UTC
Verfied in version:
Version
4.10.0-0.ci.test-2021-11-23-070259-ci-ln-jintmht-latest
tested in azure and openstack cluster, when set invalid proxy in config.image, registry pods are running.

Comment 3 Oleg Bulatov 2021-11-23 07:57:08 UTC
That's not how it should work. The pod should be unhealthy (and eventually be killed) when invalid proxy is set.

Comment 4 Oleg Bulatov 2021-11-23 10:25:57 UTC
On my local cluster:

$ oc get config.imageregistry/cluster -o json | jq .spec.proxy
{
  "http": "http://localhost",
  "https": "http://localhost"
}

$ oc -n openshift-image-registry get pods -l docker-registry=default
NAME                              READY   STATUS    RESTARTS      AGE
image-registry-6b674466bf-8kp5j   0/1     Running   2 (71s ago)   4m14s
image-registry-6b674466bf-vxqjp   0/1     Running   2 (67s ago)   4m14s

The pod starts to crash (see restarts).

Comment 8 XiuJuan Wang 2022-03-17 10:09:00 UTC
Verified on 4.11.0-0.nightly-2022-03-17-024314
Image registry pod will report crash(restart) when add invalid proxy
 
 Warning  ProbeError        10s (x2 over 20s)  kubelet            Readiness probe error: HTTP probe failed with statuscode: 503
body: {"errors":[{"code":"UNAVAILABLE","message":"service unavailable","detail":"health check failed: please see /debug/health"}]}
  Warning  Unhealthy   10s (x2 over 20s)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  ProbeError  10s (x2 over 20s)  kubelet  Liveness probe error: HTTP probe failed with statuscode: 503
body: {"errors":[{"code":"UNAVAILABLE","message":"service unavailable","detail":"health check failed: please see /debug/health"}]}
  Warning  Unhealthy  10s (x2 over 20s)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503

Comment 10 errata-xmlrpc 2022-08-10 10:38:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069