Bug 1936006 - Image Registry pod enters CrashLoopBackoff State for extended periods of time after node reboot
Summary: Image Registry pod enters CrashLoopBackoff State for extended periods of time...
Keywords:
Status: CLOSED DUPLICATE of bug 1893956
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-05 22:23 UTC by Tyler Lisowski
Modified: 2021-03-08 12:27 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-08 12:27:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Tyler Lisowski 2021-03-05 22:23:15 UTC
Description of problem:
When the node an image registry pod is on gets rebooted, the image-registry pod enters CrashloopBackoff for extended periods of time. It's very similar to what was reported in this mailing thread:
https://lists.openshift.redhat.com/openshift-archives/users/2020-June/msg00002.html

We can reproduce this consistently in IBM Cloud ROKS but do not believe it's restricted just to ROKS environments. I have seen this with registry deployments that use PVCs, S3 object storage, and regular empty dirs.

When it occurs we are able to immediately resolve by doing the following actions
1. kubectl  edit configs.imageregistry.operator.openshift.io cluster and set spec.ManagementState to Unmanaged
2. `kubectl edit deploy -n openshift-image-registry image-registry` and explicitly add in
```
command:
- /usr/bin/dockerregistry
```

And then on the re-rollout the image registry operator will work immediately. If those changes are reverted (the state left in Managed) it will proceed to start crashing again when it goes back to the default image. We replicated this on the following registry image
```
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e089b9443226e55bd013c54096d4e1eb7e99b4c8dc8da75c92a4c7227ac56484
```


Version-Release number of selected component (if applicable):
4.6.17
4.5.31

How reproducible:
Consistent reproduction outages just last for random periods of times (sometimes up to 4+ hours, sometimes 30 minutes) 

Steps to Reproduce:
Find node registry pod is on
```
NAME                                               READY   STATUS    RESTARTS   AGE     IP              NODE          NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-6874f46dbc-xgltq   2/2     Running   0          3d10h   172.17.116.83   10.240.0.29   <none>           <none>
image-registry-8686b45c4-6rsn5                     1/1     Running   0          3d10h   172.17.116.84   10.240.0.29   <none>           <none>
```

Reboot node
```
ibmcloud cs worker reboot --cluster bsr0toow08i9rajtg1a0 --worker kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe
Reboot worker? [kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe] [y/N]> y
Processing kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe...
Processing on kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe complete.
OK
```

Wait for all workload on node to stabilize. However the image-registry pod never will
```
kubectl get pods --all-namespaces -o wide | grep 10.240.0.29
calico-system                                      calico-kube-controllers-5c4474bfbf-xvflk                  1/1     Running            1          3d10h   172.17.116.84    10.240.0.29   <none>           <none>
calico-system                                      calico-node-sdls4                                         1/1     Running            1          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
calico-system                                      calico-typha-6ffbdc5dfd-bfrwd                             1/1     Running            1          3d10h   10.240.0.29      10.240.0.29   <none>           <none>
default                                            prometheus-pushgateway-1597420487-5f6d5c89f9-p77rf        1/1     Running            1          3d10h   172.17.116.125   10.240.0.29   <none>           <none>
kube-system                                        ibm-keepalived-watcher-7hckg                              1/1     Running            1          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
kube-system                                        ibm-kubelet-monitor-65qv2                                 1/1     Running            1          3d9h    172.17.116.88    10.240.0.29   <none>           <none>
kube-system                                        ibm-master-proxy-static-10.240.0.29                       2/2     Running            2          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
kube-system                                        ibm-vpc-block-csi-controller-0                            4/4     Running            4          3d10h   172.17.116.127   10.240.0.29   <none>           <none>
kube-system                                        ibm-vpc-block-csi-node-zrtdk                              3/3     Running            3          3d9h    172.17.116.124   10.240.0.29   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-479b2                                               1/1     Running            1          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
openshift-console                                  downloads-6f7bfc965b-hgn54                                1/1     Running            3          3d10h   172.17.116.113   10.240.0.29   <none>           <none>
openshift-console                                  downloads-6f7bfc965b-zc8nc                                1/1     Running            3          3d10h   172.17.116.101   10.240.0.29   <none>           <none>
openshift-dns                                      dns-default-gg7nc                                         3/3     Running            3          3d9h    172.17.116.78    10.240.0.29   <none>           <none>
openshift-image-registry                           image-registry-8686b45c4-6rsn5                            0/1     CrashLoopBackOff   4          3d10h   172.17.116.110   10.240.0.29   <none>           <none>
openshift-image-registry                           node-ca-mxn9x                                             1/1     Running            1          3d9h    172.17.116.92    10.240.0.29   <none>           <none>
openshift-kube-proxy                               openshift-kube-proxy-cz6qq                                1/1     Running            1          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
openshift-kube-storage-version-migrator            migrator-6948c84b78-v5vsg                                 1/1     Running            1          3d10h   172.17.116.67    10.240.0.29   <none>           <none>
openshift-monitoring                               alertmanager-main-0                                       5/5     Running            5          3d10h   172.17.116.120   10.240.0.29   <none>           <none>
openshift-monitoring                               alertmanager-main-1                                       5/5     Running            5          3d10h   172.17.116.123   10.240.0.29   <none>           <none>
openshift-monitoring                               alertmanager-main-2                                       5/5     Running            5          3d10h   172.17.116.107   10.240.0.29   <none>           <none>
openshift-monitoring                               grafana-5d566dbbcd-5qlmj                                  2/2     Running            2          3d10h   172.17.116.121   10.240.0.29   <none>           <none>
openshift-monitoring                               kube-state-metrics-766c75fd5b-v58md                       3/3     Running            3          3d10h   172.17.116.111   10.240.0.29   <none>           <none>
openshift-monitoring                               node-exporter-pf76w                                       2/2     Running            2          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
openshift-monitoring                               openshift-state-metrics-5cf8467749-pjtsx                  3/3     Running            3          3d10h   172.17.116.118   10.240.0.29   <none>           <none>
openshift-monitoring                               prometheus-adapter-8467686677-bc6pv                       1/1     Running            1          3d10h   172.17.116.70    10.240.0.29   <none>           <none>
openshift-monitoring                               prometheus-adapter-8467686677-vd2wg                       1/1     Running            1          3d10h   172.17.116.97    10.240.0.29   <none>           <none>
openshift-monitoring                               prometheus-k8s-0                                          7/7     Running            8          3d10h   172.17.116.126   10.240.0.29   <none>           <none>
openshift-monitoring                               prometheus-k8s-1                                          7/7     Running            8          3d10h   172.17.116.122   10.240.0.29   <none>           <none>
openshift-monitoring                               telemeter-client-56b555bd99-5nt74                         3/3     Running            3          3d10h   172.17.116.72    10.240.0.29   <none>           <none>
openshift-monitoring                               thanos-querier-5cc764f9c9-brzf4                           4/4     Running            4          3d10h   172.17.116.115   10.240.0.29   <none>           <none>
openshift-monitoring                               thanos-querier-5cc764f9c9-jbrmr                           4/4     Running            4          3d10h   172.17.116.80    10.240.0.29   <none>           <none>
openshift-multus                                   multus-4c55c                                              1/1     Running            1          3d9h    10.240.0.29      10.240.0.29   <none>           <none>
openshift-multus                                   multus-admission-controller-rmhmz                         2/2     Running            0          3m3s    172.17.116.119   10.240.0.29   <none>           <none>
openshift-roks-metrics                             metrics-6cb57bdd7f-vmv8p                                  1/1     Running            1          3d10h   172.17.116.109   10.240.0.29   <none>           <none>
openshift-roks-metrics                             push-gateway-85bbbdd967-84fc7                             1/1     Running            1          3d10h   172.17.116.87    10.240.0.29   <none>           <none>
tigera-operator                                    tigera-operator-576f8c7f86-wjrvm                          1/1     Running            1          3d10h   10.240.0.29      10.240.0.29   <none>           <none>
```

Even trying to kill and restart a new pod does not work
```
kubectl get pods -n openshift-image-registry
cluster-image-registry-operator-6874f46dbc-gnrlc   2/2     Running            0          5m17s   172.17.124.33    10.240.0.30   <none>           <none>
image-registry-8686b45c4-9jvb2                     0/1     CrashLoopBackOff   2          50s     172.17.116.103   10.240.0.29   <none>           <none>
node-ca-9gpft                                      1/1     Running            0          3d9h    172.17.124.1     10.240.0.30   <none>           <none>
node-ca-mxn9x                                      1/1     Running            1          3d9h    172.17.116.92    10.240.0.29   <none>           <none>
node-ca-w9spn                                      1/1     Running            0          3d9h    172.17.65.130    10.240.0.31   <none>           <none>
```

Then in this same state if I proceed to do the workaround I can get it to work showing below

```
tylerlisowski$ kubectl edit configs.imageregistry.operator.openshift.io cluster
config.imageregistry.operator.openshift.io/cluster edited
 tylerlisowski$ 


Tylers-MacBook-Pro:Desktop tylerlisowski$ kubectl edit deploy -n openshift-image-registry image-registry
deployment.apps/image-registry edited
Tylers-MacBook-Pro:Desktop tylerlisowski$ 


Tylers-MacBook-Pro:Desktop tylerlisowski$ kubectl get pods -n openshift-image-registry
NAME                                               READY   STATUS         RESTARTS   AGE
cluster-image-registry-operator-6874f46dbc-gnrlc   2/2     Running        0          8m24s

image-registry-5f5bb4d8b6-zpmgl                    1/1     Running        0          13s
image-registry-8686b45c4-9jvb2                     0/1     Terminating    5          3m57s

```

Note how the new one stays running
```
Tylers-MacBook-Pro:Desktop tylerlisowski$ kubectl get pods -n openshift-image-registry
NAME                                               READY   STATUS         RESTARTS   AGE
cluster-image-registry-operator-6874f46dbc-gnrlc   2/2     Running        0          9m23s
image-registry-5f5bb4d8b6-zpmgl                    1/1     Running        0          72s
node-ca-9gpft                                      1/1     Running        0          3d10h
node-ca-mxn9x                                      1/1     Running        1          3d10h
node-ca-w9spn                                      1/1     Running        0          3d10h
```

When it's failing on 4.5 it actually prints the same logs as the ones outlined in the mailing thread. When it fails in 4.6 it gives no log output.

Expected results:
Expect image registry to recover autonomously after a reboot of a node


Actual results:
The image registry does not recover in some instances and in all instances has extended period of downtime.


Additional info:

Comment 1 Tyler Lisowski 2021-03-05 23:19:29 UTC
I believe it's related to how the args are ran
```
        "args": [
          "sh",
          "-c",
          "update-ca-trust \u0026\u0026 exec \"$@\"",
          "arg0",
          "/usr/bin/dockerregistry"
        ],
```

It fails on us before it is ever able to make it to running the /usr/bin/dockerregistry binary

Comment 2 Tyler Lisowski 2021-03-06 01:14:26 UTC
On the 4.5 cluster these are the logs
```
$ kubectl get pods -n openshift-image-registry -o wide
NAME                                               READY   STATUS             RESTARTS   AGE     IP              NODE          NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-6874f46dbc-gnrlc   2/2     Running            0          3h      172.17.124.33   10.240.0.30   <none>           <none>
image-registry-8686b45c4-25tvh                     0/1     CrashLoopBackOff   22         96m     172.17.65.154   10.240.0.31   <none>           <none>
node-ca-9gpft                                      1/1     Running            0          3d12h   172.17.124.1    10.240.0.30   <none>           <none>
node-ca-mxn9x                                      1/1     Running            2          3d12h   172.17.116.90   10.240.0.29   <none>           <none>
node-ca-w9spn                                      1/1     Running            1          3d12h   172.17.65.157   10.240.0.31   <none>           <none>
$ kubectl logs -n openshift-image-registry image-registry-8686b45c4-25tvh
p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt: Operation not permitted
p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem: Operation not permitted
p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/pem/email-ca-bundle.pem: Operation not permitted
p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/pem/objsign-ca-bundle.pem: Operation not permitted
p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/java/cacerts: Operation not permitted
```

Here is the full pod yaml
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 172.17.65.154/32
    cni.projectcalico.org/podIPs: 172.17.65.154/32
    imageregistry.operator.openshift.io/dependencies-checksum: sha256:cda51f80fe4fdf0f2ee580c0cbbf9e20804b382aefea320c0ab79c255c9bc5bb
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.17.65.154"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.17.65.154"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: restricted
  creationTimestamp: "2021-03-05T23:34:42Z"
  generateName: image-registry-8686b45c4-
  labels:
    docker-registry: default
    pod-template-hash: 8686b45c4
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:imageregistry.operator.openshift.io/dependencies-checksum: {}
        f:generateName: {}
        f:labels:
          .: {}
          f:docker-registry: {}
          f:pod-template-hash: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"77120223-8302-4bcb-bc49-786229dc8f50"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:affinity:
          .: {}
          f:podAntiAffinity:
            .: {}
            f:preferredDuringSchedulingIgnoredDuringExecution: {}
        f:containers:
          k:{"name":"registry"}:
            .: {}
            f:env:
              .: {}
              k:{"name":"REGISTRY_HTTP_ADDR"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_HTTP_NET"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_HTTP_SECRET"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_HTTP_TLS_CERTIFICATE"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_HTTP_TLS_KEY"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_LOG_LEVEL"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_OPENSHIFT_METRICS_ENABLED"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_OPENSHIFT_QUOTA_ENABLED"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_OPENSHIFT_SERVER_ADDR"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_CACHE_BLOBDESCRIPTOR"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_DELETE_ENABLED"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_S3_ACCESSKEY"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:secretKeyRef:
                    .: {}
                    f:key: {}
                    f:name: {}
              k:{"name":"REGISTRY_STORAGE_S3_BUCKET"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_S3_ENCRYPT"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_S3_REGION"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_S3_REGIONENDPOINT"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_S3_SECRETKEY"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:secretKeyRef:
                    .: {}
                    f:key: {}
                    f:name: {}
              k:{"name":"REGISTRY_STORAGE_S3_USEDUALSTACK"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE"}:
                .: {}
                f:name: {}
                f:value: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":5000,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/etc/pki/ca-trust/source/anchors"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/etc/secrets"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/usr/share/pki/ca-trust-source"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/lib/kubelet/"}:
                .: {}
                f:mountPath: {}
                f:name: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:nodeSelector:
          .: {}
          f:kubernetes.io/os: {}
        f:priorityClassName: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext:
          .: {}
          f:fsGroup: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"installation-pull-secrets"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:items: {}
              f:optional: {}
              f:secretName: {}
          k:{"name":"registry-certificates"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:name: {}
            f:name: {}
          k:{"name":"registry-tls"}:
            .: {}
            f:name: {}
            f:projected:
              .: {}
              f:defaultMode: {}
              f:sources: {}
          k:{"name":"trusted-ca"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:items: {}
              f:name: {}
              f:optional: {}
            f:name: {}
    manager: kube-controller-manager
    operation: Update
    time: "2021-03-05T23:34:42Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cni.projectcalico.org/podIP: {}
          f:cni.projectcalico.org/podIPs: {}
    manager: calico
    operation: Update
    time: "2021-03-05T23:39:05Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:k8s.v1.cni.cncf.io/network-status: {}
          f:k8s.v1.cni.cncf.io/networks-status: {}
    manager: multus
    operation: Update
    time: "2021-03-05T23:39:05Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"172.17.65.154"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2021-03-06T01:12:06Z"
  name: image-registry-8686b45c4-25tvh
  namespace: openshift-image-registry
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: image-registry-8686b45c4
    uid: 77120223-8302-4bcb-bc49-786229dc8f50
  resourceVersion: "87898787"
  selfLink: /api/v1/namespaces/openshift-image-registry/pods/image-registry-8686b45c4-25tvh
  uid: 061213a4-5251-400e-bfaa-f2df25c7fc27
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          namespaces:
          - openshift-image-registry
          topologyKey: kubernetes.io/hostname
        weight: 100
  containers:
  - env:
    - name: REGISTRY_STORAGE_S3_REGIONENDPOINT
      value: https://s3.direct.us.cloud-object-storage.appdomain.cloud
    - name: REGISTRY_STORAGE
      value: s3
    - name: REGISTRY_STORAGE_S3_BUCKET
      value: roks-bsr0toow08i9rajtg1a0-cdbf
    - name: REGISTRY_STORAGE_S3_REGION
      value: us-standard
    - name: REGISTRY_STORAGE_S3_ENCRYPT
      value: "false"
    - name: REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE
      value: "false"
    - name: REGISTRY_STORAGE_S3_USEDUALSTACK
      value: "true"
    - name: REGISTRY_STORAGE_S3_ACCESSKEY
      valueFrom:
        secretKeyRef:
          key: REGISTRY_STORAGE_S3_ACCESSKEY
          name: image-registry-private-configuration
    - name: REGISTRY_STORAGE_S3_SECRETKEY
      valueFrom:
        secretKeyRef:
          key: REGISTRY_STORAGE_S3_SECRETKEY
          name: image-registry-private-configuration
    - name: REGISTRY_HTTP_ADDR
      value: :5000
    - name: REGISTRY_HTTP_NET
      value: tcp
    - name: REGISTRY_HTTP_SECRET
      value: XXXXXXXXXXXX
    - name: REGISTRY_LOG_LEVEL
      value: info
    - name: REGISTRY_OPENSHIFT_QUOTA_ENABLED
      value: "true"
    - name: REGISTRY_STORAGE_CACHE_BLOBDESCRIPTOR
      value: inmemory
    - name: REGISTRY_STORAGE_DELETE_ENABLED
      value: "true"
    - name: REGISTRY_OPENSHIFT_METRICS_ENABLED
      value: "true"
    - name: REGISTRY_OPENSHIFT_SERVER_ADDR
      value: image-registry.openshift-image-registry.svc:5000
    - name: REGISTRY_HTTP_TLS_CERTIFICATE
      value: /etc/secrets/tls.crt
    - name: REGISTRY_HTTP_TLS_KEY
      value: /etc/secrets/tls.key
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70e1e34f06beb3cff39f0c7fa795904e97516264d54fa98268c1e7c0fbc14dfe
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 5000
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry
    ports:
    - containerPort: 5000
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 5000
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000430000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/secrets
      name: registry-tls
    - mountPath: /etc/pki/ca-trust/source/anchors
      name: registry-certificates
    - mountPath: /usr/share/pki/ca-trust-source
      name: trusted-ca
    - mountPath: /var/lib/kubelet/
      name: installation-pull-secrets
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: registry-token-qb65z
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: registry-dockercfg-lnm8c
  nodeName: 10.240.0.31
  nodeSelector:
    kubernetes.io/os: linux
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000430000
    seLinuxOptions:
      level: s0:c21,c5
  serviceAccount: registry
  serviceAccountName: registry
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: registry-tls
    projected:
      defaultMode: 420
      sources:
      - secret:
          name: image-registry-tls
  - configMap:
      defaultMode: 420
      name: image-registry-certificates
    name: registry-certificates
  - configMap:
      defaultMode: 420
      items:
      - key: ca-bundle.crt
        path: anchors/ca-bundle.crt
      name: trusted-ca
      optional: true
    name: trusted-ca
  - name: installation-pull-secrets
    secret:
      defaultMode: 420
      items:
      - key: .dockerconfigjson
        path: config.json
      optional: true
      secretName: installation-pull-secrets
  - name: registry-token-qb65z
    secret:
      defaultMode: 420
      secretName: registry-token-qb65z
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-03-05T23:34:42Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-03-05T23:35:42Z"
    message: 'containers with unready status: [registry]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-03-05T23:35:42Z"
    message: 'containers with unready status: [registry]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-03-05T23:34:42Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://dfeb9108c9912329a8e1b26a391f47b51b67912904f36aa8d30d8863e413cb18
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70e1e34f06beb3cff39f0c7fa795904e97516264d54fa98268c1e7c0fbc14dfe
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70e1e34f06beb3cff39f0c7fa795904e97516264d54fa98268c1e7c0fbc14dfe
    lastState:
      terminated:
        containerID: cri-o://dfeb9108c9912329a8e1b26a391f47b51b67912904f36aa8d30d8863e413cb18
        exitCode: 1
        finishedAt: "2021-03-06T01:11:56Z"
        reason: Error
        startedAt: "2021-03-06T01:11:55Z"
    name: registry
    ready: false
    restartCount: 23
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=registry pod=image-registry-8686b45c4-25tvh_openshift-image-registry(061213a4-5251-400e-bfaa-f2df25c7fc27)
        reason: CrashLoopBackOff
  hostIP: 10.240.0.31
  phase: Running
  podIP: 172.17.65.154
  podIPs:
  - ip: 172.17.65.154
  qosClass: Burstable
  startTime: "2021-03-05T23:34:42Z"
```

Comment 3 Tyler Lisowski 2021-03-06 02:05:35 UTC
I can confirm that are ones that are failing update-ca-trust is returning exit code 1 and on ones that are passing it returns successfully

What I have found so far on ones that are successful
```
bash-4.2$ ls -la /usr/share/pki/
total 16
drwxr-xr-x. 4 root root       4096 Dec 17 14:37 .
drwxr-xr-x. 1 root root       4096 Jan 30 04:37 ..
drwxr-xr-x. 2 root root       4096 Dec 17 14:37 ca-trust-legacy
drwxrwsrwt. 3 root 1000400000 4096 Mar  5 08:21 ca-trust-source
bash-4.2$ 
```

ca-trust-source matches the uuid of the container
```
bash-4.2$ id
uid=1000400000(1000400000) gid=0(root) groups=0(root),1000400000
```

In ones that fail ca-trust-source is 
```
bash-4.2$ ls -la /etc/pki/ca-trust/
total 36
drwxr-xr-x. 1 root root 4096 Dec 17 14:37 .
drwxr-xr-x. 1 root root 4096 Dec 17 14:39 ..
-rw-r--r--. 1 root root  166 Jun  9  2020 README
-rw-r--r--. 1 root root  980 Jun  9  2020 ca-legacy.conf
drwxrwxrwt. 1 root root 4096 Dec 17 14:37 extracted
drwxr-xr-x. 4 root root 4096 Dec 17 14:37 source
```

Comment 4 Tyler Lisowski 2021-03-06 02:40:43 UTC
I was able to find the difference I see in a pod that fails with this and a pod that is successful.

A pod that fails has the following
```
bash-4.4$ ls -la /etc/pki/ca-trust/extracted/pem/
total 2380
drwxrwxrwt. 1 root       root   4096 Mar  6 02:34 .
drwxrwxrwt. 1 root       root   4096 Jan 19 18:22 ..
-rw-rw-rw-. 1 root       root    898 Jun 22  2020 README
-rw-rw-rw-. 1 root       root 163655 Jan 19 18:23 email-ca-bundle.pem
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:30 email-ca-bundle.pem.S6evtC
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:34 email-ca-bundle.pem.SIYtSC
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:34 email-ca-bundle.pem.b3Llun
-rw-rw-rw-. 1 root       root      0 Jan 19 18:23 objsign-ca-bundle.pem
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:30 objsign-ca-bundle.pem.KK0j8G
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:34 objsign-ca-bundle.pem.YSgY0s
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:34 objsign-ca-bundle.pem.fElyrI
-rw-rw-rw-. 1 root       root 216090 Jan 19 18:23 tls-ca-bundle.pem
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:34 tls-ca-bundle.pem.70Maux
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:30 tls-ca-bundle.pem.iOy7Vw
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:34 tls-ca-bundle.pem.z3qMTt
```

And what causes it to fail is specifically the files owned by root root (aka tls-ca-bundle.pem, objsign-ca-bundle.pem, and email-ca-bundle.pem)

Comment 5 Tyler Lisowski 2021-03-06 02:41:40 UTC
a pod that is successful has the following
```
bash-4.4$ ls -la etc/pki/ca-trust/extracted/pem
total 680
drwxrwxrwx. 1 root       root   4096 Mar  6 02:38 .
drwxrwxrwx. 1 root       root   4096 Jan 19 18:22 ..
-rw-rw-rw-. 1 root       root    898 Jun 22  2020 README
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:38 email-ca-bundle.pem
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:38 objsign-ca-bundle.pem
-r--r--r--. 1 1000270000 root 217378 Mar  6 02:38 tls-ca-bundle.pem

```

Note that there are no pem files owned as root root it all matches with the id the container runs as

Comment 6 Tyler Lisowski 2021-03-06 02:44:29 UTC
In the case of the failing pod it will not allow me to remove the failed bundles either
```
bash-4.4$ ls -l /etc/pki/ca-trust/extracted/pem/
total 384
-rw-rw-rw-. 1 root root    898 Jun 22  2020 README
-rw-rw-rw-. 1 root root 163655 Jan 19 18:23 email-ca-bundle.pem
-rw-rw-rw-. 1 root root      0 Jan 19 18:23 objsign-ca-bundle.pem
-rw-rw-rw-. 1 root root 216090 Jan 19 18:23 tls-ca-bundle.pem
bash-4.4$ rm /etc/pki/ca-trust/extracted/pem/email-ca-bundle.pem 
rm: cannot remove '/etc/pki/ca-trust/extracted/pem/email-ca-bundle.pem': Operation not permitted
bash-4.4$ rm /etc/pki/ca-trust/extracted/pem/objsign-ca-bundle.pem 
rm: cannot remove '/etc/pki/ca-trust/extracted/pem/objsign-ca-bundle.pem': Operation not permitted
bash-4.4$ 
```

Comment 7 Tyler Lisowski 2021-03-06 02:56:57 UTC
The same goes for all other directories in the extracted directory as well
```
README	cacerts.bin  cacerts.bin.7pScu8
bash-4.4$ ls -l /etc/pki/ca-trust/extracted/edk2/
total 324
-rw-rw-rw-. 1 root       root    566 Jun 22  2020 README
-rw-rw-rw-. 1 root       root 156842 Jan 19 18:23 cacerts.bin
-r--r--r--. 1 1000270000 root 157739 Mar  6 02:55 cacerts.bin.7pScu8
bash-4.4$ 

bash-4.4$ ls -la /etc/pki/ca-trust/extracted/java/
total 340
drwxrwxrwt. 1 root       root   4096 Mar  6 02:55 .
drwxrwxrwt. 1 root       root   4096 Jan 19 18:22 ..
-rw-rw-rw-. 1 root       root    726 Jun 22  2020 README
-rw-rw-rw-. 1 root       root 157499 Jan 19 18:23 cacerts
-r--r--r--. 1 1000270000 root 158440 Mar  6 02:55 cacerts.WcK3S6
bash-4.4$ 



bash-4.4$ ls -la /etc/pki/ca-trust/extracted/openssl/
total 496
drwxrwxrwt. 1 root       root   4096 Mar  6 02:55 .
drwxrwxrwt. 1 root       root   4096 Jan 19 18:22 ..
-rw-rw-rw-. 1 root       root    787 Jun 22  2020 README
-rw-rw-rw-. 1 root       root 249827 Jan 19 18:23 ca-bundle.trust.crt
-r--r--r--. 1 1000270000 root 225471 Mar  6 02:55 ca-bundle.trust.crt.dqvnfJ
bash-4.4$ 
```

Comment 8 Tyler Lisowski 2021-03-06 03:03:27 UTC
One potential suggestion is to Ignore error code on update-ca-trust

Or move the update-ca-trust process somewhere that doesn't kill the pod

Comment 9 Oleg Bulatov 2021-03-08 12:27:28 UTC
This is fixed by BZ 1893956. emptyDir is mounted into /etc/pki/ca-trust/extracted and should always be writable.

*** This bug has been marked as a duplicate of bug 1893956 ***


Note You need to log in before you can comment on or make changes to this bug.