Description of problem: When the node an image registry pod is on gets rebooted, the image-registry pod enters CrashloopBackoff for extended periods of time. It's very similar to what was reported in this mailing thread: https://lists.openshift.redhat.com/openshift-archives/users/2020-June/msg00002.html We can reproduce this consistently in IBM Cloud ROKS but do not believe it's restricted just to ROKS environments. I have seen this with registry deployments that use PVCs, S3 object storage, and regular empty dirs. When it occurs we are able to immediately resolve by doing the following actions 1. kubectl edit configs.imageregistry.operator.openshift.io cluster and set spec.ManagementState to Unmanaged 2. `kubectl edit deploy -n openshift-image-registry image-registry` and explicitly add in ``` command: - /usr/bin/dockerregistry ``` And then on the re-rollout the image registry operator will work immediately. If those changes are reverted (the state left in Managed) it will proceed to start crashing again when it goes back to the default image. We replicated this on the following registry image ``` image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e089b9443226e55bd013c54096d4e1eb7e99b4c8dc8da75c92a4c7227ac56484 ``` Version-Release number of selected component (if applicable): 4.6.17 4.5.31 How reproducible: Consistent reproduction outages just last for random periods of times (sometimes up to 4+ hours, sometimes 30 minutes) Steps to Reproduce: Find node registry pod is on ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-6874f46dbc-xgltq 2/2 Running 0 3d10h 172.17.116.83 10.240.0.29 <none> <none> image-registry-8686b45c4-6rsn5 1/1 Running 0 3d10h 172.17.116.84 10.240.0.29 <none> <none> ``` Reboot node ``` ibmcloud cs worker reboot --cluster bsr0toow08i9rajtg1a0 --worker kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe Reboot worker? [kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe] [y/N]> y Processing kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe... Processing on kube-bsr0toow08i9rajtg1a0-pvgmetricsc-default-000083fe complete. OK ``` Wait for all workload on node to stabilize. However the image-registry pod never will ``` kubectl get pods --all-namespaces -o wide | grep 10.240.0.29 calico-system calico-kube-controllers-5c4474bfbf-xvflk 1/1 Running 1 3d10h 172.17.116.84 10.240.0.29 <none> <none> calico-system calico-node-sdls4 1/1 Running 1 3d9h 10.240.0.29 10.240.0.29 <none> <none> calico-system calico-typha-6ffbdc5dfd-bfrwd 1/1 Running 1 3d10h 10.240.0.29 10.240.0.29 <none> <none> default prometheus-pushgateway-1597420487-5f6d5c89f9-p77rf 1/1 Running 1 3d10h 172.17.116.125 10.240.0.29 <none> <none> kube-system ibm-keepalived-watcher-7hckg 1/1 Running 1 3d9h 10.240.0.29 10.240.0.29 <none> <none> kube-system ibm-kubelet-monitor-65qv2 1/1 Running 1 3d9h 172.17.116.88 10.240.0.29 <none> <none> kube-system ibm-master-proxy-static-10.240.0.29 2/2 Running 2 3d9h 10.240.0.29 10.240.0.29 <none> <none> kube-system ibm-vpc-block-csi-controller-0 4/4 Running 4 3d10h 172.17.116.127 10.240.0.29 <none> <none> kube-system ibm-vpc-block-csi-node-zrtdk 3/3 Running 3 3d9h 172.17.116.124 10.240.0.29 <none> <none> openshift-cluster-node-tuning-operator tuned-479b2 1/1 Running 1 3d9h 10.240.0.29 10.240.0.29 <none> <none> openshift-console downloads-6f7bfc965b-hgn54 1/1 Running 3 3d10h 172.17.116.113 10.240.0.29 <none> <none> openshift-console downloads-6f7bfc965b-zc8nc 1/1 Running 3 3d10h 172.17.116.101 10.240.0.29 <none> <none> openshift-dns dns-default-gg7nc 3/3 Running 3 3d9h 172.17.116.78 10.240.0.29 <none> <none> openshift-image-registry image-registry-8686b45c4-6rsn5 0/1 CrashLoopBackOff 4 3d10h 172.17.116.110 10.240.0.29 <none> <none> openshift-image-registry node-ca-mxn9x 1/1 Running 1 3d9h 172.17.116.92 10.240.0.29 <none> <none> openshift-kube-proxy openshift-kube-proxy-cz6qq 1/1 Running 1 3d9h 10.240.0.29 10.240.0.29 <none> <none> openshift-kube-storage-version-migrator migrator-6948c84b78-v5vsg 1/1 Running 1 3d10h 172.17.116.67 10.240.0.29 <none> <none> openshift-monitoring alertmanager-main-0 5/5 Running 5 3d10h 172.17.116.120 10.240.0.29 <none> <none> openshift-monitoring alertmanager-main-1 5/5 Running 5 3d10h 172.17.116.123 10.240.0.29 <none> <none> openshift-monitoring alertmanager-main-2 5/5 Running 5 3d10h 172.17.116.107 10.240.0.29 <none> <none> openshift-monitoring grafana-5d566dbbcd-5qlmj 2/2 Running 2 3d10h 172.17.116.121 10.240.0.29 <none> <none> openshift-monitoring kube-state-metrics-766c75fd5b-v58md 3/3 Running 3 3d10h 172.17.116.111 10.240.0.29 <none> <none> openshift-monitoring node-exporter-pf76w 2/2 Running 2 3d9h 10.240.0.29 10.240.0.29 <none> <none> openshift-monitoring openshift-state-metrics-5cf8467749-pjtsx 3/3 Running 3 3d10h 172.17.116.118 10.240.0.29 <none> <none> openshift-monitoring prometheus-adapter-8467686677-bc6pv 1/1 Running 1 3d10h 172.17.116.70 10.240.0.29 <none> <none> openshift-monitoring prometheus-adapter-8467686677-vd2wg 1/1 Running 1 3d10h 172.17.116.97 10.240.0.29 <none> <none> openshift-monitoring prometheus-k8s-0 7/7 Running 8 3d10h 172.17.116.126 10.240.0.29 <none> <none> openshift-monitoring prometheus-k8s-1 7/7 Running 8 3d10h 172.17.116.122 10.240.0.29 <none> <none> openshift-monitoring telemeter-client-56b555bd99-5nt74 3/3 Running 3 3d10h 172.17.116.72 10.240.0.29 <none> <none> openshift-monitoring thanos-querier-5cc764f9c9-brzf4 4/4 Running 4 3d10h 172.17.116.115 10.240.0.29 <none> <none> openshift-monitoring thanos-querier-5cc764f9c9-jbrmr 4/4 Running 4 3d10h 172.17.116.80 10.240.0.29 <none> <none> openshift-multus multus-4c55c 1/1 Running 1 3d9h 10.240.0.29 10.240.0.29 <none> <none> openshift-multus multus-admission-controller-rmhmz 2/2 Running 0 3m3s 172.17.116.119 10.240.0.29 <none> <none> openshift-roks-metrics metrics-6cb57bdd7f-vmv8p 1/1 Running 1 3d10h 172.17.116.109 10.240.0.29 <none> <none> openshift-roks-metrics push-gateway-85bbbdd967-84fc7 1/1 Running 1 3d10h 172.17.116.87 10.240.0.29 <none> <none> tigera-operator tigera-operator-576f8c7f86-wjrvm 1/1 Running 1 3d10h 10.240.0.29 10.240.0.29 <none> <none> ``` Even trying to kill and restart a new pod does not work ``` kubectl get pods -n openshift-image-registry cluster-image-registry-operator-6874f46dbc-gnrlc 2/2 Running 0 5m17s 172.17.124.33 10.240.0.30 <none> <none> image-registry-8686b45c4-9jvb2 0/1 CrashLoopBackOff 2 50s 172.17.116.103 10.240.0.29 <none> <none> node-ca-9gpft 1/1 Running 0 3d9h 172.17.124.1 10.240.0.30 <none> <none> node-ca-mxn9x 1/1 Running 1 3d9h 172.17.116.92 10.240.0.29 <none> <none> node-ca-w9spn 1/1 Running 0 3d9h 172.17.65.130 10.240.0.31 <none> <none> ``` Then in this same state if I proceed to do the workaround I can get it to work showing below ``` tylerlisowski$ kubectl edit configs.imageregistry.operator.openshift.io cluster config.imageregistry.operator.openshift.io/cluster edited tylerlisowski$ Tylers-MacBook-Pro:Desktop tylerlisowski$ kubectl edit deploy -n openshift-image-registry image-registry deployment.apps/image-registry edited Tylers-MacBook-Pro:Desktop tylerlisowski$ Tylers-MacBook-Pro:Desktop tylerlisowski$ kubectl get pods -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-6874f46dbc-gnrlc 2/2 Running 0 8m24s image-registry-5f5bb4d8b6-zpmgl 1/1 Running 0 13s image-registry-8686b45c4-9jvb2 0/1 Terminating 5 3m57s ``` Note how the new one stays running ``` Tylers-MacBook-Pro:Desktop tylerlisowski$ kubectl get pods -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-6874f46dbc-gnrlc 2/2 Running 0 9m23s image-registry-5f5bb4d8b6-zpmgl 1/1 Running 0 72s node-ca-9gpft 1/1 Running 0 3d10h node-ca-mxn9x 1/1 Running 1 3d10h node-ca-w9spn 1/1 Running 0 3d10h ``` When it's failing on 4.5 it actually prints the same logs as the ones outlined in the mailing thread. When it fails in 4.6 it gives no log output. Expected results: Expect image registry to recover autonomously after a reboot of a node Actual results: The image registry does not recover in some instances and in all instances has extended period of downtime. Additional info:
I believe it's related to how the args are ran ``` "args": [ "sh", "-c", "update-ca-trust \u0026\u0026 exec \"$@\"", "arg0", "/usr/bin/dockerregistry" ], ``` It fails on us before it is ever able to make it to running the /usr/bin/dockerregistry binary
On the 4.5 cluster these are the logs ``` $ kubectl get pods -n openshift-image-registry -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-6874f46dbc-gnrlc 2/2 Running 0 3h 172.17.124.33 10.240.0.30 <none> <none> image-registry-8686b45c4-25tvh 0/1 CrashLoopBackOff 22 96m 172.17.65.154 10.240.0.31 <none> <none> node-ca-9gpft 1/1 Running 0 3d12h 172.17.124.1 10.240.0.30 <none> <none> node-ca-mxn9x 1/1 Running 2 3d12h 172.17.116.90 10.240.0.29 <none> <none> node-ca-w9spn 1/1 Running 1 3d12h 172.17.65.157 10.240.0.31 <none> <none> $ kubectl logs -n openshift-image-registry image-registry-8686b45c4-25tvh p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt: Operation not permitted p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem: Operation not permitted p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/pem/email-ca-bundle.pem: Operation not permitted p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/pem/objsign-ca-bundle.pem: Operation not permitted p11-kit: couldn't complete writing file: /etc/pki/ca-trust/extracted/java/cacerts: Operation not permitted ``` Here is the full pod yaml ``` apiVersion: v1 kind: Pod metadata: annotations: cni.projectcalico.org/podIP: 172.17.65.154/32 cni.projectcalico.org/podIPs: 172.17.65.154/32 imageregistry.operator.openshift.io/dependencies-checksum: sha256:cda51f80fe4fdf0f2ee580c0cbbf9e20804b382aefea320c0ab79c255c9bc5bb k8s.v1.cni.cncf.io/network-status: |- [{ "name": "k8s-pod-network", "ips": [ "172.17.65.154" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "k8s-pod-network", "ips": [ "172.17.65.154" ], "default": true, "dns": {} }] openshift.io/scc: restricted creationTimestamp: "2021-03-05T23:34:42Z" generateName: image-registry-8686b45c4- labels: docker-registry: default pod-template-hash: 8686b45c4 managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:imageregistry.operator.openshift.io/dependencies-checksum: {} f:generateName: {} f:labels: .: {} f:docker-registry: {} f:pod-template-hash: {} f:ownerReferences: .: {} k:{"uid":"77120223-8302-4bcb-bc49-786229dc8f50"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} f:spec: f:affinity: .: {} f:podAntiAffinity: .: {} f:preferredDuringSchedulingIgnoredDuringExecution: {} f:containers: k:{"name":"registry"}: .: {} f:env: .: {} k:{"name":"REGISTRY_HTTP_ADDR"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_HTTP_NET"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_HTTP_SECRET"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_HTTP_TLS_CERTIFICATE"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_HTTP_TLS_KEY"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_LOG_LEVEL"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_OPENSHIFT_METRICS_ENABLED"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_OPENSHIFT_QUOTA_ENABLED"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_OPENSHIFT_SERVER_ADDR"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_CACHE_BLOBDESCRIPTOR"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_DELETE_ENABLED"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_S3_ACCESSKEY"}: .: {} f:name: {} f:valueFrom: .: {} f:secretKeyRef: .: {} f:key: {} f:name: {} k:{"name":"REGISTRY_STORAGE_S3_BUCKET"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_S3_ENCRYPT"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_S3_REGION"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_S3_REGIONENDPOINT"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_S3_SECRETKEY"}: .: {} f:name: {} f:valueFrom: .: {} f:secretKeyRef: .: {} f:key: {} f:name: {} k:{"name":"REGISTRY_STORAGE_S3_USEDUALSTACK"}: .: {} f:name: {} f:value: {} k:{"name":"REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE"}: .: {} f:name: {} f:value: {} f:image: {} f:imagePullPolicy: {} f:livenessProbe: .: {} f:failureThreshold: {} f:httpGet: .: {} f:path: {} f:port: {} f:scheme: {} f:initialDelaySeconds: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:name: {} f:ports: .: {} k:{"containerPort":5000,"protocol":"TCP"}: .: {} f:containerPort: {} f:protocol: {} f:readinessProbe: .: {} f:failureThreshold: {} f:httpGet: .: {} f:path: {} f:port: {} f:scheme: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:resources: .: {} f:requests: .: {} f:cpu: {} f:memory: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/etc/pki/ca-trust/source/anchors"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/etc/secrets"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/usr/share/pki/ca-trust-source"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/var/lib/kubelet/"}: .: {} f:mountPath: {} f:name: {} f:dnsPolicy: {} f:enableServiceLinks: {} f:nodeSelector: .: {} f:kubernetes.io/os: {} f:priorityClassName: {} f:restartPolicy: {} f:schedulerName: {} f:securityContext: .: {} f:fsGroup: {} f:serviceAccount: {} f:serviceAccountName: {} f:terminationGracePeriodSeconds: {} f:volumes: .: {} k:{"name":"installation-pull-secrets"}: .: {} f:name: {} f:secret: .: {} f:defaultMode: {} f:items: {} f:optional: {} f:secretName: {} k:{"name":"registry-certificates"}: .: {} f:configMap: .: {} f:defaultMode: {} f:name: {} f:name: {} k:{"name":"registry-tls"}: .: {} f:name: {} f:projected: .: {} f:defaultMode: {} f:sources: {} k:{"name":"trusted-ca"}: .: {} f:configMap: .: {} f:defaultMode: {} f:items: {} f:name: {} f:optional: {} f:name: {} manager: kube-controller-manager operation: Update time: "2021-03-05T23:34:42Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:cni.projectcalico.org/podIP: {} f:cni.projectcalico.org/podIPs: {} manager: calico operation: Update time: "2021-03-05T23:39:05Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:k8s.v1.cni.cncf.io/network-status: {} f:k8s.v1.cni.cncf.io/networks-status: {} manager: multus operation: Update time: "2021-03-05T23:39:05Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: k:{"type":"ContainersReady"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} k:{"type":"Initialized"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:status: {} f:type: {} k:{"type":"Ready"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} f:containerStatuses: {} f:hostIP: {} f:phase: {} f:podIP: {} f:podIPs: .: {} k:{"ip":"172.17.65.154"}: .: {} f:ip: {} f:startTime: {} manager: kubelet operation: Update time: "2021-03-06T01:12:06Z" name: image-registry-8686b45c4-25tvh namespace: openshift-image-registry ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: image-registry-8686b45c4 uid: 77120223-8302-4bcb-bc49-786229dc8f50 resourceVersion: "87898787" selfLink: /api/v1/namespaces/openshift-image-registry/pods/image-registry-8686b45c4-25tvh uid: 061213a4-5251-400e-bfaa-f2df25c7fc27 spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: namespaces: - openshift-image-registry topologyKey: kubernetes.io/hostname weight: 100 containers: - env: - name: REGISTRY_STORAGE_S3_REGIONENDPOINT value: https://s3.direct.us.cloud-object-storage.appdomain.cloud - name: REGISTRY_STORAGE value: s3 - name: REGISTRY_STORAGE_S3_BUCKET value: roks-bsr0toow08i9rajtg1a0-cdbf - name: REGISTRY_STORAGE_S3_REGION value: us-standard - name: REGISTRY_STORAGE_S3_ENCRYPT value: "false" - name: REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE value: "false" - name: REGISTRY_STORAGE_S3_USEDUALSTACK value: "true" - name: REGISTRY_STORAGE_S3_ACCESSKEY valueFrom: secretKeyRef: key: REGISTRY_STORAGE_S3_ACCESSKEY name: image-registry-private-configuration - name: REGISTRY_STORAGE_S3_SECRETKEY valueFrom: secretKeyRef: key: REGISTRY_STORAGE_S3_SECRETKEY name: image-registry-private-configuration - name: REGISTRY_HTTP_ADDR value: :5000 - name: REGISTRY_HTTP_NET value: tcp - name: REGISTRY_HTTP_SECRET value: XXXXXXXXXXXX - name: REGISTRY_LOG_LEVEL value: info - name: REGISTRY_OPENSHIFT_QUOTA_ENABLED value: "true" - name: REGISTRY_STORAGE_CACHE_BLOBDESCRIPTOR value: inmemory - name: REGISTRY_STORAGE_DELETE_ENABLED value: "true" - name: REGISTRY_OPENSHIFT_METRICS_ENABLED value: "true" - name: REGISTRY_OPENSHIFT_SERVER_ADDR value: image-registry.openshift-image-registry.svc:5000 - name: REGISTRY_HTTP_TLS_CERTIFICATE value: /etc/secrets/tls.crt - name: REGISTRY_HTTP_TLS_KEY value: /etc/secrets/tls.key image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70e1e34f06beb3cff39f0c7fa795904e97516264d54fa98268c1e7c0fbc14dfe imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /healthz port: 5000 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: registry ports: - containerPort: 5000 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /healthz port: 5000 scheme: HTTPS periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: requests: cpu: 100m memory: 256Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000430000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/secrets name: registry-tls - mountPath: /etc/pki/ca-trust/source/anchors name: registry-certificates - mountPath: /usr/share/pki/ca-trust-source name: trusted-ca - mountPath: /var/lib/kubelet/ name: installation-pull-secrets - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: registry-token-qb65z readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: registry-dockercfg-lnm8c nodeName: 10.240.0.31 nodeSelector: kubernetes.io/os: linux priority: 2000000000 priorityClassName: system-cluster-critical restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1000430000 seLinuxOptions: level: s0:c21,c5 serviceAccount: registry serviceAccountName: registry terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - name: registry-tls projected: defaultMode: 420 sources: - secret: name: image-registry-tls - configMap: defaultMode: 420 name: image-registry-certificates name: registry-certificates - configMap: defaultMode: 420 items: - key: ca-bundle.crt path: anchors/ca-bundle.crt name: trusted-ca optional: true name: trusted-ca - name: installation-pull-secrets secret: defaultMode: 420 items: - key: .dockerconfigjson path: config.json optional: true secretName: installation-pull-secrets - name: registry-token-qb65z secret: defaultMode: 420 secretName: registry-token-qb65z status: conditions: - lastProbeTime: null lastTransitionTime: "2021-03-05T23:34:42Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2021-03-05T23:35:42Z" message: 'containers with unready status: [registry]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2021-03-05T23:35:42Z" message: 'containers with unready status: [registry]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2021-03-05T23:34:42Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://dfeb9108c9912329a8e1b26a391f47b51b67912904f36aa8d30d8863e413cb18 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70e1e34f06beb3cff39f0c7fa795904e97516264d54fa98268c1e7c0fbc14dfe imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70e1e34f06beb3cff39f0c7fa795904e97516264d54fa98268c1e7c0fbc14dfe lastState: terminated: containerID: cri-o://dfeb9108c9912329a8e1b26a391f47b51b67912904f36aa8d30d8863e413cb18 exitCode: 1 finishedAt: "2021-03-06T01:11:56Z" reason: Error startedAt: "2021-03-06T01:11:55Z" name: registry ready: false restartCount: 23 started: false state: waiting: message: back-off 5m0s restarting failed container=registry pod=image-registry-8686b45c4-25tvh_openshift-image-registry(061213a4-5251-400e-bfaa-f2df25c7fc27) reason: CrashLoopBackOff hostIP: 10.240.0.31 phase: Running podIP: 172.17.65.154 podIPs: - ip: 172.17.65.154 qosClass: Burstable startTime: "2021-03-05T23:34:42Z" ```
I can confirm that are ones that are failing update-ca-trust is returning exit code 1 and on ones that are passing it returns successfully What I have found so far on ones that are successful ``` bash-4.2$ ls -la /usr/share/pki/ total 16 drwxr-xr-x. 4 root root 4096 Dec 17 14:37 . drwxr-xr-x. 1 root root 4096 Jan 30 04:37 .. drwxr-xr-x. 2 root root 4096 Dec 17 14:37 ca-trust-legacy drwxrwsrwt. 3 root 1000400000 4096 Mar 5 08:21 ca-trust-source bash-4.2$ ``` ca-trust-source matches the uuid of the container ``` bash-4.2$ id uid=1000400000(1000400000) gid=0(root) groups=0(root),1000400000 ``` In ones that fail ca-trust-source is ``` bash-4.2$ ls -la /etc/pki/ca-trust/ total 36 drwxr-xr-x. 1 root root 4096 Dec 17 14:37 . drwxr-xr-x. 1 root root 4096 Dec 17 14:39 .. -rw-r--r--. 1 root root 166 Jun 9 2020 README -rw-r--r--. 1 root root 980 Jun 9 2020 ca-legacy.conf drwxrwxrwt. 1 root root 4096 Dec 17 14:37 extracted drwxr-xr-x. 4 root root 4096 Dec 17 14:37 source ```
I was able to find the difference I see in a pod that fails with this and a pod that is successful. A pod that fails has the following ``` bash-4.4$ ls -la /etc/pki/ca-trust/extracted/pem/ total 2380 drwxrwxrwt. 1 root root 4096 Mar 6 02:34 . drwxrwxrwt. 1 root root 4096 Jan 19 18:22 .. -rw-rw-rw-. 1 root root 898 Jun 22 2020 README -rw-rw-rw-. 1 root root 163655 Jan 19 18:23 email-ca-bundle.pem -r--r--r--. 1 1000270000 root 217378 Mar 6 02:30 email-ca-bundle.pem.S6evtC -r--r--r--. 1 1000270000 root 217378 Mar 6 02:34 email-ca-bundle.pem.SIYtSC -r--r--r--. 1 1000270000 root 217378 Mar 6 02:34 email-ca-bundle.pem.b3Llun -rw-rw-rw-. 1 root root 0 Jan 19 18:23 objsign-ca-bundle.pem -r--r--r--. 1 1000270000 root 217378 Mar 6 02:30 objsign-ca-bundle.pem.KK0j8G -r--r--r--. 1 1000270000 root 217378 Mar 6 02:34 objsign-ca-bundle.pem.YSgY0s -r--r--r--. 1 1000270000 root 217378 Mar 6 02:34 objsign-ca-bundle.pem.fElyrI -rw-rw-rw-. 1 root root 216090 Jan 19 18:23 tls-ca-bundle.pem -r--r--r--. 1 1000270000 root 217378 Mar 6 02:34 tls-ca-bundle.pem.70Maux -r--r--r--. 1 1000270000 root 217378 Mar 6 02:30 tls-ca-bundle.pem.iOy7Vw -r--r--r--. 1 1000270000 root 217378 Mar 6 02:34 tls-ca-bundle.pem.z3qMTt ``` And what causes it to fail is specifically the files owned by root root (aka tls-ca-bundle.pem, objsign-ca-bundle.pem, and email-ca-bundle.pem)
a pod that is successful has the following ``` bash-4.4$ ls -la etc/pki/ca-trust/extracted/pem total 680 drwxrwxrwx. 1 root root 4096 Mar 6 02:38 . drwxrwxrwx. 1 root root 4096 Jan 19 18:22 .. -rw-rw-rw-. 1 root root 898 Jun 22 2020 README -r--r--r--. 1 1000270000 root 217378 Mar 6 02:38 email-ca-bundle.pem -r--r--r--. 1 1000270000 root 217378 Mar 6 02:38 objsign-ca-bundle.pem -r--r--r--. 1 1000270000 root 217378 Mar 6 02:38 tls-ca-bundle.pem ``` Note that there are no pem files owned as root root it all matches with the id the container runs as
In the case of the failing pod it will not allow me to remove the failed bundles either ``` bash-4.4$ ls -l /etc/pki/ca-trust/extracted/pem/ total 384 -rw-rw-rw-. 1 root root 898 Jun 22 2020 README -rw-rw-rw-. 1 root root 163655 Jan 19 18:23 email-ca-bundle.pem -rw-rw-rw-. 1 root root 0 Jan 19 18:23 objsign-ca-bundle.pem -rw-rw-rw-. 1 root root 216090 Jan 19 18:23 tls-ca-bundle.pem bash-4.4$ rm /etc/pki/ca-trust/extracted/pem/email-ca-bundle.pem rm: cannot remove '/etc/pki/ca-trust/extracted/pem/email-ca-bundle.pem': Operation not permitted bash-4.4$ rm /etc/pki/ca-trust/extracted/pem/objsign-ca-bundle.pem rm: cannot remove '/etc/pki/ca-trust/extracted/pem/objsign-ca-bundle.pem': Operation not permitted bash-4.4$ ```
The same goes for all other directories in the extracted directory as well ``` README cacerts.bin cacerts.bin.7pScu8 bash-4.4$ ls -l /etc/pki/ca-trust/extracted/edk2/ total 324 -rw-rw-rw-. 1 root root 566 Jun 22 2020 README -rw-rw-rw-. 1 root root 156842 Jan 19 18:23 cacerts.bin -r--r--r--. 1 1000270000 root 157739 Mar 6 02:55 cacerts.bin.7pScu8 bash-4.4$ bash-4.4$ ls -la /etc/pki/ca-trust/extracted/java/ total 340 drwxrwxrwt. 1 root root 4096 Mar 6 02:55 . drwxrwxrwt. 1 root root 4096 Jan 19 18:22 .. -rw-rw-rw-. 1 root root 726 Jun 22 2020 README -rw-rw-rw-. 1 root root 157499 Jan 19 18:23 cacerts -r--r--r--. 1 1000270000 root 158440 Mar 6 02:55 cacerts.WcK3S6 bash-4.4$ bash-4.4$ ls -la /etc/pki/ca-trust/extracted/openssl/ total 496 drwxrwxrwt. 1 root root 4096 Mar 6 02:55 . drwxrwxrwt. 1 root root 4096 Jan 19 18:22 .. -rw-rw-rw-. 1 root root 787 Jun 22 2020 README -rw-rw-rw-. 1 root root 249827 Jan 19 18:23 ca-bundle.trust.crt -r--r--r--. 1 1000270000 root 225471 Mar 6 02:55 ca-bundle.trust.crt.dqvnfJ bash-4.4$ ```
One potential suggestion is to Ignore error code on update-ca-trust Or move the update-ca-trust process somewhere that doesn't kill the pod
This is fixed by BZ 1893956. emptyDir is mounted into /etc/pki/ca-trust/extracted and should always be writable. *** This bug has been marked as a duplicate of bug 1893956 ***