Bug 2101880
Summary: | [cloud-credential-operator]container has runAsNonRoot and image will run as root | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> | |
Component: | Cloud Credential Operator | Assignee: | Nobody <nobody> | |
Status: | CLOSED ERRATA | QA Contact: | Shivanthi <lamarach> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.5 | CC: | abutcher, bleanhar, jshu, sdodson, slaznick, wking | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.12.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2102633 2110629 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-17 19:50:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2102633, 2102834 |
Description
Hongkai Liu
2022-06-28 15:41:54 UTC
Collecting more context, so folks hitting this issue are more likely to be able to find it searching Bugzilla: $ oc --as system:admin adm upgrade info: An upgrade is in progress. Unable to apply 4.11.0-fc.3: the workload openshift-cloud-credential-operator/cloud-credential-operator has not yet successfully rolled out Upgradeable=False Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Channel: candidate-4.11 (available channels: candidate-4.11) No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss. $ oc --as system:admin -n openshift-cloud-credential-operator get pods NAME READY STATUS RESTARTS AGE cloud-credential-operator-5d79d8fd6d-d9q5z 1/2 CreateContainerConfigError 0 62m $ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-5d79d8fd6d-d9q5z | jq '.status.containerStatuses[] | select(.ready != true)' { "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305", "imageID": "", "lastState": {}, "name": "cloud-credential-operator", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "message": "container has runAsNonRoot and image will run as root (pod: \"cloud-credential-operator-5d79d8fd6d-d9q5z_openshift-cloud-credential-operator(25954b29-621e-4d2d-9864-b013226f22fa)\", container: cloud-credential-operator)", "reason": "CreateContainerConfigError" } } } $ oc --as system:admin -n openshift-cloud-credential-operator get events | grep -v Normal LAST SEEN TYPE REASON OBJECT MESSAGE 108m Warning Failed pod/cloud-credential-operator-5d79d8fd6d-57mbh Error: container has runAsNonRoot and image will run as root (pod: "cloud-credential-operator-5d79d8fd6d-57mbh_openshift-cloud-credential-operator(a815836e-27fd-489d-989c-df8ba2612c98)", container: cloud-credential-operator) 61m Warning Failed pod/cloud-credential-operator-5d79d8fd6d-d9q5z Error: container has runAsNonRoot and image will run as root (pod: "cloud-credential-operator-5d79d8fd6d-d9q5z_openshift-cloud-credential-operator(25954b29-621e-4d2d-9864-b013226f22fa)", container: cloud-credential-operator) Ah, the in-cluster pod has: $ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-5d79d8fd6d-d9q5z | jq .spec.securityContext { "runAsNonRoot": true } while the manifest calls for [1]: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault so we aren't setting seccompProfile. The deployment unsurprisingly matches the pod: $ oc --as system:admin -n openshift-cloud-credential-operator get -o json deployment cloud-credential-operator | jq .spec.template.spec.securityContext { "runAsNonRoot": true } Hmm, no managedFields on the deployment? $ oc --as system:admin -n openshift-cloud-credential-operator get --show-managed-fields -o json deployment cloud-credential-operator | jq -c '.metadata | keys' ["annotations","creationTimestamp","generation","labels","name","namespace","ownerReferences","resourceVersion","uid"] [1]: https://github.com/openshift/cloud-credential-operator/blob/c34501f3d66f132d21a2a9620a886c4b144e1571/manifests/03-deployment.yaml#L28-L31 Hongkai deleted the stuck Deployment: $ oc --context build02 delete deploy -n openshift-cloud-credential-operator cloud-credential-operator --as system:admin The CVO created a replacement with seccompProfile: $ oc --as system:admin -n openshift-cloud-credential-operator get -o json deployment cloud-credential-operator | jq .spec.template.spec.securityContext { "runAsNonRoot": true, "seccompProfile": { "type": "RuntimeDefault" } } But the pods are still failing the same way: $ oc --as system:admin -n openshift-cloud-credential-operator get pods NAME READY STATUS RESTARTS AGE cloud-credential-operator-6cfcffdf6d-8k6c4 1/2 CreateContainerConfigError 0 82m $ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-6cfcffdf6d-8k6c4 | jq '.status.containerStatuses[] | select(.ready != true)' { "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305", "imageID": "", "lastState": {}, "name": "cloud-credential-operator", "ready": false, "restartCount": 0, "started": false, "state": { "waiting": { "message": "container has runAsNonRoot and image will run as root (pod: \"cloud-credential-operator-6cfcffdf6d-8k6c4_openshift-cloud-credential-operator(c20dd218-2df6-482f-85dc-563ba67e6d1b)\", container: cloud-credential-operator)", "reason": "CreateContainerConfigError" } } } Mitigation attempt 2: Tell the CVO to stop caring about this deployment: $ oc --as system:admin patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-cloud-credential-operator", "name": "cloud-credential-operator", "unmanaged": true}]}]' $ oc --as system:admin -n openshift-cloud-credential-operator patch deployment cloud-credential-operator --type json -p '[{"op": "add", "path": "/spec/template/spec/securityContext/runAsUser", "value": 65534}]' Which resolved 'container has runAsNonRoot and image will run as root', but now we're failing a different way: $ oc --as system:admin -n openshift-cloud-credential-operator get pods NAME READY STATUS RESTARTS AGE cloud-credential-operator-5f7b5cbcf6-b9xgj 1/2 CrashLoopBackOff 13 (43s ago) 42m $ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-5f7b5cbcf6-b9xgj | jq '.status.containerStatuses[] | select(.ready != true)' { "containerID": "cri-o://c07a99be7f584be7e96601681c9a847b83546d0350176d6c55f65e2ef3166f8d", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305", "lastState": { "terminated": { "containerID": "cri-o://c07a99be7f584be7e96601681c9a847b83546d0350176d6c55f65e2ef3166f8d", "exitCode": 1, "finishedAt": "2022-06-28T21:49:05Z", "message": "Copying system trust bundle\ncp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied\n", "reason": "Error", "startedAt": "2022-06-28T21:49:05Z" } }, "name": "cloud-credential-operator", "ready": false, "restartCount": 14, "started": false, "state": { "waiting": { "message": "back-off 5m0s restarting failed container=cloud-credential-operator pod=cloud-credential-operator-5f7b5cbcf6-b9xgj_openshift-cloud-credential-operator(43ccf74b-b353-433e-92da-09de79a21352)", "reason": "CrashLoopBackOff" } } } Unclear what sort of permissions are needed for the operator to be able to dance around the system trust bundle. Comparing with 4.10.18 to 4.11.0-fc.3 CI [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1540112324029845504/artifacts/launch/gather-extra/artifacts/namespaces.json | jq '.items[] | select(.metadata.name == "openshift-cloud-credential-operator")' >ci.json $ oc --as system:admin get -o json namespace openshift-cloud-credential-operator | jq . >build02.json $ diff -u ci.json build02.json --- ci.json 2022-06-28 15:01:42.114750313 -0700 +++ build02.json 2022-06-28 15:01:53.923749535 -0700 @@ -6,17 +6,18 @@ "include.release.openshift.io/ibm-cloud-managed": "true", "include.release.openshift.io/self-managed-high-availability": "true", "openshift.io/node-selector": "", - "openshift.io/sa.scc.mcs": "s0:c11,c10", - "openshift.io/sa.scc.supplemental-groups": "1000130000/10000", - "openshift.io/sa.scc.uid-range": "1000130000/10000", + "openshift.io/sa.scc.mcs": "s0:c18,c17", + "openshift.io/sa.scc.supplemental-groups": "1000340000/10000", + "openshift.io/sa.scc.uid-range": "1000340000/10000", "workload.openshift.io/allowed": "management" }, - "creationTimestamp": "2022-06-23T23:28:09Z", + "creationTimestamp": "2020-05-21T19:29:42Z", "labels": { "controller-tools.k8s.io": "1.0", "kubernetes.io/metadata.name": "openshift-cloud-credential-operator", - "olm.operatorgroup.uid/0035e0c5-6505-4d02-90b4-f4ddd056fd22": "", - "openshift.io/cluster-monitoring": "true" + "olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a": "", + "openshift.io/cluster-monitoring": "true", + "openshift.io/run-level": "1" }, ... Ahh, the cloud-credentials operator dropped openshift.io/run-level back in 4.5 with bug 1806892. But as discussed in [2], the cluster-version operator allows (and does not stomp) labels beyond what are contained in the current manifest. Clearing the overrides to put the CVO back in command of the deployment: $ oc --as system:admin patch clusterversion version --type json -p '[{"op": "remove", "path": "/spec/overrides"}]' Drop the obsolete label: $ oc --as system:admin label namespace openshift-cloud-credential-operator openshift.io/run-level- And delete the deployment in case the label change only has creation-time effects (this step might not be needed?): $ oc --as system:admin -n openshift-cloud-credential-operator delete deployment cloud-credential-operator A bit afterwards, things seem better: $ oc --as system:admin -n openshift-cloud-credential-operator get pods NAME READY STATUS RESTARTS AGE cloud-credential-operator-6cfcffdf6d-rhbmm 2/2 Running 0 5m41s So the issue is the stale openshift.io/run-level label for clusters which were born in 4.4 or earlier, and which are updating to 4.11. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1540112324029845504 [2]: https://issues.redhat.com/browse/OTA-330 Recommended mitigation: update your namespace manifest [1] to set an explicit: openshift.io/run-level: "" like the cluster-version operator does [2] since bug 2020107. See [3] for more context on how the CVO handles label reconciliation for manifests. [1]: https://github.com/openshift/cloud-credential-operator/blob/c34501f3d66f132d21a2a9620a886c4b144e1571/manifests/00-namespace.yaml#L9 [2]: https://github.com/openshift/cluster-version-operator/blob/b81272d8a78466596c92b7f88896fc3565feb335/install/0000_00_cluster-version-operator_00_namespace.yaml#L12 [3]: https://issues.redhat.com/browse/OTA-330 Native validation for this one would be a bit of a pain, installing 4.4, updating to 4.5, 4.6, 4.7, 4.8, 4.9, 4.10, and to 4.11 to reproduce the hang. Cheaper validation is probably: 1. Install a 4.10 cluster. 2. Manually set the label, as if the cluster had been born in 4.4 or earlier: $ oc label namespace openshift-cloud-credential-operator openshift.io/run-level=1 3. Request an update to 4.11. With 4.11.0-fc.3 as the target, we expect the update to stick on the cloud-credential operator deployment, with CreateContainerConfigError and 'container has runAsNonRoot and image will run as root'. With a patched 4.11 as the target, we expect the update to complete without issues. Wow! that's impressive work, Trevor. Should we push for this to be included in the 4.11 fc build the other build clusters will upgrade to? Moving back to assigned, going to use this BZ as a base for all the 4.12 PRs. 2102834 was cloned in CCO PR for fix porting to 4.11.0. Need to use it to track the back porting for all above PRs? Verified w/ version 4.12.0-0.nightly-2022-07-05-083442 following the suggested validation way: 1. Install ocp cluster with version 4.10.0-0.nightly-2022-06-08-150219 2. Apply labels jianpingshu@jshu-mac bin % oc label namespace openshift-cloud-credential-operator openshift.io/run-level=1 namespace/openshift-cloud-credential-operator labeled jianpingshu@jshu-mac bin % oc label namespace openshift-apiserver-operator openshift.io/run-level=1 namespace/openshift-apiserver-operator labeled jianpingshu@jshu-mac bin % oc label namespace openshift-machine-api openshift.io/run-level=1 namespace/openshift-machine-api labeled jianpingshu@jshu-mac bin % oc label namespace openshift-service-ca-operator openshift.io/run-level=1 namespace/openshift-service-ca-operator labeled jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1 error: 'openshift.io/run-level' already has a value (), and --overwrite is false jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1 --overwrite namespace/openshift-machine-config-operator labeled 3. Upgrade to 4.12 jianpingshu@jshu-mac bin % oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-05-083442 --allow-explicit-upgrade --force 4. Upgrade is successful and run-level unset for the namespaces jianpingshu@jshu-mac bin % oc get clusterversion -w NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-07-05-083442 True False 9m54s Cluster version is 4.12.0-0.nightly-2022-07-05-083442 jianpingshu@jshu-mac bin % oc get namespace openshift-cloud-credential-operator -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac bin % oc get namespace openshift-apiserver-operator -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac bin % oc get namespace openshift-machine-api -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac bin % oc get namespace openshift-service-ca-operator -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac bin % oc get namespace openshift-machine-config-operator -o yaml |grep run-level openshift.io/run-level: "" Bugzilla 2110501 reverted the openshift machine-api-operator pull 1031. Verified with 4.12.0-0.ci-2022-07-26-053821 (include change for Bugzilla 2110501) 1. Install ocp cluster with version 4.10.0-0.nightly-2022-07-25-110002 2. Apply labels jianpingshu@jshu-mac bin % oc label namespace openshift-cloud-credential-operator openshift.io/run-level=1 namespace/openshift-cloud-credential-operator labeled jianpingshu@jshu-mac bin % oc label namespace openshift-apiserver-operator openshift.io/run-level=1 namespace/openshift-apiserver-operator labeled jianpingshu@jshu-mac bin % oc label namespace openshift-machine-api openshift.io/run-level=1 namespace/openshift-machine-api labeled jianpingshu@jshu-mac bin % oc label namespace openshift-service-ca-operator openshift.io/run-level=1 namespace/openshift-service-ca-operator labeled jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1 error: 'openshift.io/run-level' already has a value (), and --overwrite is false jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1 --overwrite namespace/openshift-machine-config-operator labeled 3. Upgrade to 4.12 jianpingshu@jshu-mac bin % oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.12.0-0.ci-2022-07-26-053821 --allow-explicit-upgrade --force 4. Upgrade is successful and run-level unset for the namespaces (except for openshift-machine-api) jianpingshu@jshu-mac ~ % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.ci-2022-07-26-053821 True False 9m56s Cluster version is 4.12.0-0.ci-2022-07-26-053821 jianpingshu@jshu-mac bin % oc get namespace openshift-cloud-credential-operator -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac bin % oc get namespace openshift-apiserver-operator -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac ~ % oc get namespace openshift-machine-api -o yaml |grep run-level openshift.io/run-level: "1" jianpingshu@jshu-mac bin % oc get namespace openshift-service-ca-operator -o yaml |grep run-level openshift.io/run-level: "" jianpingshu@jshu-mac bin % oc get namespace openshift-machine-config-operator -o yaml |grep run-level openshift.io/run-level: "" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 Errata shipped here and (via [1]) in 4.11.0. Clearing NEEDINFO [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2102834#c7 |