Bug 2101880 - [cloud-credential-operator]container has runAsNonRoot and image will run as root
Summary: [cloud-credential-operator]container has runAsNonRoot and image will run as root
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Nobody
QA Contact: Shivanthi
URL:
Whiteboard:
Depends On:
Blocks: 2102633 2102834
TreeView+ depends on / blocked
 
Reported: 2022-06-28 15:41 UTC by Hongkai Liu
Modified: 2023-03-14 04:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2102633 2110629 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:50:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 472 0 None Merged Bug 2101880: manifests/00-namespace: Set empty openshift.io/run-level 2022-06-30 20:42:17 UTC
Github openshift cluster-openshift-apiserver-operator pull 497 0 None Merged Bug 2101880: operator NS manifest: Set empty openshift.io/run-level 2022-07-05 07:16:24 UTC
Github openshift machine-api-operator pull 1031 0 None Merged Bug 2101880: operator NS manifest: Set empty openshift.io/run-level 2022-07-05 07:16:25 UTC
Github openshift machine-config-operator pull 3217 0 None Merged Bug 2101880: NS manifest: Set empty openshift.io/run-level 2022-07-05 07:16:26 UTC
Github openshift service-ca-operator pull 194 0 None Merged Bug 2101880: operator NS manifest: Set empty openshift.io/run-level 2022-07-05 07:16:28 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:51:07 UTC

Description Hongkai Liu 2022-06-28 15:41:54 UTC
Description of problem:
It happened during upgrading a CI cluster build02 4.10.18 to 4.11.0-fc.3.

version   4.10.18   True        True          88m     Working towards 4.11.0-fc.3: 637 of 802 done (79% complete), waiting on cloud-credential

oc --context build02 get pod -n openshift-cloud-credential-operator cloud-credential-operator-5d79d8fd6d-d9q5z
NAME                                         READY   STATUS                       RESTARTS   AGE
cloud-credential-operator-5d79d8fd6d-d9q5z   1/2     CreateContainerConfigError   0          14m

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 W. Trevor King 2022-06-28 16:27:45 UTC
Collecting more context, so folks hitting this issue are more likely to be able to find it searching Bugzilla:

$ oc --as system:admin adm upgrade
info: An upgrade is in progress. Unable to apply 4.11.0-fc.3: the workload openshift-cloud-credential-operator/cloud-credential-operator has not yet successfully rolled out

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Channel: candidate-4.11 (available channels: candidate-4.11)
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss.

$ oc --as system:admin -n openshift-cloud-credential-operator get pods
NAME                                         READY   STATUS                       RESTARTS   AGE
cloud-credential-operator-5d79d8fd6d-d9q5z   1/2     CreateContainerConfigError   0          62m

$ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-5d79d8fd6d-d9q5z | jq '.status.containerStatuses[] | select(.ready != true)'
{
  "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305",
  "imageID": "",
  "lastState": {},
  "name": "cloud-credential-operator",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "message": "container has runAsNonRoot and image will run as root (pod: \"cloud-credential-operator-5d79d8fd6d-d9q5z_openshift-cloud-credential-operator(25954b29-621e-4d2d-9864-b013226f22fa)\", container: cloud-credential-operator)",
      "reason": "CreateContainerConfigError"
    }
  }
}

$ oc --as system:admin -n openshift-cloud-credential-operator get events | grep -v Normal
LAST SEEN   TYPE      REASON              OBJECT                                            MESSAGE
108m        Warning   Failed              pod/cloud-credential-operator-5d79d8fd6d-57mbh    Error: container has runAsNonRoot and image will run as root (pod: "cloud-credential-operator-5d79d8fd6d-57mbh_openshift-cloud-credential-operator(a815836e-27fd-489d-989c-df8ba2612c98)", container: cloud-credential-operator)
61m         Warning   Failed              pod/cloud-credential-operator-5d79d8fd6d-d9q5z    Error: container has runAsNonRoot and image will run as root (pod: "cloud-credential-operator-5d79d8fd6d-d9q5z_openshift-cloud-credential-operator(25954b29-621e-4d2d-9864-b013226f22fa)", container: cloud-credential-operator)

Comment 3 W. Trevor King 2022-06-28 17:01:31 UTC
Ah, the in-cluster pod has:

$ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-5d79d8fd6d-d9q5z | jq .spec.securityContext
{
  "runAsNonRoot": true
}

while the manifest calls for [1]:

      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault

so we aren't setting seccompProfile.  The deployment unsurprisingly matches the pod:

$ oc --as system:admin -n openshift-cloud-credential-operator get -o json deployment cloud-credential-operator | jq .spec.template.spec.securityContext
{
  "runAsNonRoot": true
}

Hmm, no managedFields on the deployment?

$ oc --as system:admin -n openshift-cloud-credential-operator get --show-managed-fields -o json deployment cloud-credential-operator | jq -c '.metadata | keys'
["annotations","creationTimestamp","generation","labels","name","namespace","ownerReferences","resourceVersion","uid"]

[1]: https://github.com/openshift/cloud-credential-operator/blob/c34501f3d66f132d21a2a9620a886c4b144e1571/manifests/03-deployment.yaml#L28-L31

Comment 4 W. Trevor King 2022-06-28 18:53:04 UTC
Hongkai deleted the stuck Deployment:

  $ oc --context build02 delete deploy -n openshift-cloud-credential-operator cloud-credential-operator --as system:admin

The CVO created a replacement with seccompProfile:

  $ oc --as system:admin -n openshift-cloud-credential-operator get -o json deployment cloud-credential-operator | jq .spec.template.spec.securityContext
  {
    "runAsNonRoot": true,
    "seccompProfile": {
      "type": "RuntimeDefault"
    }
  }

But the pods are still failing the same way:

  $ oc --as system:admin -n openshift-cloud-credential-operator get pods
  NAME                                         READY   STATUS                       RESTARTS   AGE
  cloud-credential-operator-6cfcffdf6d-8k6c4   1/2     CreateContainerConfigError   0          82m
  $ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-6cfcffdf6d-8k6c4 | jq '.status.containerStatuses[] | select(.ready != true)'
  {
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305",
    "imageID": "",
    "lastState": {},
    "name": "cloud-credential-operator",
    "ready": false,
    "restartCount": 0,
    "started": false,
    "state": {
      "waiting": {
        "message": "container has runAsNonRoot and image will run as root (pod: \"cloud-credential-operator-6cfcffdf6d-8k6c4_openshift-cloud-credential-operator(c20dd218-2df6-482f-85dc-563ba67e6d1b)\", container: cloud-credential-operator)",
        "reason": "CreateContainerConfigError"
      }
    }
  }

Comment 5 W. Trevor King 2022-06-28 21:51:18 UTC
Mitigation attempt 2:

Tell the CVO to stop caring about this deployment:

  $ oc --as system:admin patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps", "namespace": "openshift-cloud-credential-operator", "name": "cloud-credential-operator", "unmanaged": true}]}]'
  $ oc --as system:admin -n openshift-cloud-credential-operator patch deployment cloud-credential-operator --type json -p '[{"op": "add", "path": "/spec/template/spec/securityContext/runAsUser", "value": 65534}]'

Which resolved 'container has runAsNonRoot and image will run as root', but now we're failing a different way:

  $ oc --as system:admin -n openshift-cloud-credential-operator get pods
  NAME                                         READY   STATUS             RESTARTS       AGE
  cloud-credential-operator-5f7b5cbcf6-b9xgj   1/2     CrashLoopBackOff   13 (43s ago)   42m
  $ oc --as system:admin -n openshift-cloud-credential-operator get -o json pod cloud-credential-operator-5f7b5cbcf6-b9xgj | jq '.status.containerStatuses[] | select(.ready != true)'
  {
    "containerID": "cri-o://c07a99be7f584be7e96601681c9a847b83546d0350176d6c55f65e2ef3166f8d",
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305",
    "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0cf4b01c2f5e29fc55c22580d487721d423283c52997c8ff8e344e2f8b251305",
    "lastState": {
      "terminated": {
        "containerID": "cri-o://c07a99be7f584be7e96601681c9a847b83546d0350176d6c55f65e2ef3166f8d",
        "exitCode": 1,
        "finishedAt": "2022-06-28T21:49:05Z",
        "message": "Copying system trust bundle\ncp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied\n",
        "reason": "Error",
        "startedAt": "2022-06-28T21:49:05Z"
      }
    },
    "name": "cloud-credential-operator",
    "ready": false,
    "restartCount": 14,
    "started": false,
    "state": {
      "waiting": {
        "message": "back-off 5m0s restarting failed container=cloud-credential-operator pod=cloud-credential-operator-5f7b5cbcf6-b9xgj_openshift-cloud-credential-operator(43ccf74b-b353-433e-92da-09de79a21352)",
        "reason": "CrashLoopBackOff"
      }
    }
  }

Unclear what sort of permissions are needed for the operator to be able to dance around the system trust bundle.

Comment 6 W. Trevor King 2022-06-28 22:22:19 UTC
Comparing with 4.10.18 to 4.11.0-fc.3 CI [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1540112324029845504/artifacts/launch/gather-extra/artifacts/namespaces.json | jq '.items[] | select(.metadata.name == "openshift-cloud-credential-operator")' >ci.json
$ oc --as system:admin get -o json namespace openshift-cloud-credential-operator | jq . >build02.json
$ diff -u ci.json build02.json 
--- ci.json     2022-06-28 15:01:42.114750313 -0700
+++ build02.json        2022-06-28 15:01:53.923749535 -0700
@@ -6,17 +6,18 @@
       "include.release.openshift.io/ibm-cloud-managed": "true",
       "include.release.openshift.io/self-managed-high-availability": "true",
       "openshift.io/node-selector": "",
-      "openshift.io/sa.scc.mcs": "s0:c11,c10",
-      "openshift.io/sa.scc.supplemental-groups": "1000130000/10000",
-      "openshift.io/sa.scc.uid-range": "1000130000/10000",
+      "openshift.io/sa.scc.mcs": "s0:c18,c17",
+      "openshift.io/sa.scc.supplemental-groups": "1000340000/10000",
+      "openshift.io/sa.scc.uid-range": "1000340000/10000",
       "workload.openshift.io/allowed": "management"
     },
-    "creationTimestamp": "2022-06-23T23:28:09Z",
+    "creationTimestamp": "2020-05-21T19:29:42Z",
     "labels": {
       "controller-tools.k8s.io": "1.0",
       "kubernetes.io/metadata.name": "openshift-cloud-credential-operator",
-      "olm.operatorgroup.uid/0035e0c5-6505-4d02-90b4-f4ddd056fd22": "",
-      "openshift.io/cluster-monitoring": "true"
+      "olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a": "",
+      "openshift.io/cluster-monitoring": "true",
+      "openshift.io/run-level": "1"
     },
...

Ahh, the cloud-credentials operator dropped openshift.io/run-level back in 4.5 with bug 1806892.  But as discussed in [2], the cluster-version operator allows (and does not stomp) labels beyond what are contained in the current manifest.  Clearing the overrides to put the CVO back in command of the deployment:

  $ oc --as system:admin patch clusterversion version --type json -p '[{"op": "remove", "path": "/spec/overrides"}]'

Drop the obsolete label:

  $ oc --as system:admin label namespace openshift-cloud-credential-operator openshift.io/run-level-

And delete the deployment in case the label change only has creation-time effects (this step might not be needed?):

  $ oc --as system:admin -n openshift-cloud-credential-operator delete deployment cloud-credential-operator

A bit afterwards, things seem better:

  $ oc --as system:admin -n openshift-cloud-credential-operator get pods
  NAME                                         READY   STATUS    RESTARTS   AGE
  cloud-credential-operator-6cfcffdf6d-rhbmm   2/2     Running   0          5m41s

So the issue is the stale openshift.io/run-level label for clusters which were born in 4.4 or earlier, and which are updating to 4.11.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1540112324029845504
[2]: https://issues.redhat.com/browse/OTA-330

Comment 7 W. Trevor King 2022-06-28 22:33:30 UTC
Recommended mitigation: update your namespace manifest [1] to set an explicit:

  openshift.io/run-level: ""

like the cluster-version operator does [2] since bug 2020107.  See [3] for more context on how the CVO handles label reconciliation for manifests.

[1]: https://github.com/openshift/cloud-credential-operator/blob/c34501f3d66f132d21a2a9620a886c4b144e1571/manifests/00-namespace.yaml#L9
[2]: https://github.com/openshift/cluster-version-operator/blob/b81272d8a78466596c92b7f88896fc3565feb335/install/0000_00_cluster-version-operator_00_namespace.yaml#L12
[3]: https://issues.redhat.com/browse/OTA-330

Comment 8 W. Trevor King 2022-06-28 22:46:39 UTC
Native validation for this one would be a bit of a pain, installing 4.4, updating to 4.5, 4.6, 4.7, 4.8, 4.9, 4.10, and to 4.11 to reproduce the hang.  Cheaper validation is probably:

1. Install a 4.10 cluster.
2. Manually set the label, as if the cluster had been born in 4.4 or earlier:
     $ oc label namespace openshift-cloud-credential-operator openshift.io/run-level=1
3. Request an update to 4.11.

With 4.11.0-fc.3 as the target, we expect the update to stick on the cloud-credential operator deployment, with CreateContainerConfigError and 'container has runAsNonRoot and image will run as root'.

With a patched 4.11 as the target, we expect the update to complete without issues.

Comment 9 Brenton Leanhardt 2022-06-29 12:03:26 UTC
Wow! that's impressive work, Trevor.  Should we push for this to be included in the 4.11 fc build the other build clusters will upgrade to?

Comment 11 Standa Laznicka 2022-07-01 07:27:58 UTC
Moving back to assigned, going to use this BZ as a base for all the 4.12 PRs.

Comment 13 Jianping SHu 2022-07-05 11:50:43 UTC
2102834 was cloned in CCO PR for fix porting to 4.11.0. 
Need to use it to track the back porting for all above PRs?

Comment 15 Jianping SHu 2022-07-05 15:30:39 UTC
Verified w/ version 4.12.0-0.nightly-2022-07-05-083442 following the suggested validation way:
1. Install ocp cluster with version 4.10.0-0.nightly-2022-06-08-150219
2. Apply labels
jianpingshu@jshu-mac bin % oc label namespace openshift-cloud-credential-operator openshift.io/run-level=1
namespace/openshift-cloud-credential-operator labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-apiserver-operator  openshift.io/run-level=1
namespace/openshift-apiserver-operator labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-machine-api openshift.io/run-level=1
namespace/openshift-machine-api labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-service-ca-operator openshift.io/run-level=1
namespace/openshift-service-ca-operator labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1
error: 'openshift.io/run-level' already has a value (), and --overwrite is false
jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1 --overwrite
namespace/openshift-machine-config-operator labeled
3. Upgrade to 4.12
jianpingshu@jshu-mac bin % oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-05-083442 --allow-explicit-upgrade --force
4. Upgrade is successful and run-level unset for the namespaces
jianpingshu@jshu-mac bin % oc get clusterversion -w
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-05-083442   True        False         9m54s   Cluster version is 4.12.0-0.nightly-2022-07-05-083442

jianpingshu@jshu-mac bin % oc get namespace openshift-cloud-credential-operator -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac bin % oc get namespace openshift-apiserver-operator -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac bin % oc get namespace openshift-machine-api -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac bin % oc get namespace openshift-service-ca-operator -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac bin % oc get namespace openshift-machine-config-operator -o yaml |grep run-level
    openshift.io/run-level: ""

Comment 16 Jianping SHu 2022-07-27 05:27:19 UTC
Bugzilla 2110501 reverted the openshift machine-api-operator pull 1031.

Verified with 4.12.0-0.ci-2022-07-26-053821 (include change for Bugzilla 2110501)
1. Install ocp cluster with version 4.10.0-0.nightly-2022-07-25-110002
2. Apply labels
jianpingshu@jshu-mac bin % oc label namespace openshift-cloud-credential-operator openshift.io/run-level=1
namespace/openshift-cloud-credential-operator labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-apiserver-operator  openshift.io/run-level=1
namespace/openshift-apiserver-operator labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-machine-api openshift.io/run-level=1
namespace/openshift-machine-api labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-service-ca-operator openshift.io/run-level=1
namespace/openshift-service-ca-operator labeled
jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1
error: 'openshift.io/run-level' already has a value (), and --overwrite is false
jianpingshu@jshu-mac bin % oc label namespace openshift-machine-config-operator openshift.io/run-level=1 --overwrite
namespace/openshift-machine-config-operator labeled
3. Upgrade to 4.12
jianpingshu@jshu-mac bin % oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.12.0-0.ci-2022-07-26-053821 --allow-explicit-upgrade --force
4. Upgrade is successful and run-level unset for the namespaces (except for openshift-machine-api)
jianpingshu@jshu-mac ~ % oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.ci-2022-07-26-053821   True        False         9m56s   Cluster version is 4.12.0-0.ci-2022-07-26-053821

jianpingshu@jshu-mac bin % oc get namespace openshift-cloud-credential-operator -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac bin % oc get namespace openshift-apiserver-operator -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac ~ % oc get namespace openshift-machine-api -o yaml |grep run-level
    openshift.io/run-level: "1"
jianpingshu@jshu-mac bin % oc get namespace openshift-service-ca-operator -o yaml |grep run-level
    openshift.io/run-level: ""
jianpingshu@jshu-mac bin % oc get namespace openshift-machine-config-operator -o yaml |grep run-level
    openshift.io/run-level: ""

Comment 20 errata-xmlrpc 2023-01-17 19:50:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 21 W. Trevor King 2023-03-14 04:47:55 UTC
Errata shipped here and (via [1]) in 4.11.0.  Clearing NEEDINFO

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2102834#c7


Note You need to log in before you can comment on or make changes to this bug.