Seen in a 4.10.18 to 4.11.0-fc.3 update, the 'worker' pool wasn't completing its update: $ oc --as system:admin get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-12902db670ee8e73aa13a8df2782ddc6 True False False 3 3 3 0 2y38d worker rendered-worker-4504e2ba786e399f01bb8e68c54cc177 False True True 24 22 23 1 2y38d Checking on node state: $ oc --as system:admin get -o json nodes | jq -r '.items[].metadata.annotations["machineconfiguration.openshift.io/state"]' | sort | uniq -c 1 Degraded 29 Done Getting the name of the degraded node: $ oc --as system:admin get -o json nodes | jq -r '.items[].metadata | select(.annotations["machineconfiguration.openshift.io/state"] == "Degraded").name' build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal Reason for the degradation: $ oc --as system:admin get -o json node build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal | jq -r '.metadata.annotations["machineconfiguration.openshift.io/reason"]' unexpected on-disk state validating against rendered-worker-44f2c74623e4d3bbe9557a9e82102c01: mode mismatch for file: "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit"; expected: -rw-r--r--/420/0644; received: ----------/0/0 Hopping onto that node to confirm and adjust: $ oc --as system:admin debug node/build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") Starting pod/build0-gstfj-w-b-infra-wlvr9copenshift-ci-build-farminternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.32.90 If you don't see a command prompt, try pressing enter. sh-4.4# ls -lh /host/etc/crio/crio.conf.d total 12K -rw-r--r--. 1 root root 2.7K Jun 28 22:38 00-default ----------. 1 root root 46 Jun 28 22:38 01-ctrcfg-pidsLimit -rw-r--r--. 1 root root 67 Jun 28 22:38 01-mc-seccompUseDefault sh-4.4# file /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit: ASCII text sh-4.4# cat /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit [crio] [crio.runtime] pids_limit = 8096 sh-4.4# chmod 0644 /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit sh-4.4# ls -lh /host/etc/crio/crio.conf.d total 12K -rw-r--r--. 1 root root 2.7K Jun 28 22:38 00-default -rw-r--r--. 1 root root 46 Jun 28 22:38 01-ctrcfg-pidsLimit -rw-r--r--. 1 root root 67 Jun 28 22:38 01-mc-seccompUseDefault sh-4.4# exit Removing debug pod ... That unstuck the update: $ oc --as system:admin get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-12902db670ee8e73aa13a8df2782ddc6 True False False 3 3 3 0 2y38d worker rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 True False False 32 32 32 0 2y38d $ oc --as system:admin adm upgrade Cluster version is 4.11.0-fc.3 ... Not clear what happened to corrupt the mode. Checking node logs, entries for that file look like its siblings: $ oc --as system:admin adm node-logs build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal | gzip >node.logs.gz $ zgrep -2 01-ctrcfg-pidsLimit node.logs.gz | sed 's/^/ /' Jun 28 22:38:52.241334 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Reloading. Jun 28 22:38:52.936870 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/etc/kubernetes/kubelet.conf" Jun 28 22:38:52.936870 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(22): [started] writing file "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" Jun 28 22:38:52.936870 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(22): [finished] writing file "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" Jun 28 22:38:52.936870 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(23): [started] writing file "/etc/crio/crio.conf.d/01-mc-seccompUseDefault" Jun 28 22:38:52.936870 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(23): [finished] writing file "/etc/crio/crio.conf.d/01-mc-seccompUseDefault" -- Jun 28 22:38:56.543634 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Reloading. Jun 28 22:38:56.543646 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/etc/kubernetes/kubelet.conf" Jun 28 22:38:56.543658 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(22): [started] writing file "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" Jun 28 22:38:56.543670 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(22): [finished] writing file "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" Jun 28 22:38:56.543681 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(23): [started] writing file "/etc/crio/crio.conf.d/01-mc-seccompUseDefault" Jun 28 22:38:56.543693 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(23): [finished] writing file "/etc/crio/crio.conf.d/01-mc-seccompUseDefault" -- Jun 28 22:38:58.520605 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal restorecon[1607]: Relabeled /etc/crio/crio.conf.d from system_u:object_r:unlabeled_t:s0 to system_u:object_r:container_config_t:s0 Jun 28 22:38:58.520605 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal restorecon[1607]: Relabeled /etc/crio/crio.conf.d/00-default from system_u:object_r:unlabeled_t:s0 to system_u:object_r:container_config_t:s0 Jun 28 22:38:58.520605 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal restorecon[1607]: Relabeled /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit from system_u:object_r:unlabeled_t:s0 to system_u:object_r:container_config_t:s0 Jun 28 22:38:58.520605 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal restorecon[1607]: Relabeled /etc/crio/crio.conf.d/01-mc-seccompUseDefault from system_u:object_r:unlabeled_t:s0 to system_u:object_r:container_config_t:s0 Jun 28 22:38:58.520605 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal restorecon[1607]: Relabeled /etc/containers/policy.json from system_u:object_r:unlabeled_t:s0 to system_u:object_r:etc_t:s0
Hmm, this is suspicious: $ zgrep -A1 'process 1999' node.logs.gz Jun 28 22:41:15.894942 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Found left-over process 1999 (rpm-ostree) in control group while starting unit. Ignoring. Jun 28 22:41:15.894958 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. What's the deal with 1999? $ zgrep 'rpm-ostree' node.logs.gz | grep -A1 '[[]1999]' | tail -n6 Jun 28 22:40:56.695627 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal rpm-ostree[1999]: client(id:machine-config-operator dbus:1.48 unit:crio-e942a63a45ecbd88571cc4f6803a571d6a66367aece0d1af5d07014a5dfe9d2d.scope uid:0) vanished; remaining=0 Jun 28 22:40:56.695663 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal rpm-ostree[1999]: In idle state; will auto-exit in 63 seconds Jun 28 22:40:57.807054 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal rpm-ostree[1999]: client(id:machine-config-operator dbus:1.55 unit:crio-e942a63a45ecbd88571cc4f6803a571d6a66367aece0d1af5d07014a5dfe9d2d.scope uid:0) added; new total=1 Jun 28 22:40:57.817972 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal rpm-ostree[1999]: client(id:machine-config-operator dbus:1.55 unit:crio-e942a63a45ecbd88571cc4f6803a571d6a66367aece0d1af5d07014a5dfe9d2d.scope uid:0) vanished; remaining=0 Jun 28 22:40:57.817995 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal rpm-ostree[1999]: In idle state; will auto-exit in 62 seconds Jun 28 22:40:59.790481 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Start request repeated too quickly. I'm not sure what "Start request repeated too quickly" means, but perhaps this is part of the wonky mode...
More PID 1999 sadness: $ zgrep 'rpm-ostree' node.logs.gz | grep -A1 ' 1999' Jun 28 22:41:08.101919 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID which is currently not known Jun 28 22:41:15.894942 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Found left-over process 1999 (rpm-ostree) in control group while starting unit. Ignoring. Jun 28 22:41:15.896417 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Starting rpm-ostree System Management Daemon... -- Jun 28 22:41:18.109270 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:41:28.109373 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:41:38.109519 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:41:48.101491 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:41:58.108844 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:42:00.101055 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:42:00.101121 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: rpm-ostreed.service: Got notification message from PID 1999, but reception only permitted for main PID 4990 Jun 28 22:42:00.101714 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Started rpm-ostree System Management Daemon.
Interesting, I don't think we've seen this bug before. A few notes: 1. it seems that this same config on the other nodes didn't run into this problem, as indicated by the fact that most other nodes were successful 2. unexpected on-disk state validating against rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 seems like its validating against the NEW config post-reboot, meaning that it was the MCD who wrote the file, not ignition, is that correct? Or is that an old config that the node booted with and never updated? 3. that file is managed via the MCO via containerruntimeconfigcontroller, and it looks like there is no mode specified, which means normally it would have been 420 like all the other files I don't think rpm-ostreed is involved directly in this process but that would have to be confirmed with the RHCOS team. I don't see anything that jumps out as "this is MCO's doing" at the very moment, but also not sure how we could have gotten here
Same timestamps on: ----------. 1 root root 46 Jun 28 22:38 01-ctrcfg-pidsLimit and: Jun 28 22:38:52.936870 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : files: createFilesystemsFiles: createFiles: op(22): [finished] writing file "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" suggests this is an RHCOS thing to me. But yeah, I didn't see anything that makes me completely sure.
Did the MCD at any point attempt to perform an update on the files of that node? Or was this caught before the update happened? It sounds like the latter based on your most recently comment but I am not certain
In this cluster today: $ oc --as system:admin get -l node-role.kubernetes.io/infra= -o json nodes | jq -r '.items[].metadata | .annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .annotations["machineconfiguration.openshift.io/desiredConfig"] + " " + .annotations["machineconfiguration.openshift.io/state"] + " " + .name' | sort rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Done build0-gstfj-w-b-infra-5ck6v.c.openshift-ci-build-farm.internal rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Done build0-gstfj-w-b-infra-l69dd.c.openshift-ci-build-farm.internal rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Done build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal And that 44f... MachineConfig was given as the worker target in comment 0's MachineConfigPool list and in the 'unexpected on-disk state validating against rendered-worker-44f2c74623e4d3bbe9557a9e82102c01' message. The MachineConfigPool list also included the single degraded node, and 23 of 24 nodes in the 'worker' pool were updated. I think that means that this stick happened after the node was asked to update. But to get at pivot timing, using the gathered node logs: $ zgrep 44f2c74623e4d3bbe9557a9e82102c01 node.logs.gz Jun 28 22:39:13.624843 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal machine-config-daemon[2302]: I0628 22:39:13.624799 2302 update.go:513] Checking Reconcilable for config mco-empty-mc to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Jun 28 22:39:13.625826 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal machine-config-daemon[2302]: I0628 22:39:13.625786 2302 update.go:1977] Starting update from mco-empty-mc to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false} Jun 28 22:39:13.632052 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal root[2342]: machine-config-daemon[2302]: Starting update from mco-empty-mc to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false} Jun 28 22:40:08.564733 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal logger[2454]: rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Jun 28 22:40:08.569757 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal machine-config-daemon[2302]: I0628 22:40:08.569726 2302 update.go:1977] initiating reboot: Completing firstboot provisioning to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Jun 28 22:40:08.572129 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal root[2456]: machine-config-daemon[2302]: initiating reboot: Completing firstboot provisioning to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Jun 28 22:40:08.585843 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Started machine-config-daemon: Completing firstboot provisioning to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01. Jun 28 22:40:08.620927 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Stopping machine-config-daemon: Completing firstboot provisioning to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01... Jun 28 22:40:08.732459 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal systemd[1]: Stopped machine-config-daemon: Completing firstboot provisioning to rendered-worker-44f2c74623e4d3bbe9557a9e82102c01. Jun 29 04:57:45.156007 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal logger[322343]: rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 Jun 29 04:57:55.188890 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal root[322477]: machine-config-daemon[1962]: Update completed for config rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 and node has been successfully uncordoned "Starting update from mco-empty-mc" is a bit weird. Ah, it was a new node being born: $ oc --as system:admin get -o json node build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal | jq -r .metadata.creationTimestamp 2022-06-28T22:40:51Z Which explains why Ignition was so active. $ zgrep 'ignition.*https' node.logs.gz Jun 28 22:38:48.766270 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[911]: GET https://api-int.build02.gcp.ci.openshift.org:22623/config/worker: attempt #1 Jun 28 22:38:48.884632 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[911]: fetched referenced config at https://api-int.build02.gcp.ci.openshift.org:22623/config/worker with SHA512: 935bd4c87f722517dbe23c1fd4310949bf418aae5b389dffbc746cfd11fcb630724cb6bc71f2385e7886c7dc00b8b239651740670bce704f321a8533e1f07845 I dunno how to map that hash to the rendered MachineConfigs... It doesn't seem to match what's currently being rendered: $ oc --as system:admin debug --no-tty -q node/build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal -- curl -s https://api-int.build02.gcp.ci.openshift.org:22623/config/worker | sha512sum Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e -
The other two infra machines are quite a bit older: $ for NODE in $(oc --as system:admin get -l node-role.kubernetes.io/infra= -o jsonpath='{.items[*].metadata.name}' nodes); do echo "${NODE}"; oc --as system:admin get -o jsonpath='{.metadata.creationTimestamp}{"\n"}' node "${NODE}"; oc --as system:admin adm node-logs "${NODE}" | grep 'ignition.*fetched referenced config'; done build0-gstfj-w-b-infra-5ck6v.c.openshift-ci-build-farm.internal 2022-03-28T10:06:20Z build0-gstfj-w-b-infra-l69dd.c.openshift-ci-build-farm.internal 2022-01-17T12:55:54Z build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal 2022-06-28T22:40:51Z Jun 28 22:38:48.884632 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[911]: fetched referenced config at https://api-int.build02.gcp.ci.openshift.org:22623/config/worker with SHA512: 935bd4c87f722517dbe23c1fd4310949bf418aae5b389dffbc746cfd11fcb630724cb6bc71f2385e7886c7dc00b8b239651740670bce704f321a8533e1f07845 Broadening the net to include all 'worker' nodes: $ for NODE in $(oc --as system:admin get -l node-role.kubernetes.io/worker= -o jsonpath='{.items[*].metadata.name}' nodes); do oc --as system:admin adm node-logs "${NODE}" | grep 'ignition.*fetched referenced config'; done | sort Jun 28 22:38:48.884632 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[911]: fetched referenced config at https://api-int.build02.gcp.ci.openshift.org:22623/config/worker with SHA512: 935bd4c87f722517dbe23c1fd4310949bf418aae5b389dffbc746cfd11fcb630724cb6bc71f2385e7886c7dc00b8b239651740670bce704f321a8533e1f07845 Jun 28 22:43:01.138271 build0-gstfj-w-b-98mtn.c.openshift-ci-build-farm.internal ignition[852]: fetched referenced config at https://api-int.build02.gcp.ci.openshift.org:22623/config/worker with SHA512: d8bc4ab38515a0727a51f492d6c8ec71dd45b9ccd61e071a216fe1ad8bff3965af591cf04b7139edbc93d783851372ec8b110e0616f2afbc990eb400e2bfcbb4 Jun 29 00:03:48.153096 localhost ignition[847]: fetched referenced config at https://api-int.build02.gcp.ci.openshift.org:22623/config/worker with SHA512: d8bc4ab38515a0727a51f492d6c8ec71dd45b9ccd61e071a216fe1ad8bff3965af591cf04b7139edbc93d783851372ec8b110e0616f2afbc990eb400e2bfcbb4 ... Jun 29 21:20:45.979121 localhost ignition[852]: fetched referenced config at https://api-int.build02.gcp.ci.openshift.org:22623/config/worker with SHA512: d8bc4ab38515a0727a51f492d6c8ec71dd45b9ccd61e071a216fe1ad8bff3965af591cf04b7139edbc93d783851372ec8b110e0616f2afbc990eb400e2bfcbb4 So this node is the oldest one to remember its Ignition config, and right after this node, folks started seeing d8bc... It's a bit weird that this node knew its hostname by the time it made that request, most of the other nodes are still using 'localhost' at this point. Timestamps are also weird, since 21:20 is still a few hours into the future if we're talking UTC: $ date --utc Wed 29 Jun 2022 09:40:49 PM UTC
The first served ignition content is always preserved on the node, at /etc/mcs-machine-config-content.json If you want to check contents, that's probably the best place. That file should never get updated again after the MCO takes over. Could you find the MCD logs of the problematic node, as well as the rendered MC it should be on, as well as the original content in that file^
$ oc --as system:admin debug --no-tty -q node/build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal -- cat /host/etc/mcs-machine-config-content.json >mcs-machine-config-content.json $ sha512sum mcs-machine-config-content.json 8afe3d5d2ad4568e362473051cdc7c220f60e0d7d221db678c2352af7d6d81289e4e40046a71908e96e03f15603f490b7fec8e86952bff1c1e4fa06f808ea304 mcs-machine-config-content.json Doesn't match the 'SHA512: 935b...' from the node logs. Oh, because this is the initial MachineConfig, not the initial Ignition content: $ jq -r .metadata.name mcs-machine-config-content.json rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 $ jq '.spec.config.storage.files[] | select(.path == "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit")' mcs-machine-config-content.json { "contents": { "compression": "", "source": "data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIHBpZHNfbGltaXQgPSA4MDk2Cg==" }, "overwrite": true, "path": "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" } There are some other files that include explicit mode requests: $ jq -r '.spec.config.storage.files[].mode' mcs-machine-config-content.json | sort | uniq -c 2 384 23 420 1 484 5 493 1 null Hmm, maybe this file is the only one with 'null'? From the still-live, in-cluster MachineConfig: $ oc --as system:admin get -o json machineconfig rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 | jq '.spec.config.storage.files[] | select(.mode == null)' { "contents": { "compression": "", "source": "data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIHBpZHNfbGltaXQgPSA4MDk2Cg==" }, "overwrite": true, "path": "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" } My guess is that Ignition treats that as 000, and the MCO wants to fall back to 420, and the MCO is mad that Ignition doesn't agree on the default.
Auditing for 'null': $ oc --as system:admin get -o json machineconfigs | jq -r '.items[] | .metadata as $m | .spec.config.storage| (.files // [])[] | select(.mode == null) | $m.creationTimestamp + " " + $m.name + " " + .path' 2020-10-20T21:25:54Z 99-worker-generated-containerruntime /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit 2020-10-26T17:22:49Z 99-worker-generated-kubelet /etc/node-sizing-enabled.env 2020-10-26T17:22:49Z 99-worker-generated-kubelet /etc/kubernetes/kubelet.conf 2022-06-28T22:34:21Z rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit $ oc --as system:admin get --show-managed-fields -o json machineconfig 99-worker-generated-containerruntime | jq '.metadata | {annotations, creationTimestamp, generation, managedFields: ([. managedFields[] | {manager}])}' { "annotations": { "machineconfiguration.openshift.io/generated-by-controller-version": "37b741601f9b7ff9b2e1870102cc2970b24e1835" }, "creationTimestamp": "2020-10-20T21:25:54Z", "generation": 4, "managedFields": [ { "manager": "machine-config-controller" } ] } Dropping into audit logs: $ oc --as system:admin adm must-gather --dest-dir=audit-logs -- /usr/bin/gather_audit_logs $ zgrep -h '"verb":"[^"]*".*"name":"99-worker-generated-containerruntime"' audit-logs/quay-io-*/audit_logs/kube-apiserver/*.gz | jq -r .verb | sort | uniq -c 3 get So I'm not fast enough to figure out when the last-touch happened. Checking the covered range: $ zgrep -h 'stageTimestamp' audit-logs/quay-io-*/audit_logs/kube-apiserver/*.gz | jq .stageTimestamp | sort | head -n2 "2022-06-29T16:29:59.025155Z" "2022-06-29T16:29:59.026341Z" which is indeed well after the 28th when this node was coming up.
Digging into one of the other paths with a null, to understand why that's not making it into the rendered config: $ oc --as system:admin get -o json machineconfigs | jq -r '.items[] | .metadata as $m | .spec.config.storage | (.files // [])[] | select(.path == "/etc/node-sizing-enabled.env") | $m.creationTimestamp + " " + $m.name + " " + (.mode | tostring)' | sort 2020-05-21T19:36:27Z 00-master 420 2020-05-21T19:36:27Z 00-worker 420 2020-10-26T17:22:49Z 99-worker-generated-kubelet null 2021-06-09T19:27:30Z rendered-master-7905996052725f8e4e7d8c39f4a4e89d 420 2021-06-09T19:27:30Z rendered-worker-ab6e05f3836865910b2d57f7ff099947 420 ...so many other rendered-* MachineConfig, do we not prune?... 2022-06-28T22:34:21Z rendered-master-12902db670ee8e73aa13a8df2782ddc6 420 2022-06-28T22:34:21Z rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 420 Diffing: $ get_node_sizing_enabled() { > NAME="${1}" > oc --as system:admin get -o json machineconfig "${NAME}" | jq -r '.spec.config.storage | (.files // [])[] | select(.path == "/etc/node-sizing-enabled.env")' > } $ diff -u <(get_node_sizing_enabled 00-worker) <(get_node_sizing_enabled rendered-worker-44f2c74623e4d3bbe9557a9e82102c01) --- /dev/fd/63 2022-06-29 15:42:38.732172513 -0700 +++ /dev/fd/62 2022-06-29 15:42:38.734172513 -0700 @@ -1,6 +1,7 @@ { "contents": { - "source": "data:,NODE_SIZING_ENABLED%3Dfalse%0ASYSTEM_RESERVED_MEMORY%3D1Gi%0ASYSTEM_RESERVED_CPU%3D500m" + "compression": "", + "source": "data:text/plain;charset=utf-8;base64,Tk9ERV9TSVpJTkdfRU5BQkxFRD1mYWxzZQpTWVNURU1fUkVTRVJWRURfTUVNT1JZPTNHaQpTWVNURU1fUkVTRVJWRURfQ1BVPTEwMDBtCg==" }, "mode": 420, "overwrite": true, $ diff -u <(get_node_sizing_enabled 99-worker-generated-kubelet) <(get_node_sizing_enabled rendered-worker-44f2c74623e4d3bbe9557a9e82102c01) --- /dev/fd/63 2022-06-29 15:43:03.654170872 -0700 +++ /dev/fd/62 2022-06-29 15:43:03.656170872 -0700 @@ -3,6 +3,7 @@ "compression": "", "source": "data:text/plain;charset=utf-8;base64,Tk9ERV9TSVpJTkdfRU5BQkxFRD1mYWxzZQpTWVNURU1fUkVTRVJWRURfTUVNT1JZPTNHaQpTWVNURU1fUkVTRVJWRURfQ1BVPTEwMDBtCg==" }, + "mode": 420, "overwrite": true, "path": "/etc/node-sizing-enabled.env" } Is there some kind of union going on?
Fitting the suspicious 99-worker-generated-containerruntime into the cluster's update history: $ (oc --as system:admin get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'; echo '2020-10-20T21:25:54Z --- 99-worker-generated-containerruntime') | sort 2020-07-29T15:01:15Z 2020-07-29T15:35:42Z Completed 4.5.4 2020-08-27T17:41:55Z 2020-08-27T18:21:45Z Completed 4.5.6 2020-09-25T09:30:46Z 2020-09-25T10:17:32Z Completed 4.5.12 2020-10-20T17:53:50Z 2020-10-20T18:23:58Z Completed 4.5.14 2020-10-20T18:30:58Z 2020-10-21T15:18:00Z Completed 4.6.0-rc.4 2020-10-20T21:25:54Z --- 99-worker-generated-containerruntime 2020-10-23T14:01:15Z 2020-10-23T14:31:36Z Completed 4.6.1 2020-11-17T17:19:57Z 2020-11-17T19:04:40Z Completed 4.6.4 ... 2022-06-06T12:43:55Z 2022-06-06T13:51:27Z Completed 4.10.17 2022-06-17T14:26:40Z 2022-06-17T15:01:31Z Completed 4.10.18 2022-06-28T14:12:13Z 2022-06-29T04:59:16Z Completed 4.11.0-fc.3 So possibly it is something the cluster picked up on that 4.6.0-rc.4, the MCO was fixed by GA, and we never worried about recovering the rc.4 behavior because clusters that touched it were tainted into not being supported?
>Hmm, maybe this file is the only one with 'null'? From the still-live, in-cluster MachineConfig: >$ oc --as system:admin get -o json machineconfig rendered-worker-44f2c74623e4d3bbe9557a9e82102c01 | jq '.spec.config.storage.files[] | select(.mode == null)' >{ > "contents": { > "compression": "", > "source": "data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIHBpZHNfbGltaXQgPSA4MDk2Cg==" > }, > "overwrite": true, > "path": "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" >} >My guess is that Ignition treats that as 000, and the MCO wants to fall back to 420, and the MCO is mad that Ignition doesn't agree on the default. That is very odd we never ran into this issue in the past. I don't know what the default ignition would be here. I think we should check with the firstboot team. > ...so many other rendered-* MachineConfig, do we not prune?... Nope. There's a card somewhere for that > Is there some kind of union going on? Nope, assuming I read that correctly and its the same file, the latter overwrites completely. > Fitting the suspicious 99-worker-generated-containerruntime into the cluster's update history: The controller should be generating that MC every time the MCO updated. Does it have a generated by controller version? If so it's still managed by the MCO and continuously updated. Otherwise the MCO would have thrown an error. If it doesn't exist, that's problematic
I think our first and best lead here is to figure out, if a file doesn't specify mode, what ignition's default behaviour is, and continue from there? and secondly, why did other nodes not fail on this, and why we didn't see this issue until the current update.
4.6 GAed with 4.6.1 [1]. The MCO at that point was: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.1-x86_64 | grep machine-config-operator machine-config-operator https://github.com/openshift/machine-config-operator 6896f6bc491fde7e8314631652f70841c7f9f31d By 4.6.3 it was: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.3-x86_64 | grep machine-config-operator machine-config-operator https://github.com/openshift/machine-config-operator 4f29aba0ccd1d7c04f839f5c9ceaa3dd63a04960 And back at 4.6.0-rc.4 it was: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.6.0-rc.4-x86_64 | grep machine-config-operator machine-config-operator https://github.com/openshift/machine-config-operator 48d52f385642cbecf5c95e0ac4b0ec8c37664fe7 Checking the history: $ git --no-pager log --oneline --first-parent 48d52f385642c..4f29aba0ccd 4f29aba0c Merge pull request #2164 from vrutkovs/4.6-ovs_installed_workaround 6896f6bc4 Merge pull request #2168 from cgwalters/encap-fix-version-4.6 But no hits in either for '[Mm]ode'. Your comment 15 follow-ups make sense to me. [1]: https://github.com/openshift/cincinnati-graph-data/blob/4c92f3ac16b161cf42fdae5f538f93b1741f761d/internal-channels/stable.yaml#L44
$ git clone --depth 1 --branch main https://github.com/coreos/ignition.git $ cd ignition $ git --no-pager log --oneline -1 7652b83 (grafted, HEAD -> main, origin/main, origin/HEAD) Merge pull request #1409 from bgilbert/notes $ git --no-pager grep '_mode_.*file' docs docs/configuration-v3_0.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). Setuid/setgid/sticky bits are not supported. If not specified, the permission mode for files defaults to 0644 or the existing file's permissions if `overwrite` is false, `contents.source` is unspecified, and a file already exists at the path. docs/configuration-v3_1.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). Setuid/setgid/sticky bits are not supported. If not specified, the permission mode for files defaults to 0644 or the existing file's permissions if `overwrite` is false, `contents.source` is unspecified, and a file already exists at the path. docs/configuration-v3_2.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). Setuid/setgid/sticky bits are not supported. If not specified, the permission mode for files defaults to 0644 or the existing file's permissions if `overwrite` is false, `contents.source` is unspecified, and a file already exists at the path. docs/configuration-v3_3.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). Setuid/setgid/sticky bits are not supported. If not specified, the permission mode for files defaults to 0644 or the existing file's permissions if `overwrite` is false, `contents.source` is unspecified, and a file already exists at the path. docs/configuration-v3_4_experimental.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). Setuid/setgid/sticky bits are supported. If not specified, the permission mode for files defaults to 0644 or the existing file's permissions if `overwrite` is false, `contents.source` is unspecified, and a file already exists at the path. So consistently 420 in v3. Not sure if I need to worry about v2?
Here's v2 just in case, always nominally 420 there too: $ git clone --depth 1 --branch spec2x https://github.com/coreos/ignition.git $ cd ignition $ git --no-pager log --oneline -1 4db553d (grafted, HEAD -> spec2x, origin/spec2x) Merge pull request #1067 from bgilbert/spec2y $ git --no-pager grep '_mode_.*file' doc doc/configuration-v2_0.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). doc/configuration-v2_1.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). doc/configuration-v2_2.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). doc/configuration-v2_3.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). doc/configuration-v2_4.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420). doc/configuration-v2_5-experimental.md: * **_mode_** (integer): the file's permission mode. Note that the mode must be properly specified as a **decimal** value (i.e. 0644 -> 420).
oh, wait. There is no "If not specified..." clause in the v2 spec...
Back to our cluster: $ oc --as system:admin get -o json machineconfigs | jq -r '.items[].spec.config.ignition.version' | sort | uniq -c 33 2.2.0 27 3.1.0 136 3.2.0 $ oc --as system:admin get -o json machineconfigs | jq -r '.items[] | .spec.config.ignition.version + " " + .metadata.name' | grep -v rendered- 3.2.0 00-master 3.2.0 00-worker 3.2.0 01-master-container-runtime 3.2.0 01-master-kubelet 3.2.0 01-worker-container-runtime 3.2.0 01-worker-kubelet 3.2.0 99-master-generated-crio-seccomp-use-default 3.2.0 99-master-generated-registries 2.2.0 99-master-ssh 3.2.0 99-worker-generated-containerruntime 3.2.0 99-worker-generated-crio-seccomp-use-default 3.2.0 99-worker-generated-kubelet 3.2.0 99-worker-generated-registries 2.2.0 99-worker-ssh So the 99-worker-generated-containerruntime that seems to be introducing the problem is 3.2.0, where the null mode has (nominally) an explicit default.
And back to our node logs for the Ignition process that wrote the file at 22:38:52.936870: $ zgrep 'ignition[[]1210]' node.logs.gz | head -n1 Jun 28 22:38:51.655160 build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal ignition[1210]: INFO : Ignition 0.35.0 Looks like that's 2020-01-21 code [1], so the Ignition service is probably down-converting on the fly. [1]: https://github.com/coreos/ignition/tree/v0.35.0
heh, here's where comment 7's cf83... hash comes from: $ sha512sum </dev/null cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e - Because: $ oc --as system:admin debug -q node/build0-gstfj-w-b-infra-wlvr9.c.openshift-ci-build-farm.internal -- curl https://api-int.build02.gcp.ci.openshift.org:22623/config/worker Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to api-int.build02.gcp.ci.openshift.org port 22623: Connection refused I'll go read up about how access to the Ignition service is supposed to work.
$ oc --as system:admin -n openshift-machine-api get -o json machines | jq -r '.items[].spec.providerSpec.value.disks[].image' | sort | uniq -c 6 build0-gstfj-rhcos-image 39 projects/rhcos-cloud/global/images/rhcos-46-82-202011260640-0-gcp-x86-64 Hmmm...
That's from [1]. I'll go talk to the test-platform folks to get some context on the custom boot image. [1]: https://github.com/openshift/release/blame/7e61829f682e1574513f78c3e94537836d824ab3/clusters/build-clusters/02_cluster/openshift-machine-api/build0-gstfj-w-b-infra_machineset.yaml#L38
Ok, [1] has fixed that MachineSet: $ oc --as system:admin -n openshift-machine-api get -o json machinesets | jq -r '.items[].spec.template.spec.providerSpec.value.disks[].image' | sort | uniq -c 10 projects/rhcos-cloud/global/images/rhcos-46-82-202011260640-0-gcp-x86-64 And checking the Ignition servers now, there's no sign of older clients: $ for POD in $(oc --as system:admin -n openshift-machine-config-operator get -o jsonpath="{.items[*].metadata.name}" -l k8s-app=machine-config-server pods); do oc --as system:admin -n openshift-machine-config-operator logs "${POD}" | grep -o 'User-Agent.*'; done | sort | uniq -c 866 User-Agent:"Ignition/2.6.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.1.0, */*;q=0.1" But these logs only go back so far: $ for POD in $(oc --as system:admin -n openshift-machine-config-operator get -o jsonpath="{.items[*].metadata.name}" -l k8s-app=machine-config-server pods); do oc --as system:admin -n openshift-machine-config-operator logs "${POD}" | grep 'User-Agent.*' | head -n1; done I0628 23:05:42.511656 1 api.go:109] Pool worker requested by address:"10.0.32.9:51888" User-Agent:"Ignition/2.6.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.1.0, */*;q=0.1" I0628 23:08:20.013154 1 api.go:109] Pool worker requested by address:"10.0.32.43:59892" User-Agent:"Ignition/2.6.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.1.0, */*;q=0.1" I0628 22:43:00.883926 1 api.go:109] Pool worker requested by address:"10.0.32.124:54844" User-Agent:"Ignition/2.6.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.1.0, */*;q=0.1" so I don't have the entries from the Jun 28 22:38 UTC Ignition that triggered this bug. [1]: https://github.com/openshift/release/pull/30001
Setting up a test MachineSet to reproduce the break: $ curl -s https://raw.githubusercontent.com/openshift/release/master/clusters/build-clusters/02_cluster/openshift-machine-api/build0-gstfj-w-b-infra_machineset.yaml | sed 's/\(name: build0-\).*/\1test/;s/image: .*/image: build0-gstfj-rhcos-image/' >machineset.yaml $ oc --as system:admin apply -f machineset.yaml $ oc --as system:admin -n openshift-machine-api get machineset build0-test NAME DESIRED CURRENT READY AVAILABLE AGE build0-test 1 1 71s And there's that ancient client: $ for POD in $(oc --as system:admin -n openshift-machine-config-operator get -o jsonpath="{.items[*].metadata.name}" -l k8s-app=machine-config-server pods); do oc --as system:admin -n openshift-machine-config-operator logs "${POD}" | grep -o 'User-Agent.*'; done | sort | uniq -c 1 User-Agent:"Ignition/0.35.0" Accept-Header: "application/vnd.coreos.ignition+json; version=2.2.0, application/vnd.coreos.ignition+json; version=1; q=0.5, */*; q=0.1" 868 User-Agent:"Ignition/2.6.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.1.0, */*;q=0.1" Machine comes up fine: $ oc --as system:admin -n openshift-machine-api get machines | grep build0-test- build0-test-k9pcq Running n1-standard-16 us-east1 us-east1-b 3m31s Node is ready: $ oc --as system:admin -n openshift-machine-api get nodes | grep build0-test build0-test-k9pcq.c.openshift-ci-build-farm.internal Ready infra,worker 101s v1.24.0+284d62a And the mode is 0: $ oc --as system:admin debug -q node/build0-test-k9pcq.c.openshift-ci-build-farm.internal -- ls -l /host/etc/crio/crio.conf.d Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") total 12 -rw-r--r--. 1 root root 2672 Jun 30 21:59 00-default ----------. 1 root root 46 Jun 30 21:59 01-ctrcfg-pidsLimit -rw-r--r--. 1 root root 67 Jun 30 21:59 01-mc-seccompUseDefault Deleting the test MachineSet: $ oc --as system:admin -n openshift-machine-api delete machineset build0-test So it's definitely an unset Ignition version=2.2.0 mode vs. this old Ignition/0.35.0 client. Still unclear to me why the mode is unset in the three 99-worker-generated-* MachineConfig that comment 11 was poking at.
Testing with cluster bot 'launch 4.11.0-rc.0': $ oc adm upgrade Cluster version is 4.11.0-rc.0 $ cat crc.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: containerRuntimeConfig: pidsLimit: 8096 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" $ oc apply -f crc.yaml $ oc get -o json machineconfigs | jq -r '.items[] | .metadata as $m | .spec.config.storage| (.files // [])[] | select(.mode == null) | $m.creationTimestamp + " " + $m.name + " " + .path' 2022-06-30T23:10:08Z 99-worker-generated-containerruntime /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit 2022-06-30T23:10:13Z rendered-worker-0f5cf0eea59f3e588caf37269a3ec44a /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit So I dunno where in the code things are getting lost, but I would consider any changes that avoid null modes in ContainerRuntimeConfig-to-MachineConfig conversion (and possibly also KubeletConfig-to-MachineConfig conversion [1]?) as sufficient to close the bug. [1]: https://docs.openshift.com/container-platform/4.10/post_installation_configuration/machine-configuration-tasks.html#configuring-machines-with-custom-resources
And auditing our minor releases for exposure: * 4.5... [4.5]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.5-e2e-aws-serial/1537875703121317888/artifacts/e2e-aws-serial/ipi-install-install/artifacts/log-bundle-20220617201122.tar | tar xz --strip-components=1 $ grep -ro 'User-Agent.*Ignition.*' ...no hits, so not sure what 4.5 was running... * 4.6.59 [4.6]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.6-e2e-aws-serial/1534496288538628096/artifacts/e2e-aws-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-server-xcf27_machine-config-server.log | grep -o 'User-Agent.*' | sort | uniq -c 4 User-Agent:"Ignition/2.6.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.1.0, */*;q=0.1" * 4.8.46 [4.8] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial/1542186763156459520/artifacts/e2e-aws-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-server-2pclf_machine-config-server.log | grep -o 'User-Agent.*' | sort | uniq -c 3 User-Agent:"Ignition/2.9.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.2.0, */*;q=0.1" * 4.10.21 [4.10] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial/1542174109675294720/artifacts/e2e-aws-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-server-8gth4_machine-config-server.log | grep -o 'User-Agent.*' | sort | uniq -c 3 User-Agent:"Ignition/2.13.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.3.0, */*;q=0.1" * 4.11.0-rc.0 [4.11] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1542165038209437696/artifacts/e2e-aws-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-server-c85gf_machine-config-server.log | grep -o 'User-Agent.*' | sort | uniq -c 3 User-Agent:"Ignition/2.13.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.3.0, */*;q=0.1" So I'm not entirely clear on what's happening with 4.5 and earlier boot image Ignition clients, but 4.6 and later (at least by their current z-stream tips) are all looking for version=3.*, so they have the comment 17 defaulting and won't be bothered by the nulls. [4.5]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#periodic-ci-openshift-release-master-nightly-4.5-e2e-aws-serial from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#periodic-ci-openshift-release-master-nightly-4.5-e2e-aws-serial [4.6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.6-e2e-aws-serial/1534496288538628096 [4.8]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial/1542186763156459520 [4.10]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial/1542174109675294720 [4.11]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1542165038209437696
I think if we want to fundamentally fix this, it may require some fixes to 4.5 and earlier. Like Trevor mentions, I think we can apply some fixes for the new versions to make sure this doesn't happen, but this issue has (I think) always been around and should be relatively rare. Could I ask why this is marked as a servicedeliveryblocker?
I don't think it's worth dusting off the 4.5.z build pipelines to publish changes there. But shipping something in 4.12, and possibly 4.11.z, that avoids the null mode in ContainerRuntimeConfig-to-MachineConfig conversion (and possibly also KubeletConfig-to-MachineConfig conversion, per comment 27), would protect folks with born-in-4.5-or-earlier clusters who are updating to 4.11. I'm not actually clear on why these clusters did not have trouble on 4.10...
Ok, back to cluster bot with 'launch 4.10.18', and otherwise following comment 27: $ oc adm upgrade Cluster version is 4.10.18 ... $ cat crc.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: containerRuntimeConfig: pidsLimit: 8096 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" $ oc apply -f crc.yaml $ oc get -o json machineconfigs | jq -r '.items[] | .metadata as $m | .spec.config.storage| (.files // [])[] | select(.mode == null) | $m.creationTimestamp + " " + $m.name + " " + .path' ...no hits... $ oc get machineconfigs | grep containerruntime 99-worker-generated-containerruntime 7152adb176e7ff4bf6c1d3a2e7b0aae4fd2794b6 3.2.0 42s $ oc get -o json machineconfig 99-worker-generated-containerruntime | jq '.spec.config.storage.files[]' { "contents": { "source": "data:text/plain,%5Bcrio%5D%0A%20%20%5Bcrio.runtime%5D%0A%20%20%20%20pids_limit%20%3D%208096%0A" }, "mode": 420, "overwrite": true, "path": "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" } So that explains why this wasn't an issue for born-in-4.old clusters on 4.10. I'm not clear on exactly what changed between 4.10 and 4.11, but the 4.11 FCs and RCs seem like the first to set the null mode for these generated MachineConfigs.
Diffing 4.10 and 4.11, I suspect the regression (for folks with boot images that use v2 Ignition clients) came in with [1]. [1]: https://github.com/openshift/machine-config-operator/pull/3128/commits/da5184f3045bbcaa22e5acf04bea707eb714b062#diff-b2ef9db0d76c6145ec1bb5310134037bae64572ab47464d3223745c2a410d3e3L111
Following comment 27 and comment 31 to get some pre-merge verification on mco#3226, I asked cluster bot for a cluster with 'launch 4.12.0-0.nightly-2022-07-05-191634,openshift/machine-config-operator#3226' [1]. And then: $ oc adm upgrade Cluster version is 0.0.1-0.test-2022-07-05-233035-ci-ln-imcnbfb-latest ... $ cat crc.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: containerRuntimeConfig: pidsLimit: 8096 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" $ oc apply -f crc.yaml $ oc get -o json machineconfigs | jq -r '.items[] | .metadata as $m | .spec.config.storage| (.files // [])[] | select(.mode == null) | $m.creationTimestamp + " " + $m.name + " " + .path' ...no hits... $ oc get machineconfigs | grep containerruntime 99-worker-generated-containerruntime bc6a7da4fd747ab95d2ac9592c4682a32d7ff954 3.2.0 17s $ oc get -o json machineconfig 99-worker-generated-containerruntime | jq '.spec.config.storage.files[]' { "contents": { "compression": "", "source": "data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIHBpZHNfbGltaXQgPSA4MDk2Cg==" }, "mode": 420, "overwrite": true, "path": "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit" } So hooray, that looks great. Hunting around for when the default boot images changed to Ignition v3, I've found [2], which is new in 4.6. $ git clone https://github.com/openshift/installer $ cd installer $ git log --oneline origin/release-4.6 | grep 'Migrate to ignition config spec v3' b9028eea48 *: Migrate to ignition config spec v3.1 $ git log --oneline origin/release-4.5 | grep 'Migrate to ignition config spec v3' ...no hits... My cluster bot cluster happens to be GCP, so grabbing the 4.5 boot image: $ git cat-file -p origin/release-4.5:data/data/rhcos-amd64.json | jq -r .gcp.image rhcos-45-82-202008010929-0-gcp-x86-64 So following comment 26 in an attempt to create a reproducing MachineSet: $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-imcnbfb-72292-ql4sr-worker-a 1 1 1 1 20m ci-ln-imcnbfb-72292-ql4sr-worker-b 1 1 1 1 20m ci-ln-imcnbfb-72292-ql4sr-worker-c 1 1 1 1 20m ci-ln-imcnbfb-72292-ql4sr-worker-f 0 0 20m $ oc -n openshift-machine-api get -o yaml machineset ci-ln-imcnbfb-72292-ql4sr-worker-b >machineset.yaml $ emacs machineset.yaml # remove status and other mutable stuff. Change name and labels to old-image. Slot in rhcos-45-82-202008010929-0-gcp-x86-64 $ oc apply -f machineset.yaml But no luck: $ oc -n openshift-machine-api get machines | grep old-image old-image-4m6h5 Failed 43s $ oc -n openshift-machine-api get -o json machine old-image-4m6h5 | jq '.status.providerStatus.conditions[]' { "lastTransitionTime": "2022-07-06T00:03:01Z", "message": "googleapi: Error 403: Required 'compute.images.useReadOnly' permission for 'projects/rhcos-cloud/global/images/rhcos-45-82-202008010929-0-gcp-x86-64', forbidden", "reason": "MachineCreationFailed", "status": "False", "type": "MachineCreated" } So I'd need to dig deeper into GCP MachineSets to figure out how to request a MachineSet with that old 4.5 boot image. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1544461264191754240 [2]: https://github.com/openshift/installer/pull/3871
I'm going to guess "high" severity so I can set a target release without the bots clearing it off, because I like support for these born-in-4.old clusters to be boring and stable, but feel free to adjust as you see fit.
This hit an internal cluster during an upgrade to 4.11 and blocked to upgrade, hence labeled it as ServiceDeliveryBlocker as there might be clusters from external customers suffering the same problem. I can't tell why it didn't happen during the upgrade to 4.10, probably something changed on cluster.
Comment 32 links the code that regressed the behavior vs. Ignition v2 clients between 4.10 and 4.11, which is why folks didn't bump into this until updating to 4.11.
Thanks for the fix Trevor! I think we should backport this ASAP and mark as a blocker for 4.11.0, given this will probably be an issue for anyone specifying that containerruntimeconfig
verified on 4.12.0-0.nightly-2022-07-06-221008 1. install ocp 4.5.41 which has old boot image. ignition version is 2 2. create containerruntimeconfig to change pidsLimit $ cat ctrcfg-pidlimit.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: containerRuntimeConfig: pidsLimit: 8096 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" 3. upgrade cluster from 4.5.41 to 4.11.0-fc.3 $ oc get clusterversion -o yaml|yq -y '.items[].status.history' - completionTime: '2022-07-08T15:23:33Z' image: quay.io/openshift-release-dev/ocp-release:4.11.0-fc.3-x86_64 startedTime: '2022-07-08T14:06:27Z' state: Completed verified: false version: 4.11.0-fc.3 - completionTime: '2022-07-08T11:06:19Z' image: quay.io/openshift-release-dev/ocp-release:4.10.18-x86_64 startedTime: '2022-07-08T09:50:57Z' state: Completed verified: false version: 4.10.18 - completionTime: '2022-07-08T09:41:12Z' image: quay.io/openshift-release-dev/ocp-release:4.9.42-x86_64 startedTime: '2022-07-08T08:09:47Z' state: Completed verified: false version: 4.9.42 - completionTime: '2022-07-08T08:08:32Z' image: quay.io/openshift-release-dev/ocp-release:4.8.46-x86_64 startedTime: '2022-07-08T06:45:39Z' state: Completed verified: false version: 4.8.46 - completionTime: '2022-07-08T06:37:00Z' image: quay.io/openshift-release-dev/ocp-release:4.7.54-x86_64 startedTime: '2022-07-08T05:22:22Z' state: Completed verified: false version: 4.7.54 - completionTime: '2022-07-08T05:05:42Z' image: quay.io/openshift-release-dev/ocp-release:4.6.59-x86_64 startedTime: '2022-07-08T04:03:14Z' state: Completed verified: false version: 4.6.59 - completionTime: '2022-07-08T03:44:39Z' image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297 startedTime: '2022-07-08T03:17:19Z' state: Completed verified: false version: 4.5.41 4. scale up machineset to provision new node $ oc scale --replicas=2 machineset rioliu-070802-sc6nb-worker-us-east-2a -n openshift-machine-api machineset.machine.openshift.io/rioliu-070802-sc6nb-worker-us-east-2a scaled 5. new node is degraded due to mode mismatch issue $ oc get node/ip-10-0-147-62.us-east-2.compute.internal -o yaml | yq -y '.metadata.annotations' cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-02ce464059c75b4fe","ifaddr":{"ipv4":"10.0.128.0/19"},"capacity":{"ipv4":9,"ipv6":10}}]' csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-03a76df8c6f58451b"}' machine.openshift.io/machine: openshift-machine-api/rioliu-070802-sc6nb-worker-us-east-2a-95jkc machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-103a240b977aedb4265c8501c0f23c1a machineconfiguration.openshift.io/desiredConfig: rendered-worker-103a240b977aedb4265c8501c0f23c1a machineconfiguration.openshift.io/reason: 'unexpected on-disk state validating against rendered-worker-103a240b977aedb4265c8501c0f23c1a: mode mismatch for file: "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit"; expected: -rw-r--r--/420/0644; received: ----------/0/0' machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: 'true' $ oc debug node/ip-10-0-147-62.us-east-2.compute.internal -- chroot /host stat /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") Starting pod/ip-10-0-147-62us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` File: /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit Size: 46 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 176164947 Links: 1 Access: (0000/----------) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:container_config_t:s0 Access: 2022-07-08 15:55:26.207531504 +0000 Modify: 2022-07-08 15:51:42.826000000 +0000 Change: 2022-07-08 15:54:19.140076330 +0000 Birth: 2022-07-08 15:54:19.140076330 +0000 Removing debug pod ... 6. scale down machineset to remove issued node $ oc scale --replicas=1 machineset rioliu-070802-sc6nb-worker-us-east-2a -n openshift-machine-api machineset.machine.openshift.io/rioliu-070802-sc6nb-worker-us-east-2a scaled $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-166.us-east-2.compute.internal Ready worker 13h v1.24.0+284d62a ip-10-0-133-203.us-east-2.compute.internal Ready master 13h v1.24.0+284d62a ip-10-0-160-136.us-east-2.compute.internal Ready master 13h v1.24.0+284d62a ip-10-0-170-121.us-east-2.compute.internal Ready worker 13h v1.24.0+284d62a ip-10-0-202-25.us-east-2.compute.internal Ready master 13h v1.24.0+284d62a ip-10-0-204-123.us-east-2.compute.internal Ready worker 13h v1.24.0+284d62a 7. upgrade cluster to 4.12.0-0.nightly-2022-07-06-221008 $ oc get clusterversion -o yaml|yq -y '.items[].status.history[0]' completionTime: '2022-07-08T18:13:53Z' image: registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-07-06-221008 startedTime: '2022-07-08T17:03:41Z' state: Completed verified: false version: 4.12.0-0.nightly-2022-07-06-221008 8. scale up machineset to provision new node $ oc scale --replicas=2 machineset rioliu-070802-sc6nb-worker-us-east-2a -n openshift-machine-api machineset.machine.openshift.io/rioliu-070802-sc6nb-worker-us-east-2a scaled 9. check new node status, it is not degraded, no config drift found $ oc get node ip-10-0-139-88.us-east-2.compute.internal -o yaml| yq -y '.metadata.annotations' cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-0da5c4a13816be886","ifaddr":{"ipv4":"10.0.128.0/19"},"capacity":{"ipv4":9,"ipv6":10}}]' csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0984649b2743f74ac"}' machine.openshift.io/machine: openshift-machine-api/rioliu-070802-sc6nb-worker-us-east-2a-kkh4c machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-5b907435666b678ea836542b5fd330c2 machineconfiguration.openshift.io/desiredConfig: rendered-worker-5b907435666b678ea836542b5fd330c2 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-5b907435666b678ea836542b5fd330c2 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-5b907435666b678ea836542b5fd330c2 machineconfiguration.openshift.io/reason: '' machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: 'true' $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-14580ffdad42a4a97812d4472fdb6272 True False False 3 3 3 0 15h worker rendered-worker-5b907435666b678ea836542b5fd330c2 True False False 4 4 4 0 15h
10. check file mode in mc and new provisioned node $ oc get mc/99-worker-generated-containerruntime -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: annotations: machineconfiguration.openshift.io/generated-by-controller-version: 5d54e097506ec8bdd76f3969cc9f382ae1efea3c creationTimestamp: "2022-07-08T04:51:05Z" generation: 5 labels: machineconfiguration.openshift.io/role: worker name: 99-worker-generated-containerruntime ownerReferences: - apiVersion: machineconfiguration.openshift.io/v1 blockOwnerDeletion: true controller: true kind: ContainerRuntimeConfig name: set-pids-limit uid: 5259a782-0251-4746-89ab-f5637bd3d2ce resourceVersion: "460887" uid: a314dc86-272b-4aa3-a989-172144b71e30 spec: config: ignition: version: 3.2.0 storage: files: - contents: compression: "" source: data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIHBpZHNfbGltaXQgPSA4MDk2Cg== mode: 420 overwrite: true path: /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit extensions: null fips: false kernelArguments: null kernelType: "" osImageURL: "" $ oc debug node/ip-10-0-139-88.us-east-2.compute.internal -- chroot /host stat /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") Starting pod/ip-10-0-139-88us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` File: /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit Size: 46 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 54569073 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:container_config_t:s0 Access: 2022-07-08 18:22:31.484067929 +0000 Modify: 2022-07-08 18:18:41.156000000 +0000 Change: 2022-07-08 18:21:20.230737254 +0000 Birth: 2022-07-08 18:21:20.230737254 +0000 Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399