Bug 2104561

Summary: 4.10 to 4.11 update: Degraded node: unexpected on-disk state: mode mismatch for file: "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit"; expected: -rw-r--r--/420/0644; received: ----------/0/0
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Machine Config OperatorAssignee: MCO Bug Bot <mco-triage>
Machine Config Operator sub component: Machine Config Operator QA Contact: Rio Liu <rioliu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: hongkliu, jerzhang, mbargenq, mdewald, mkrejci, skumari, travi, wking
Version: 4.11Keywords: Regression, ServiceDeliveryBlocker, Upgrades
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:20:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2102004    
Bug Blocks:    

Comment 1 Yu Qi Zhang 2022-07-06 17:52:46 UTC
Marking as blocker+ since it will be an upgrade blocker for some cases (old bootimages).

Comment 3 Sinny Kumari 2022-07-07 08:38:14 UTC
We decided on not marking as 4.11 blocker because it doesn't blocks upgrade or new 4.11 cluster install. With this bug, upgrade may be slower for machines taking longer reboot time. This should land soon in 4.11.z.

Comment 4 Sinny Kumari 2022-07-07 08:42:29 UTC
sorry for the noise (happens when you have multiple bugs opened in different tabs :/), please ignore my last comment#3. Accidentally replied wrong bug, comment#3 was intended for bug https://bugzilla.redhat.com/show_bug.cgi?id=2104687 .

Comment 5 Rio Liu 2022-07-11 11:07:32 UTC
verified on 4.11.0-0.nightly-2022-07-08-182347

1. install ocp 4.5.41 which has old boot image. ignition version is 2
2. create containerruntimeconfig to change pidsLimit

$ cat ctrcfg-pidlimit.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: set-pids-limit
spec:
  containerRuntimeConfig:
    pidsLimit: 8096
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""

$ oc create -f ctrcfg-pidlimit.yaml
containerruntimeconfig.machineconfiguration.openshift.io/set-pids-limit created

$ oc get ctrcfg
NAME             AGE
set-pids-limit   6m38s

3. upgrade cluster from 4.5.41 to 4.11.0-0.nightly-2022-07-08-182347

$ oc get clusterversion -o yaml|yq -y '.items[].status.history'
- completionTime: '2022-07-11T10:37:32Z'
  image: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-08-182347
  startedTime: '2022-07-11T09:23:37Z'
  state: Completed
  verified: false
  version: 4.11.0-0.nightly-2022-07-08-182347
- completionTime: '2022-07-11T09:08:39Z'
  image: quay.io/openshift-release-dev/ocp-release:4.10.18-x86_64
  startedTime: '2022-07-11T07:53:23Z'
  state: Completed
  verified: false
  version: 4.10.18
- completionTime: '2022-07-11T07:48:53Z'
  image: quay.io/openshift-release-dev/ocp-release:4.9.42-x86_64
  startedTime: '2022-07-11T06:16:18Z'
  state: Completed
  verified: false
  version: 4.9.42
- completionTime: '2022-07-11T06:08:18Z'
  image: quay.io/openshift-release-dev/ocp-release:4.8.46-x86_64
  startedTime: '2022-07-11T04:50:35Z'
  state: Completed
  verified: false
  version: 4.8.46
- completionTime: '2022-07-11T04:29:54Z'
  image: quay.io/openshift-release-dev/ocp-release:4.7.54-x86_64
  startedTime: '2022-07-11T03:19:28Z'
  state: Completed
  verified: false
  version: 4.7.54
- completionTime: '2022-07-11T03:12:58Z'
  image: quay.io/openshift-release-dev/ocp-release:4.6.59-x86_64
  startedTime: '2022-07-11T02:05:56Z'
  state: Completed
  verified: false
  version: 4.6.59
- completionTime: '2022-07-11T01:49:11Z'
  image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297
  startedTime: '2022-07-11T01:20:58Z'
  state: Completed
  verified: false
  version: 4.5.41

4. scale up machineset to provision new node

$ oc scale --replicas=2 machineset rioliu-071101-9kdz4-worker-us-east-2a -n openshift-machine-api
machineset.machine.openshift.io/rioliu-071101-9kdz4-worker-us-east-2a scaled

5. check status of new provisioned node 

$ oc get node/ip-10-0-150-187.us-east-2.compute.internal -o yaml | yq -y '.metadata.annotations'
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-052e2dcc1c60c26f5","ifaddr":{"ipv4":"10.0.128.0/19"},"capacity":{"ipv4":9,"ipv6":10}}]'
csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0699401c438fa8d45"}'
machine.openshift.io/machine: openshift-machine-api/rioliu-071101-9kdz4-worker-us-east-2a-k4vhd
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-db73a634550d8f3c8185739695540ab1
machineconfiguration.openshift.io/desiredConfig: rendered-worker-db73a634550d8f3c8185739695540ab1
machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-db73a634550d8f3c8185739695540ab1
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-db73a634550d8f3c8185739695540ab1
machineconfiguration.openshift.io/reason: ''
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: 'true'

there is no degrade issue found. 

6. check crio dropin file mode on new provisioned node 

$ oc debug node/ip-10-0-150-187.us-east-2.compute.internal -- chroot /host stat /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit 2>&1 | grep -v Warning
Starting pod/ip-10-0-150-187us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
  File: /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit
  Size: 46        	Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 201327778   Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:container_config_t:s0
Access: 2022-07-11 10:45:56.235725651 +0000
Modify: 2022-07-11 10:41:53.763000000 +0000
Change: 2022-07-11 10:44:42.678450201 +0000
 Birth: 2022-07-11 10:44:42.678450201 +0000

Removing debug pod ...


7. check mcd logs on new provisioned node

$ oc get pod -n openshift-machine-config-operator --field-selector spec.nodeName=ip-10-0-150-187.us-east-2.compute.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-m5lfs   2/2     Running   0          18m

$ oc logs -n openshift-machine-config-operator machine-config-daemon-m5lfs
Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy
I0711 10:46:23.440799    1690 start.go:112] Version: v4.11.0-202207070244.p0.g35d7962.assembly.stream-dirty (35d79621a58766190071f95415f0bef74ee204a7)
I0711 10:46:23.498245    1690 start.go:125] Calling chroot("/rootfs")
I0711 10:46:23.500288    1690 update.go:1962] Running: systemctl start rpm-ostreed
I0711 10:46:23.952275    1690 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0711 10:46:24.032407    1690 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0711 10:46:24.092063    1690 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31fbe3e17b35802c87ba7eb2fa862aa0f7542294c50dd9ad0f1445a1a5378603 (411.86.202207072134-0)
I0711 10:46:24.206562    1690 start.go:101] Copied self to /run/bin/machine-config-daemon on host
I0711 10:46:24.208420    1690 start.go:189] overriding kubernetes api to https://api-int.rioliu-071101.qe.devcluster.openshift.com:6443
I0711 10:46:24.219064    1690 metrics.go:106] Registering Prometheus metrics
I0711 10:46:24.219218    1690 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0711 10:46:24.222043    1690 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
I0711 10:46:24.222280    1690 update.go:1977] Starting to manage node: ip-10-0-150-187.us-east-2.compute.internal
I0711 10:46:24.239918    1690 rpm-ostree.go:324] Running captured: rpm-ostree status
I0711 10:46:24.307917    1690 daemon.go:1220] State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31fbe3e17b35802c87ba7eb2fa862aa0f7542294c50dd9ad0f1445a1a5378603
              CustomOrigin: Managed by machine-config-operator
                   Version: 411.86.202207072134-0 (2022-07-07T21:37:37Z)

  f9d88d07921009f524c39773d0935a7d1642a02bd37e0d621696bf4f766a0540
                   Version: 45.82.202008010929-0 (2020-08-01T09:33:23Z)
I0711 10:46:24.308694    1690 coreos.go:95] CoreOS aleph version: mtime=2020-08-01 09:35:48.964 +0000 UTC build=45.82.202008010929-0 imgid=rhcos-45.82.202008010929-0-qemu.x86_64.qcow2
No /etc/.ignition-result.json foundI0711 10:46:24.308859    1690 rpm-ostree.go:324] Running captured: journalctl --list-boots
I0711 10:46:24.315966    1690 daemon.go:1229] journalctl --list-boots:
-1 eca1ef0ce5a84f2c807475727a7a1025 Mon 2022-07-11 10:41:32 UTC—Mon 2022-07-11 10:44:54 UTC
 0 236ab8645b25460f8dc6d09e717dd1d6 Mon 2022-07-11 10:45:41 UTC—Mon 2022-07-11 10:46:24 UTC
I0711 10:46:24.315989    1690 rpm-ostree.go:324] Running captured: systemctl list-units --state=failed --no-legend
I0711 10:46:24.325287    1690 daemon.go:1244] systemd service state: OK
I0711 10:46:24.325310    1690 daemon.go:909] Starting MachineConfigDaemon
I0711 10:46:24.325415    1690 daemon.go:916] Enabling Kubelet Healthz Monitor
I0711 10:46:25.266810    1690 daemon.go:451] Node ip-10-0-150-187.us-east-2.compute.internal is not labeled node-role.kubernetes.io/master
I0711 10:46:25.267093    1690 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node ip-10-0-150-187.us-east-2.compute.internal: map[cloud.network.openshift.io/egress-ipconfig:[{"interface":"eni-052e2dcc1c60c26f5","ifaddr":{"ipv4":"10.0.128.0/19"},"capacity":{"ipv4":9,"ipv6":10}}] machine.openshift.io/machine:openshift-machine-api/rioliu-071101-9kdz4-worker-us-east-2a-k4vhd machineconfiguration.openshift.io/controlPlaneTopology:HighlyAvailable volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json
I0711 10:46:25.267785    1690 node.go:45] Setting initial node config: rendered-worker-db73a634550d8f3c8185739695540ab1
I0711 10:46:25.295177    1690 daemon.go:1137] In bootstrap mode
I0711 10:46:25.295262    1690 daemon.go:1165] Current+desired config: rendered-worker-db73a634550d8f3c8185739695540ab1
I0711 10:46:25.295304    1690 daemon.go:1175] state: Done
I0711 10:46:25.304709    1690 daemon.go:1425] No bootstrap pivot required; unlinking bootstrap node annotations
I0711 10:46:25.304836    1690 daemon.go:1463] Validating against pending config rendered-worker-db73a634550d8f3c8185739695540ab1
I0711 10:46:25.304906    1690 rpm-ostree.go:324] Running captured: rpm-ostree kargs
I0711 10:46:25.431638    1690 daemon.go:1481] Validated on-disk state
I0711 10:46:25.467763    1690 daemon.go:1532] Completing pending config rendered-worker-db73a634550d8f3c8185739695540ab1
I0711 10:46:35.492777    1690 update.go:1977] Update completed for config rendered-worker-db73a634550d8f3c8185739695540ab1 and node has been successfully uncordoned
I0711 10:46:35.499143    1690 daemon.go:1548] In desired config rendered-worker-db73a634550d8f3c8185739695540ab1
I0711 10:46:35.530577    1690 config_drift_monitor.go:246] Config Drift Monitor started

Comment 7 errata-xmlrpc 2022-08-10 11:20:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069