Bug 2104561
Summary: | 4.10 to 4.11 update: Degraded node: unexpected on-disk state: mode mismatch for file: "/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit"; expected: -rw-r--r--/420/0644; received: ----------/0/0 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> |
Component: | Machine Config Operator | Assignee: | MCO Bug Bot <mco-triage> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | hongkliu, jerzhang, mbargenq, mdewald, mkrejci, skumari, travi, wking |
Version: | 4.11 | Keywords: | Regression, ServiceDeliveryBlocker, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:20:25 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2102004 | ||
Bug Blocks: |
Comment 1
Yu Qi Zhang
2022-07-06 17:52:46 UTC
We decided on not marking as 4.11 blocker because it doesn't blocks upgrade or new 4.11 cluster install. With this bug, upgrade may be slower for machines taking longer reboot time. This should land soon in 4.11.z. sorry for the noise (happens when you have multiple bugs opened in different tabs :/), please ignore my last comment#3. Accidentally replied wrong bug, comment#3 was intended for bug https://bugzilla.redhat.com/show_bug.cgi?id=2104687 . verified on 4.11.0-0.nightly-2022-07-08-182347 1. install ocp 4.5.41 which has old boot image. ignition version is 2 2. create containerruntimeconfig to change pidsLimit $ cat ctrcfg-pidlimit.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: containerRuntimeConfig: pidsLimit: 8096 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" $ oc create -f ctrcfg-pidlimit.yaml containerruntimeconfig.machineconfiguration.openshift.io/set-pids-limit created $ oc get ctrcfg NAME AGE set-pids-limit 6m38s 3. upgrade cluster from 4.5.41 to 4.11.0-0.nightly-2022-07-08-182347 $ oc get clusterversion -o yaml|yq -y '.items[].status.history' - completionTime: '2022-07-11T10:37:32Z' image: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-08-182347 startedTime: '2022-07-11T09:23:37Z' state: Completed verified: false version: 4.11.0-0.nightly-2022-07-08-182347 - completionTime: '2022-07-11T09:08:39Z' image: quay.io/openshift-release-dev/ocp-release:4.10.18-x86_64 startedTime: '2022-07-11T07:53:23Z' state: Completed verified: false version: 4.10.18 - completionTime: '2022-07-11T07:48:53Z' image: quay.io/openshift-release-dev/ocp-release:4.9.42-x86_64 startedTime: '2022-07-11T06:16:18Z' state: Completed verified: false version: 4.9.42 - completionTime: '2022-07-11T06:08:18Z' image: quay.io/openshift-release-dev/ocp-release:4.8.46-x86_64 startedTime: '2022-07-11T04:50:35Z' state: Completed verified: false version: 4.8.46 - completionTime: '2022-07-11T04:29:54Z' image: quay.io/openshift-release-dev/ocp-release:4.7.54-x86_64 startedTime: '2022-07-11T03:19:28Z' state: Completed verified: false version: 4.7.54 - completionTime: '2022-07-11T03:12:58Z' image: quay.io/openshift-release-dev/ocp-release:4.6.59-x86_64 startedTime: '2022-07-11T02:05:56Z' state: Completed verified: false version: 4.6.59 - completionTime: '2022-07-11T01:49:11Z' image: quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297 startedTime: '2022-07-11T01:20:58Z' state: Completed verified: false version: 4.5.41 4. scale up machineset to provision new node $ oc scale --replicas=2 machineset rioliu-071101-9kdz4-worker-us-east-2a -n openshift-machine-api machineset.machine.openshift.io/rioliu-071101-9kdz4-worker-us-east-2a scaled 5. check status of new provisioned node $ oc get node/ip-10-0-150-187.us-east-2.compute.internal -o yaml | yq -y '.metadata.annotations' cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-052e2dcc1c60c26f5","ifaddr":{"ipv4":"10.0.128.0/19"},"capacity":{"ipv4":9,"ipv6":10}}]' csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0699401c438fa8d45"}' machine.openshift.io/machine: openshift-machine-api/rioliu-071101-9kdz4-worker-us-east-2a-k4vhd machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-db73a634550d8f3c8185739695540ab1 machineconfiguration.openshift.io/desiredConfig: rendered-worker-db73a634550d8f3c8185739695540ab1 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-db73a634550d8f3c8185739695540ab1 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-db73a634550d8f3c8185739695540ab1 machineconfiguration.openshift.io/reason: '' machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: 'true' there is no degrade issue found. 6. check crio dropin file mode on new provisioned node $ oc debug node/ip-10-0-150-187.us-east-2.compute.internal -- chroot /host stat /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit 2>&1 | grep -v Warning Starting pod/ip-10-0-150-187us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` File: /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit Size: 46 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 201327778 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:container_config_t:s0 Access: 2022-07-11 10:45:56.235725651 +0000 Modify: 2022-07-11 10:41:53.763000000 +0000 Change: 2022-07-11 10:44:42.678450201 +0000 Birth: 2022-07-11 10:44:42.678450201 +0000 Removing debug pod ... 7. check mcd logs on new provisioned node $ oc get pod -n openshift-machine-config-operator --field-selector spec.nodeName=ip-10-0-150-187.us-east-2.compute.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-m5lfs 2/2 Running 0 18m $ oc logs -n openshift-machine-config-operator machine-config-daemon-m5lfs Defaulted container "machine-config-daemon" out of: machine-config-daemon, oauth-proxy I0711 10:46:23.440799 1690 start.go:112] Version: v4.11.0-202207070244.p0.g35d7962.assembly.stream-dirty (35d79621a58766190071f95415f0bef74ee204a7) I0711 10:46:23.498245 1690 start.go:125] Calling chroot("/rootfs") I0711 10:46:23.500288 1690 update.go:1962] Running: systemctl start rpm-ostreed I0711 10:46:23.952275 1690 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0711 10:46:24.032407 1690 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0711 10:46:24.092063 1690 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31fbe3e17b35802c87ba7eb2fa862aa0f7542294c50dd9ad0f1445a1a5378603 (411.86.202207072134-0) I0711 10:46:24.206562 1690 start.go:101] Copied self to /run/bin/machine-config-daemon on host I0711 10:46:24.208420 1690 start.go:189] overriding kubernetes api to https://api-int.rioliu-071101.qe.devcluster.openshift.com:6443 I0711 10:46:24.219064 1690 metrics.go:106] Registering Prometheus metrics I0711 10:46:24.219218 1690 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0711 10:46:24.222043 1690 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig I0711 10:46:24.222280 1690 update.go:1977] Starting to manage node: ip-10-0-150-187.us-east-2.compute.internal I0711 10:46:24.239918 1690 rpm-ostree.go:324] Running captured: rpm-ostree status I0711 10:46:24.307917 1690 daemon.go:1220] State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31fbe3e17b35802c87ba7eb2fa862aa0f7542294c50dd9ad0f1445a1a5378603 CustomOrigin: Managed by machine-config-operator Version: 411.86.202207072134-0 (2022-07-07T21:37:37Z) f9d88d07921009f524c39773d0935a7d1642a02bd37e0d621696bf4f766a0540 Version: 45.82.202008010929-0 (2020-08-01T09:33:23Z) I0711 10:46:24.308694 1690 coreos.go:95] CoreOS aleph version: mtime=2020-08-01 09:35:48.964 +0000 UTC build=45.82.202008010929-0 imgid=rhcos-45.82.202008010929-0-qemu.x86_64.qcow2 No /etc/.ignition-result.json foundI0711 10:46:24.308859 1690 rpm-ostree.go:324] Running captured: journalctl --list-boots I0711 10:46:24.315966 1690 daemon.go:1229] journalctl --list-boots: -1 eca1ef0ce5a84f2c807475727a7a1025 Mon 2022-07-11 10:41:32 UTC—Mon 2022-07-11 10:44:54 UTC 0 236ab8645b25460f8dc6d09e717dd1d6 Mon 2022-07-11 10:45:41 UTC—Mon 2022-07-11 10:46:24 UTC I0711 10:46:24.315989 1690 rpm-ostree.go:324] Running captured: systemctl list-units --state=failed --no-legend I0711 10:46:24.325287 1690 daemon.go:1244] systemd service state: OK I0711 10:46:24.325310 1690 daemon.go:909] Starting MachineConfigDaemon I0711 10:46:24.325415 1690 daemon.go:916] Enabling Kubelet Healthz Monitor I0711 10:46:25.266810 1690 daemon.go:451] Node ip-10-0-150-187.us-east-2.compute.internal is not labeled node-role.kubernetes.io/master I0711 10:46:25.267093 1690 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node ip-10-0-150-187.us-east-2.compute.internal: map[cloud.network.openshift.io/egress-ipconfig:[{"interface":"eni-052e2dcc1c60c26f5","ifaddr":{"ipv4":"10.0.128.0/19"},"capacity":{"ipv4":9,"ipv6":10}}] machine.openshift.io/machine:openshift-machine-api/rioliu-071101-9kdz4-worker-us-east-2a-k4vhd machineconfiguration.openshift.io/controlPlaneTopology:HighlyAvailable volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json I0711 10:46:25.267785 1690 node.go:45] Setting initial node config: rendered-worker-db73a634550d8f3c8185739695540ab1 I0711 10:46:25.295177 1690 daemon.go:1137] In bootstrap mode I0711 10:46:25.295262 1690 daemon.go:1165] Current+desired config: rendered-worker-db73a634550d8f3c8185739695540ab1 I0711 10:46:25.295304 1690 daemon.go:1175] state: Done I0711 10:46:25.304709 1690 daemon.go:1425] No bootstrap pivot required; unlinking bootstrap node annotations I0711 10:46:25.304836 1690 daemon.go:1463] Validating against pending config rendered-worker-db73a634550d8f3c8185739695540ab1 I0711 10:46:25.304906 1690 rpm-ostree.go:324] Running captured: rpm-ostree kargs I0711 10:46:25.431638 1690 daemon.go:1481] Validated on-disk state I0711 10:46:25.467763 1690 daemon.go:1532] Completing pending config rendered-worker-db73a634550d8f3c8185739695540ab1 I0711 10:46:35.492777 1690 update.go:1977] Update completed for config rendered-worker-db73a634550d8f3c8185739695540ab1 and node has been successfully uncordoned I0711 10:46:35.499143 1690 daemon.go:1548] In desired config rendered-worker-db73a634550d8f3c8185739695540ab1 I0711 10:46:35.530577 1690 config_drift_monitor.go:246] Config Drift Monitor started Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |