Bug 1916169
Summary: | a reboot while MCO is applying changes leaves the node in undesirable state and MCP looks fine (UPDATED=true) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sabina Aledort <saledort> | ||||
Component: | Machine Config Operator | Assignee: | mkenigsb | ||||
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | alukiano, aos-bugs, bbreard, dblack, dollierp, fromani, imcleod, jlebon, jligon, keyoung, markmc, mcornea, miabbott, mkarg, mkenigsb, mkrejci, nstielau, rioliu, sbkirova, skumari, smilner, tmulquee, vgrinber, walters, william.caban, ykashtan | ||||
Version: | 4.7 | Keywords: | Reopened | ||||
Target Milestone: | --- | Flags: | mkenigsb:
needinfo-
mkenigsb: needinfo- |
||||
Target Release: | 4.10.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
* Before this update, the Machine Config Operator (MCO) stored pending configuration changes to disk before it applied them to {op-system-first}. If a power loss interrupted the MCO from applying the configuration, it treated the configuration as applied and did not validate the changes. If this configuration contained invalid changes, applying them failed. With this update, the MCO saves a configuration to disk only after being applied. This way, if the power is lost while the MCO is applying the configuration, it reapplies the configuration after it restarts. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1916169[BZ#1916169])
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-03-10 16:02:37 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Sabina Aledort
2021-01-14 11:43:49 UTC
I looked at the machine-config-daemon log and I see that for some reason MCD decided that the node has the desired config when in reality it did not apply kernel arguments at all. I0114 05:10:26.414913 7452 update.go:1852] Starting update from rendered-worker-duprofile-fddde77e6dd45d93e36ecf425f9f2e3a to rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} <- start of the update ... A lot of errors regarding network is unreachable ... W0114 05:13:40.994793 7452 run.go:44] nice failed: running nice -- ionice -c 3 podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-re lease-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14 failed: Error: unable to pull quay.io/openshift-release-de v/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: Error initializing source docker://quay.io/openshift-release-dev/ ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: (Mirrors also failed: [dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.c om:5000/ocp4/openshift4@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: error pinging docker registry dhcp19-17-175.clus2.t5g.lab.en g.bos.redhat.com:5000: Get "https://dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.com:5000/v2/": dial tcp: lookup dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.com on 10.11.5.19:53: dial udp 10.11.5.19:53: connect: network is unreachable]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c0 2a944642a93b80f4aa80c364218c97b8bfb3c14: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.11.5.19:53: dial ud p 10.11.5.19:53: connect: network is unreachable : exit status 125; retrying... I0114 05:18:30.769401 10199 start.go:108] Version: v4.7.0-202101130853.p0-dirty (69ac8b941b0f29d3cfdfced35aded406d75bc84a) <- last error and jump to start ... I0114 05:18:32.172888 10199 daemon.go:401] Node cnfdd7.clus2.t5g.lab.eng.bos.redhat.com is not labeled node-role.kubernetes.io/master I0114 05:18:32.177594 10199 daemon.go:816] Current config: rendered-worker-duprofile-fddde77e6dd45d93e36ecf425f9f2e3a I0114 05:18:32.177604 10199 daemon.go:817] Desired config: rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452 I0114 05:18:32.184604 10199 update.go:1852] Disk currentConfig rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452 overrides node's currentConfig annotation rendered-worker-duprofile-fddde77e6dd45d93e36ecf425f9f2e3a I0114 05:18:32.186708 10199 daemon.go:1102] Validating against current config rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452 I0114 05:18:32.206530 10199 daemon.go:1110] Validated on-disk state I0114 05:18:32.206541 10199 daemon.go:1181] In desired config rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452 <- machine config daemon decided that the machine has desired state and update the current machine config, when in reality it does not updated kernel arguments In a little further analysis of this issue, it appears in our automation of the DU profile specifically which is using the PAO. In other automated test cases, we are not seeing this issue so we may need to use our environment to correct the issue. Can you please attach a must-gather? I *think* I know the problem, but would need more information. Ben, The CI run is here: https://master-jenkins-csb-cnfqe.cloud.paas.psi.redhat.com/view/CNF%20Dev/job/ran-tests/44/ A must gather is here: https://master-jenkins-csb-cnfqe.cloud.paas.psi.redhat.com/view/CNF%20Dev/job/ran-tests/44/artifact/must-gather.tar.gz /KenY It seems that original problem is what Artyom mentioned in comment#1 Summarizing the problem after looking at must-gather log and machine-config-daemon-gk49g pod log running on node cnfdd7.clus2.t5g.lab.eng.bos.redhat.com in there: - Worker was initially running fine and then image registry and pull-secret has been update <--- Applied successfully ... 2021-01-14T02:51:04.726315851Z I0114 02:51:04.726275 7452 update.go:1852] Starting update from rendered-worker-11d9e183829cdce30de4b92da89fe047 to rendered-worker-fddde77e6dd45d93e36ecf425f9f2e3a: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false} 2021-01-14T02:51:04.759054365Z I0114 02:51:04.759025 7452 update.go:512] File diff: detected change to /etc/containers/registries.conf 2021-01-14T02:51:04.759121811Z I0114 02:51:04.759113 7452 update.go:512] File diff: detected change to /var/lib/kubelet/config.json ... - A new MachineConfigPpool worker-duprofile was created and added the node cnfdd7.clus2.t5g.lab.eng.bos.redhat.com in that pool < --- applied suucessfully - MachineConfig 50-nto-worker-duprofile.yaml was applied which adds kargs, files, unit and switches to RT kernel <---- Problem started here .... 2021-01-14T05:10:26.361347085Z I0114 05:10:26.361303 7452 update.go:596] Checking Reconcilable for config rendered-worker-duprofile-fddde77e6dd45d93e36ecf425f9f2e3a to rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452 2021-01-14T05:10:26.414948492Z I0114 05:10:26.414913 7452 update.go:1852] Starting update from rendered-worker-duprofile-fddde77e6dd45d93e36ecf425f9f2e3a to rendered-worker-duprofile-0469ef4eb76db5f6c67f60132d80d452: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} 2021-01-14T05:10:26.442420037Z I0114 05:10:26.442397 7452 update.go:1852] Update prepared; beginning drain ... 2021-01-14T05:12:10.153351159Z I0114 05:12:10.153328 7452 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-662841183 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14 2021-01-14T05:12:10.379123537Z error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.11.5.19:53: dial udp 10.11.5.19:53: connect: network is unreachable 2021-01-14T05:12:10.382602694Z W0114 05:12:10.382581 7452 run.go:44] nice failed: running nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-662841183 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14 failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.11.5.19:53: dial udp 10.11.5.19:53: connect: network is unreachable 2021-01-14T05:12:10.382602694Z : exit status 1; retrying... ... 2021-01-14T05:12:25.537532667Z I0114 05:12:25.537525 7452 update.go:371] Falling back to using podman cp to fetch OS image content 2021-01-14T05:12:25.537559798Z I0114 05:12:25.537551 7452 run.go:18] Running: nice -- ionice -c 3 podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14 2021-01-14T05:12:25.627480952Z Error: unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: (Mirrors also failed: [dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.com:5000/ocp4/openshift4@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: error pinging docker registry dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.com:5000: Get "https://dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.com:5000/v2/": dial tcp: lookup dhcp19-17-175.clus2.t5g.lab.eng.bos.redhat.com on 10.11.5.19:53: dial udp 10.11.5.19:53: connect: network is unreachable]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.11.5.19:53: dial udp 10.11.5.19:53: connect: network is unreachable ... Due to network unreachable, MCO couldn't download the machine-os-content which it will use later on to switch to RT kernel. This looks to me that most likely either pull-secret config or image registry change might have caused network unreachable. @Sabina can you please check it? To add to the issue, we are seeing in the error message as: I0113 03:29:17.566348 9261 update.go:445] Rolling back applied changes to OS due to error: error running rpm-ostree kargs --delete=skew_tick=1 --delete=nohz=on --delete=rcu_nocbs=1,3,5,7,9-51 --delete=tuned.non_isolcpus=000000ff,ffffffff,fff00000,00000155 --delete=intel_pstate=disable --delete=nosoftlockup --delete=tsc=nowatchdog --delete=intel_iommu=on --delete=iommu=pt --delete=isolcpus=managed_irq,1,3,5,7,9-51 --delete=systemd.cpu_affinity=0,2,4,6,8,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 --delete=default_hugepagesz=1G --delete=nosmt --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1,3,5,7,9-51 --append=tuned.non_isolcpus=000000ff,ffffffff,fff00000,00000155 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1,3,5,7,9-51 --append=systemd.cpu_affinity=0,2,4,6,8,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 --append=default_hugepagesz=1G --append=nosmt: error: No key 'skew_tick' found : exit status 1 We are seeing this error because MCO keeps retrying to move to desired config. Note from comment#7 that MCO (rpm-ostree) never applied successfully any kargs and hence during recovery `rpm-ostree kargs -delete=skew_tick=1 ...` failed as rpm-ostree never applied it. This is happening because MCO is not checking today for kargs diff between old and new MachineConfig before updating kargs. We will fix this but please note that actual cause of failure lies in comment#7 I was able to pull the image manually on the affected node. [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd5.clus2.t5g.lab.eng.bos.redhat.com Ready worker,worker-cnf 16h v1.20.0+d9c52cc cnfdd6.clus2.t5g.lab.eng.bos.redhat.com Ready worker,worker-cnf 16h v1.20.0+d9c52cc cnfdd7.clus2.t5g.lab.eng.bos.redhat.com Ready,SchedulingDisabled worker,worker-duprofile 16h v1.20.0+d9c52cc dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 17h v1.20.0+d9c52cc dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 17h v1.20.0+d9c52cc dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 17h v1.20.0+d9c52cc [root@cnfdd5-installer ~]# ssh core.t5g.lab.eng.bos.redhat.com Red Hat Enterprise Linux CoreOS 47.83.202101171239-0 Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html --- Last login: Wed Jan 20 16:49:49 2021 from 10.19.17.175 [core@cnfdd7 ~]$ sudo -i [root@cnfdd7 ~]# podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c41d0f9f751623c8dc2d1c02a944642a93b80f4aa80c364218c97b8bfb3c14 a117ce8f9053528c60b17c8cafcf16aaa5dcb71e7187b00dafdb3ccca39b91c6 [root@cnfdd7 ~]# [root@cnfdd5-installer ~]# oc describe node cnfdd7.clus2.t5g.lab.eng.bos.redhat.com Name: cnfdd7.clus2.t5g.lab.eng.bos.redhat.com Roles: worker,worker-duprofile Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=cnfdd7.clus2.t5g.lab.eng.bos.redhat.com kubernetes.io/os=linux node-role.kubernetes.io/worker= node-role.kubernetes.io/worker-duprofile= node.openshift.io/os_id=rhcos ptp/slave= Annotations: k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_cnfdd7.clus2.t5g.lab.eng.bos.redhat.com","mac-address":"0c:42:a1:55:e4:8e","ip-addresses... k8s.ovn.org/node-chassis-id: aca6e191-5798-4313-ab86-45a77cade266 k8s.ovn.org/node-local-nat-ip: {"default":["169.254.10.219"]} k8s.ovn.org/node-mgmt-port-mac-address: ca:3e:43:82:4f:3f k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.19.16.104/23","ipv6":"2620:52:0:1310:5c4f:1483:696b:da9b/64"} k8s.ovn.org/node-subnets: {"default":"10.133.2.0/23"} machine.openshift.io/machine: openshift-machine-api/cnfdd5-9v4v4-worker-0-tnqz4 machineconfiguration.openshift.io/currentConfig: rendered-worker-duprofile-ca50fb1e10457688976170bcb20389b8 machineconfiguration.openshift.io/desiredConfig: rendered-worker-duprofile-e5cca08848ad42fb183fccdb4b27420d machineconfiguration.openshift.io/reason: error running rpm-ostree kargs --delete=skew_tick=1 --delete=nohz=on --delete=rcu_nocbs=1,3,5,7,9-51 --delete=tuned.non_isolcpus=000000ff,... : exit status 1 machineconfiguration.openshift.io/state: Degraded sriovnetwork.openshift.io/state: Idle volumes.kubernetes.io/controller-managed-attach-detach: true maybe it's because that image extract command doesnt wait for networkmanager to finish? ``` I0113 03:15:58.010936 8404 update.go:1470] Writing systemd unit dropin "10-ovs-vswitchd-restart.conf" I0113 03:15:58.249746 8404 update.go:1459] Preset systemd unit ovs-vswitchd.service I0113 03:15:58.249762 8404 update.go:1470] Writing systemd unit dropin "10-ovsdb-restart.conf" I0113 03:15:58.250067 8404 update.go:1470] Writing systemd unit dropin "10-mco-default-env.conf" I0113 03:15:58.255736 8404 update.go:1542] Could not reset unit preset for pivot.service, skipping. (Error msg: error running preset on unit: Failed to preset unit: Unit file pivot.service does not exist. ) I0113 03:15:58.255748 8404 update.go:1470] Writing systemd unit dropin "mco-disabled.conf" I0113 03:15:58.260998 8404 update.go:1542] Could not reset unit preset for zincati.service, skipping. (Error msg: error running preset on unit: Failed to preset unit: Unit file zincati.service does not exist. ) I0113 03:15:58.261010 8404 update.go:1505] Writing systemd unit "stalld.service" I0113 03:15:58.261196 8404 update.go:1505] Writing systemd unit "update-rps@.service" I0113 03:15:58.498474 8404 update.go:1459] Preset systemd unit update-rps@.service I0113 03:15:58.498495 8404 update.go:1505] Writing systemd unit "hugepages-allocation-1048576kB-NUMA0.service" I0113 03:15:58.747437 8404 update.go:1437] Enabled systemd units: [etc-NetworkManager-system\x2dconnections\x2dmerged.mount kubelet.service machine-config-daemon-firstboot.service machine-config-daemon-pull.service node-valid-hostname.service nodeip-configuration.service openvswitch.service ovs-configuration.service ovsdb-server.service stalld.service hugepages-allocation-1048576kB-NUMA0.service] I0113 03:15:58.747461 8404 update.go:1288] Deleting stale data I0113 03:15:58.772154 8404 update.go:1683] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0113 03:15:58.798307 8404 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-893969747 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:41b5ef878bdd94890a904535618abb95e190f011a0f3bb21b3f9c36baa74c288 I0113 03:16:06.563475 8404 daemon.go:600] Got SIGTERM, but actively updating W0113 03:16:06.570524 8404 run.go:44] nice failed: running nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-893969747 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:41b5ef878bdd94890a904535618abb95e190f011a0f3bb21b3f9c36baa74c288 failed: : signal: terminated; retrying... W0113 03:16:10.810028 8404 daemon.go:645] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused I0113 03:16:11.570645 8404 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-893969747 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:41b5ef878bdd94890a904535618abb95e190f011a0f3bb21b3f9c36baa74c288 error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:41b5ef878bdd94890a904535618abb95e190f011a0f3bb21b3f9c36baa74c288: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.11.5.19:53: dial udp 10.11.5.19:53: connect: network is unreachable ``` When we extract OSContainer image, RHCOS is fully running along with basic systemd services that are part of boot. NM related logs that you seeing in machine-config-daemon: I0113 03:15:58.747437 8404 update.go:1437] Enabled systemd units: [etc-NetworkManager-system\x2dconnections\x2dmerged.mount kubelet.service machine-config-daemon-firstboot.service machine-config-daemon-pull.service node-valid-hostname.service nodeip-configuration.service openvswitch.service ovs-configuration.service ovsdb-server.service stalld.service hugepages-allocation-1048576kB-NUMA0.service] this basically just make sure that NM service will start when node reboots, it doesn't impact active boot. Adding here offline discussion that I had with Sabina: Sabina provided me access to a cluster where next CI run occurred, looking at machine-config-daemon pod logs I can confirm that MCO was able to successfully switch to RT kernel on the node available in worker-duprofile pool. Adding relevant logs here: $ oc logs -f -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-2rr6m ... I0121 12:14:50.270445 6850 update.go:1852] Starting update from rendered-worker-243559e29aa74a48d04e1cb5430ec055 to rendered-worker-duprofile-131d78f004c218a04fca898f5c2d5d99: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} ... I0121 12:14:52.477586 6850 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-129809413 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8672a3ce64788e4138bec0d2cafe701d96b55c35482314d393b52dd414e635a4 I0121 12:15:05.593487 6850 update.go:1852] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1,3,5,7,9-51 --append=tuned.non_isolcpus=00000155 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1,3,5,7,9-51 --append=systemd.cpu_affinity=0,2,4,6,8 --append=default_hugepagesz=1G --append=nosmt] I0121 12:15:05.595787 6850 rpm-ostree.go:261] Running captured: rpm-ostree kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1,3,5,7,9-51 --append=tuned.non_isolcpus=00000155 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1,3,5,7,9-51 --append=systemd.cpu_affinity=0,2,4,6,8 --append=default_hugepagesz=1G --append=nosmt I0121 12:15:08.299351 6850 update.go:1852] Initiating switch from kernel default to realtime I0121 12:15:08.301828 6850 update.go:1852] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] I0121 12:15:08.303455 6850 rpm-ostree.go:261] Running captured: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm I0121 12:16:07.246397 6850 update.go:1852] Rebooting node I0121 12:16:07.248542 6850 update.go:1852] initiating reboot: Node will reboot into config rendered-worker-duprofile-131d78f004c218a04fca898f5c2d5d99 ... Also, I have fixed in MCO problem that surfaced up in comment#8 https://github.com/openshift/machine-config-operator/pull/2349 . Now, MCO will not reapply kargs when MCO doesn't detect any change in kargs I think problems here all surface from that kernel switch. we have multiple open BZs on kernel-rt which manifests in that node such as https://bugzilla.redhat.com/show_bug.cgi?id=1886109 (see that it is verified and attached to rhel errata, but havent hit OCP yet) and I think these can be the root cause of the network problems. here's a nice and simple reproducer: ``` apiVersion: v1 kind: Namespace metadata: labels: openshift.io/run-level: "1" name: openshift-sriov-network-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator-subscription namespace: openshift-sriov-network-operator spec: channel: "4.6" name: sriov-network-operator source: redhat-operators sourceNamespace: openshift-marketplace --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-mh-dpdk-site-1-fqdn-worker1 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci isRdma: false nicSelector: pfNames: - ens7f1 nodeSelector: node-role.kubernetes.io/worker-duprofile: "" numVfs: 4 priority: 10 resourceName: mh_u_site_1_fqdn_worker1 --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-duprofile spec: machineConfigSelector: matchExpressions: - key: machineconfiguration.openshift.io/role operator: In values: - worker-duprofile - worker nodeSelector: matchLabels: node-role.kubernetes.io/worker-duprofile: "" --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: name: reproducer-example labels: machineconfiguration.openshift.io/role: worker-duprofile spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} storage: fips: false kernelArguments: - nosmt kernelType: realtime osImageURL: "" ``` to use it: this need a node with Intel SRIOV capable NIC. make sure to update the SriovNetworkNodePolicy with that NIC name then: 1. oc apply -f reproducer.yaml # it is expected to fail on missing CRDs 2. wait for cluster to settle and sriov-network-operator to become operational 3. apply worker-duprofile to node 4. oc apply -f reproducer.yaml # again to apply missing CRs 5. you can inspect sriov-daemon and machine-config-daemon on that node to see what happening and that it. Yeah, Vitaly is right! an easier, even simpler, reproducer would be without sriov just 1. create mcp 2. label the node 3. apply the mc above (or similar, preferably something that will take long enough to run) 4. reboot the node externally while that is applying. Ben, I believe this issue should be documented in the 4.7 Release notes. I am thinking something like this: Cause: If an asynchronous reboot such as a power interruption coincides with changes in the Machine Config policy, the node can be left in an undetermined state. Machine Config Operator believes that updated policy has been applied when it actually has not. Consequence: The node is left in an indeterminate state. Workaround (if any): If the reboot is caused by applying MCO configuration as well as other configuration which requires a restart of the server, this can be avoided. First apply all MCO configuration and wait for the nodes to settle. Then apply the secondary configuration. There is no known workaround for power interruptions. Result: If the configuration of MCO and other configuration (such as SRIOV on an Intel NIC) is completed sequentially, the node will provision correctly. What do you think? By the way, the deadline for identifying bugs for Release Notes is close of business tomorrow. Regards, Ken Y *** Bug 1928266 has been marked as a duplicate of this bug. *** This shouldn't be happening now as this was fixed as part of sriov operator PR https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/93 . SRIOV operator now co-ordinates with MCO before making changes to node. See bz#1921321 @skumari I tend to disagree. This have nothing to do with SRIOV, as the reboot can be simply a result of power loss (or any other kind of interruption) MCD should be resilient to these kind of failures on it's own. Thanks Yuval for pointing out the Power outage situation, missed some of the comments while going through it during bug refinement. I tried to reproduce the problem by applying RT kernel MachineConfig and in the middle of kernel switching operation rebooted the node 1. Couldn't reproduce this exact bug when reboot was initiated manually by user by doing rsh/ssh into node 2. Can reproduce the problem when did hard reset of the node externally (on gcp did reset from GCE UI) Opening this bug, we will try to fix it when we have some free cycle. Created attachment 1853109 [details]
reproducer script
Added a script to reproduce. The final command should output a kernel version containing rt but does not
Final text signed off by @mkenigsb: Previously, the MCO stored pending configuration changes to disk before OS changes were applied. As a result, in situations such as power loss, the MCO assumed OS changes had already been applied on restart, and validation skipped over changes such as kargs and kernel-rt. This was remedied by only storing configuration changes to disk after the OS changes were applied. Now, if power is lost during the configuration application, the MCO knows it must reattempt the configuration application on restart. The text from Tony looks good! It's not currently entered in the bug's Doc Text though. This should be fixed more comprehensively by https://github.com/openshift/enhancements/pull/1032 (which will effectively implement https://github.com/openshift/machine-config-operator/issues/1190) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |