gathered data URL : shorturl.at/pvE79 Hello, Customer did an upgrade of his cluster, update is stuck with a random worker node in degraded state. According to the latest must-gather current and desired config are set to the previous MachineConfig. ~~ 2021-04-15T13:51:56.801540297Z I0415 13:51:56.801482 2217 daemon.go:808] Current+desired config: rendered-worker-1d0866ca1672b3108f63cb78ee21e9a5 2021-04-15T13:51:56.808645810Z I0415 13:51:56.808603 2217 update.go:1676] Disk currentConfig rendered-worker-6884f95ab6f02681824be2fe6d68d049 overrides node's currentConfig annotation rendered-worker-1d0866ca1672b3108f63cb78ee21e9a5 2021-04-15T13:51:56.812541252Z I0415 13:51:56.812493 2217 daemon.go:1085] Validating against current config rendered-worker-6884f95ab6f02681824be2fe6d68d049 2021-04-15T13:51:56.812643566Z E0415 13:51:56.812629 2217 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-6884f95ab6f02681824be2fe6d68d049: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26" ~~~ The MCD still complaining about the osImageURL -------------------------------------------------------------------- we follow the KCS: https://access.redhat.com/solutions/5598401 and we get sh-4.4# export HTTP_PROXY='http://http-proxy.cezdata.corp:8080' sh-4.4# export HTTPS_PROXY='http://http-proxy.cezdata.corp:8080' sh-4.4# /run/bin/machine-config-daemon pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a I0415 15:18:00.786311 85973 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-367690970 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a I0415 15:18:45.750981 85973 rpm-ostree.go:261] Running captured: rpm-ostree status --json I0415 15:18:45.840305 85973 rpm-ostree.go:179] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 I0415 15:18:48.104897 85973 rpm-ostree.go:211] Pivoting to: 46.82.202103050041-0 (d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06) I0415 15:18:48.104923 85973 rpm-ostree.go:243] Executing rebase from repo path /run/mco-machine-os-content/os-content-367690970/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a and checksum d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 I0415 15:18:48.104935 85973 rpm-ostree.go:261] Running captured: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-367690970/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description Managed by machine-config-operator error: No enabled repositories error: with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-367690970/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description Managed by machine-config-operator: 0 metadata, 0 content objects imported; 0 bytes content written Checking out tree d8cb906...done : exit status 1 -------------------------------------------------------------------- this looks like a bug in rpm-ostree or in information that is passed to rpm-ostree. Additional remarks/observation: - the workaround is done by using the --cache-only (-C) directive for rpm-ostree rebase. If using this directive, a repo check is skipped altogether, which should be always the case in case of CoreOS upgrade in this situation. In our case /etc/yam.d.repos is empty on all nodes anyway. - as the actual image is downloaded in the step prior to rpm-ostree rebase, we think this should be a safe option - once the upgrade is "unstuck", any subsequent upgrade or downgrade, even of the same OS image, runs automatically without this error. This means this problem is gone with first successful rpm-ostree rebase command. - this situation happens randomly on other clusters as well, may not be bound to these particular versions. Sometimes it does not occur at all. - all worker nodes on affected clusters are created by Ansible, same Ansible role, running on the same hardware, the only difference is IP address, node name, etc. - once this error occurs, it persists across multiple attemps, until the workaround (-C) is applied - we think the condition in rpm-ostree should be investigated, it may not work in a particular set of inputs - alternatively, you may consider adding -C option to the /run/bin/machine-config-daemon script to circumvent the issue (the assumption is that local repos are not actually defined) --------------------------------- finally got it working. sh-4.4# rpm-ostree rebase -C --experimental /run/mco-machine-os-content/os-content-093177341/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description "Managed by machine-config-operator" 0 metadata, 0 content objects imported; 0 bytes content written Checking out tree d8cb906... done Resolving dependencies... done Checking out packages... done Running pre scripts... done Running post scripts... done Running posttrans scripts... done Writing rpmdb... done Writing OSTree commit... done Staging deployment... done Freed: 52.0 MB (pkgcache branches: 0) Upgraded: cri-o 1.19.1-7.rhaos4.6.git6377f68.el8 -> 1.19.1-10.rhaos4.6.git27e71a0.el8 criu 3.12-9.module+el8.2.0+9347+d4fa9cbb -> 3.12-9.module+el8.2.0+9938+46853747 fuse-overlayfs 0.7.2-5.module+el8.2.0+9347+d4fa9cbb -> 0.7.2-5.module+el8.2.0+9938+46853747 grub2-common 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 grub2-efi-x64 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 grub2-pc 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 grub2-pc-modules 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 grub2-tools 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 grub2-tools-extra 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 grub2-tools-minimal 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3 microcode_ctl 4:20191115-4.20201112.1.el8_2 -> 4:20191115-4.20210216.1.el8_2 openshift-hyperkube 4.6.0-202102120001.p0.git.94272.41c57c9.el8 -> 4.6.0-202102261508.p0.git.94279.27a8e35.el8 runc 1.0.0-82.rhaos4.6.git086e841.el8 -> 1.0.0-83.rhaos4.6.git8c2e7c8.el8 slirp4netns 0.4.2-3.git21fdece.module+el8.2.0+9347+d4fa9cbb -> 0.4.2-3.git21fdece.module+el8.2.0+9938+46853747 Added: bootupd-0.2.4-3.el8.x86_64 Run "systemctl reboot" to start a reboot customer made it work by using the --cache-only (-C) switch, as he downloaded the image in the previous step. This circumvents a check on line 797 (see below). What I found was this: https://github.com/coreos/rpm-ostree/issues/780 Which made him look into the code of rpm-ostree. The problematic tests seems to be here, on line 803: https://github.com/coreos/rpm-ostree/blob/master/src/libpriv/rpmostree-core.cxx It seems this test may sometimes fail even if it is not supposed to: if (g_variant_dict_lookup (self->spec->dict, "packages", "^a&s", &pkgs) && g_strv_length (pkgs) > 0) After doing this, the infra node was upgraded, rebooted, downgraded to previous version because ot the patch we tried, then upgraded back to the version it is supposed to be. Maybe rpm-ostree maintainers should check this. Unfortunately, with the problem fixed (and even subsequent upgrade to the problematic version went without problems), we can no longer reproduce, so hopefully must-gather will contain necessary information as to why the condition above failed even if it is clearly not supposed to. --------------------------------------------------- Could you please help us check this eventual bug and it's the root cause which may reoccur any time?
Given the issue is for rpm-ostree, moving over to the coreos team, thanks!
Are any extensions in use here? Can you paste the output of `rpm-ostree status`?
I think the issue here is that that KCS doesn't account for extensions. Normally the MCD writes out the `coreos-extensions.repo` file before calling rpm-ostree. Using `--cache-only` is not correct because there's no guarantee that the version of extensions cached matches the versions which were shipped with the OS image we're rebasing to. So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)?
Or adapt the KCS to have it drop the extensions first before calling pivot, but I'm not sure if the MCD will immediately know to re-add the extensions in the MCs once the node is back on the right OSTree commit.
(In reply to Colin Walters from comment #2) > Are any extensions in use here? > > Can you paste the output of `rpm-ostree status`? Hello, sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a CustomOrigin: Managed by machine-config-operator Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z) LayeredPackages: kernel-devel kernel-headers pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 CustomOrigin: Managed by machine-config-operator Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z) LayeredPackages: kernel-devel kernel-headers
(In reply to Colin Walters from comment #2) > Are any extensions in use here? > > Can you paste the output of `rpm-ostree status`? hello, sh-4.4# rpm-ostree status -vvv State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a CustomOrigin: Managed by machine-config-operator Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z) BaseCommit: d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 |- art-rhaos-4.6 (2021-03-04T23:40:59Z) |- rhel8-fast-datapath (2021-02-15T16:31:03Z) |- rhel8-baseos (2021-03-02T19:20:28Z) `- rhel8-appstream (2021-03-03T10:25:13Z) Commit: b203b0e1db8ebd37e2295757354d88ff1ed52b20ab5b7994c48e359180f275b3 `- coreos-extensions (2021-03-05T01:10:40Z) Staged: no StateRoot: rhcos LayeredPackages: kernel-devel kernel-headers pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 CustomOrigin: Managed by machine-config-operator Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z) BaseCommit: 2e9900b43c58f52ff4197ef675cb69c0a29b04393bd34e7b7427648002a653d2 |- art-rhaos-4.6 (2021-02-12T03:10:48Z) |- rhel8-fast-datapath (2021-02-11T14:49:34Z) |- rhel8-baseos (2021-02-10T09:40:05Z) `- rhel8-appstream (2021-02-11T21:54:21Z) Commit: b45a17d7875e10f0a9d331681cf171b13a11ad8821ba8e0f25660ea893697652 `- coreos-extensions (2021-02-12T04:09:32Z) StateRoot: rhcos LayeredPackages: kernel-devel kernel-headers
hello, customer confirm that this problem happened again today when upgrading a diferrent cluster from 4.6.21 to 4.6.26. Current state: sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a CustomOrigin: Managed by machine-config-operator Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z) LayeredPackages: kernel-devel kernel-headers pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 CustomOrigin: Managed by machine-config-operator Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z) LayeredPackages: kernel-devel kernel-headers thanks in advance for your help.
OK so basically we hit some other unknown bug causing the "unexpected osimageurl", then the workaround hit another bug. There have been a few bugfixes related to the former that landed at least for 4.8 I believe around kubelet shutdown ordering. But we really need to prevent this situation from occuring in the first place. (Lowering the whole concept of ostree-container and extensions into rpm-ostree would help here too) > So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)? Right, though I think this situation is better handled by using the "force it" operation we already have: https://github.com/openshift/machine-config-operator/pull/2265 Then the MCD will run through creating the repo and performing the update.
OK in a quick test, unfortunately it seems using the force flag doesn't work, because the MCD is checking for "should I do an os update" by diffing the two MCs, so as Jerry mentioned it only works when currentConfig != desiredConfig. We should change that though I think.
There's not much we can do directly in RHCOS to help this right now.
To unwrap things a bit here, there's a few things being discussed: 1. what caused the MCD's attempt to fail to update the os in the first place There actually should be an error message in the MCD logs (generally speaking) , if you scroll up to the previous boot's update attempt it should be there, unless the pod has been deleted and restarted. If we wish to dig into this further, we can do this here 2. why the MCD fails this in the first place, but does not report it as such I think we should add some retries/better error reporting here, because it's not very apparent what caused the mismatch. This is a bug on the MCO 3. what caused the KCS https://access.redhat.com/solutions/5598401 to fail maybe update the KCS to account for extensions Which should this track? @mchebbi which do you want to focus on
Hello. Customer says this has happened again, they upgraded a cluster and it occurred twice. The cluster consists of 12 RHEV VM-based worker nodes (6 app workers, 6 infra workers) and 5 bare-metal worker nodes. This bug occured on two VM-based app worker nodes, w05 and w06. Uploading the requested data.
Hi Team, The customer is still facing the same issue while running RHOCP 4.9.21. He was able to reproduce the issue, by adding a new node to his cluster. The customer has provided a fresh must-gather, where we can see: ~~~ $ oc logs -n openshift-machine-config-operator machine-config-daemon-n8tfj -c machine-config-daemon | tail -4 2022-03-23T00:22:37.328931681Z I0323 00:22:37.328881 2142 daemon.go:799] Current+desired config: rendered-worker-9b6ac8cab4985745e3f4bd09356c4823 2022-03-23T00:22:37.335779638Z I0323 00:22:37.335739 2142 daemon.go:1092] Validating against current config rendered-worker-9b6ac8cab4985745e3f4bd09356c4823 2022-03-23T00:22:37.335896200Z E0323 00:22:37.335887 2142 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-9b6ac8cab4985745e3f4bd09356c4823: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2a3cc8834dd0b9c8d63fff4e3f4", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31cda6faf934c68faf2c3c9484f604b98b373248bc613b2a0ee793bd712da631" ~~~ He has also uploaded the strace of the following commands: ~~~ sh-4.4# /usr/bin/strace -fvttTyyx -s 4096 -o /tmp/strace-rpm-ostree_rebase.txt /usr/bin/rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-068608363/srv/repo:d4bb9737c139799bd9d855853a42f5bc83cfa5acdd6ff9d0f0d90ceee9a36ae --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2a3cc8834dd0b98d63fff4e3f4 --custom-origin-description "Manual trigger" 0 metadata, 0 content objects imported; 0 bytes content written Checking out tree d4bb973... done error: No enabled repositories ~~~ ~~~ sh-4.4# /usr/bibin/strace -fvttTyyx -s 4096 -o /tmp/strace-rpm-ostree_rebase--cache-onlyxt /usr/bin/rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-068608363/srv/repo:d4bb9737c5139799bd9d855853a42f5bc83cfa5acdd6ff9d0f0d90ceee9a36ae --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2cc8834dd0b9c8d63fff4e3f4 --custom-origin-description "Manual trigger" --cache-only 0 metadata, 0 content objects imported; 0 bytes content written Checking out tree d4bb973... done Resolving dependencies... done Checking out packages... done Running pre scripts... done Running post scripts... done Running posttrans scripts... done Writing rpmdb... done Writing OSTree commit... done Staging deployment... done Upgraded: NetworkManager 1:1.30.0-10.el8_4 -> 1:1.30.0-13.el8_4 NetworkManager-cloud-setup 1:1.30.0-10.el8_4 -> 1:1.30.0-13.el8_4 [...] Run "systemctl reboot" to start a reboot ~~~ I hope this will help to diag the issue. Additional context: the CU is still able to apply the desired config using the '-C' option when running the rpm-ostree command. But this means that he has to closely monitor the worker updates, to ensure that none of them is stuck deploying the latest coreos version. They really want this issue to be fixed.
Hi, Sorry we never got back to this,we have had later bugs that cover for some use cases, e.g. https://issues.redhat.com/browse/OCPBUGS-2269 I can't seem to pinpoint the error from what we have here, and since Bugzilla is being deprecated + this is an older issue, I am going to go ahead and close this. If this comes up again, please open a Jira bug against RHCOS with corresponding rpm-ostree logs. Thanks again and sorry for the delays.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days