Bug 1955977
Summary: | rpm-ostree ==> error: No enabled repositories | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | mchebbi <mchebbi> |
Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Jian Zhang <jiazha> |
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, chrzhang, cshepher, dahernan, dornelas, fminafra, jerzhang, jlebon, jligon, mkrejci, mrussell, nstielau, smilner, stanislav.polasek, vaclav.miller, vlours, walters |
Version: | 4.6.z | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-12-16 00:52:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
mchebbi@redhat.com
2021-05-01 21:13:51 UTC
Given the issue is for rpm-ostree, moving over to the coreos team, thanks! Are any extensions in use here? Can you paste the output of `rpm-ostree status`? I think the issue here is that that KCS doesn't account for extensions. Normally the MCD writes out the `coreos-extensions.repo` file before calling rpm-ostree. Using `--cache-only` is not correct because there's no guarantee that the version of extensions cached matches the versions which were shipped with the OS image we're rebasing to. So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)? Or adapt the KCS to have it drop the extensions first before calling pivot, but I'm not sure if the MCD will immediately know to re-add the extensions in the MCs once the node is back on the right OSTree commit. (In reply to Colin Walters from comment #2) > Are any extensions in use here? > > Can you paste the output of `rpm-ostree status`? Hello, sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a CustomOrigin: Managed by machine-config-operator Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z) LayeredPackages: kernel-devel kernel-headers pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 CustomOrigin: Managed by machine-config-operator Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z) LayeredPackages: kernel-devel kernel-headers (In reply to Colin Walters from comment #2) > Are any extensions in use here? > > Can you paste the output of `rpm-ostree status`? hello, sh-4.4# rpm-ostree status -vvv State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a CustomOrigin: Managed by machine-config-operator Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z) BaseCommit: d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 |- art-rhaos-4.6 (2021-03-04T23:40:59Z) |- rhel8-fast-datapath (2021-02-15T16:31:03Z) |- rhel8-baseos (2021-03-02T19:20:28Z) `- rhel8-appstream (2021-03-03T10:25:13Z) Commit: b203b0e1db8ebd37e2295757354d88ff1ed52b20ab5b7994c48e359180f275b3 `- coreos-extensions (2021-03-05T01:10:40Z) Staged: no StateRoot: rhcos LayeredPackages: kernel-devel kernel-headers pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 CustomOrigin: Managed by machine-config-operator Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z) BaseCommit: 2e9900b43c58f52ff4197ef675cb69c0a29b04393bd34e7b7427648002a653d2 |- art-rhaos-4.6 (2021-02-12T03:10:48Z) |- rhel8-fast-datapath (2021-02-11T14:49:34Z) |- rhel8-baseos (2021-02-10T09:40:05Z) `- rhel8-appstream (2021-02-11T21:54:21Z) Commit: b45a17d7875e10f0a9d331681cf171b13a11ad8821ba8e0f25660ea893697652 `- coreos-extensions (2021-02-12T04:09:32Z) StateRoot: rhcos LayeredPackages: kernel-devel kernel-headers hello, customer confirm that this problem happened again today when upgrading a diferrent cluster from 4.6.21 to 4.6.26. Current state: sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a CustomOrigin: Managed by machine-config-operator Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z) LayeredPackages: kernel-devel kernel-headers pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26 CustomOrigin: Managed by machine-config-operator Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z) LayeredPackages: kernel-devel kernel-headers thanks in advance for your help. OK so basically we hit some other unknown bug causing the "unexpected osimageurl", then the workaround hit another bug. There have been a few bugfixes related to the former that landed at least for 4.8 I believe around kubelet shutdown ordering. But we really need to prevent this situation from occuring in the first place. (Lowering the whole concept of ostree-container and extensions into rpm-ostree would help here too) > So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)? Right, though I think this situation is better handled by using the "force it" operation we already have: https://github.com/openshift/machine-config-operator/pull/2265 Then the MCD will run through creating the repo and performing the update. OK in a quick test, unfortunately it seems using the force flag doesn't work, because the MCD is checking for "should I do an os update" by diffing the two MCs, so as Jerry mentioned it only works when currentConfig != desiredConfig. We should change that though I think. There's not much we can do directly in RHCOS to help this right now. To unwrap things a bit here, there's a few things being discussed: 1. what caused the MCD's attempt to fail to update the os in the first place There actually should be an error message in the MCD logs (generally speaking) , if you scroll up to the previous boot's update attempt it should be there, unless the pod has been deleted and restarted. If we wish to dig into this further, we can do this here 2. why the MCD fails this in the first place, but does not report it as such I think we should add some retries/better error reporting here, because it's not very apparent what caused the mismatch. This is a bug on the MCO 3. what caused the KCS https://access.redhat.com/solutions/5598401 to fail maybe update the KCS to account for extensions Which should this track? @mchebbi which do you want to focus on Hello. Customer says this has happened again, they upgraded a cluster and it occurred twice. The cluster consists of 12 RHEV VM-based worker nodes (6 app workers, 6 infra workers) and 5 bare-metal worker nodes. This bug occured on two VM-based app worker nodes, w05 and w06. Uploading the requested data. Hi Team, The customer is still facing the same issue while running RHOCP 4.9.21. He was able to reproduce the issue, by adding a new node to his cluster. The customer has provided a fresh must-gather, where we can see: ~~~ $ oc logs -n openshift-machine-config-operator machine-config-daemon-n8tfj -c machine-config-daemon | tail -4 2022-03-23T00:22:37.328931681Z I0323 00:22:37.328881 2142 daemon.go:799] Current+desired config: rendered-worker-9b6ac8cab4985745e3f4bd09356c4823 2022-03-23T00:22:37.335779638Z I0323 00:22:37.335739 2142 daemon.go:1092] Validating against current config rendered-worker-9b6ac8cab4985745e3f4bd09356c4823 2022-03-23T00:22:37.335896200Z E0323 00:22:37.335887 2142 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-9b6ac8cab4985745e3f4bd09356c4823: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2a3cc8834dd0b9c8d63fff4e3f4", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31cda6faf934c68faf2c3c9484f604b98b373248bc613b2a0ee793bd712da631" ~~~ He has also uploaded the strace of the following commands: ~~~ sh-4.4# /usr/bin/strace -fvttTyyx -s 4096 -o /tmp/strace-rpm-ostree_rebase.txt /usr/bin/rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-068608363/srv/repo:d4bb9737c139799bd9d855853a42f5bc83cfa5acdd6ff9d0f0d90ceee9a36ae --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2a3cc8834dd0b98d63fff4e3f4 --custom-origin-description "Manual trigger" 0 metadata, 0 content objects imported; 0 bytes content written Checking out tree d4bb973... done error: No enabled repositories ~~~ ~~~ sh-4.4# /usr/bibin/strace -fvttTyyx -s 4096 -o /tmp/strace-rpm-ostree_rebase--cache-onlyxt /usr/bin/rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-068608363/srv/repo:d4bb9737c5139799bd9d855853a42f5bc83cfa5acdd6ff9d0f0d90ceee9a36ae --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2cc8834dd0b9c8d63fff4e3f4 --custom-origin-description "Manual trigger" --cache-only 0 metadata, 0 content objects imported; 0 bytes content written Checking out tree d4bb973... done Resolving dependencies... done Checking out packages... done Running pre scripts... done Running post scripts... done Running posttrans scripts... done Writing rpmdb... done Writing OSTree commit... done Staging deployment... done Upgraded: NetworkManager 1:1.30.0-10.el8_4 -> 1:1.30.0-13.el8_4 NetworkManager-cloud-setup 1:1.30.0-10.el8_4 -> 1:1.30.0-13.el8_4 [...] Run "systemctl reboot" to start a reboot ~~~ I hope this will help to diag the issue. Additional context: the CU is still able to apply the desired config using the '-C' option when running the rpm-ostree command. But this means that he has to closely monitor the worker updates, to ensure that none of them is stuck deploying the latest coreos version. They really want this issue to be fixed. Hi, Sorry we never got back to this,we have had later bugs that cover for some use cases, e.g. https://issues.redhat.com/browse/OCPBUGS-2269 I can't seem to pinpoint the error from what we have here, and since Bugzilla is being deprecated + this is an older issue, I am going to go ahead and close this. If this comes up again, please open a Jira bug against RHCOS with corresponding rpm-ostree logs. Thanks again and sorry for the delays. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |