Bug 1955977

Summary: rpm-ostree ==> error: No enabled repositories
Product: OpenShift Container Platform Reporter: mchebbi <mchebbi>
Component: Machine Config OperatorAssignee: Yu Qi Zhang <jerzhang>
Machine Config Operator sub component: Machine Config Operator QA Contact: Jian Zhang <jiazha>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, chrzhang, cshepher, dahernan, dornelas, fminafra, jerzhang, jlebon, jligon, mkrejci, mrussell, nstielau, smilner, stanislav.polasek, vaclav.miller, vlours, walters
Version: 4.6.zKeywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-16 00:52:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mchebbi@redhat.com 2021-05-01 21:13:51 UTC
gathered data URL : shorturl.at/pvE79

Hello,
Customer did an upgrade of his cluster, update is stuck with a random worker node in degraded state.

According to the latest must-gather current and desired config are set to the previous MachineConfig.

~~
2021-04-15T13:51:56.801540297Z I0415 13:51:56.801482    2217 daemon.go:808] Current+desired config: rendered-worker-1d0866ca1672b3108f63cb78ee21e9a5
2021-04-15T13:51:56.808645810Z I0415 13:51:56.808603    2217 update.go:1676] Disk currentConfig rendered-worker-6884f95ab6f02681824be2fe6d68d049 overrides node's currentConfig annotation rendered-worker-1d0866ca1672b3108f63cb78ee21e9a5
2021-04-15T13:51:56.812541252Z I0415 13:51:56.812493    2217 daemon.go:1085] Validating against current config rendered-worker-6884f95ab6f02681824be2fe6d68d049
2021-04-15T13:51:56.812643566Z E0415 13:51:56.812629    2217 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-6884f95ab6f02681824be2fe6d68d049: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26"
~~~

The MCD still complaining about the osImageURL
--------------------------------------------------------------------
we follow the KCS: https://access.redhat.com/solutions/5598401 and we get
 
sh-4.4# export HTTP_PROXY='http://http-proxy.cezdata.corp:8080'
sh-4.4# export HTTPS_PROXY='http://http-proxy.cezdata.corp:8080'
sh-4.4# /run/bin/machine-config-daemon pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
I0415 15:18:00.786311   85973 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-367690970 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
I0415 15:18:45.750981   85973 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I0415 15:18:45.840305   85973 rpm-ostree.go:179] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
I0415 15:18:48.104897   85973 rpm-ostree.go:211] Pivoting to: 46.82.202103050041-0 (d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06)
I0415 15:18:48.104923   85973 rpm-ostree.go:243] Executing rebase from repo path /run/mco-machine-os-content/os-content-367690970/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a and checksum d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06
I0415 15:18:48.104935   85973 rpm-ostree.go:261] Running captured: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-367690970/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description Managed by machine-config-operator
error: No enabled repositories
error: with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-367690970/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description Managed by machine-config-operator: 0 metadata, 0 content objects imported; 0 bytes content written
Checking out tree d8cb906...done
: exit status 1

--------------------------------------------------------------------
this looks like a bug in rpm-ostree or in information that is passed to rpm-ostree. Additional remarks/observation:

- the workaround is done by using the --cache-only (-C) directive for rpm-ostree rebase. If using this directive, a repo check is skipped altogether, which should be always the case in case of CoreOS upgrade in this situation. In our case /etc/yam.d.repos is empty on all nodes anyway.
- as the actual image is downloaded in the step prior to rpm-ostree rebase, we think this should be a safe option
- once the upgrade is "unstuck", any subsequent upgrade or downgrade, even of the same OS image, runs automatically without this error. This means this problem is gone with first successful rpm-ostree rebase command.
- this situation happens randomly on other clusters as well, may not be bound to these particular versions. Sometimes it does not occur at all.
- all worker nodes on affected clusters are created by Ansible, same Ansible role, running on the same hardware, the only difference is IP address, node name, etc.
- once this error occurs, it persists across multiple attemps, until the workaround (-C) is applied
- we think the condition in rpm-ostree should be investigated, it may not work in a particular set of inputs
- alternatively, you may consider adding -C option to the /run/bin/machine-config-daemon script to circumvent the issue (the assumption is that local repos are not actually defined)

---------------------------------

finally got it working.

sh-4.4# rpm-ostree rebase -C --experimental /run/mco-machine-os-content/os-content-093177341/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description "Managed by machine-config-operator"
0 metadata, 0 content objects imported; 0 bytes content written
Checking out tree d8cb906... done
Resolving dependencies... done
Checking out packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Freed: 52.0 MB (pkgcache branches: 0)
Upgraded:
  cri-o 1.19.1-7.rhaos4.6.git6377f68.el8 -> 1.19.1-10.rhaos4.6.git27e71a0.el8
  criu 3.12-9.module+el8.2.0+9347+d4fa9cbb -> 3.12-9.module+el8.2.0+9938+46853747
  fuse-overlayfs 0.7.2-5.module+el8.2.0+9347+d4fa9cbb -> 0.7.2-5.module+el8.2.0+9938+46853747
  grub2-common 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-efi-x64 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-pc 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-pc-modules 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-tools 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-tools-extra 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-tools-minimal 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  microcode_ctl 4:20191115-4.20201112.1.el8_2 -> 4:20191115-4.20210216.1.el8_2
  openshift-hyperkube 4.6.0-202102120001.p0.git.94272.41c57c9.el8 -> 4.6.0-202102261508.p0.git.94279.27a8e35.el8
  runc 1.0.0-82.rhaos4.6.git086e841.el8 -> 1.0.0-83.rhaos4.6.git8c2e7c8.el8
  slirp4netns 0.4.2-3.git21fdece.module+el8.2.0+9347+d4fa9cbb -> 0.4.2-3.git21fdece.module+el8.2.0+9938+46853747
Added:
  bootupd-0.2.4-3.el8.x86_64
Run "systemctl reboot" to start a reboot

customer made it work by using the --cache-only (-C) switch, as he downloaded the image in the previous step. 

This circumvents a check on line 797 (see below).
What I found was this:
https://github.com/coreos/rpm-ostree/issues/780

Which made him look into the code of rpm-ostree.

The problematic tests seems to be here, on line 803:
https://github.com/coreos/rpm-ostree/blob/master/src/libpriv/rpmostree-core.cxx

It seems this test may sometimes fail even if it is not supposed to:

if (g_variant_dict_lookup (self->spec->dict, "packages", "^a&s", &pkgs) &&
          g_strv_length (pkgs) > 0)

After doing this, the infra node was upgraded, rebooted, downgraded to previous version because ot the patch we tried, then upgraded back to the version it is supposed to be.

Maybe rpm-ostree maintainers should check this. Unfortunately, with the problem fixed (and even subsequent upgrade to the problematic version went without problems), we can no longer reproduce, so hopefully must-gather will contain necessary information as to why the condition above failed even if it is clearly not supposed to.


---------------------------------------------------
Could you please help us check this eventual bug and it's the root cause which may reoccur any time?

Comment 1 Yu Qi Zhang 2021-05-04 00:30:00 UTC
Given the issue is for rpm-ostree, moving over to the coreos team, thanks!

Comment 2 Colin Walters 2021-05-04 17:13:33 UTC
Are any extensions in use here?

Can you paste the output of `rpm-ostree status`?

Comment 3 Jonathan Lebon 2021-05-04 17:36:30 UTC
I think the issue here is that that KCS doesn't account for extensions. Normally the MCD writes out the `coreos-extensions.repo` file before calling rpm-ostree.

Using `--cache-only` is not correct because there's no guarantee that the version of extensions cached matches the versions which were shipped with the OS image we're rebasing to.

So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)?

Comment 4 Jonathan Lebon 2021-05-04 19:36:03 UTC
Or adapt the KCS to have it drop the extensions first before calling pivot, but I'm not sure if the MCD will immediately know to re-add the extensions in the MCs once the node is back on the right OSTree commit.

Comment 5 mchebbi@redhat.com 2021-05-06 10:50:37 UTC
(In reply to Colin Walters from comment #2)
> Are any extensions in use here?
> 
> Can you paste the output of `rpm-ostree status`?

Hello,
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z)
           LayeredPackages: kernel-devel kernel-headers

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z)
           LayeredPackages: kernel-devel kernel-headers

Comment 6 mchebbi@redhat.com 2021-05-06 10:51:52 UTC
(In reply to Colin Walters from comment #2)
> Are any extensions in use here?
> 
> Can you paste the output of `rpm-ostree status`?

hello,

sh-4.4# rpm-ostree status -vvv
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z)
                BaseCommit: d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06
                            |- art-rhaos-4.6 (2021-03-04T23:40:59Z)
                            |- rhel8-fast-datapath (2021-02-15T16:31:03Z)
                            |- rhel8-baseos (2021-03-02T19:20:28Z)
                            `- rhel8-appstream (2021-03-03T10:25:13Z)
                    Commit: b203b0e1db8ebd37e2295757354d88ff1ed52b20ab5b7994c48e359180f275b3
                            `- coreos-extensions (2021-03-05T01:10:40Z)
                    Staged: no
                 StateRoot: rhcos
           LayeredPackages: kernel-devel kernel-headers

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z)
                BaseCommit: 2e9900b43c58f52ff4197ef675cb69c0a29b04393bd34e7b7427648002a653d2
                            |- art-rhaos-4.6 (2021-02-12T03:10:48Z)
                            |- rhel8-fast-datapath (2021-02-11T14:49:34Z)
                            |- rhel8-baseos (2021-02-10T09:40:05Z)
                            `- rhel8-appstream (2021-02-11T21:54:21Z)
                    Commit: b45a17d7875e10f0a9d331681cf171b13a11ad8821ba8e0f25660ea893697652
                            `- coreos-extensions (2021-02-12T04:09:32Z)
                 StateRoot: rhcos
           LayeredPackages: kernel-devel kernel-headers

Comment 7 mchebbi@redhat.com 2021-05-06 10:53:10 UTC
hello,
customer confirm that this problem happened again today when upgrading a diferrent cluster from 4.6.21 to 4.6.26.

Current state:

sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z)
           LayeredPackages: kernel-devel kernel-headers

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z)
           LayeredPackages: kernel-devel kernel-headers

thanks in advance for your help.

Comment 8 Colin Walters 2021-05-27 13:56:14 UTC
OK so basically we hit some other unknown bug causing the "unexpected osimageurl", then the workaround hit another bug.

There have been a few bugfixes related to the former that landed at least for 4.8 I believe around kubelet shutdown ordering.
But we really need to prevent this situation from occuring in the first place.

(Lowering the whole concept of ostree-container and extensions into rpm-ostree would help here too)


> So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)?

Right, though I think this situation is better handled by using the "force it" operation we already have:
https://github.com/openshift/machine-config-operator/pull/2265

Then the MCD will run through creating the repo and performing the update.

Comment 9 Colin Walters 2021-05-27 15:54:37 UTC
OK in a quick test, unfortunately it seems using the force flag doesn't work, because the MCD is checking for "should I do an os update" by diffing the two MCs, so as Jerry mentioned it only works when currentConfig != desiredConfig.

We should change that though I think.

Comment 10 Colin Walters 2021-06-02 18:34:00 UTC
There's not much we can do directly in RHCOS to help this right now.

Comment 11 Yu Qi Zhang 2021-06-03 01:47:22 UTC
To unwrap things a bit here, there's a few things being discussed:

1. what caused the MCD's attempt to fail to update the os in the first place

There actually should be an error message in the MCD logs (generally speaking) , if you scroll up to the previous boot's update attempt it should be there, unless the pod has been deleted and restarted. If we wish to dig into this further, we can do this here

2. why the MCD fails this in the first place, but does not report it as such

I think we should add some retries/better error reporting here, because it's not very apparent what caused the mismatch. This is a bug on the MCO

3. what caused the KCS https://access.redhat.com/solutions/5598401 to fail

maybe update the KCS to account for extensions


Which should this track? @mchebbi which do you want to focus on

Comment 15 cshepher 2021-09-09 00:08:05 UTC
Hello.  Customer says this has happened again, they upgraded a cluster and it occurred twice.  The cluster consists of 12 RHEV VM-based worker nodes (6 app workers, 6 infra workers) and 5 bare-metal worker nodes. This bug occured on two VM-based app worker nodes, w05 and w06.

Uploading the requested data.

Comment 19 Vincent Lours 2022-03-24 05:48:02 UTC
Hi Team,

The customer is still facing the same issue while running RHOCP 4.9.21.
He was able to reproduce the issue, by adding a new node to his cluster.

The customer has provided a fresh must-gather, where we can see:
~~~
$ oc logs -n openshift-machine-config-operator machine-config-daemon-n8tfj -c machine-config-daemon | tail -4
2022-03-23T00:22:37.328931681Z I0323 00:22:37.328881    2142 daemon.go:799] Current+desired config: rendered-worker-9b6ac8cab4985745e3f4bd09356c4823
2022-03-23T00:22:37.335779638Z I0323 00:22:37.335739    2142 daemon.go:1092] Validating against current config rendered-worker-9b6ac8cab4985745e3f4bd09356c4823
2022-03-23T00:22:37.335896200Z E0323 00:22:37.335887    2142 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-9b6ac8cab4985745e3f4bd09356c4823: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2a3cc8834dd0b9c8d63fff4e3f4", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31cda6faf934c68faf2c3c9484f604b98b373248bc613b2a0ee793bd712da631"
~~~

He has also uploaded the strace of the following commands:
~~~
sh-4.4# /usr/bin/strace -fvttTyyx -s 4096 -o /tmp/strace-rpm-ostree_rebase.txt /usr/bin/rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-068608363/srv/repo:d4bb9737c139799bd9d855853a42f5bc83cfa5acdd6ff9d0f0d90ceee9a36ae --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2a3cc8834dd0b98d63fff4e3f4 --custom-origin-description "Manual trigger"
0 metadata, 0 content objects imported; 0 bytes content written
Checking out tree d4bb973... done
error: No enabled repositories
~~~

~~~
sh-4.4# /usr/bibin/strace -fvttTyyx -s 4096 -o /tmp/strace-rpm-ostree_rebase--cache-onlyxt /usr/bin/rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-068608363/srv/repo:d4bb9737c5139799bd9d855853a42f5bc83cfa5acdd6ff9d0f0d90ceee9a36ae --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08d376c06b2acc8ad4abf4527aabd24922c4e2cc8834dd0b9c8d63fff4e3f4 --custom-origin-description "Manual trigger" --cache-only
0 metadata, 0 content objects imported; 0 bytes content written
Checking out tree d4bb973... done
Resolving dependencies... done
Checking out packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Upgraded:
  NetworkManager 1:1.30.0-10.el8_4 -> 1:1.30.0-13.el8_4
  NetworkManager-cloud-setup 1:1.30.0-10.el8_4 -> 1:1.30.0-13.el8_4
[...]
Run "systemctl reboot" to start a reboot
~~~

I hope this will help to diag the issue.

Additional context: the CU is still able to apply the desired config using the '-C' option when running the rpm-ostree command.
But this means that he has to closely monitor the worker updates, to ensure that none of them is stuck deploying the latest coreos version.
They really want this issue to be fixed.

Comment 25 Yu Qi Zhang 2022-12-16 00:52:38 UTC
Hi,

Sorry we never got back to this,we have had later bugs that cover for some use cases, e.g. https://issues.redhat.com/browse/OCPBUGS-2269

I can't seem to pinpoint the error from what we have here, and since Bugzilla is being deprecated + this is an older issue, I am going to go ahead and close this. If this comes up again, please open a Jira bug against RHCOS with corresponding rpm-ostree logs.

Thanks again and sorry for the delays.

Comment 26 Red Hat Bugzilla 2023-09-18 00:26:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days