Bug 1955977 - rpm-ostree ==> error: No enabled repositories
Summary: rpm-ostree ==> error: No enabled repositories
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-01 21:13 UTC by mchebbi@redhat.com
Modified: 2021-05-06 10:53 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description mchebbi@redhat.com 2021-05-01 21:13:51 UTC
gathered data URL : shorturl.at/pvE79

Hello,
Customer did an upgrade of his cluster, update is stuck with a random worker node in degraded state.

According to the latest must-gather current and desired config are set to the previous MachineConfig.

~~
2021-04-15T13:51:56.801540297Z I0415 13:51:56.801482    2217 daemon.go:808] Current+desired config: rendered-worker-1d0866ca1672b3108f63cb78ee21e9a5
2021-04-15T13:51:56.808645810Z I0415 13:51:56.808603    2217 update.go:1676] Disk currentConfig rendered-worker-6884f95ab6f02681824be2fe6d68d049 overrides node's currentConfig annotation rendered-worker-1d0866ca1672b3108f63cb78ee21e9a5
2021-04-15T13:51:56.812541252Z I0415 13:51:56.812493    2217 daemon.go:1085] Validating against current config rendered-worker-6884f95ab6f02681824be2fe6d68d049
2021-04-15T13:51:56.812643566Z E0415 13:51:56.812629    2217 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-6884f95ab6f02681824be2fe6d68d049: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26"
~~~

The MCD still complaining about the osImageURL
--------------------------------------------------------------------
we follow the KCS: https://access.redhat.com/solutions/5598401 and we get
 
sh-4.4# export HTTP_PROXY='http://http-proxy.cezdata.corp:8080'
sh-4.4# export HTTPS_PROXY='http://http-proxy.cezdata.corp:8080'
sh-4.4# /run/bin/machine-config-daemon pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
I0415 15:18:00.786311   85973 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-367690970 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
I0415 15:18:45.750981   85973 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I0415 15:18:45.840305   85973 rpm-ostree.go:179] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
I0415 15:18:48.104897   85973 rpm-ostree.go:211] Pivoting to: 46.82.202103050041-0 (d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06)
I0415 15:18:48.104923   85973 rpm-ostree.go:243] Executing rebase from repo path /run/mco-machine-os-content/os-content-367690970/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a and checksum d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06
I0415 15:18:48.104935   85973 rpm-ostree.go:261] Running captured: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-367690970/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description Managed by machine-config-operator
error: No enabled repositories
error: with stdout output: : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-367690970/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description Managed by machine-config-operator: 0 metadata, 0 content objects imported; 0 bytes content written
Checking out tree d8cb906...done
: exit status 1

--------------------------------------------------------------------
this looks like a bug in rpm-ostree or in information that is passed to rpm-ostree. Additional remarks/observation:

- the workaround is done by using the --cache-only (-C) directive for rpm-ostree rebase. If using this directive, a repo check is skipped altogether, which should be always the case in case of CoreOS upgrade in this situation. In our case /etc/yam.d.repos is empty on all nodes anyway.
- as the actual image is downloaded in the step prior to rpm-ostree rebase, we think this should be a safe option
- once the upgrade is "unstuck", any subsequent upgrade or downgrade, even of the same OS image, runs automatically without this error. This means this problem is gone with first successful rpm-ostree rebase command.
- this situation happens randomly on other clusters as well, may not be bound to these particular versions. Sometimes it does not occur at all.
- all worker nodes on affected clusters are created by Ansible, same Ansible role, running on the same hardware, the only difference is IP address, node name, etc.
- once this error occurs, it persists across multiple attemps, until the workaround (-C) is applied
- we think the condition in rpm-ostree should be investigated, it may not work in a particular set of inputs
- alternatively, you may consider adding -C option to the /run/bin/machine-config-daemon script to circumvent the issue (the assumption is that local repos are not actually defined)

---------------------------------

finally got it working.

sh-4.4# rpm-ostree rebase -C --experimental /run/mco-machine-os-content/os-content-093177341/srv/repo:d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a --custom-origin-description "Managed by machine-config-operator"
0 metadata, 0 content objects imported; 0 bytes content written
Checking out tree d8cb906... done
Resolving dependencies... done
Checking out packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Freed: 52.0 MB (pkgcache branches: 0)
Upgraded:
  cri-o 1.19.1-7.rhaos4.6.git6377f68.el8 -> 1.19.1-10.rhaos4.6.git27e71a0.el8
  criu 3.12-9.module+el8.2.0+9347+d4fa9cbb -> 3.12-9.module+el8.2.0+9938+46853747
  fuse-overlayfs 0.7.2-5.module+el8.2.0+9347+d4fa9cbb -> 0.7.2-5.module+el8.2.0+9938+46853747
  grub2-common 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-efi-x64 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-pc 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-pc-modules 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-tools 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-tools-extra 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  grub2-tools-minimal 1:2.02-87.el8_2.2 -> 1:2.02-87.el8_2.3
  microcode_ctl 4:20191115-4.20201112.1.el8_2 -> 4:20191115-4.20210216.1.el8_2
  openshift-hyperkube 4.6.0-202102120001.p0.git.94272.41c57c9.el8 -> 4.6.0-202102261508.p0.git.94279.27a8e35.el8
  runc 1.0.0-82.rhaos4.6.git086e841.el8 -> 1.0.0-83.rhaos4.6.git8c2e7c8.el8
  slirp4netns 0.4.2-3.git21fdece.module+el8.2.0+9347+d4fa9cbb -> 0.4.2-3.git21fdece.module+el8.2.0+9938+46853747
Added:
  bootupd-0.2.4-3.el8.x86_64
Run "systemctl reboot" to start a reboot

customer made it work by using the --cache-only (-C) switch, as he downloaded the image in the previous step. 

This circumvents a check on line 797 (see below).
What I found was this:
https://github.com/coreos/rpm-ostree/issues/780

Which made him look into the code of rpm-ostree.

The problematic tests seems to be here, on line 803:
https://github.com/coreos/rpm-ostree/blob/master/src/libpriv/rpmostree-core.cxx

It seems this test may sometimes fail even if it is not supposed to:

if (g_variant_dict_lookup (self->spec->dict, "packages", "^a&s", &pkgs) &&
          g_strv_length (pkgs) > 0)

After doing this, the infra node was upgraded, rebooted, downgraded to previous version because ot the patch we tried, then upgraded back to the version it is supposed to be.

Maybe rpm-ostree maintainers should check this. Unfortunately, with the problem fixed (and even subsequent upgrade to the problematic version went without problems), we can no longer reproduce, so hopefully must-gather will contain necessary information as to why the condition above failed even if it is clearly not supposed to.


---------------------------------------------------
Could you please help us check this eventual bug and it's the root cause which may reoccur any time?

Comment 1 Yu Qi Zhang 2021-05-04 00:30:00 UTC
Given the issue is for rpm-ostree, moving over to the coreos team, thanks!

Comment 2 Colin Walters 2021-05-04 17:13:33 UTC
Are any extensions in use here?

Can you paste the output of `rpm-ostree status`?

Comment 3 Jonathan Lebon 2021-05-04 17:36:30 UTC
I think the issue here is that that KCS doesn't account for extensions. Normally the MCD writes out the `coreos-extensions.repo` file before calling rpm-ostree.

Using `--cache-only` is not correct because there's no guarantee that the version of extensions cached matches the versions which were shipped with the OS image we're rebasing to.

So... I think we need to teach the `/run/bin/machine-config-daemon pivot` about extensions (or lower down some of the extensions handling into the shared code between the two paths)?

Comment 4 Jonathan Lebon 2021-05-04 19:36:03 UTC
Or adapt the KCS to have it drop the extensions first before calling pivot, but I'm not sure if the MCD will immediately know to re-add the extensions in the MCs once the node is back on the right OSTree commit.

Comment 5 mchebbi@redhat.com 2021-05-06 10:50:37 UTC
(In reply to Colin Walters from comment #2)
> Are any extensions in use here?
> 
> Can you paste the output of `rpm-ostree status`?

Hello,
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z)
           LayeredPackages: kernel-devel kernel-headers

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z)
           LayeredPackages: kernel-devel kernel-headers

Comment 6 mchebbi@redhat.com 2021-05-06 10:51:52 UTC
(In reply to Colin Walters from comment #2)
> Are any extensions in use here?
> 
> Can you paste the output of `rpm-ostree status`?

hello,

sh-4.4# rpm-ostree status -vvv
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z)
                BaseCommit: d8cb9069afa2946131e033b3da6ad81a49c32642cbec3277a8cc9a1d7c022b06
                            |- art-rhaos-4.6 (2021-03-04T23:40:59Z)
                            |- rhel8-fast-datapath (2021-02-15T16:31:03Z)
                            |- rhel8-baseos (2021-03-02T19:20:28Z)
                            `- rhel8-appstream (2021-03-03T10:25:13Z)
                    Commit: b203b0e1db8ebd37e2295757354d88ff1ed52b20ab5b7994c48e359180f275b3
                            `- coreos-extensions (2021-03-05T01:10:40Z)
                    Staged: no
                 StateRoot: rhcos
           LayeredPackages: kernel-devel kernel-headers

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z)
                BaseCommit: 2e9900b43c58f52ff4197ef675cb69c0a29b04393bd34e7b7427648002a653d2
                            |- art-rhaos-4.6 (2021-02-12T03:10:48Z)
                            |- rhel8-fast-datapath (2021-02-11T14:49:34Z)
                            |- rhel8-baseos (2021-02-10T09:40:05Z)
                            `- rhel8-appstream (2021-02-11T21:54:21Z)
                    Commit: b45a17d7875e10f0a9d331681cf171b13a11ad8821ba8e0f25660ea893697652
                            `- coreos-extensions (2021-02-12T04:09:32Z)
                 StateRoot: rhcos
           LayeredPackages: kernel-devel kernel-headers

Comment 7 mchebbi@redhat.com 2021-05-06 10:53:10 UTC
hello,
customer confirm that this problem happened again today when upgrading a diferrent cluster from 4.6.21 to 4.6.26.

Current state:

sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3786aa3c566dac65955db229dc20f9f19106046ebe33561bd60cce5538544e6a
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202103050041-0 (2021-03-05T00:45:01Z)
           LayeredPackages: kernel-devel kernel-headers

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b69dcece911957dbd617e1a5ca97f7ab0d8f1feab5870f6d6fd9221143f0fc26
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202102120340-0 (2021-02-12T03:44:11Z)
           LayeredPackages: kernel-devel kernel-headers

thanks in advance for your help.


Note You need to log in before you can comment on or make changes to this bug.