2075080 – OSP17 yum repo file gets overwritten causing undercloud deploy fail

Bug 2075080 - OSP17 yum repo file gets overwritten causing undercloud deploy fail

Summary: OSP17 yum repo file gets overwritten causing undercloud deploy fail

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Alpha
Target Release:	17.0
Assignee:	Cédric Jeanneret
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:	2058540 2095316 2097694
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-13 14:38 UTC by David Rosenfeld
Modified:	2022-09-21 12:21 UTC (History)
CC List:	5 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-14.3.1-0.20220628111342.7c969c5.el9ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-21 12:20:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	845353	0	None	master: MERGED	tripleo-heat-templates: Workaround for subscription-manager (Ia66fb6933b8e7b8289e80fab5508c7ed7f828f01)	2022-06-28 14:35:51 UTC
OpenStack gerrit	845674	0	None	stable/wallaby: MERGED	tripleo-heat-templates: Workaround for subscription-manager (Ia66fb6933b8e7b8289e80fab5508c7ed7f828f01)	2022-06-28 14:35:56 UTC
Red Hat Bugzilla	2097694	1	unspecified	CLOSED	Allow mounting -v /run:/run without leaking .containerenv file to the host	2023-05-30 06:39:35 UTC
Red Hat Issue Tracker	OSP-14665	0	None	None	None	2022-04-13 14:50:32 UTC
Red Hat Product Errata	RHEA-2022:6543	0	None	None	None	2022-09-21 12:21:12 UTC

Description David Rosenfeld 2022-04-13 14:38:22 UTC

Description of problem: Deploying undercloud using satellite server as rpm source fails. When undercloud deploy starts yum updates are successful. However, midway through the deployment the undercloud can no longer install rpms.

This is seen in undercloud deploy log:
Error: Execution of '/bin/dnf -d 0 -e 1 -y install net-snmp' returned 1: Error: Failed to download metadata for repo 'Default_Organization_OSP17_0_RPM_rhelosp-17_0-image-build-override': Cannot download repomd.xml: Curl error (77): Problem with the SSL CA cert (path? access rights?) for https://cdn.redhat.com/Default_Organization/Library/custom/OSP17_0_RPM/rhelosp-17_0-image-build-override/repodata/repomd.xml [error setting certificate file: %(ca_cert_dir)sredhat-uep.pem]\n<13>Apr 13 09:47:03 puppet-user: Error: /Stage[main]/Snmp/Package[snmpd]/ensure: change from 'purged' to 'present' failed:

Its looking for cdn.redhat.com when it should be looking for titan98.lab.eng.tlv2.redhat.com


Looking at the repo file when deployment starts:

ls -l /etc/yum.repos.d
total 24
-rw-r--r--. 1 root root 20761 Apr 13 09:25 redhat.repo

All entries in the redhat.repo file correctly show the satellite server (titan98) as being used as rpm source. Example of one entry:

[Default_Organization_OSP17_0_RPM_rhelosp-17_0-image-build-override]
name = rhelosp-17.0-image-build-override
baseurl = https://titan98.lab.eng.tlv2.redhat.com/pulp/repos/Default_Organization/Library/custom/OSP17_0_RPM/rhelosp-17_0-image-build-override
enabled = 1
gpgcheck = 0
sslverify = 1
sslcacert = /etc/rhsm/ca/katello-server-ca.pem
sslclientkey = /etc/pki/entitlement/3186737919124279956-key.pem
sslclientcert = /etc/pki/entitlement/3186737919124279956.pem
metadata_expire = 1
enabled_metadata = 1



Then when the undercloud deployment has failed it is seen that the repos entries  have been changed to reference cdn and rpms are no longer found.

 ls -l /etc/yum.repos.d
total 20
-rw-r--r--. 1 root root 19375 Apr 13 09:47 redhat.repo


[Default_Organization_OSP17_0_RPM_rhelosp-17_0-image-build-override]
name = rhelosp-17.0-image-build-override
baseurl = https://cdn.redhat.com/Default_Organization/Library/custom/OSP17_0_RPM/rhelosp-17_0-image-build-override
enabled = 1
gpgcheck = 0
sslverify = 1
sslcacert = %(ca_cert_dir)sredhat-uep.pem
sslclientkey = /etc/pki/entitlement/3186737919124279956-key.pem
sslclientcert = /etc/pki/entitlement/3186737919124279956.pem
metadata_expire = 1
enabled_metadata = 1



Version-Release number of selected component (if applicable): RHOS-17.0-RHEL-9-20220401.n.1


How reproducible: Every time


Steps to Reproduce:
1. Execute Phase 3 job DFG-df-deployment-17.0-virthost-3cont_2comp_3ceph-ceph-ipv4-geneve-satellite-local-registry
2.
3.

Actual results: Undercloud successfully deploys


Expected results: Undercloud deploy fails due to not being able to install rpms.


Additional info:

Comment 1 David Rosenfeld 2022-04-13 14:39:16 UTC

Logs to a failing job:

https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-17.0-virthost-3cont_2comp_3ceph-ceph-ipv4-geneve-satellite-local-registry/24/

Comment 4 Yaniv Kaul 2022-05-11 13:04:23 UTC

Any updates? did you get the reproducer?

Comment 5 Cédric Jeanneret 2022-05-16 06:18:03 UTC

Hello Yaniv,

The depends-on bug is ON_QA, so David should be able to give it a try - hopefully that other issue will fix the actual one, at least it sounds like a similar thing.

@drosenfe any chance to get the satellite run with openstack-tripleo-heat-templates-14.3.1-0.20220428004525.935261d.el8ost (or the equivalent for el9 - probably same version/build)?

Cheers,

C.

Comment 6 David Rosenfeld 2022-05-16 12:21:06 UTC

Problem was still seen in Phase 3 regression of RHOS-17.0-RHEL-9-20220511.n.1 which contains:

[stack@undercloud-0 ~]$ yum list installed | grep openstack-tripleo-heat-templates
openstack-tripleo-heat-templates.noarch       14.3.1-0.20220506221655.7b9b4ef.el9ost   @rhelosp-17.0

Comment 7 Cédric Jeanneret 2022-06-03 11:13:59 UTC

Some more info:
- David could get a job where he registered the node against Satellite, but stopped the UC deploy (killed script)
- after about 1h, the repositories were untouched (i.e. timestamp stayed the same, as well as content)
This points to an issue within OSP itself, not external source.

In the meantime, I've checked within tripleo-heat-templates. While the "roles_data_undercloud.yaml" lists the OS::TripleO::Services::Rhsm service, it's apparently nullified in another file:
overcloud-resource-registry-puppet.j2.yaml:  OS::TripleO::Services::Rhsm: OS::Heat::None

We can also see it's not listed in the active services of the generated files for the deploy, such as external_deploy_steps_tasks.yaml and external_deploy_steps_tasks_step1.yaml. Though we can see that service being present in the OC container prepare section, I really doubt that could do anything weird on the system. It would be really good to get an actual reproducer outside of Jenkins, so that we can run, re-run, put breakpoints and so on.

I'll try to deploy a satellite on my own lab, though it would require "some" resources, and I'll have to configure it. @drosenfe iirc you have some kind of script for a satellite configuration/bootstrap, would you be able to share it? I'd run it in a VM on my own builder, and use it with a 1-undercloud layout (since the UC fails - no need to get more nodes to reproduce afaik).

Sorry for taking so long, that issue isn't easy to squash :/.

Comment 8 David Rosenfeld 2022-06-03 12:51:45 UTC

I don't have a script to set up a satellite. You are welcome to point to my satellite if that helps. I can also run the job in my testbed and let you ssh to the undercloud to watch what is happening. You may be able to correlate what undercloud deploy is doing when the repos file changes. I think that would be a lot easier than building your own satellite.

This is test I did to try to get more information:

- submit the satellite jenkins job

- saw that it failed 22 minutes into the undercloud stage due to the /etc/yum.repos.d/redhat.repo being rewritten to include cdn instead of the satellite server

Then did this:

- submit the satellite jenkins job again

- waited until the undercloud stage started.

- 10 minutes into the undercloud stage the script undercloud_deploy.sh is executed to install the undercloud

- immediately killed that script

- watched the time stamp of the /etc/yum.repos.d/redhat.repo file for the next hour

- after one hour the /etc/yum.repos.d/redhat.repo timestamp had not changed. In addition the file still contained the url of the satellite server and not cdn

My conclusion is that something during undercloud install is causing the /etc/yum.repos.d/redhat.repo file to be rewritten to include cdn instead of the satellite address.

Comment 10 Cédric Jeanneret 2022-06-09 13:12:23 UTC

After a long delay, here are some data!

- deploying OSP-17 on el9
- using our QE Satellite host instead of Red Hat CDN

As soon as I get a container running with the /run:/run mount, dnf switches the sourcelist to the CDN in the redhat.repo, because "Subscription Manager is operating in container mode."

This happens with any (clean, makecache, install, anything!!) dnf command, and ends in crashing the OSP deploy.

After some more testing, removing all containers and cleaning the /run/.containerenv stops this unwanted behavior.

After even some more poking, it seems we have an enabled plugin in dnf: libdnf-plugin-subscription-manager-1.29.26-3.el9_0.x86_64.

Disabling it seems to stop updating the redhat.repo file.

I'm wondering if we're not hitting a bug here, since subscription-manager should really check its configuration in /etc/rhsm/rhsm.conf instead of blindly overriding things based on some (wrong) assumption.

WDYT?


PS: I've also commented on #2058540 to raise awareness.

Comment 11 Cédric Jeanneret 2022-06-09 14:19:06 UTC

A proposal: https://review.opendev.org/c/openstack/ansible-role-redhat-subscription/+/845276

Though I'm really, really unhappy with that. IMHO, things should be fixed here: https://bugzilla.redhat.com/show_bug.cgi?id=2095316

Comment 12 Cédric Jeanneret 2022-06-10 12:29:04 UTC

The proposed review won't fix the UC. Needs some more digging and testing.

I also took some time to see how subscription-manager actually works.

It leads to those blocks:

https://github.com/candlepin/subscription-manager/blob/385d64843affed7b58c2fb461612cf05f1dac4e5/src/rhsm/config.py#L102-L127
this one shows how it checks if we're in a container or not. In our case, it will return True even from the host since the /run/.containerenv is present (due to /run being bind-mounted in all containers)

https://github.com/candlepin/subscription-manager/blob/385d64843affed7b58c2fb461612cf05f1dac4e5/src/rhsm/config.py#L384-L387
this one shows why we get the redhat.repo overridden: it searches the rhsm.conf in a non-existing location on the host, /etc/rhsm-host; and therefore fallbacks on the default cdn.redhat.com

Maybe a proper patch is to just create a symlink, /etc/rhsm-host -> /etc/rhsm - and we're good. I'm running some tests right now, I should get some results soonish.

Comment 13 Cédric Jeanneret 2022-06-10 13:22:22 UTC

After some more testing and poking, it seems this patch is an actual fix/workaround that actually works: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/845353

Comment 14 Cédric Jeanneret 2022-06-16 06:57:16 UTC

Wallaby backport has some issues: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/845674

Will keep an eye on it.

Comment 15 Cédric Jeanneret 2022-06-16 11:38:28 UTC

An actual fix in podman is also in place, and we're looking to backport it in bug 2097694

We therefore may be able to revert the workaround I implemented - that would be really good imho.

Comment 16 Cédric Jeanneret 2022-06-23 09:12:45 UTC

stable/wallaby merged. Should be included in the next sync!

Comment 19 David Rosenfeld 2022-07-06 14:10:59 UTC

Undercloud deploy successful during Phase 3 regression of RHOS-17.0-RHEL-9-20220701.n.1

Comment 24 errata-xmlrpc 2022-09-21 12:20:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.