Description of problem: Deploying undercloud using satellite server as rpm source fails. When undercloud deploy starts yum updates are successful. However, midway through the deployment the undercloud can no longer install rpms. This is seen in undercloud deploy log: Error: Execution of '/bin/dnf -d 0 -e 1 -y install net-snmp' returned 1: Error: Failed to download metadata for repo 'Default_Organization_OSP17_0_RPM_rhelosp-17_0-image-build-override': Cannot download repomd.xml: Curl error (77): Problem with the SSL CA cert (path? access rights?) for https://cdn.redhat.com/Default_Organization/Library/custom/OSP17_0_RPM/rhelosp-17_0-image-build-override/repodata/repomd.xml [error setting certificate file: %(ca_cert_dir)sredhat-uep.pem]\n<13>Apr 13 09:47:03 puppet-user: Error: /Stage[main]/Snmp/Package[snmpd]/ensure: change from 'purged' to 'present' failed: Its looking for cdn.redhat.com when it should be looking for titan98.lab.eng.tlv2.redhat.com Looking at the repo file when deployment starts: ls -l /etc/yum.repos.d total 24 -rw-r--r--. 1 root root 20761 Apr 13 09:25 redhat.repo All entries in the redhat.repo file correctly show the satellite server (titan98) as being used as rpm source. Example of one entry: [Default_Organization_OSP17_0_RPM_rhelosp-17_0-image-build-override] name = rhelosp-17.0-image-build-override baseurl = https://titan98.lab.eng.tlv2.redhat.com/pulp/repos/Default_Organization/Library/custom/OSP17_0_RPM/rhelosp-17_0-image-build-override enabled = 1 gpgcheck = 0 sslverify = 1 sslcacert = /etc/rhsm/ca/katello-server-ca.pem sslclientkey = /etc/pki/entitlement/3186737919124279956-key.pem sslclientcert = /etc/pki/entitlement/3186737919124279956.pem metadata_expire = 1 enabled_metadata = 1 Then when the undercloud deployment has failed it is seen that the repos entries have been changed to reference cdn and rpms are no longer found. ls -l /etc/yum.repos.d total 20 -rw-r--r--. 1 root root 19375 Apr 13 09:47 redhat.repo [Default_Organization_OSP17_0_RPM_rhelosp-17_0-image-build-override] name = rhelosp-17.0-image-build-override baseurl = https://cdn.redhat.com/Default_Organization/Library/custom/OSP17_0_RPM/rhelosp-17_0-image-build-override enabled = 1 gpgcheck = 0 sslverify = 1 sslcacert = %(ca_cert_dir)sredhat-uep.pem sslclientkey = /etc/pki/entitlement/3186737919124279956-key.pem sslclientcert = /etc/pki/entitlement/3186737919124279956.pem metadata_expire = 1 enabled_metadata = 1 Version-Release number of selected component (if applicable): RHOS-17.0-RHEL-9-20220401.n.1 How reproducible: Every time Steps to Reproduce: 1. Execute Phase 3 job DFG-df-deployment-17.0-virthost-3cont_2comp_3ceph-ceph-ipv4-geneve-satellite-local-registry 2. 3. Actual results: Undercloud successfully deploys Expected results: Undercloud deploy fails due to not being able to install rpms. Additional info:
Logs to a failing job: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-17.0-virthost-3cont_2comp_3ceph-ceph-ipv4-geneve-satellite-local-registry/24/
Any updates? did you get the reproducer?
Hello Yaniv, The depends-on bug is ON_QA, so David should be able to give it a try - hopefully that other issue will fix the actual one, at least it sounds like a similar thing. @drosenfe any chance to get the satellite run with openstack-tripleo-heat-templates-14.3.1-0.20220428004525.935261d.el8ost (or the equivalent for el9 - probably same version/build)? Cheers, C.
Problem was still seen in Phase 3 regression of RHOS-17.0-RHEL-9-20220511.n.1 which contains: [stack@undercloud-0 ~]$ yum list installed | grep openstack-tripleo-heat-templates openstack-tripleo-heat-templates.noarch 14.3.1-0.20220506221655.7b9b4ef.el9ost @rhelosp-17.0
Some more info: - David could get a job where he registered the node against Satellite, but stopped the UC deploy (killed script) - after about 1h, the repositories were untouched (i.e. timestamp stayed the same, as well as content) This points to an issue within OSP itself, not external source. In the meantime, I've checked within tripleo-heat-templates. While the "roles_data_undercloud.yaml" lists the OS::TripleO::Services::Rhsm service, it's apparently nullified in another file: overcloud-resource-registry-puppet.j2.yaml: OS::TripleO::Services::Rhsm: OS::Heat::None We can also see it's not listed in the active services of the generated files for the deploy, such as external_deploy_steps_tasks.yaml and external_deploy_steps_tasks_step1.yaml. Though we can see that service being present in the OC container prepare section, I really doubt that could do anything weird on the system. It would be really good to get an actual reproducer outside of Jenkins, so that we can run, re-run, put breakpoints and so on. I'll try to deploy a satellite on my own lab, though it would require "some" resources, and I'll have to configure it. @drosenfe iirc you have some kind of script for a satellite configuration/bootstrap, would you be able to share it? I'd run it in a VM on my own builder, and use it with a 1-undercloud layout (since the UC fails - no need to get more nodes to reproduce afaik). Sorry for taking so long, that issue isn't easy to squash :/.
I don't have a script to set up a satellite. You are welcome to point to my satellite if that helps. I can also run the job in my testbed and let you ssh to the undercloud to watch what is happening. You may be able to correlate what undercloud deploy is doing when the repos file changes. I think that would be a lot easier than building your own satellite. This is test I did to try to get more information: - submit the satellite jenkins job - saw that it failed 22 minutes into the undercloud stage due to the /etc/yum.repos.d/redhat.repo being rewritten to include cdn instead of the satellite server Then did this: - submit the satellite jenkins job again - waited until the undercloud stage started. - 10 minutes into the undercloud stage the script undercloud_deploy.sh is executed to install the undercloud - immediately killed that script - watched the time stamp of the /etc/yum.repos.d/redhat.repo file for the next hour - after one hour the /etc/yum.repos.d/redhat.repo timestamp had not changed. In addition the file still contained the url of the satellite server and not cdn My conclusion is that something during undercloud install is causing the /etc/yum.repos.d/redhat.repo file to be rewritten to include cdn instead of the satellite address.
After a long delay, here are some data! - deploying OSP-17 on el9 - using our QE Satellite host instead of Red Hat CDN As soon as I get a container running with the /run:/run mount, dnf switches the sourcelist to the CDN in the redhat.repo, because "Subscription Manager is operating in container mode." This happens with any (clean, makecache, install, anything!!) dnf command, and ends in crashing the OSP deploy. After some more testing, removing all containers and cleaning the /run/.containerenv stops this unwanted behavior. After even some more poking, it seems we have an enabled plugin in dnf: libdnf-plugin-subscription-manager-1.29.26-3.el9_0.x86_64. Disabling it seems to stop updating the redhat.repo file. I'm wondering if we're not hitting a bug here, since subscription-manager should really check its configuration in /etc/rhsm/rhsm.conf instead of blindly overriding things based on some (wrong) assumption. WDYT? PS: I've also commented on #2058540 to raise awareness.
A proposal: https://review.opendev.org/c/openstack/ansible-role-redhat-subscription/+/845276 Though I'm really, really unhappy with that. IMHO, things should be fixed here: https://bugzilla.redhat.com/show_bug.cgi?id=2095316
The proposed review won't fix the UC. Needs some more digging and testing. I also took some time to see how subscription-manager actually works. It leads to those blocks: https://github.com/candlepin/subscription-manager/blob/385d64843affed7b58c2fb461612cf05f1dac4e5/src/rhsm/config.py#L102-L127 this one shows how it checks if we're in a container or not. In our case, it will return True even from the host since the /run/.containerenv is present (due to /run being bind-mounted in all containers) https://github.com/candlepin/subscription-manager/blob/385d64843affed7b58c2fb461612cf05f1dac4e5/src/rhsm/config.py#L384-L387 this one shows why we get the redhat.repo overridden: it searches the rhsm.conf in a non-existing location on the host, /etc/rhsm-host; and therefore fallbacks on the default cdn.redhat.com Maybe a proper patch is to just create a symlink, /etc/rhsm-host -> /etc/rhsm - and we're good. I'm running some tests right now, I should get some results soonish.
After some more testing and poking, it seems this patch is an actual fix/workaround that actually works: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/845353
Wallaby backport has some issues: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/845674 Will keep an eye on it.
An actual fix in podman is also in place, and we're looking to backport it in bug 2097694 We therefore may be able to revert the workaround I implemented - that would be really good imho.
stable/wallaby merged. Should be included in the next sync!
Undercloud deploy successful during Phase 3 regression of RHOS-17.0-RHEL-9-20220701.n.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543