Bug 2048134 - tripleo-bootstrap requires connectivity before networking is set up
Summary: tripleo-bootstrap requires connectivity before networking is set up
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
high
low
Target Milestone: beta
: 17.0
Assignee: Harald Jensås
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-29 17:29 UTC by François Rigault
Modified: 2022-09-21 12:19 UTC (History)
5 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20220218230348.f9786b1.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:18:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 827383 0 None MERGED tripleo-bootstrap - check packeges fact befor install 2022-02-15 07:04:18 UTC
OpenStack gerrit 827567 0 None MERGED tuned - check packages before install 2022-02-15 07:04:15 UTC
Red Hat Issue Tracker OSP-12370 0 None None None 2022-01-29 17:34:07 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:19:15 UTC

Description François Rigault 2022-01-29 17:29:07 UTC
Description of problem:
deploying the stack fails in "Deploy network-scripts required for deprecated network service" because the yum repositories are not accessible.
I'm in the case where
- some yum repositories have been added at start-up (eg through cloud-init or images newly build with dib)
- the ctlplane network is not routed


Version-Release number of selected component (if applicable):
tripleo-ansible-0.7.1-2.20210603175844.el8ost.9.noarch


How reproducible:
all the time: 


Steps to Reproduce:
1. use a ctlplane network that is not routed (server does not have access to the yum repository using its IP on the ctlplane network)
2. add some yum repositories in the base image
3. try to deploy

Actual results:
fails

Expected results:
it should not fail when network-scripts is already installed


Additional info:
code looks like this:
~~
- name: Deploy and enable network service
  become: true
  when:
    - (tripleo_bootstrap_legacy_network_packages | length) > 0
  block:
    - name: Deploy network-scripts required for deprecated network service
      package:
        name: "{{ tripleo_bootstrap_legacy_network_packages }}"
        state: present
~~

tripleo_bootstrap_legacy_network_packages is non empty only on fedora and redhat-8.
the package: state: present

triggers a dnf update even though the network is not configured yet (setting skip_if_unavailable=true in the repos can be done as a workaround, but then it will need to be set again to false after the network is set-up, and this workaround does not work if the repos contain mirrors and these mirrors cannot be resolved).

Can it be made so that
- either the package task is removed completely: tripleo would assume the package is already part of the base image
- either tripleo does an extra effort to not use the package task. eg running something like

~~~
  - shell: "rpm -q {{ tripleo_bootstrap_legacy_network_packages[0] }}"
    failed_when: no
    changed_when: no
    register: res 
  - package:
    when: res.rc != 0 
~~~
so that the package task is not called, in a "best effort" mode.

Comment 2 Harald Jensås 2022-02-02 02:10:22 UTC
I tried to reprodece this, but `package` did not fail for me with the node isolated.
See details on the reproducer I attempted below.

I proposed a patch: https://review.opendev.org/c/openstack/tripleo-ansible/+/827383

@François, can you test the patch in your environment to verify that it solved your issue?

Thank you!


---------------------------------------------------------------
$ ip route del default

$ dnf info network-scripts
delorean-openstack-ironic-python-agent-builder-a08dcb4c36ee464ffc0825f770686a4 0.0  B/s |   0  B     00:00
Errors during downloading metadata for repository 'delorean-component-baremetal':
  - Curl error (6): Couldn't resolve host name for https://trunk.rdoproject.org/centos8/component/baremetal/a0/8d/a08dcb4c36ee46
4ffc0825f770686a47ed86c570_a36069e8/repodata/repomd.xml [Could not resolve host: trunk.rdoproject.org]
Error: Failed to download metadata for repo 'delorean-component-baremetal': Cannot download repomd.xml: Cannot download repodata
/repomd.xml: All mirrors were tried

$ rpm -q network-scripts
network-scripts-10.00.15-1.el8.x86_64

Ran this playbook:
---
- name: Reproduce RHBZ#2048134
  hosts: localhost
  gather_facts: false
  vars:
    tripleo_bootstrap_legacy_network_packages:
      - network-scripts
  tasks:
  - name: Install package
    become: true
    package:
      name: "{{ tripleo_bootstrap_legacy_network_packages }}"
      state: present


$ ansible-playbook reproducer.yaml
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [Reproduce RHBZ#2048134] ****************************************************************************************

TASK [Install package] *****************************************************************
ok: [localhost]

PLAY RECAP *****************************************************************************
localhost  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Comment 3 François Rigault 2022-02-02 06:37:39 UTC
thanks. I will try harder, please don't merge yet :)

I am curious about the non-reproducer.. I can reproduce consistently that the package tasks behavior fails (originally on 8-stream, I tried 9-stream and fedora too). By any chance, if you run an "ip route" at the end, is the default route still missing? The only idea I have is there was a lease renewed just between the "dnf info" and the "ansible-playbook" above.

Depending on the tasks included and the version of tripleo, there are other ansible package run here and there.

--
[cloud-user@stream ~]$ sudo dnf clean all
0 files removed
[cloud-user@stream ~]$ sudo ip route del default
[cloud-user@stream ~]$ ansible-playbook play.yaml 
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [Reproduce RHBZ#2048134] ***************************************************************************************************************

TASK [Install package] **********************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to download metadata for repo 'baseos': Cannot prepare internal mirrorlist: Curl error (7): Couldn't connect to server for https://mirrors.centos.org/metalink?repo=centos-baseos-9-stream&arch=x86_64&protocol=https,http []", "rc": 1, "results": []}

PLAY RECAP **********************************************************************************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Comment 4 François Rigault 2022-02-02 11:57:28 UTC
thanks for this patch.
After further investigation, one reliable way to make it work is to run in the cloud-init:

    runcmd:
    - ip route del default
    - sed -i 's/^skip_if_unavailable.*/skip_if_unavailable=true/' /etc/dnf/dnf.conf

To avoid the skip_if_unavailable=true, the patch is working (I modified it to match the installed version of tripleo) for my minimal Centos 8-stream image.

For RHEL it is _not_ working because
- the "openvswitch2.15" package is present on the system while the one installed is called "openvswitch" and there is some specific logic around this case. I think it works by chance :) the playbook does not fail for openvswitch because of a workaround for ceph deployments.
- when the OS::TripleO::Services::Tuned service is defined, tuned/tasks/tuned_install.yml fails due to similar logic.

I still don't understand why you werent able to reproduce the issue.
I initially thought the skip_if_unavailable was not reliable. Centos comes with a default list of repositories, I thought I added skip_if_unavailable to each of them, but it was not the case, so I thought it was coming from repositories mentioning mirrors.. By making sure the dnf.conf contains skip_if_unavailable=true, the deployment just works. I think it is a good enough workaround and tripleo works for me without any patch.

Comment 13 errata-xmlrpc 2022-09-21 12:18:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.