Bug 2172507 - Loss of connectivity with controllers after doing an undercloud restore
Summary: Loss of connectivity with controllers after doing an undercloud restore
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Carlos Camacho
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-22 11:50 UTC by Fernando Díaz
Modified: 2023-08-09 10:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-24 11:19:27 UTC
Target Upstream Version:
Embargoed:
fdiazbra: needinfo-
fdiazbra: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-22596 0 None None None 2023-02-22 11:51:31 UTC

Description Fernando Díaz 2023-02-22 11:50:20 UTC
Description of problem:
After restoring the undercloud we are losing the connectivity with the controller nodes because for some reason the network interfaces names are changing from eth0-eth1-eth2 to ensX. This change in the interace's names is breaking the bridge br-ctlplane and as a result we can see in the logs:  "ssh: connect to host 192.168.24.34 port 22: No route to host".


Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230216.n.1
rhel-guest-image-9.2-20230207.8.x86_64.qcow2


How reproducible:
Executing the bnr workflow with infrared plugin

Steps to Reproduce:
1. Deploy the stack
2. Install rear and nfs: infrared backup-restore --ospversion 17.1 --setup-nfs-rear true --backup-dir /home/ctl_plane_backups
3. Execute rear backup: infrared backup-restore --ospversion 17.1 --backup-undercloud true --backup-overcloud true  --backup-dir /home/ctl_plane_backups
4. Restore the undercloud: infrared backup-restore --ospversion 17.1 --restore-undercloud true --restore-overcloud true --backup-dir /home/ctl_plane_backups

Actual results:
This change in the interace's names is breaking the bridge br-ctlplane and as a result we can see in the undercloud restore logs:  "ssh: connect to host 192.168.24.34 port 22: No route to host".

Expected results:
Restoration is sucessful

Additional info:
As a workaroud it is possible to fix the networking problem using os-net-config utility:
sed -i 's/eth0/ens3/g' /etc/os-net-config/config.yaml;
os-net-config -c /etc/os-net-config/config.yaml

Comment 1 Harald Jensås 2023-02-24 09:15:49 UTC
You are destroying the undercloud VM set up by infrared virsh plugin and re-creating the undercloud VM without using the virsh plugin, I guess in one of these playbooks?
* https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-undercloud/restore.yml#L97-L115
* https://github.com/redhat-openstack/infrared/blob/55ba05ca0d9f5aca6f605816da02dac053537254/plugins/tripleo-undercloud/restore_containerized.yml

The virsh plugin does it in a similar way, but there is many things that can be different based on options.
* https://github.com/redhat-openstack/infrared/blob/master/plugins/virsh/tasks/vms_2_install.yml#L37-L78

For example:
          {% if provision.bootmode == 'uefi' %}
          --boot {{ 'hd' if topology_node.deploy_os|default(True) else 'uefi' }} \
          {% else %}

          {%- if interface.model is defined and interface.model %},model={{ interface.model }}{% endif %}

          {% if topology_node.machine_type is defined and topology_node.machine_type %}
          --machine {{ topology_node.machine_type }} \
          {% endif %}

          --os-variant {{ topology_node.os.variant }} \

I think something is different, i.e the hardware the undercloud sees is different and based on that the interface names are different.

It is also possible the undercloud initially installed has netifnames disabled?


I doubt that this is a product bug, this is an issue with the infrastructure used for testing.

Comment 2 Fernando Díaz 2023-03-01 15:28:41 UTC
Thanks Harald for your comment, let me provide a clarification about the procedure:
We are not using the tripleo-undercloud IR plugin. We use the backup and restore plugin [1] that execute the backup and restore tripleo role [2] using the openstack backup commands that are implemented in the cli [3] .
The backup and restore role relies on ReaR [4] to backup and restore the undercloud and controller nodes, so when we restore the undercloud node using ReaR, we expect to have restored exactly the vm with the same network interfaces as in the backup image. I wonder why the interfaces names are changing since the scripts to enable the network interfaces in the restored node are with the ethX naming.

[1] https://gitlab.cee.redhat.com/osp-dfg-enterprise/infrared-plugin-backup-restore
[2] https://github.com/openstack/tripleo-ansible/tree/master/tripleo_ansible/roles/backup_and_restore
[3] https://github.com/openstack/python-tripleoclient
[4] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_basic_system_settings/assembly_recovering-and-restoring-a-system_configuring-basic-system-settings

Comment 3 Fernando Díaz 2023-03-02 13:12:02 UTC
I can confirm that for some reason the undercloud was initially installed with netifnames disabled:

[root@undercloud-0 stack]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/vmlinuz-5.14.0-283.el9.x86_64 root=UUID=62b51192-13b0-4838-a267-e410f86ee01e console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M

Comment 7 Carlos Camacho 2023-03-24 11:19:27 UTC
We will go doing further investigation on ReaR side.


Note You need to log in before you can comment on or make changes to this bug.