Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2172507

Summary:	Loss of connectivity with controllers after doing an undercloud restore
Product:	Red Hat OpenStack	Reporter:	Fernando Díaz <fdiazbra>
Component:	tripleo-ansible	Assignee:	Carlos Camacho <ccamacho>
Status:	CLOSED WORKSFORME	QA Contact:	Joe H. Rahme <jhakimra>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	17.1 (Wallaby)	CC:	ayefimov, ccamacho, hjensas, jpretori, sbaker
Target Milestone:	---	Keywords:	Triaged
Target Release:	---	Flags:	fdiazbra: needinfo- fdiazbra: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-03-24 11:19:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Fernando Díaz 2023-02-22 11:50:20 UTC

Description of problem:
After restoring the undercloud we are losing the connectivity with the controller nodes because for some reason the network interfaces names are changing from eth0-eth1-eth2 to ensX. This change in the interace's names is breaking the bridge br-ctlplane and as a result we can see in the logs:  "ssh: connect to host 192.168.24.34 port 22: No route to host".


Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230216.n.1
rhel-guest-image-9.2-20230207.8.x86_64.qcow2


How reproducible:
Executing the bnr workflow with infrared plugin

Steps to Reproduce:
1. Deploy the stack
2. Install rear and nfs: infrared backup-restore --ospversion 17.1 --setup-nfs-rear true --backup-dir /home/ctl_plane_backups
3. Execute rear backup: infrared backup-restore --ospversion 17.1 --backup-undercloud true --backup-overcloud true  --backup-dir /home/ctl_plane_backups
4. Restore the undercloud: infrared backup-restore --ospversion 17.1 --restore-undercloud true --restore-overcloud true --backup-dir /home/ctl_plane_backups

Actual results:
This change in the interace's names is breaking the bridge br-ctlplane and as a result we can see in the undercloud restore logs:  "ssh: connect to host 192.168.24.34 port 22: No route to host".

Expected results:
Restoration is sucessful

Additional info:
As a workaroud it is possible to fix the networking problem using os-net-config utility:
sed -i 's/eth0/ens3/g' /etc/os-net-config/config.yaml;
os-net-config -c /etc/os-net-config/config.yaml

Comment 1 Harald Jensås 2023-02-24 09:15:49 UTC

You are destroying the undercloud VM set up by infrared virsh plugin and re-creating the undercloud VM without using the virsh plugin, I guess in one of these playbooks?
* https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-undercloud/restore.yml#L97-L115
* https://github.com/redhat-openstack/infrared/blob/55ba05ca0d9f5aca6f605816da02dac053537254/plugins/tripleo-undercloud/restore_containerized.yml

The virsh plugin does it in a similar way, but there is many things that can be different based on options.
* https://github.com/redhat-openstack/infrared/blob/master/plugins/virsh/tasks/vms_2_install.yml#L37-L78

For example:
          {% if provision.bootmode == 'uefi' %}
          --boot {{ 'hd' if topology_node.deploy_os|default(True) else 'uefi' }} \
          {% else %}

          {%- if interface.model is defined and interface.model %},model={{ interface.model }}{% endif %}

          {% if topology_node.machine_type is defined and topology_node.machine_type %}
          --machine {{ topology_node.machine_type }} \
          {% endif %}

          --os-variant {{ topology_node.os.variant }} \

I think something is different, i.e the hardware the undercloud sees is different and based on that the interface names are different.

It is also possible the undercloud initially installed has netifnames disabled?


I doubt that this is a product bug, this is an issue with the infrastructure used for testing.

Comment 2 Fernando Díaz 2023-03-01 15:28:41 UTC

Thanks Harald for your comment, let me provide a clarification about the procedure:
We are not using the tripleo-undercloud IR plugin. We use the backup and restore plugin [1] that execute the backup and restore tripleo role [2] using the openstack backup commands that are implemented in the cli [3] .
The backup and restore role relies on ReaR [4] to backup and restore the undercloud and controller nodes, so when we restore the undercloud node using ReaR, we expect to have restored exactly the vm with the same network interfaces as in the backup image. I wonder why the interfaces names are changing since the scripts to enable the network interfaces in the restored node are with the ethX naming.

[1] https://gitlab.cee.redhat.com/osp-dfg-enterprise/infrared-plugin-backup-restore
[2] https://github.com/openstack/tripleo-ansible/tree/master/tripleo_ansible/roles/backup_and_restore
[3] https://github.com/openstack/python-tripleoclient
[4] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_basic_system_settings/assembly_recovering-and-restoring-a-system_configuring-basic-system-settings

Comment 3 Fernando Díaz 2023-03-02 13:12:02 UTC

I can confirm that for some reason the undercloud was initially installed with netifnames disabled:

[root@undercloud-0 stack]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/vmlinuz-5.14.0-283.el9.x86_64 root=UUID=62b51192-13b0-4838-a267-e410f86ee01e console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M

Comment 7 Carlos Camacho 2023-03-24 11:19:27 UTC

We will go doing further investigation on ReaR side.