Bug 1858673
| Summary: | [RHOSP 13 to 16.1 Upgrades] 'openvswitch' and 'ovs-vswitchd' failed to start after performing leapp upgrade | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | MD Sufiyan <msufiyan> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Saravanan KR <skramaja> |
| Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 16.1 (Train) | CC: | apevec, chrisw, fbaudin, fiezzi, hakhande, jamsmith, jfrancoa, jlibosva, jpretori, kthakre, mburns, rhos-maint, skramaja, spower, supadhya, tvignaud, yrachman |
| Target Milestone: | z1 | Keywords: | Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-0.20200616081539.396affd.el8ost | Doc Type: | Bug Fix |
| Doc Text: |
This update fixes a GRUB parameter naming convention that led to unpredictable behaviors on compute nodes during leapp upgrades.
+
Previously, the presence of the obsolete "TRIPELO" prefix on GRUB parameters caused problems.
+
The file /etc/default/grub has been updated with GRUB for the tripleo kernel args parameter so that leapp can upgrade it correctly. This is done by adding "upgrade_tasks" to the service "OS::TripleO::Services::BootParams", which is a new service added to all roles in the roles_data.yaml file.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-27 15:19:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
MD Sufiyan
2020-07-20 03:21:27 UTC
It seem the leapp upgrade didn't consider the "hugepages* and "isolcpus" kernel paramter while upgrading(building) the OS to 8.2 "
After leapp upgrade of overcloud-computeovsdpdk-0 node:
~~~
[root@overcloud-computeovsdpdk-0 ~]# ls -l /dev/hugepages/
total 0
~~~
~~~
cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.6.3.el8_2.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet
~~~
>> Required "hugepages* and "isolcpus" kernep paramter:-
From one of non-upgraded dpdk node "overcloud-computeovsdpdk-1"
~~~
cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11 skew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup
~~~
~~~
[root@overcloud-computeovsdpdk-1 ~]# ls -l /dev/hugepages/
total 2097152
drwxr-xr-x. 3 root root 0 Jul 18 13:56 libvirt
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 18 13:55 rtemap_0
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 18 13:55 rtemap_32768
~~~
we have narrrow down the issue to "/etc/default/grub" which was missing quotes for paramter "GRUB_CMDLINE_LINUX" and "TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" due to which newly build grub2.cfg was not considering these variable and booted the OS without dpdk parameters.
~~~
[root@overcloud-computeovsdpdk-0 ~]# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet <<<<<---- Closing quote missing after leapp upgrade
TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS= default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11" <<<<<---- opening quote missing after leapp upgrade.
GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}"
GRUB_DISABLE_RECOVERY="true"
GRUB_CMDLINE_LINUX_DEFAULT="${GRUB_CMDLINE_LINUX_DEFAULT:+$GRUB_CMDLINE_LINUX_DEFAULT }\$tuned_params"
GRUB_INITRD_OVERLAY="${GRUB_INITRD_OVERLAY:+$GRUB_INITRD_OVERLAY }\$tuned_initrd"
GRUB_ENABLE_BLSCFG=true
~~~
After updating both the missing quotes in "/etc/default/grub" followed by a reboot we were able to bring up both openswitch and ovs-switchd services
~~~
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet" <<<<<---- CAdded losing quote
TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS="default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11" <<<<<---- Added opening quote
~~~
> After reboot
~~~
[root@overcloud-computeovsdpdk-0 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.6.3.el8_2.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hug
epagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11 skew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup s
kew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup
[root@overcloud-computeovsdpdk-0 ~]# systemctl | egrep -i "ovs-vswitchd|openvswitch"
openvswitch.service loaded active exited Open vSwitch
ovs-vswitchd.service loaded active running Open vSwitch Forwarding Unit
[root@overcloud-computeovsdpdk-0 ~]# ls -l /dev/hugepages/
total 2097152
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 20 11:57 rtemap_0
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 20 11:57 rtemap_32768
[root@overcloud-computeovsdpdk-0 ~]#
~~~
We have to check and figure out why quotes are missing at first place else it will lead in an additional reboot for all scenarios which uses kernel argument (eg: sriov, cpu pinning etc)
I see there is an open issue in the leapp regarding command line - https://github.com/oamg/leapp-repository/issues/251 Trying to see if any other alternatvies. I can see the file is wrongly updated because of it is expecting only parameters to start with GRUB at the start of the line. leapp has a inbuilt logic to fix the erorrs which is resulting in the wrong file format. https://github.com/oamg/leapp-repository/blob/master/repos/system_upgrade/el7toel8/actors/addupgradebootentry/libraries/addupgradebootentry.py#L73 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have tried to test it with parameter starting with GRUB, like below: GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}" GRUB_TRIPLEO="tripleo_test" GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX } ${GRUB_TRIPLEO} " With this, it does not screw the file, the file is intact, but the grub parameters are not updated. It is still with the default ones of the kernel. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Then I tried with below configuration, which modified the existing GRUB_CMDLINE_LINUX in place to add the new parameters and it is working. So it looks like leapp is truing to read only the entry GRUB_CMDLINE_LINUX (the first occurence?) [root@rhel-leapp ~]# cat /etc/default/grub GRUB_TIMEOUT=1 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 default_hugepagesz=1GB hugepagesz=1G hugepages=1" GRUB_DISABLE_RECOVERY="true" GRUB_ENABLE_BLSCFG=true [root@rhel-leapp ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-193.13.2.el8_2.x86_64 root=UUID=26284d49-7043-49ee-9ae0-43ec1db62953 ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check default_hugepagesz=1GB hugepagesz=1G hugepages=1 net.ifnames=0 [root@rhel-leapp ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The acutal solution would be to fix the leapp to process the file /etc/default/grub contents to get the final processed value of GRUB_CMDLINE_LINUX, so that all values can be incorporated. A workaround can be to achive that processing before triggering the leap upgrade and update the GRUB_CMDLINE_LINUX in /etc/default/grub file, so that leapp can work fine without any errors. It is also important to ensure that the next time comming after GRUB_CMDLINE_LINUX entry should start with GRUB. We also need to consider the possibility of users adding their own custom entries using first-boot scripts to apply kernel args, by evalutaing /etc/default/grub to get the finaly entry of GRUB_CMDLINE_LINUX will solve all problems. I can see the file is wrongly updated because of it is expecting only parameters to start with GRUB at the start of the line. leapp has a inbuilt logic to fix the erorrs which is resulting in the wrong file format. https://github.com/oamg/leapp-repository/blob/master/repos/system_upgrade/el7toel8/actors/addupgradebootentry/libraries/addupgradebootentry.py#L73 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have tried to test it with parameter starting with GRUB, like below: GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}" GRUB_TRIPLEO="tripleo_test" GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX } ${GRUB_TRIPLEO} " With this, it does not screw the file, the file is intact, but the grub parameters are not updated. It is still with the default ones of the kernel. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Then I tried with below configuration, which modified the existing GRUB_CMDLINE_LINUX in place to add the new parameters and it is working. So it looks like leapp is truing to read only the entry GRUB_CMDLINE_LINUX (the first occurence?) [root@rhel-leapp ~]# cat /etc/default/grub GRUB_TIMEOUT=1 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 default_hugepagesz=1GB hugepagesz=1G hugepages=1" GRUB_DISABLE_RECOVERY="true" GRUB_ENABLE_BLSCFG=true [root@rhel-leapp ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-193.13.2.el8_2.x86_64 root=UUID=26284d49-7043-49ee-9ae0-43ec1db62953 ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check default_hugepagesz=1GB hugepagesz=1G hugepages=1 net.ifnames=0 [root@rhel-leapp ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The acutal solution would be to fix the leapp to process the file /etc/default/grub contents to get the final processed value of GRUB_CMDLINE_LINUX, so that all values can be incorporated. A workaround can be to achive that processing before triggering the leap upgrade and update the GRUB_CMDLINE_LINUX in /etc/default/grub file, so that leapp can work fine without any errors. It is also important to ensure that the next time comming after GRUB_CMDLINE_LINUX entry should start with GRUB. We also need to consider the possibility of users adding their own custom entries using first-boot scripts to apply kernel args, by evalutaing /etc/default/grub to get the finaly entry of GRUB_CMDLINE_LINUX will solve all problems. Maybe you can add that task into this step which runs right before the leapp upgrade: https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/tripleo-packages/tripleo-packages-baremetal-puppet.yaml#L163 Eventhough the kernel args can be applied using a workaround, it will still miss the tuned's kernel args (for cpu-partitioning profile). OSP13 Compute cmdline: ---------------------- BOOT_IMAGE=/boot/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=6a5dcd85-d234-4c23-a615-9349f727dd13 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_iommu=on iommu=pt isolcpus=1-11,13-23 skew_tick=1 nohz=on nohz_full=1-11,13-23 rcu_nocbs=1-11,13-23 tuned.non_isolcpus=00001001 intel_pstate=disable nosoftlockup KernelArgs - default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_iommu=on iommu=pt isolcpus=1-11,13-23 Tuned Args - skew_tick=1 nohz=on nohz_full=1-11,13-23 rcu_nocbs=1-11,13-23 tuned.non_isolcpus=00001001 intel_pstate=disable nosoftlockup Tuned args are created by enabling tuned profile - cpu-partitioning. I am analyzing if the same workaround (reading tuned_args and apply it to GRUB_CMDLINE_LINUX) can be applied for tuned args too. By aliging the existing tripleo parameter for kernel args to append GRUB_ has fixed the issue. it has been validated with tripleo_upgrade using infrared. Patch under review - https://review.opendev.org/#/c/742625/ RHOS-16.1-RHEL-8-20200821.n.0
[root@computeovsdpdksriov-0 ~]# journalctl -p err -b | grep 'Open vSwitch'
[root@computeovsdpdksriov-0 ~]# systemctl status openvswitch.service
● openvswitch.service - Open vSwitch
Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
Active: active (exited) since Wed 2020-08-26 02:52:56 UTC; 5h 24min ago
Main PID: 2269 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 838860)
Memory: 0B
CGroup: /system.slice/openvswitch.service
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3542 |