Description of problem: Unable to perform final compute upgrade step[2] as 'openvswitch' and 'ovs-vswitchd' services move to fail state after performing leapp upgrade[1] on dpdk-compute node due to permission issue >> Error while running the upgrade step[2] ~~~ TASK [Always ensure the openvswitch service is enabled and running after upgrades] *** Sunday 19 July 2020 22:10:23 -0400 (0:00:08.933) 0:12:52.758 *********** fatal: [overcloud-computeovsdpdk-0]: FAILED! => {"changed": false, "msg": "Unable to start service openvswitch: A dependency job for openvswitch.service failed. See 'journalctl -xe' for details.\n"} ~~~ >> From compute node: ~~~ [root@overcloud-computeovsdpdk-0 ~]# journalctl -p err -b | grep 'Open vSwitch' Jul 19 19:18:33 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Database Unit. Jul 19 19:18:35 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 19 19:18:35 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 19 19:18:35 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 19 19:18:36 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 19 19:18:37 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 19 19:18:37 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:10:26 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:10:26 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:10:27 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:10:27 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:10:28 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:10:28 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. Jul 20 02:13:02 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit. ~~~ >> /var/log/openvswitch/ovs-vswitchd.log ~~~ 2020-07-20T02:48:10.470Z|00016|dpdk|INFO|EAL: Multi-process socket /var/run/openvswitch/dpdk/rte/mp_socket 2020-07-20T02:48:10.511Z|00017|dpdk|INFO|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied 2020-07-20T02:48:10.511Z|00018|dpdk|INFO|EAL: Selected IOVA mode 'VA' 2020-07-20T02:48:10.512Z|00019|dpdk|WARN|EAL: No free hugepages reported in hugepages-2048kB 2020-07-20T02:48:10.512Z|00020|dpdk|WARN|EAL: No free hugepages reported in hugepages-2048kB 2020-07-20T02:48:10.512Z|00021|dpdk|WARN|EAL: No available hugepages reported in hugepages-2048kB 2020-07-20T02:48:10.512Z|00022|dpdk|WARN|EAL: No available hugepages reported in hugepages-1048576kB 2020-07-20T02:48:10.512Z|00023|dpdk|ERR|EAL: Cannot get hugepage information. 2020-07-20T02:48:10.512Z|00024|dpdk|EMER|Unable to initialize DPDK: Permission denied 2020-07-20T02:48:10.528Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted)) 2020-07-20T02:48:10.529Z|00003|daemon_unix|EMER|could not detach from foreground session ~~~ ~~~ [root@overcloud-computeovsdpdk-0 ~]# ls -l /proc/self/pagemap -r--------. 1 root root 0 Jul 20 03:11 /proc/self/pagemap ~~~ ~~~ [1] openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit overcloud-computeovsdpdk-0 (leapp completed successfully and upgraded the OS to 8.2) [2] openstack overcloud upgrade run --stack overcloud --limit overcloud-computeovsdpdk-0 (failed as openvswitch is down after leapp) ~~~ Version-Release number of selected component (if applicable): [root@overcloud-computeovsdpdk-0 ~]# rpm -qa | grep openvswitch openvswitch2.13-2.13.0-39.el8fdp.x86_64 rhosp-openvswitch-2.13-8.el8ost.noarch openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch network-scripts-openvswitch2.13-2.13.0-39.el8fdp.x86_64 OSP16.1
It seem the leapp upgrade didn't consider the "hugepages* and "isolcpus" kernel paramter while upgrading(building) the OS to 8.2 " After leapp upgrade of overcloud-computeovsdpdk-0 node: ~~~ [root@overcloud-computeovsdpdk-0 ~]# ls -l /dev/hugepages/ total 0 ~~~ ~~~ cat /proc/cmdline BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.6.3.el8_2.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet ~~~ >> Required "hugepages* and "isolcpus" kernep paramter:- From one of non-upgraded dpdk node "overcloud-computeovsdpdk-1" ~~~ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11 skew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup ~~~ ~~~ [root@overcloud-computeovsdpdk-1 ~]# ls -l /dev/hugepages/ total 2097152 drwxr-xr-x. 3 root root 0 Jul 18 13:56 libvirt -rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 18 13:55 rtemap_0 -rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 18 13:55 rtemap_32768 ~~~
we have narrrow down the issue to "/etc/default/grub" which was missing quotes for paramter "GRUB_CMDLINE_LINUX" and "TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" due to which newly build grub2.cfg was not considering these variable and booted the OS without dpdk parameters. ~~~ [root@overcloud-computeovsdpdk-0 ~]# cat /etc/default/grub GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet <<<<<---- Closing quote missing after leapp upgrade TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS= default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11" <<<<<---- opening quote missing after leapp upgrade. GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}" GRUB_DISABLE_RECOVERY="true" GRUB_CMDLINE_LINUX_DEFAULT="${GRUB_CMDLINE_LINUX_DEFAULT:+$GRUB_CMDLINE_LINUX_DEFAULT }\$tuned_params" GRUB_INITRD_OVERLAY="${GRUB_INITRD_OVERLAY:+$GRUB_INITRD_OVERLAY }\$tuned_initrd" GRUB_ENABLE_BLSCFG=true ~~~ After updating both the missing quotes in "/etc/default/grub" followed by a reboot we were able to bring up both openswitch and ovs-switchd services ~~~ GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet" <<<<<---- CAdded losing quote TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS="default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11" <<<<<---- Added opening quote ~~~ > After reboot ~~~ [root@overcloud-computeovsdpdk-0 ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.6.3.el8_2.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hug epagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11 skew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup s kew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup [root@overcloud-computeovsdpdk-0 ~]# systemctl | egrep -i "ovs-vswitchd|openvswitch" openvswitch.service loaded active exited Open vSwitch ovs-vswitchd.service loaded active running Open vSwitch Forwarding Unit [root@overcloud-computeovsdpdk-0 ~]# ls -l /dev/hugepages/ total 2097152 -rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 20 11:57 rtemap_0 -rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 20 11:57 rtemap_32768 [root@overcloud-computeovsdpdk-0 ~]# ~~~ We have to check and figure out why quotes are missing at first place else it will lead in an additional reboot for all scenarios which uses kernel argument (eg: sriov, cpu pinning etc)
I see there is an open issue in the leapp regarding command line - https://github.com/oamg/leapp-repository/issues/251 Trying to see if any other alternatvies.
I can see the file is wrongly updated because of it is expecting only parameters to start with GRUB at the start of the line. leapp has a inbuilt logic to fix the erorrs which is resulting in the wrong file format. https://github.com/oamg/leapp-repository/blob/master/repos/system_upgrade/el7toel8/actors/addupgradebootentry/libraries/addupgradebootentry.py#L73 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have tried to test it with parameter starting with GRUB, like below: GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}" GRUB_TRIPLEO="tripleo_test" GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX } ${GRUB_TRIPLEO} " With this, it does not screw the file, the file is intact, but the grub parameters are not updated. It is still with the default ones of the kernel. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Then I tried with below configuration, which modified the existing GRUB_CMDLINE_LINUX in place to add the new parameters and it is working. So it looks like leapp is truing to read only the entry GRUB_CMDLINE_LINUX (the first occurence?) [root@rhel-leapp ~]# cat /etc/default/grub GRUB_TIMEOUT=1 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 default_hugepagesz=1GB hugepagesz=1G hugepages=1" GRUB_DISABLE_RECOVERY="true" GRUB_ENABLE_BLSCFG=true [root@rhel-leapp ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-193.13.2.el8_2.x86_64 root=UUID=26284d49-7043-49ee-9ae0-43ec1db62953 ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check default_hugepagesz=1GB hugepagesz=1G hugepages=1 net.ifnames=0 [root@rhel-leapp ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The acutal solution would be to fix the leapp to process the file /etc/default/grub contents to get the final processed value of GRUB_CMDLINE_LINUX, so that all values can be incorporated. A workaround can be to achive that processing before triggering the leap upgrade and update the GRUB_CMDLINE_LINUX in /etc/default/grub file, so that leapp can work fine without any errors. It is also important to ensure that the next time comming after GRUB_CMDLINE_LINUX entry should start with GRUB. We also need to consider the possibility of users adding their own custom entries using first-boot scripts to apply kernel args, by evalutaing /etc/default/grub to get the finaly entry of GRUB_CMDLINE_LINUX will solve all problems.
Maybe you can add that task into this step which runs right before the leapp upgrade: https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/tripleo-packages/tripleo-packages-baremetal-puppet.yaml#L163
Eventhough the kernel args can be applied using a workaround, it will still miss the tuned's kernel args (for cpu-partitioning profile). OSP13 Compute cmdline: ---------------------- BOOT_IMAGE=/boot/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=6a5dcd85-d234-4c23-a615-9349f727dd13 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_iommu=on iommu=pt isolcpus=1-11,13-23 skew_tick=1 nohz=on nohz_full=1-11,13-23 rcu_nocbs=1-11,13-23 tuned.non_isolcpus=00001001 intel_pstate=disable nosoftlockup KernelArgs - default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_iommu=on iommu=pt isolcpus=1-11,13-23 Tuned Args - skew_tick=1 nohz=on nohz_full=1-11,13-23 rcu_nocbs=1-11,13-23 tuned.non_isolcpus=00001001 intel_pstate=disable nosoftlockup Tuned args are created by enabling tuned profile - cpu-partitioning. I am analyzing if the same workaround (reading tuned_args and apply it to GRUB_CMDLINE_LINUX) can be applied for tuned args too.
By aliging the existing tripleo parameter for kernel args to append GRUB_ has fixed the issue. it has been validated with tripleo_upgrade using infrared. Patch under review - https://review.opendev.org/#/c/742625/
RHOS-16.1-RHEL-8-20200821.n.0 [root@computeovsdpdksriov-0 ~]# journalctl -p err -b | grep 'Open vSwitch' [root@computeovsdpdksriov-0 ~]# systemctl status openvswitch.service ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: active (exited) since Wed 2020-08-26 02:52:56 UTC; 5h 24min ago Main PID: 2269 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 838860) Memory: 0B CGroup: /system.slice/openvswitch.service
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3542