Bug 1858673 - [RHOSP 13 to 16.1 Upgrades] 'openvswitch' and 'ovs-vswitchd' failed to start after performing leapp upgrade
Summary: [RHOSP 13 to 16.1 Upgrades] 'openvswitch' and 'ovs-vswitchd' failed to start ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: z1
: 16.1 (Train on RHEL 8.2)
Assignee: Saravanan KR
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-20 03:21 UTC by MD Sufiyan
Modified: 2020-09-10 03:57 UTC (History)
17 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200616081539.396affd.el8ost
Doc Type: Bug Fix
Doc Text:
This update fixes a GRUB parameter naming convention that led to unpredictable behaviors on compute nodes during leapp upgrades. + Previously, the presence of the obsolete "TRIPELO" prefix on GRUB parameters caused problems. + The file /etc/default/grub has been updated with GRUB for the tripleo kernel args parameter so that leapp can upgrade it correctly. This is done by adding "upgrade_tasks" to the service "OS::TripleO::Services::BootParams", which is a new service added to all roles in the roles_data.yaml file.
Clone Of:
Environment:
Last Closed: 2020-08-27 15:19:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 745059 0 None MERGED Align kernel args for system upgrade using leapp 2020-11-17 10:50:18 UTC
Red Hat Product Errata RHBA-2020:3542 0 None None None 2020-08-27 15:19:28 UTC

Description MD Sufiyan 2020-07-20 03:21:27 UTC
Description of problem:

Unable to perform final compute upgrade step[2] as 'openvswitch' and 'ovs-vswitchd' services move to fail state after performing leapp upgrade[1] on dpdk-compute node due to permission issue

>> Error while running the upgrade step[2]

~~~
TASK [Always ensure the openvswitch service is enabled and running after upgrades] ***
Sunday 19 July 2020  22:10:23 -0400 (0:00:08.933)       0:12:52.758 ***********
fatal: [overcloud-computeovsdpdk-0]: FAILED! => {"changed": false, "msg": "Unable to start service openvswitch: A dependency job for openvswitch.service failed. See 'journalctl -xe' for details.\n"}
~~~

>> From compute node:

~~~
[root@overcloud-computeovsdpdk-0 ~]# journalctl -p err -b  | grep 'Open vSwitch'
Jul 19 19:18:33 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Database Unit.
Jul 19 19:18:35 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 19 19:18:35 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 19 19:18:35 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 19 19:18:36 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 19 19:18:37 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 19 19:18:37 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:10:26 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:10:26 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:10:27 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:10:27 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:10:28 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:10:28 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
Jul 20 02:13:02 overcloud-computeovsdpdk-0 systemd[1]: Failed to start Open vSwitch Forwarding Unit.
~~~


>> /var/log/openvswitch/ovs-vswitchd.log

~~~
2020-07-20T02:48:10.470Z|00016|dpdk|INFO|EAL: Multi-process socket /var/run/openvswitch/dpdk/rte/mp_socket
2020-07-20T02:48:10.511Z|00017|dpdk|INFO|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2020-07-20T02:48:10.511Z|00018|dpdk|INFO|EAL: Selected IOVA mode 'VA'
2020-07-20T02:48:10.512Z|00019|dpdk|WARN|EAL: No free hugepages reported in hugepages-2048kB
2020-07-20T02:48:10.512Z|00020|dpdk|WARN|EAL: No free hugepages reported in hugepages-2048kB
2020-07-20T02:48:10.512Z|00021|dpdk|WARN|EAL: No available hugepages reported in hugepages-2048kB
2020-07-20T02:48:10.512Z|00022|dpdk|WARN|EAL: No available hugepages reported in hugepages-1048576kB
2020-07-20T02:48:10.512Z|00023|dpdk|ERR|EAL: Cannot get hugepage information.
2020-07-20T02:48:10.512Z|00024|dpdk|EMER|Unable to initialize DPDK: Permission denied
2020-07-20T02:48:10.528Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted))
2020-07-20T02:48:10.529Z|00003|daemon_unix|EMER|could not detach from foreground session
~~~

~~~
[root@overcloud-computeovsdpdk-0 ~]# ls -l /proc/self/pagemap
-r--------. 1 root root 0 Jul 20 03:11 /proc/self/pagemap
~~~

~~~
[1] openstack overcloud upgrade run --stack overcloud --tags system_upgrade --limit overcloud-computeovsdpdk-0 (leapp completed successfully and upgraded the OS to 8.2)
[2] openstack overcloud upgrade run --stack overcloud --limit overcloud-computeovsdpdk-0  (failed as openvswitch is down after leapp)
~~~

Version-Release number of selected component (if applicable):

[root@overcloud-computeovsdpdk-0 ~]#  rpm -qa | grep openvswitch
openvswitch2.13-2.13.0-39.el8fdp.x86_64
rhosp-openvswitch-2.13-8.el8ost.noarch
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch
network-scripts-openvswitch2.13-2.13.0-39.el8fdp.x86_64

OSP16.1

Comment 1 MD Sufiyan 2020-07-20 08:15:44 UTC
It seem the leapp upgrade didn't consider the "hugepages* and "isolcpus" kernel paramter while upgrading(building) the OS to 8.2 "

After leapp upgrade of overcloud-computeovsdpdk-0 node:

~~~
[root@overcloud-computeovsdpdk-0 ~]# ls -l /dev/hugepages/              
total 0                                                                 
~~~

~~~
cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.6.3.el8_2.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet                                 
~~~
                                                                                                                                                                               
>> Required "hugepages* and "isolcpus" kernep paramter:-

From one of non-upgraded dpdk node "overcloud-computeovsdpdk-1" 

~~~
cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11 skew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup                                                             
~~~

~~~
[root@overcloud-computeovsdpdk-1 ~]#  ls -l /dev/hugepages/             
total 2097152                                                           
drwxr-xr-x. 3 root        root               0 Jul 18 13:56 libvirt     
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 18 13:55 rtemap_0    
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 18 13:55 rtemap_32768
~~~

Comment 2 MD Sufiyan 2020-07-20 12:19:15 UTC
we have narrrow down the issue to "/etc/default/grub" which was missing quotes for paramter "GRUB_CMDLINE_LINUX" and "TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" due to which newly build grub2.cfg was not considering these variable and booted the OS without dpdk parameters.

~~~
[root@overcloud-computeovsdpdk-0 ~]# cat /etc/default/grub                                                                 
GRUB_TIMEOUT=5                                                                                                             
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"                                                          
GRUB_DEFAULT=saved                                                                                                         
GRUB_DISABLE_SUBMENU=true                                                                                                  
GRUB_TERMINAL_OUTPUT="console"                                                                                             
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet     <<<<<---- Closing quote missing after leapp upgrade                                 
TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS= default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11" <<<<<---- opening quote missing after leapp upgrade.
GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}"                       
GRUB_DISABLE_RECOVERY="true"                                                                                               
GRUB_CMDLINE_LINUX_DEFAULT="${GRUB_CMDLINE_LINUX_DEFAULT:+$GRUB_CMDLINE_LINUX_DEFAULT }\$tuned_params"                     
GRUB_INITRD_OVERLAY="${GRUB_INITRD_OVERLAY:+$GRUB_INITRD_OVERLAY }\$tuned_initrd"                                          
GRUB_ENABLE_BLSCFG=true                                                                                                    
~~~

After updating both the missing quotes in "/etc/default/grub" followed by a reboot we were able to bring up both openswitch and ovs-switchd services

~~~
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet"     <<<<<---- CAdded losing quote                                 
TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS="default_hugepagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11" <<<<<---- Added opening quote 
~~~

> After reboot

~~~
[root@overcloud-computeovsdpdk-0 ~]# cat /proc/cmdline                                                                                                                                        
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.6.3.el8_2.x86_64 root=UUID=566687de-3830-4c23-b9cf-b2d936de8ec3 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hug
epagesz=1GB hugepagesz=1G hugepages=20 iommu=pt intel_iommu=on isolcpus=2-11 skew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup s
kew_tick=1 nohz=on nohz_full=2-11 rcu_nocbs=2-11 tuned.non_isolcpus=00000003 intel_pstate=disable nosoftlockup                                                                                

[root@overcloud-computeovsdpdk-0 ~]# systemctl | egrep -i "ovs-vswitchd|openvswitch"                                                                                                          
openvswitch.service                                                                      loaded active exited    Open vSwitch                                                                 
ovs-vswitchd.service                                                                     loaded active running   Open vSwitch Forwarding Unit                                                 

[root@overcloud-computeovsdpdk-0 ~]# ls -l /dev/hugepages/                                                                                                                                    
total 2097152                                                                                                                                                                                 
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 20 11:57 rtemap_0                                                                                                                          
-rw-------. 1 openvswitch hugetlbfs 1073741824 Jul 20 11:57 rtemap_32768                                                                                                                      
[root@overcloud-computeovsdpdk-0 ~]#                                                                                                                                                          
~~~

We have to check and figure out why quotes are missing at first place else it will lead in an additional reboot for all scenarios which uses kernel argument (eg: sriov, cpu pinning etc)

Comment 4 Saravanan KR 2020-07-23 06:19:26 UTC
I see there is an open issue in the leapp regarding command line - https://github.com/oamg/leapp-repository/issues/251

Trying to see if any other alternatvies.

Comment 5 Saravanan KR 2020-07-23 07:32:11 UTC
I can see the file is wrongly updated because of it is expecting only parameters to start with GRUB at the start of the line. leapp has a inbuilt logic to fix the erorrs which is resulting in the wrong file format. 
https://github.com/oamg/leapp-repository/blob/master/repos/system_upgrade/el7toel8/actors/addupgradebootentry/libraries/addupgradebootentry.py#L73

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have tried to test it with parameter starting with GRUB, like below:

  GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}"
  GRUB_TRIPLEO="tripleo_test"
  GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX } ${GRUB_TRIPLEO} "

With this, it does not screw the file, the file is intact, but the grub parameters are not updated. It is still with the default ones of the kernel.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Then I tried with below configuration, which modified the existing GRUB_CMDLINE_LINUX in place to add the new parameters and it is working. So it looks like leapp is truing to read only the entry GRUB_CMDLINE_LINUX (the first occurence?)

[root@rhel-leapp ~]# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 default_hugepagesz=1GB hugepagesz=1G hugepages=1"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true
[root@rhel-leapp ~]# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-193.13.2.el8_2.x86_64 root=UUID=26284d49-7043-49ee-9ae0-43ec1db62953 ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check default_hugepagesz=1GB hugepagesz=1G hugepages=1 net.ifnames=0
[root@rhel-leapp ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.2 (Ootpa)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The acutal solution would be to fix the leapp to process the file /etc/default/grub contents to get the final processed value of GRUB_CMDLINE_LINUX, so that all values can be incorporated. 

A workaround can be to achive that processing before triggering the leap upgrade and update the GRUB_CMDLINE_LINUX in /etc/default/grub file, so that leapp can work fine without any errors. It is also important to ensure that the next time comming after GRUB_CMDLINE_LINUX entry should start with GRUB.

We also need to consider the possibility of users adding their own custom entries using first-boot scripts to apply kernel args, by evalutaing /etc/default/grub to get the finaly entry of GRUB_CMDLINE_LINUX will solve all problems.

Comment 6 Saravanan KR 2020-07-23 07:32:27 UTC
I can see the file is wrongly updated because of it is expecting only parameters to start with GRUB at the start of the line. leapp has a inbuilt logic to fix the erorrs which is resulting in the wrong file format. 
https://github.com/oamg/leapp-repository/blob/master/repos/system_upgrade/el7toel8/actors/addupgradebootentry/libraries/addupgradebootentry.py#L73

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have tried to test it with parameter starting with GRUB, like below:

  GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }${TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS}"
  GRUB_TRIPLEO="tripleo_test"
  GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX } ${GRUB_TRIPLEO} "

With this, it does not screw the file, the file is intact, but the grub parameters are not updated. It is still with the default ones of the kernel.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Then I tried with below configuration, which modified the existing GRUB_CMDLINE_LINUX in place to add the new parameters and it is working. So it looks like leapp is truing to read only the entry GRUB_CMDLINE_LINUX (the first occurence?)

[root@rhel-leapp ~]# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 default_hugepagesz=1GB hugepagesz=1G hugepages=1"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true
[root@rhel-leapp ~]# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-193.13.2.el8_2.x86_64 root=UUID=26284d49-7043-49ee-9ae0-43ec1db62953 ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check default_hugepagesz=1GB hugepagesz=1G hugepages=1 net.ifnames=0
[root@rhel-leapp ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.2 (Ootpa)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The acutal solution would be to fix the leapp to process the file /etc/default/grub contents to get the final processed value of GRUB_CMDLINE_LINUX, so that all values can be incorporated. 

A workaround can be to achive that processing before triggering the leap upgrade and update the GRUB_CMDLINE_LINUX in /etc/default/grub file, so that leapp can work fine without any errors. It is also important to ensure that the next time comming after GRUB_CMDLINE_LINUX entry should start with GRUB.

We also need to consider the possibility of users adding their own custom entries using first-boot scripts to apply kernel args, by evalutaing /etc/default/grub to get the finaly entry of GRUB_CMDLINE_LINUX will solve all problems.

Comment 7 Jose Luis Franco 2020-07-23 09:25:58 UTC
Maybe you can add that task into this step which runs right before the leapp upgrade: https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/tripleo-packages/tripleo-packages-baremetal-puppet.yaml#L163

Comment 8 Saravanan KR 2020-07-29 07:07:28 UTC
Eventhough the kernel args can be applied using a workaround, it will still miss the tuned's kernel args (for cpu-partitioning profile).

OSP13 Compute cmdline:
----------------------
BOOT_IMAGE=/boot/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=6a5dcd85-d234-4c23-a615-9349f727dd13 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_iommu=on iommu=pt isolcpus=1-11,13-23 skew_tick=1 nohz=on nohz_full=1-11,13-23 rcu_nocbs=1-11,13-23 tuned.non_isolcpus=00001001 intel_pstate=disable nosoftlockup

  KernelArgs - default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_iommu=on iommu=pt isolcpus=1-11,13-23
  Tuned Args - skew_tick=1 nohz=on nohz_full=1-11,13-23 rcu_nocbs=1-11,13-23 tuned.non_isolcpus=00001001 intel_pstate=disable nosoftlockup


Tuned args are created by enabling tuned profile - cpu-partitioning. I am analyzing if the same workaround (reading tuned_args and apply it to GRUB_CMDLINE_LINUX) can be applied for tuned args too.

Comment 9 Saravanan KR 2020-08-05 06:49:41 UTC
By aliging the existing tripleo parameter for kernel args to append GRUB_ has fixed the issue. it has been validated with tripleo_upgrade using infrared. Patch under review - https://review.opendev.org/#/c/742625/

Comment 16 Yariv 2020-08-26 08:19:41 UTC
RHOS-16.1-RHEL-8-20200821.n.0

[root@computeovsdpdksriov-0 ~]# journalctl -p err -b  | grep 'Open vSwitch'

[root@computeovsdpdksriov-0 ~]# systemctl status openvswitch.service
● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: active (exited) since Wed 2020-08-26 02:52:56 UTC; 5h 24min ago
 Main PID: 2269 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 838860)
   Memory: 0B
   CGroup: /system.slice/openvswitch.service

Comment 18 errata-xmlrpc 2020-08-27 15:19:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3542


Note You need to log in before you can comment on or make changes to this bug.