Bug 2035325 - pre-provision with dpdk fails
Summary: pre-provision with dpdk fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 17.0
Assignee: Steve Baker
QA Contact: Joe H. Rahme
URL:
Whiteboard:
: 2071699 2073101 (view as bug list)
Depends On: 2073855
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-23 15:45 UTC by Ella Shulman
Modified: 2022-09-21 12:18 UTC (History)
13 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20220508231833.96104ee.el8ost diskimage-builder-3.20.4-0.20220428174017.555cecb.el8ost openstack-tripleo-image-elements-13.1.3-0.20220510162343.6883abc.el8ost openstack-tripleo-common-15.4.1-0.20220510162343.855dcd5.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:18:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 823734 0 None master: MERGED tripleo-ansible: Add playbook which includes kernelargs.yml (I01542b3a07e01256ed6920ea7ffbb6ea114a6db8) 2022-04-13 11:06:55 UTC
OpenStack gerrit 833693 0 None stable/wallaby: MERGED tripleo-ansible: Add playbook which includes kernelargs.yml (I01542b3a07e01256ed6920ea7ffbb6ea114a6db8) 2022-04-13 11:07:01 UTC
OpenStack gerrit 840384 0 None NEW kernelargs play, gather facts after waiting for pre_tasks 2022-05-03 21:31:49 UTC
Red Hat Issue Tracker OSP-11943 0 None None None 2021-12-23 15:50:41 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:18:36 UTC

Description Ella Shulman 2021-12-23 15:45:55 UTC
Description of problem:
pre-provision with dpdk fails due to enable_unsafe_noiommu_mode not being set

Version-Release number of selected component (if applicable):
osp 17

How reproducible:
100%

Steps to Reproduce:
1. deploy undercloud
2. try pre-provision with dpdk interfaces
3.

Actual results:
os net config fails

Expected results:
pre-provisioned networks

Additional info:
to workaround added the following config in baremetal_deployment yaml:
    config_drive:
      cloud_config:
        bootcmd:
          - echo Y > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode
          - echo "options vfio enable_unsafe_noiommu_mode=Y" > /etc/modprobe.d/vfio.conf
          - echo "vfio-pci" > /etc/modules-load.d/vfio-pci.conf

Comment 3 Saravanan KR 2022-01-04 06:32:45 UTC
Executing the command "openstack overcloud node provision" with the option "--network-config" results in running os-net-config immediately after node provision. But for NFV, before os-net-config, the kernel args has to be applied, else it will result in failure. 

But not providing the option "--network-config", the provision alone is successful and then running "openstack overcloud deploy" command configures the kernel args and runs os-net-config successfully and the deployment continues to next step.

This is a change in the deploy steps that will be confusing, it would be ideal if the network-config is done after the kernel args (step: 0). Looping @hjensas for his inputs.

Comment 4 Steve Baker 2022-01-04 20:54:23 UTC
In the short term I think this will need a custom ansible playbook which sets the kernel args and does a reboot before the network-config playbook runs. But this solution will likely need to evolve into the provision tool supporting specifiying kernel args on a per role/node basis. I'll have a chat with Harald and we'll come up with a plan.

Comment 5 Steve Baker 2022-01-05 04:14:22 UTC
Actually the cloud_config approach in #1 could be the officially documented solution. It runs earlier than the network-config playbook, and its making changes which don't depend on any passed-in values. What do you think Harald?

Comment 6 Saravanan KR 2022-01-05 04:40:50 UTC
Some of the drawbacks with cloud-config:
* Cannot modify args as it will be done only once, current kernel args implementation support modifications, like the number of huge pages, can be changed is a common ask from users. But the reboot has to be done manually.
* NFV deployments required tuned to be applied before reboot so that the additional kernel args set by tuned also applied along with user kernel args

I would prefer if we hook the "step: 0" deploy steps with node provision before network config. "step: 0" is created for pre-network configurations. 

Also consider that we still support "PreNetworkConfig" network resource, which is used by one of the partners to apply custom settings before network config.

Comment 7 Saravanan KR 2022-01-06 12:14:04 UTC
@sbaker When "overcloud deploy" is invoked without "node provision", we are hitting with https://bugzilla.redhat.com/show_bug.cgi?id=2037418. Cinder keystone cleanup is failing with gateway timeout. Any info that could help to overcome it would be great.


PLAY [External deployment step 4] **********************************************
2022-01-05 15:42:07.559671 | 52540059-36e3-4025-3304-0000000000d1 |       TASK | External deployment step 4
2022-01-05 15:42:07.588051 | 52540059-36e3-4025-3304-0000000000d1 |         OK | External deployment step 4 | undercloud -> localhost | result={
    "changed": false,
    "msg": "Use --start-at-task 'External deployment step 4' to resume from this task"
}
[WARNING]: ('undercloud -> localhost', '52540059-36e3-4025-3304-0000000000d1')
missing from stats
2022-01-05 15:42:07.631182 | 52540059-36e3-4025-3304-0000000000d2 |     TIMING | include_tasks | undercloud | 0:14:17.019214 | 0.03s
2022-01-05 15:42:07.645466 | 579a1122-478e-4263-a9c9-f1b42fe9a748 |   INCLUDED | /home/stack/overcloud-deploy/overcloud/config-download/overcloud/external_deploy_steps_tasks_step4.yaml | undercloud
2022-01-05 15:42:07.661801 | 52540059-36e3-4025-3304-000000006e7d |       TASK | Clean up legacy Cinder keystone catalog entries
rvices = self.list_services()\n  File \"/usr/lib/python3.6/site-packages/openstack/cloud/_identity.py\", line 492, in list_services\n    if self._is_client_version('identity', 2):\n

Comment 8 Steve Baker 2022-01-06 21:41:58 UTC
(In reply to Saravanan KR from comment #7)
> @sbaker When "overcloud deploy" is invoked without "node
> provision", we are hitting with
> https://bugzilla.redhat.com/show_bug.cgi?id=2037418. Cinder keystone cleanup
> is failing with gateway timeout. Any info that could help to overcome it
> would be great.

I don't know, but also lets not use this bug to also discuss an unrelated issue.

The network config ansible playbook runs last, and we already have a kernelargs role[1], so I think the fix for this will be to add a playbook to tripleo-ansible which just runs the tripleo_kernel kernelargs.yml, then the DPDK documentation and the upstream docs[2] can describe how to run this playbook with custom kernel args.

Could you please provide the full baremetal yaml used for the provision command? Then I can add the required ansible_playbooks section when its ready.

[1] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_kernel/tasks/kernelargs.yml
[2] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#ansible-playbooks

Comment 9 Steve Baker 2022-01-06 22:39:41 UTC
Here[1] is what I'm proposing. Could you please provide feedback in that review. 

[1] https://review.opendev.org/c/openstack/tripleo-ansible/+/823734

Comment 10 Steve Baker 2022-01-06 23:13:07 UTC
Here[1] is the docs change which shows the playbook in use. You could test this playbook by fetching cli-overcloud-node-kernelargs.yaml locally and invoking it on your DPDK nodes as documented.

[1] https://review.opendev.org/c/openstack/tripleo-docs/+/823735

Comment 11 Harald Jensås 2022-01-07 11:34:47 UTC
I would have suggested just including a custom playbook to set the kernel args by adding it to ansible_playbooks for the role/node in baremetal_deployment.yaml, but I see Steve proposed shipping a playbook in [1]. I like that idea!

Is there other roles we should include in a similar fashion, in addition to tuned and kernel-args? If so I think we should open separate bugzillas.


Anything done with "PreNetworkConfig" would have to be moved to ansible_playbooks in baremetal_deployment.yaml when using '--network-config' with 'overcloud node provision'.
Another option would be not including '--network-config' option when provisioning baremetal nodes.


NOTE:
  Since PreNetworkConfig resources still run, we should ensure the values passed as HeatParameters match the ones passed in baremetal_deployment.yaml to ensure kernel params don't reset.
  I wonder if we should implement a mechanism to make deploy-steps playbook skip plays that has already been run?


[1] https://review.opendev.org/c/openstack/tripleo-ansible/+/823734

Comment 12 Ella Shulman 2022-01-09 14:31:25 UTC
@sbaker I tested the patch on my setup and added review comments on the patch, thanks a lot for the help

Comment 16 Steve Baker 2022-01-11 21:46:03 UTC
*** Bug 2037418 has been marked as a duplicate of this bug. ***

Comment 19 Steve Baker 2022-01-12 22:48:48 UTC
I've refreshed the review

Comment 21 Steve Baker 2022-03-14 22:37:41 UTC
For future documentation reference, here is the upstream docs
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#set-kernel-arguments-playbook

Wallaby backport proposed

Comment 22 Ella Shulman 2022-04-04 11:16:30 UTC
Hi all, 

I tried adding this patch manually when deploying with rhel-9 but unfortunately, it didn't work and it seems like no parameter is set in the cmdline file although the playbook ran.

[heat-admin@computeovsdpdksriov-0 ~]$ cat /proc/cmdline 
BOOT_IMAGE=(lvmid/2CjQYD-AyLy-vFfp-fn9F-CK5p-Xnz6-s4xWac/QMfrKG-g4Mv-YiYk-2swb-0OYp-KzGR-zbTPCd)/boot/vmlinuz-5.14.0-63.el9.x86_64 root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb

the part from the log in which the added playbook ran: http://pastebin.test.redhat.com/1042082

Comment 24 Ella Shulman 2022-04-05 09:02:48 UTC
it seems like the new raised issue might be rooted in a grub issue having that the issue happened even when trying to set it manually created a bug to the rhel team https://bugzilla.redhat.com/show_bug.cgi?id=2071699

Comment 29 Steve Baker 2022-04-11 22:33:16 UTC
bug #2073855 now has 3 changes which fixes the /boot/loader/entries filenames, 2 of them will need to be backported to wallaby.

Comment 30 James Parker 2022-04-13 15:07:37 UTC
*** Bug 2073101 has been marked as a duplicate of this bug. ***

Comment 32 Steve Baker 2022-04-22 01:09:01 UTC
I've proposed the suggested playbook change

Comment 39 Ella Shulman 2022-06-08 12:28:02 UTC
*** Bug 2071699 has been marked as a duplicate of this bug. ***

Comment 45 errata-xmlrpc 2022-09-21 12:18:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.