Description of problem: pre-provision with dpdk fails due to enable_unsafe_noiommu_mode not being set Version-Release number of selected component (if applicable): osp 17 How reproducible: 100% Steps to Reproduce: 1. deploy undercloud 2. try pre-provision with dpdk interfaces 3. Actual results: os net config fails Expected results: pre-provisioned networks Additional info: to workaround added the following config in baremetal_deployment yaml: config_drive: cloud_config: bootcmd: - echo Y > /sys/module/vfio/parameters/enable_unsafe_noiommu_mode - echo "options vfio enable_unsafe_noiommu_mode=Y" > /etc/modprobe.d/vfio.conf - echo "vfio-pci" > /etc/modules-load.d/vfio-pci.conf
Executing the command "openstack overcloud node provision" with the option "--network-config" results in running os-net-config immediately after node provision. But for NFV, before os-net-config, the kernel args has to be applied, else it will result in failure. But not providing the option "--network-config", the provision alone is successful and then running "openstack overcloud deploy" command configures the kernel args and runs os-net-config successfully and the deployment continues to next step. This is a change in the deploy steps that will be confusing, it would be ideal if the network-config is done after the kernel args (step: 0). Looping @hjensas for his inputs.
In the short term I think this will need a custom ansible playbook which sets the kernel args and does a reboot before the network-config playbook runs. But this solution will likely need to evolve into the provision tool supporting specifiying kernel args on a per role/node basis. I'll have a chat with Harald and we'll come up with a plan.
Actually the cloud_config approach in #1 could be the officially documented solution. It runs earlier than the network-config playbook, and its making changes which don't depend on any passed-in values. What do you think Harald?
Some of the drawbacks with cloud-config: * Cannot modify args as it will be done only once, current kernel args implementation support modifications, like the number of huge pages, can be changed is a common ask from users. But the reboot has to be done manually. * NFV deployments required tuned to be applied before reboot so that the additional kernel args set by tuned also applied along with user kernel args I would prefer if we hook the "step: 0" deploy steps with node provision before network config. "step: 0" is created for pre-network configurations. Also consider that we still support "PreNetworkConfig" network resource, which is used by one of the partners to apply custom settings before network config.
@sbaker When "overcloud deploy" is invoked without "node provision", we are hitting with https://bugzilla.redhat.com/show_bug.cgi?id=2037418. Cinder keystone cleanup is failing with gateway timeout. Any info that could help to overcome it would be great. PLAY [External deployment step 4] ********************************************** 2022-01-05 15:42:07.559671 | 52540059-36e3-4025-3304-0000000000d1 | TASK | External deployment step 4 2022-01-05 15:42:07.588051 | 52540059-36e3-4025-3304-0000000000d1 | OK | External deployment step 4 | undercloud -> localhost | result={ "changed": false, "msg": "Use --start-at-task 'External deployment step 4' to resume from this task" } [WARNING]: ('undercloud -> localhost', '52540059-36e3-4025-3304-0000000000d1') missing from stats 2022-01-05 15:42:07.631182 | 52540059-36e3-4025-3304-0000000000d2 | TIMING | include_tasks | undercloud | 0:14:17.019214 | 0.03s 2022-01-05 15:42:07.645466 | 579a1122-478e-4263-a9c9-f1b42fe9a748 | INCLUDED | /home/stack/overcloud-deploy/overcloud/config-download/overcloud/external_deploy_steps_tasks_step4.yaml | undercloud 2022-01-05 15:42:07.661801 | 52540059-36e3-4025-3304-000000006e7d | TASK | Clean up legacy Cinder keystone catalog entries rvices = self.list_services()\n File \"/usr/lib/python3.6/site-packages/openstack/cloud/_identity.py\", line 492, in list_services\n if self._is_client_version('identity', 2):\n
(In reply to Saravanan KR from comment #7) > @sbaker When "overcloud deploy" is invoked without "node > provision", we are hitting with > https://bugzilla.redhat.com/show_bug.cgi?id=2037418. Cinder keystone cleanup > is failing with gateway timeout. Any info that could help to overcome it > would be great. I don't know, but also lets not use this bug to also discuss an unrelated issue. The network config ansible playbook runs last, and we already have a kernelargs role[1], so I think the fix for this will be to add a playbook to tripleo-ansible which just runs the tripleo_kernel kernelargs.yml, then the DPDK documentation and the upstream docs[2] can describe how to run this playbook with custom kernel args. Could you please provide the full baremetal yaml used for the provision command? Then I can add the required ansible_playbooks section when its ready. [1] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_kernel/tasks/kernelargs.yml [2] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#ansible-playbooks
Here[1] is what I'm proposing. Could you please provide feedback in that review. [1] https://review.opendev.org/c/openstack/tripleo-ansible/+/823734
Here[1] is the docs change which shows the playbook in use. You could test this playbook by fetching cli-overcloud-node-kernelargs.yaml locally and invoking it on your DPDK nodes as documented. [1] https://review.opendev.org/c/openstack/tripleo-docs/+/823735
I would have suggested just including a custom playbook to set the kernel args by adding it to ansible_playbooks for the role/node in baremetal_deployment.yaml, but I see Steve proposed shipping a playbook in [1]. I like that idea! Is there other roles we should include in a similar fashion, in addition to tuned and kernel-args? If so I think we should open separate bugzillas. Anything done with "PreNetworkConfig" would have to be moved to ansible_playbooks in baremetal_deployment.yaml when using '--network-config' with 'overcloud node provision'. Another option would be not including '--network-config' option when provisioning baremetal nodes. NOTE: Since PreNetworkConfig resources still run, we should ensure the values passed as HeatParameters match the ones passed in baremetal_deployment.yaml to ensure kernel params don't reset. I wonder if we should implement a mechanism to make deploy-steps playbook skip plays that has already been run? [1] https://review.opendev.org/c/openstack/tripleo-ansible/+/823734
@sbaker I tested the patch on my setup and added review comments on the patch, thanks a lot for the help
*** Bug 2037418 has been marked as a duplicate of this bug. ***
I've refreshed the review
For future documentation reference, here is the upstream docs https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#set-kernel-arguments-playbook Wallaby backport proposed
Hi all, I tried adding this patch manually when deploying with rhel-9 but unfortunately, it didn't work and it seems like no parameter is set in the cmdline file although the playbook ran. [heat-admin@computeovsdpdksriov-0 ~]$ cat /proc/cmdline BOOT_IMAGE=(lvmid/2CjQYD-AyLy-vFfp-fn9F-CK5p-Xnz6-s4xWac/QMfrKG-g4Mv-YiYk-2swb-0OYp-KzGR-zbTPCd)/boot/vmlinuz-5.14.0-63.el9.x86_64 root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb the part from the log in which the added playbook ran: http://pastebin.test.redhat.com/1042082
it seems like the new raised issue might be rooted in a grub issue having that the issue happened even when trying to set it manually created a bug to the rhel team https://bugzilla.redhat.com/show_bug.cgi?id=2071699
bug #2073855 now has 3 changes which fixes the /boot/loader/entries filenames, 2 of them will need to be backported to wallaby.
*** Bug 2073101 has been marked as a duplicate of this bug. ***
I've proposed the suggested playbook change
*** Bug 2071699 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543