Bug 1679007

Summary: realtime-virtual-host profile apply remains stuck during initial compute deployment via openstack director
Product: Red Hat Enterprise Linux 7 Reporter: Jaison Raju <jraju>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED DUPLICATE QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: medium    
Version: 7.6CC: jeder, jraju, jskarvad, mtosatti, olysonek
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-28 05:25:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaison Raju 2019-02-20 05:22:47 UTC
Description of problem:
While deploying RT KVM computes via openstack director, the realtime-virtual-host profile apply remains stuck for over 600sec & the deployment fails.
The qemu-kvm process keeps running.

root      26847  17321  0 23:42 ?        00:00:00 /bin/sh /usr/lib/tuned/realtime-virtual-host/script.sh start
root      26863      2  0 23:42 ?        00:00:00 [kworker/u900:1]
root      26864      2  0 23:42 ?        00:00:00 [kworker/u901:1]
root      26919  26847 99 23:42 ?        00:07:03 /usr/libexec/qemu-kvm -enable-kvm -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -display none -serial stdio -device pci-testdev -kernel /usr/share/qemu-kvm/tscdeadline_latency.flat -cpu host
root      26920  26847  0 23:42 ?        00:00:00 grep latency
root      26921  26847  0 23:42 ?        00:00:00 cut -f 2 -d :

Version-Release number of selected component (if applicable):
[root@overcloud-computeovsdpdkrt-0 ~]# rpm -qa | grep tuned
tuned-2.10.0-6.el7.noarch
tuned-profiles-realtime-2.10.0-6.el7.noarch
tuned-profiles-nfv-host-2.10.0-6.el7.noarch
tuned-profiles-cpu-partitioning-2.10.0-6.el7.noarch
[root@overcloud-computeovsdpdkrt-0 ~]# rpm -q kernel-rt
kernel-rt-3.10.0-957.5.1.rt56.916.el7.x86_64

(undercloud) [stack@ocp-130-107 ~]$ rpm -q openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-8.0.7-21.el7ost.noarch


How reproducible:
Always

Steps to Reproduce:
1. Deploy an environment with RT KVM
2.
3.

Actual results:
realtime-virtual-host tuned profile is takes indefinete time to finish & stack deployment fails.
$ openstack stack failures list --long overcloud
overcloud.ComputeOvsDpdkSriovRT.0.PreNetworkConfig.HostParametersDeployment:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 9e484e3c-2baf-4a9f-bf8b-a8d42c586085
  status: CREATE_FAILED
  status_reason: |
    Error: resources.HostParametersDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    
    PLAY [Configuration to be applied before rebooting the node] *******************
    
    TASK [Gathering Facts] *********************************************************
    ok: [localhost]
    
    TASK [Ensure the kernel args ( default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=2-39 ) is present as TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS] ***
    changed: [localhost]
    
    TASK [Add TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS to the GRUB_CMDLINE_LINUX parameter] ***
    changed: [localhost]
    
    TASK [Generate grub config file] ***********************************************
    changed: [localhost]
    
    TASK [Tune-d Configuration] ****************************************************
    changed: [localhost]
    
    TASK [Tune-d profile activation] ***********************************************
    fatal: [localhost]: FAILED! => {"changed": true, "cmd": "tuned-adm profile realtime-virtual-host", "delta": "0:10:01.625060", "end": "2019-02-19 07:46:44.661856", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2019-02-19 07:36:43.036796", "stderr": "", "stderr_lines": [], "stdout": "Operation timed out after waiting 600 seconds(s), you may try to increase timeout by using --timeout command line option or using --async.", "stdout_lines": ["Operation timed out after waiting 600 seconds(s), you may try to increase timeout by using --timeout command line option or using --async."]}
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/7e86f3ac-b22f-4a87-acd2-1dbeb2755d93_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=5    changed=4    unreachable=0    failed=1   
    
  deploy_stderr: |


Expected results:
realtime-virtual-host tuned profile is applied successfully.


Additional info:

Comment 4 Marcelo Tosatti 2019-02-20 12:24:30 UTC
Looks like a duplicate of

https://bugzilla.redhat.com/show_bug.cgi?id=1670275

Can you confirm the workaround in that BZ fixes the problem?

Comment 6 Jaison Raju 2019-02-28 05:25:11 UTC
The following patch fixes the issue as suggested.
https://github.com/redhat-performance/tuned/commit/4790e570ce0e41bde4e1866ed6e3cba723b5f4d8
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# guestmount -a /tmp/overcloud-realtime-compute.qcow2  -m /dev/sda /mnt/
# wget https://raw.githubusercontent.com/redhat-performance/tuned/4790e570ce0e41bde4e1866ed6e3cba723b5f4d8/profiles/realtime-virtual-host/script.sh
--2019-02-27 21:30:51--  https://raw.githubusercontent.com/redhat-performance/tuned/4790e570ce0e41bde4e1866ed6e3cba723b5f4d8/profiles/realtime-virtual-host/script.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.152.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.152.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3550 (3.5K) [text/plain]
Saving to: ‘script.sh.1’

100%[=========================================================================================================================================================================>] 3,550       --.-K/s   in 0s      

2019-02-27 21:30:51 (95.1 MB/s) - ‘script.sh.1’ saved [3550/3550]

# cat script.sh.1 > /mnt/usr/lib/tuned/realtime-virtual-host/script.sh
# guestunmount /mnt/

$ openstack overcloud image upload --update-existing --os-image-name overcloud-realtime-compute.qcow2 
Image "overcloud-realtime-compute-vmlinuz" is up-to-date, skipping.
Image "overcloud-realtime-compute-initrd" is up-to-date, skipping.
Image "overcloud-realtime-compute" was uploaded.
+--------------------------------------+----------------------------+-------------+------------+--------+
|                  ID                  |            Name            | Disk Format |    Size    | Status |
+--------------------------------------+----------------------------+-------------+------------+--------+
| f2a5a793-f5bc-4227-8192-cd738310d57f | overcloud-realtime-compute |    qcow2    | 2642280448 | active |
+--------------------------------------+----------------------------+-------------+------------+--------+
Image "bm-deploy-kernel" is up-to-date, skipping.
Image "bm-deploy-ramdisk" is up-to-date, skipping.
Image file "/httpboot/agent.kernel" is up-to-date, skipping.
Image file "/httpboot/agent.ramdisk" is up-to-date, skipping.
Some images have been updated in Glance, make sure to rerun
	openstack overcloud node configure
to reflect the changes on the nodes
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

*** This bug has been marked as a duplicate of bug 1554851 ***