Description of problem: 14 out of 16 tempest tests failed on splitstack deployments. see report: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-df-splitstack-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/10/testReport/ Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200813.n.0 How reproducible: Steps to Reproduce: 1. Run https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-df-splitstack-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/ 2. 3. Actual results: Expected results: Additional info: Sanity tests passed on same non-splitstack job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/3/testReport/
selinux is preventing qemu-kvm from executing. Here's a snippet from compute-0's audit.log: type=AVC msg=audit(1597535175.792:18853): avc: denied { entrypoint } for pid=84947 comm="libvirtd" path="/usr/libexec/qemu-kvm" dev="overlay" ino=310042 scontext=system_u:system_r:svirt_t:s0:c15,c878 tcontext=system_u:object_r:container_file_t:s0:c411,c758 tclass=file permissive=0 The question is - is this the same as BZ1841822, or a new bug? Can you double the version of t-h-t used to deploy the environment, and whether it's earlier or later than openstack-tripleo-heat-templates-10.6.3-0.20200113185561.cf467ea?
(In reply to Artom Lifshitz from comment #3) > selinux is preventing qemu-kvm from executing. Here's a snippet from > compute-0's audit.log: > > type=AVC msg=audit(1597535175.792:18853): avc: denied { entrypoint } for > pid=84947 comm="libvirtd" path="/usr/libexec/qemu-kvm" dev="overlay" > ino=310042 scontext=system_u:system_r:svirt_t:s0:c15,c878 > tcontext=system_u:object_r:container_file_t:s0:c411,c758 tclass=file > permissive=0 > > The question is - is this the same as BZ1841822, or a new bug? Can you > double the version of t-h-t used to deploy the environment, and whether it's > earlier or later than > openstack-tripleo-heat-templates-10.6.3-0.20200113185561.cf467ea? Hi, Jad -- did you get a chance to double-check the above?
(In reply to Kashyap Chamarthy from comment #4) > (In reply to Artom Lifshitz from comment #3) [...] > > The question is - is this the same as BZ1841822, or a new bug? Can you > > double the version of t-h-t used to deploy the environment, and whether it's > > earlier or later than > > openstack-tripleo-heat-templates-10.6.3-0.20200113185561.cf467ea? > > Hi, Jad -- did you get a chance to double-check the above? We're reasonably confident you're hitting a bug that's already fixed (https://bugzilla.redhat.com/show_bug.cgi?id=1841822) For now,I'm closing this bug. If you think this comment is wrong; feel free to re-open it with more evidence.
Virtually certain your analysis is correct. The splitstack tempest tests fail when using RHEL 8.2 due to podman version being used. However, the referenced bug from comment 5: https://bugzilla.redhat.com/show_bug.cgi?id=1841822 is for OSP15. The failed tempest tests still occur in 16.1. Comment 40 of that BZ says a new 16.1 BZ is needed. Don't believe that BZ was written so want to use this BZ to track 16.1 failures. If podman version won't change in 16.1 due to its short lifespan then believe Closed EOL or Closed Won't Fix would be appropriate state for this BZ.
Logs from current 16.1 splitstack test with failed tempest tests: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/splitstack/job/DFG-df-splitstack-16.1-virsh-3cont_2comp_3ceph-scaleup/35 Traceback seen is: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper return f(*func_args, **func_kwargs) File "/usr/lib/python3.6/site-packages/tempest/scenario/test_minimum_basic.py", line 112, in test_minimum_basic_scenario server = self.create_server(image_id=image, key_name=keypair['name']) File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 286, in create_server image_id=image_id, **kwargs) File "/usr/lib/python3.6/site-packages/tempest/common/compute.py", line 271, in create_test_server server['id']) File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise raise value File "/usr/lib/python3.6/site-packages/tempest/common/compute.py", line 242, in create_test_server clients.servers_client, server['id'], wait_until) File "/usr/lib/python3.6/site-packages/tempest/common/waiters.py", line 76, in wait_for_server_status server_id=server_id) tempest.exceptions.BuildErrorException: Server 9b47aebc-355d-4fd7-97ee-7d4ee8e2d75a failed to build and is in ERROR status Details: {'code': 500, 'created': '2021-04-25T16:16:52Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 9b47aebc-355d-4fd7-97ee-7d4ee8e2d75a.'}
(In reply to David Rosenfeld from comment #6) > Virtually certain your analysis is correct. The splitstack tempest tests > fail when using RHEL 8.2 due to podman version being used. > > However, the referenced bug from comment 5: > https://bugzilla.redhat.com/show_bug.cgi?id=1841822 is for OSP15. The failed > tempest tests still occur in 16.1. Comment 40 of that BZ says a new 16.1 BZ > is needed. Don't believe that BZ was written so want to use this BZ to track > 16.1 failures. > > If podman version won't change in 16.1 due to its short lifespan then > believe Closed EOL or Closed Won't Fix would be appropriate state for this > BZ. Okay, I'll discuss it with the Compute team during the bug triage on the timeline and then close it appropriately.
I had a look at the env which shows the issue. Here is what I got Installed podman version: [root@compute-0 ~]# rpm -qa |grep podman podman-1.6.4-12.module+el8.2.0+6669+dde598ec.x86_64 Updated podman on compute-0, even though rhos-release was installed, there was no later update available, so re-subscribed using rhos-release to the used puddle from the core_puddle version on the undercloud: [root@compute-0 ~]# rhos-release 16.1 -r 8.2 -p RHOS-16.1-RHEL-8-20210503.n.0 [root@compute-0 ~]# dnf update podman [root@compute-0 ~]# rpm -qa |grep podman podman-1.6.4-19.module+el8.2.0+10175+e12b0910.x86_64 [root@compute-0 ~]# systemctl restart tripleo_nova_libvirt.service Disabled compute 1/2 to get created instances to compute-0 (overcloud) [stack@undercloud-0 ~]$ openstack compute service set --disable compute-2.redhat.local nova-compute (overcloud) [stack@undercloud-0 ~]$ openstack compute service set --disable compute-1.redhat.local nova-compute => the instance create failed with the same error The issue is the wrong SE context on /usr/libexec/qemu-kvm in the openstack-nova-libvirt:16.1_20210430.1 container image: ()[root@compute-0 /]$ ls -laZ /usr/libexec/qemu-kvm -rwxr-xr-x. 1 root root system_u:object_r:container_file_t:s0:c675,c872 13921824 Feb 11 16:09 /usr/libexec/qemu-kvm [root@compute-0 ~]# podman inspect nova_libvirt > /tmp/nova_libvirt.inspect Stopped nova_libvirt container and created manually a debug container using paunch [root@compute-0 ~]# paunch debug --file /var/lib/tripleo-config/container-startup-config/step_3/nova_libvirt.json --container nova_libvirt --action run [root@compute-0 ~]# podman exec -it -u root nova_libvirt-wtid4a6j sh ()[root@compute-0 /]$ ls -laZ /usr/libexec/qemu-kvm -rwxr-xr-x. 1 root root system_u:object_r:container_ro_file_t:s0 13921632 Oct 14 2020 /usr/libexec/qemu-kvm [root@compute-0 ~]# podman inspect nova_libvirt > /tmp/nova_libvirt-wtid4a6j.inspect From the two container definitions we see that the MountLabel and ProcessLabel are not correct: [root@compute-0 ~]# diff -u /tmp/nova_libvirt.inspect /tmp/nova_libvirt-wtid4a6j.inspect ... - "MountLabel": "system_u:object_r:container_file_t:s0:c54,c900", - "ProcessLabel": "", + "MountLabel": "system_u:object_r:container_share_t:s0", + "ProcessLabel": "system_u:system_r:spc_t:s0", From podman rpm changelog we see that BZ1846364 was fixed with -13/-15 [root@compute-0 ~]# rpm -q --changelog podman * Mi Jul 01 2020 Jindrich Novy <jnovy> - 1.6.4-15 - fix "Don't disable selinux labels if user specifies a security opt" - Resolves: #1846364 * Di Jun 30 2020 Ian Mcleod <imcleod> - 1.6.4-14 - follow on fix for #1846364 * Mo Jun 29 2020 Jindrich Novy <jnovy> - 1.6.4-13 - fix "podman 1.6.4 is not honouring --security-opt when --privileged is passed" - Resolves: #1846364 On compute-0 and compute-1 * updated podman to -19, * deleted the nova_libvirt container * rerun overcloud deploy (removed --limit from this environments deploy script). Note: kept compute-2 with old podman -12 version, also see bellow note re compute-2 [root@compute-0 ~]# podman exec -it -u root nova_libvirt sh ()[root@compute-0 /]$ ls -laZ /usr/libexec/qemu-kvm -rwxr-xr-x. 1 root root system_u:object_r:container_ro_file_t:s0 13921824 Feb 11 16:09 /usr/libexec/qemu-kvm [root@compute-0 ~]# podman inspect nova_libvirt ... "MountLabel": "system_u:object_r:container_share_t:s0", "ProcessLabel": "system_u:system_r:spc_t:s0", instance create is successful: (overcloud) [stack@undercloud-0 ~]$ openstack server list --long +--------------------------------------+------+--------+------------+-------------+-----------------------+------------+--------------------------------------+-------------+-----------+-------------------+------------------------+------------+ | ID | Name | Status | Task State | Power State | Networks | Image Name | Image ID | Flavor Name | Flavor ID | Availability Zone | Host | Properties | +--------------------------------------+------+--------+------------+-------------+-----------------------+------------+--------------------------------------+-------------+-----------+-------------------+------------------------+------------+ | 5bfb18ad-6f28-46c0-9f08-fcc947d380b6 | test | ACTIVE | None | Running | private=192.168.0.178 | cirros | 3d270ea4-e6d7-4dd8-ae35-2ae79d304919 | | | nova | compute-0.redhat.local | | +--------------------------------------+------+--------+------------+-------------+-----------------------+------------+--------------------------------------+-------------+-----------+-------------------+------------------------+------------+ So this confirms that the bug is a duplicate of BZ1846364. I'll close this BZ as a duplicate of 1846364, feel free to reopen if needed. Note re compute-2 scaling: compute-2 has no entries in controllers or computes /etc/hosts file. This is related to --limit was used in the deploy command. Live migration without running it at least on all computes *** This bug has been marked as a duplicate of bug 1846364 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days