Bug 1869503

Summary: Tempest sanity tests failed on splistack deployment
Product: Red Hat OpenStack Reporter: Jad Haj Yahya <jhajyahy>
Component: python-tripleo-common-tests-tempestAssignee: Arx Cruz <acruz>
Status: CLOSED DUPLICATE QA Contact: Alexander Chuzhoy <sasha>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: apetrich, chkumar, drosenfe, kchamart, kecarter, mschuppe, ramishra, whayutin
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-06 11:17:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jad Haj Yahya 2020-08-18 06:57:44 UTC
Description of problem:
14 out of 16 tempest tests failed on splitstack deployments. see report:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-df-splitstack-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/10/testReport/


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200813.n.0

How reproducible:


Steps to Reproduce:
1. Run https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-df-splitstack-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/
2.
3.

Actual results:


Expected results:


Additional info:
Sanity tests passed on same non-splitstack job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/3/testReport/

Comment 3 Artom Lifshitz 2020-10-09 14:37:27 UTC
selinux is preventing qemu-kvm from executing. Here's a snippet from compute-0's audit.log:

type=AVC msg=audit(1597535175.792:18853): avc:  denied  { entrypoint } for  pid=84947 comm="libvirtd" path="/usr/libexec/qemu-kvm" dev="overlay" ino=310042 scontext=system_u:system_r:svirt_t:s0:c15,c878 tcontext=system_u:object_r:container_file_t:s0:c411,c758 tclass=file permissive=0

The question is - is this the same as BZ1841822, or a new bug? Can you double the version of t-h-t used to deploy the environment, and whether it's earlier or later than openstack-tripleo-heat-templates-10.6.3-0.20200113185561.cf467ea?

Comment 4 Kashyap Chamarthy 2020-11-20 15:06:27 UTC
(In reply to Artom Lifshitz from comment #3)
> selinux is preventing qemu-kvm from executing. Here's a snippet from
> compute-0's audit.log:
> 
> type=AVC msg=audit(1597535175.792:18853): avc:  denied  { entrypoint } for 
> pid=84947 comm="libvirtd" path="/usr/libexec/qemu-kvm" dev="overlay"
> ino=310042 scontext=system_u:system_r:svirt_t:s0:c15,c878
> tcontext=system_u:object_r:container_file_t:s0:c411,c758 tclass=file
> permissive=0
> 
> The question is - is this the same as BZ1841822, or a new bug? Can you
> double the version of t-h-t used to deploy the environment, and whether it's
> earlier or later than
> openstack-tripleo-heat-templates-10.6.3-0.20200113185561.cf467ea?

Hi, Jad -- did you get a chance to double-check the above?

Comment 5 Kashyap Chamarthy 2020-12-04 13:45:06 UTC
(In reply to Kashyap Chamarthy from comment #4)
> (In reply to Artom Lifshitz from comment #3)

[...]

> > The question is - is this the same as BZ1841822, or a new bug? Can you
> > double the version of t-h-t used to deploy the environment, and whether it's
> > earlier or later than
> > openstack-tripleo-heat-templates-10.6.3-0.20200113185561.cf467ea?
> 
> Hi, Jad -- did you get a chance to double-check the above?

We're reasonably confident you're hitting a bug that's already fixed (https://bugzilla.redhat.com/show_bug.cgi?id=1841822)

For now,I'm closing this bug.  If you think this comment is wrong; feel free to re-open it with more evidence.

Comment 6 David Rosenfeld 2021-04-29 13:53:18 UTC
Virtually certain your analysis is correct. The splitstack tempest tests fail when using RHEL 8.2 due to podman version being used.

However, the referenced bug from comment 5: https://bugzilla.redhat.com/show_bug.cgi?id=1841822 is for OSP15. The failed tempest tests still occur in 16.1. Comment 40 of that BZ says a new 16.1 BZ is needed. Don't believe that BZ was written so want to use this BZ to track 16.1 failures.

If podman version won't change in 16.1 due to its short lifespan then believe Closed EOL or Closed Won't Fix would be appropriate state for this BZ.

Comment 7 David Rosenfeld 2021-04-29 14:01:10 UTC
Logs from current 16.1 splitstack test with failed tempest tests:

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/splitstack/job/DFG-df-splitstack-16.1-virsh-3cont_2comp_3ceph-scaleup/35

Traceback seen is:


Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_minimum_basic.py", line 112, in test_minimum_basic_scenario
    server = self.create_server(image_id=image, key_name=keypair['name'])
  File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 286, in create_server
    image_id=image_id, **kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/common/compute.py", line 271, in create_test_server
    server['id'])
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/tempest/common/compute.py", line 242, in create_test_server
    clients.servers_client, server['id'], wait_until)
  File "/usr/lib/python3.6/site-packages/tempest/common/waiters.py", line 76, in wait_for_server_status
    server_id=server_id)
tempest.exceptions.BuildErrorException: Server 9b47aebc-355d-4fd7-97ee-7d4ee8e2d75a failed to build and is in ERROR status
Details: {'code': 500, 'created': '2021-04-25T16:16:52Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 9b47aebc-355d-4fd7-97ee-7d4ee8e2d75a.'}

Comment 8 Kashyap Chamarthy 2021-05-05 14:52:45 UTC
(In reply to David Rosenfeld from comment #6)
> Virtually certain your analysis is correct. The splitstack tempest tests
> fail when using RHEL 8.2 due to podman version being used.
> 
> However, the referenced bug from comment 5:
> https://bugzilla.redhat.com/show_bug.cgi?id=1841822 is for OSP15. The failed
> tempest tests still occur in 16.1. Comment 40 of that BZ says a new 16.1 BZ
> is needed. Don't believe that BZ was written so want to use this BZ to track
> 16.1 failures.
> 
> If podman version won't change in 16.1 due to its short lifespan then
> believe Closed EOL or Closed Won't Fix would be appropriate state for this
> BZ.

Okay, I'll discuss it with the Compute team during the bug triage on the timeline
and then close it appropriately.

Comment 9 Martin Schuppert 2021-05-06 11:17:00 UTC
I had a look at the env which shows the issue. Here is what I got

Installed podman version:
[root@compute-0 ~]# rpm -qa |grep podman
podman-1.6.4-12.module+el8.2.0+6669+dde598ec.x86_64

Updated podman on compute-0, even though rhos-release was installed, there was no later update available, so re-subscribed using rhos-release to the used puddle from the core_puddle version on the undercloud:
[root@compute-0 ~]# rhos-release 16.1 -r 8.2 -p RHOS-16.1-RHEL-8-20210503.n.0
[root@compute-0 ~]# dnf update podman
[root@compute-0 ~]# rpm -qa |grep podman
podman-1.6.4-19.module+el8.2.0+10175+e12b0910.x86_64
[root@compute-0 ~]# systemctl restart tripleo_nova_libvirt.service

Disabled compute 1/2 to get created instances to compute-0
(overcloud) [stack@undercloud-0 ~]$ openstack compute service set --disable compute-2.redhat.local nova-compute
(overcloud) [stack@undercloud-0 ~]$ openstack compute service set --disable compute-1.redhat.local nova-compute

=> the instance create failed with the same error

The issue is the wrong SE context on /usr/libexec/qemu-kvm in the openstack-nova-libvirt:16.1_20210430.1 container image:
()[root@compute-0 /]$ ls -laZ /usr/libexec/qemu-kvm 
-rwxr-xr-x. 1 root root system_u:object_r:container_file_t:s0:c675,c872 13921824 Feb 11 16:09 /usr/libexec/qemu-kvm

[root@compute-0 ~]# podman inspect nova_libvirt > /tmp/nova_libvirt.inspect


Stopped nova_libvirt container and created manually a debug container using paunch
[root@compute-0 ~]# paunch debug --file  /var/lib/tripleo-config/container-startup-config/step_3/nova_libvirt.json --container nova_libvirt --action run

[root@compute-0 ~]# podman exec -it -u root nova_libvirt-wtid4a6j sh
()[root@compute-0 /]$ ls -laZ /usr/libexec/qemu-kvm 
-rwxr-xr-x. 1 root root system_u:object_r:container_ro_file_t:s0 13921632 Oct 14  2020 /usr/libexec/qemu-kvm

[root@compute-0 ~]# podman inspect nova_libvirt > /tmp/nova_libvirt-wtid4a6j.inspect

From the two container definitions we see that the MountLabel and ProcessLabel are not correct:

[root@compute-0 ~]# diff -u /tmp/nova_libvirt.inspect /tmp/nova_libvirt-wtid4a6j.inspect
...
-        "MountLabel": "system_u:object_r:container_file_t:s0:c54,c900",
-        "ProcessLabel": "",
+        "MountLabel": "system_u:object_r:container_share_t:s0",
+        "ProcessLabel": "system_u:system_r:spc_t:s0",

From podman rpm changelog we see that BZ1846364 was fixed with -13/-15 
[root@compute-0 ~]# rpm -q --changelog podman

* Mi Jul 01 2020 Jindrich Novy <jnovy> - 1.6.4-15
- fix "Don't disable selinux labels if user specifies a security opt"
- Resolves: #1846364

* Di Jun 30 2020 Ian Mcleod <imcleod> - 1.6.4-14
- follow on fix for #1846364

* Mo Jun 29 2020 Jindrich Novy <jnovy> - 1.6.4-13
- fix "podman 1.6.4 is not honouring  --security-opt when --privileged is passed"
- Resolves: #1846364

On compute-0 and compute-1
* updated podman to -19,
* deleted the nova_libvirt container 
* rerun overcloud deploy (removed --limit from this environments deploy script).

Note: kept compute-2 with old podman -12 version, also see bellow note re compute-2

[root@compute-0 ~]# podman exec -it -u root nova_libvirt sh                                                                                                                                                
()[root@compute-0 /]$ ls -laZ /usr/libexec/qemu-kvm
-rwxr-xr-x. 1 root root system_u:object_r:container_ro_file_t:s0 13921824 Feb 11 16:09 /usr/libexec/qemu-kvm 

[root@compute-0 ~]# podman inspect nova_libvirt
...
        "MountLabel": "system_u:object_r:container_share_t:s0",
        "ProcessLabel": "system_u:system_r:spc_t:s0",

instance create is successful:

(overcloud) [stack@undercloud-0 ~]$ openstack server list --long
+--------------------------------------+------+--------+------------+-------------+-----------------------+------------+--------------------------------------+-------------+-----------+-------------------+------------------------+------------+
| ID                                   | Name | Status | Task State | Power State | Networks              | Image Name | Image ID                             | Flavor Name | Flavor ID | Availability Zone
| Host                   | Properties |
+--------------------------------------+------+--------+------------+-------------+-----------------------+------------+--------------------------------------+-------------+-----------+-------------------+------------------------+------------+
| 5bfb18ad-6f28-46c0-9f08-fcc947d380b6 | test | ACTIVE | None       | Running     | private=192.168.0.178 | cirros     | 3d270ea4-e6d7-4dd8-ae35-2ae79d304919 |             |           | nova             
| compute-0.redhat.local |            |
+--------------------------------------+------+--------+------------+-------------+-----------------------+------------+--------------------------------------+-------------+-----------+-------------------+------------------------+------------+

So this confirms that the bug is a duplicate of BZ1846364.

I'll close this BZ as a duplicate of 1846364, feel free to reopen if needed.

Note re compute-2 scaling:
compute-2 has no entries in controllers or computes /etc/hosts file. This is related to --limit was used in the deploy command. Live migration without running it at least on all computes

*** This bug has been marked as a duplicate of bug 1846364 ***

Comment 11 Red Hat Bugzilla 2023-09-15 01:30:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days