Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1751559

Summary:	osp15 Overcloud deployment fails with : Task 'provide_manageable' (02793a02-c785-4682-b85d-3aca564f6f90) [RUNNING -> ERROR, msg=Failure caused by error in tasks: send_message
Product:	Red Hat OpenStack	Reporter:	pkomarov
Component:	openstack-ironic	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED DUPLICATE	QA Contact:	mlammon
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	15.0 (Stein)	CC:	bfournie, mburns, sathlang
Target Milestone:	---	Keywords:	AutomationBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-12 11:30:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1751300
Bug Blocks:

Description pkomarov 2019-09-12 07:30:27 UTC

Description of problem:

osp15 Overcloud deployment fails with :  Task 'provide_manageable' (02793a02-c785-4682-b85d-3aca564f6f90) [RUNNING -> ERROR, msg=Failure caused by error in tasks: send_message


Version-Release number of selected component (if applicable):

osp15 , RHOS_TRUNK-15.0-RHEL-8-20190830.n.0

How reproducible:
100%

Steps to Reproduce:
rerun any ops15 job , 
or : 
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-pidone-updates-15_director-rhel-virthost-3cont_3db_3msg_2net_2comp-ipv4-geneve-ansible-sts-composable_roles/


Additional info:
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| a57971e6-6dc7-416e-bb9d-6b7d7f37ff30 | compute-0    | None                                 | power off   | available          | False       |
| 2858f835-68ff-49c5-a4b7-8c759d9f4113 | compute-1    | None                                 | power off   | available          | False       |
| 9130a6b4-e3dc-4036-a5bd-8db709fe4761 | controller-0 | None                                 | power off   | available          | False       |
| 52188d3d-2206-448c-ad09-407e820ab48b | controller-1 | None                                 | power off   | available          | False       |
| 8e7d583e-65ef-4ae7-ad71-c805db2d8ae4 | controller-2 | None                                 | power off   | available          | False       |
| 170b667c-ba40-4549-a6af-3a4350376c61 | database-0   | None                                 | power off   | available          | False       |
| 81182a82-2710-4eb3-8975-ca4c61c24ff6 | database-1   | None                                 | power off   | available          | False       |
| 09f16cf5-fc0b-4dad-b707-de451c6afaa1 | database-2   | None                                 | power off   | available          | False       |
| 2cd49555-25e8-487f-a10e-19c2f45efc65 | messaging-0  | 9670711f-b7bd-41d0-bd5b-84041628e4a8 | power on    | wait call-back     | False       |
| 05de9c8d-3877-4419-8404-24fe34d24622 | messaging-1  | None                                 | power off   | available          | False       |
| df7c5701-efc1-4c92-91bd-b3f0be3d97df | messaging-2  | None                                 | power off   | available          | False       |
| d462905a-a8d7-4767-ad42-6c40fe77dc6c | networker-0  | None                                 | power off   | available          | False       |
| 89ffd80c-74bd-493c-b305-120f3d87d041 | networker-1  | None                                 | power off   | available          | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

/var/log/containers/mistral/engine.log.1:2019-09-11 21:28:42.886 9 INFO workflow_trace [req-57bbb9a1-4200-4f99-a7c7-1640e1365a01 ff635a26b9c04d6b857ca18541ce609d 1a946ac0f6f3414d982901cc12d80cb3 - default default] Task 'provide_manageable' (02793a02-c785-4682-b85d-3aca564f6f90) [RUNNING -> ERROR, msg=Failure caused by error in tasks: send_message

(undercloud) [stack@undercloud-0 ~]$ openstack server show compute-0
[]
| fault                               | {'code': 500, 'created': '2019-09-12T04:49:08Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance f5dcc8e5-43d5-4d68-a471-e5ebc0e063de.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 633, in build_instances\n    raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance f5dcc8e5-43d5-4d68-a471-e5ebc0e063de.\n'}

Comment 1 pkomarov 2019-09-12 07:40:14 UTC

sosreports and stack home are at : http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ_1751559/

live env : titan88.lab.eng.tlv2.redhat.com

Comment 2 Sofer Athlan-Guyot 2019-09-12 10:55:35 UTC

hi,

just FYI, but it seems that we may have an issue with selinux that prevent the provisioning of the nodes.  Running those commands on the undercloud-0 seems to fix it:

cat /var/log/audit/audit.log | audit2allow -M local ; 
semodule -i local.pp; 
systemctl restart tripleo_neutron_dhcp.service ;

and on the hypervisor:

for i in compute-0 controller-0 controller-1 controller-2 ; do virsh destroy $i; virsh start $i ;done

where local.pp is the compiled version of:

[root@undercloud-0 ~]# cat local.te 

module local 1.0;

require {
        type unlabeled_t;
        type system_dbusd_t;
        type spc_t;
        type container_t;
        class unix_stream_socket connectto;
        class key create;
}

#============= container_t ==============
allow container_t system_dbusd_t:unix_stream_socket connectto;

#============= spc_t ==============
allow spc_t unlabeled_t:key create;


Then the heat deployment when on and then the ansible started.

Those commands have to be run during the "stuck" (and going to failure) heat deployment. 

For CI, then should be run after the undercloud deployment (without the reboot of the nodes in that case) as a workaround.


See https://bugzilla.redhat.com/show_bug.cgi?id=1751300

Comment 3 Sofer Athlan-Guyot 2019-09-12 11:03:15 UTC

Now, I think that we have an issue because the deployment failure wasn't detect somehow.

Comment 4 Bob Fournier 2019-09-12 11:30:02 UTC

This is the same selinux issue we are hitting everywhere else and which is causing tests to fail - its due to the dhcp agent not starting because of
"process_linux.go:430: container init caused \\"write /proc/self/attr/keycreate: permission denied\\""\n: internal libpod error\n',)] 

See containers/neutron/dhcp-agent.log.1 which is filled with this error.  Because the agent can't be started, no nodes can be provisioned.  

Marking as duplicate.

*** This bug has been marked as a duplicate of bug 1751300 ***