1331081 – mitaka current-passed-ci fails to deploy on RHEL 7.2, galera not starting up on all controllers

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1331081 - mitaka current-passed-ci fails to deploy on RHEL 7.2, galera not starting up on all controllers

Summary: mitaka current-passed-ci fails to deploy on RHEL 7.2, galera not starting up ...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	RDO
Classification:	Community
Component:	openstack-tripleo
Sub Component:
Version:	trunk
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	Kilo
Assignee:	James Slagle
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-27 16:26 UTC by Attila Darazs
Modified:	2016-07-08 05:36 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-19 15:32:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Attila Darazs 2016-04-27 16:26:47 UTC

Description of problem:
During a HA deployment, the overcloud deployment command returns with zero, but the heat stack ends up in CREATE_FAILED state.

The failed part is "overcloud-ControllerNodesPostDeployment-qi55q7oqwo26-ControllerServicesBaseDeployment_Step2-cfncnhcqxihv"

The deployment fails on the clustercheck command:

Error: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mError: /Stage[main]/Main/Exec[galera-ready]/returns: change from notrun to 0 failed: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]\u001b[0m\n", "deploy_status_code":
Apr 26 05:21:07 overcloud-controller-1.localdomain os-collect-config[3013]: 6}

And on a reproducer system, galera is indeed in a degraded state:
# pcs status
[..]
 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-0 (unmanaged)
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-1 (unmanaged)
     Masters: [ overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=71, status=complete, exitreason='MySQL server failed to start (pid=16096) (rc=0), please check your installation',
    last-rc-change='Wed Apr 27 15:27:07 2016', queued=0ms, exec=14362ms
* galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=69, status=complete, exitreason='MySQL server failed to start (pid=15015) (rc=0), please check your installation',
    last-rc-change='Wed Apr 27 15:27:07 2016', queued=0ms, exec=14366ms


Version-Release number of selected component (if applicable):
[root@overcloud-controller-0 ~]# rpm -qa|grep galera
galera-25.3.5-7.el7.x86_64
mariadb-server-galera-10.1.12-4.el7.x86_64

[stack@instack ~]$ rpm -qa|grep tripleo
openstack-tripleo-image-elements-0.9.10-0.20160419165211.fdf717f.el7.centos.noarch
tripleo-common-1.0.1-0.20160323101840.d52d04b.el7.centos.noarch
openstack-tripleo-heat-templates-2.0.1-0.20160423124014.671f5c8.el7.centos.noarch
openstack-tripleo-0.0.1-0.20160411152951.b076a5a.el7.centos.noarch
python-tripleoclient-2.0.1-0.20160415042551.c084825.el7.centos.noarch
openstack-tripleo-puppet-elements-2.0.1-0.20160415124916.75e3610.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy a HA overcloud of RDO Mitaka delorean on RHEL 7.2

Additional info:
There are downstream gate jobs that can reproduce the issue.

Comment 2 Michele Baldessari 2016-05-09 14:20:36 UTC

Likely a simple selinux relabel when building the image is missing:
type=AVC msg=audit(1461646313.398:171): avc:  denied  { setpgid } for  pid=12510 comm="mysqld" scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:mysqld_t:s0 tclass=process
type=SYSCALL msg=audit(1461646313.398:171): arch=c000003e syscall=109 success=no exit=-13 a0=0 a1=0 a2=1 a3=8 items=0 ppid=12502 pid=12510 auid=4294967295 uid=27 gid=27 euid=27 suid=27 fsuid=27 egid=27 sgid=27 fsgid=27 tty=(none) ses=4294967295 comm="mysqld" exe="/usr/libexec/mysqld" subj=system_u:system_r:mysqld_t:s0 key=(null)

Comment 3 Alan Pevec 2016-05-10 08:18:28 UTC

Is this with tripleo-quickstart image or from-scratch custom built images?

Comment 4 Michele Baldessari 2016-05-10 10:31:12 UTC

So I looked at the image building logs and it seems to set the selinux files:
+ echo dib-run-parts Tue Apr 26 00:15:04 EDT 2016 Running /tmp/in_target.d/finalise.d/90-selinux-fixfiles-restore
dib-run-parts Tue Apr 26 00:15:04 EDT 2016 Running /tmp/in_target.d/finalise.d/90-selinux-fixfiles-restore
+ target_tag=90-selinux-fixfiles-restore
+ date +%s.%N
+ /tmp/in_target.d/finalise.d/90-selinux-fixfiles-restore
+ set -eu
+ set -o pipefail
++ which setfiles
+ SETFILES=/usr/sbin/setfiles
+ '[' -e /etc/selinux/targeted/contexts/files/file_contexts -a -x /usr/sbin/setfiles ']'
+ setfiles /etc/selinux/targeted/contexts/files/file_contexts /
+ target_tag=90-selinux-fixfiles-restore
+ date +%s.%N
+ output '90-selinux-fixfiles-restore completed'


Could we do one of the two following steps to troubleshoot further?
a) Add -v to 90-selinux-fixfiles-restore dib 

b) Call a virt-customize --selinux-relabel on the produced overcloud image and see if the issue is still there

Comment 5 Chandan Kumar 2016-05-19 15:32:40 UTC

This bug is against a Version which has reached End of Life.
If it's still present in supported release (http://releases.openstack.org), please update Version and reopen.

Comment 6 Graeme Gillies 2016-07-06 07:02:20 UTC

Hi,

We currently have an RDO mitaka environment experiencing this issue. Was the root cause ever accurately determined? It was closed as EOL but I'm not entirely sure that is accurate

Regards,

Graeme

Comment 7 Michele Baldessari 2016-07-06 07:27:52 UTC

Graeme,

did you observe the denials in comment 2? 

cheers,
Michele

Comment 8 Graeme Gillies 2016-07-06 22:17:28 UTC

I see some similar denials for mysql and haproxy

type=AVC msg=audit(1467790984.329:131): avc:  denied  { name_bind } for  pid=9632 comm="haproxy" src=3306 scontext=system_u:system_r:haproxy_t:s0 tcontext=system_u:object_r:mysqld_port_t:s0 tclass=tcp_socket
type=AVC msg=audit(1467790992.061:163): avc:  denied  { write } for  pid=10505 comm="mysqld_safe" path="/tmp/tmp.c2XPC6oag1" dev="sda2" ino=16900546 scontext=system_u:system_r:mysqld_safe_t:s0 tcontext=system_u:object_r:cluster_tmp_t:s0 tclass=file
type=SYSCALL msg=audit(1467790992.061:163): arch=c000003e syscall=59 success=yes exit=0 a0=9aa610 a1=93a010 a2=976640 a3=7ffd526d34a0 items=0 ppid=10386 pid=10505 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mysqld_safe" exe="/usr/bin/bash" subj=system_u:system_r:mysqld_safe_t:s0 key=(null)
type=AVC msg=audit(1467790994.678:166): avc:  denied  { read } for  pid=10865 comm="mysqld_safe" name="cores" dev="sda2" ino=19519726 scontext=system_u:system_r:mysqld_safe_t:s0 tcontext=unconfined_u:object_r:cluster_var_lib_t:s0 tclass=dir
type=SYSCALL msg=audit(1467790994.678:166): arch=c000003e syscall=257 success=yes exit=3 a0=ffffffffffffff9c a1=4a9e9f a2=90800 a3=0 items=0 ppid=10864 pid=10865 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mysqld_safe" exe="/usr/bin/bash" subj=system_u:system_r:mysqld_safe_t:s0 key=(null)
type=AVC msg=audit(1467790994.679:167): avc:  denied  { write } for  pid=10505 comm="mysqld_safe" path="/tmp/tmp.c2XPC6oag1" dev="sda2" ino=16900546 scontext=system_u:system_r:mysqld_safe_t:s0 tcontext=system_u:object_r:cluster_tmp_t:s0 tclass=file
type=SYSCALL msg=audit(1467790994.679:167): arch=c000003e syscall=1 success=yes exit=94 a0=1 a1=7fa2e2821000 a2=5e a3=5d items=0 ppid=10386 pid=10505 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mysqld_safe" exe="/usr/bin/bash" subj=system_u:system_r:mysqld_safe_t:s0 key=(null)
type=AVC msg=audit(1467791012.260:199): avc:  denied  { read } for  pid=12255 comm="mysqld_safe" name="cores" dev="sda2" ino=19519726 scontext=system_u:system_r:mysqld_safe_t:s0 tcontext=unconfined_u:object_r:cluster_var_lib_t:s0 tclass=dir
type=SYSCALL msg=audit(1467791012.260:199): arch=c000003e syscall=257 success=yes exit=3 a0=ffffffffffffff9c a1=4a9e9f a2=90800 a3=0 items=0 ppid=12254 pid=12255 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mysqld_safe" exe="/usr/bin/bash" subj=system_u:system_r:mysqld_safe_t:s0 key=(null)

but note that all nodes are running in selinux permissive mode, so this shouldn't affect anything right?

Comment 9 Michele Baldessari 2016-07-08 05:36:00 UTC

Correct, if your systems are in permissive mode these do not apply. If we could
get sosreports from all three nodes we can try and take a look and see what is
going on there.

Note You need to log in before you can comment on or make changes to this bug.