Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1157405

Summary:

openstack service are in inactive state at the end of the deployment since mariadb is down

Product:

Red Hat OpenStack

Reporter:

Asaf Hirshberg <ahirshbe>

Component:

openstack-foreman-installer

Assignee:

Jason Guiditta <jguiditt>

Status:

CLOSED NOTABUG

QA Contact:

Ofer Blaut <oblaut>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

5.0 (RHEL 6)

CC:

ahirshbe, cwolfe, jguiditt, mburns, morazi, oblaut, rhos-maint, rohara, sclewis, yeylon

Target Milestone:

Target Release:

Installer

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-10-29 21:55:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
openstack-status output	none
mariadb logs	none
/var/log/messages of server1	none
/var/log/messages of server2	none
/var/log/messages of server3	none
maria db log server1	none
maria db log server2	none
maria db log server3	none

Description Asaf Hirshberg 2014-10-27 08:09:40 UTC

Description of problem:
We tried to deploy an HA-neutron installation on bare metal servers. The deployment first have stuck on 45% but after we checked and saw that puppet have finished is run we tried to hit resume at the osp-installer site and the deployment moved to install the compute. When it got to 100% we checked the servers and saw that most of openstack-services are inactive

Version-Release number of selected component (if applicable):
rhel-osp-installer-0.4.5-2.el6ost.noarch (poodle)
rhel 6.6


How reproducible:
3/3

Steps to Reproduce:
1.create and deploy HA-neutron deployment
2.if its get stuck on 45% press resume
3.at 100% check the openstack services with: openstack-status 

Actual results:
most of the services was in inactive state

Expected results:
all services should be active


Additional info:
openstack-status aoutput is attached

Comment 1 Asaf Hirshberg 2014-10-27 08:10:55 UTC

Created attachment 950903 [details]
openstack-status output

Comment 3 Ofer Blaut 2014-10-27 09:12:52 UTC

Created attachment 950917 [details]
mariadb logs

according to neutron.server.log , it can not connect to mariadb

2014-10-26 14:14:22.758 4798 WARNING neutron.openstack.common.db.sqlalchemy.session [-] This application has not enabled MySQL traditional mode, which means silent data corruption may occur. Please encourage the application developers to enable this mode.
2014-10-26 14:14:22.762 4798 WARNING neutron.openstack.common.db.sqlalchemy.session [-] SQL connection failed. infinite attempts left.

mariadb logs attached

Comment 4 Jason Guiditta 2014-10-27 12:53:53 UTC

I see over and over in those logs:

141026 13:59:17 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://192.168.0.3,192.168.0.4,192.168.0.2': -110 (Connection timed out)

can the other machines actually be reached at those IPs as expected?  We can check other possible issues, but I think that is a good place to start.

Comment 5 Ryan O'Hara 2014-10-27 13:49:52 UTC

(In reply to Jason Guiditta from comment #4)
> I see over and over in those logs:
> 
> 141026 13:59:17 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open
> channel 'galera_cluster' at 'gcomm://192.168.0.3,192.168.0.4,192.168.0.2':
> -110 (Connection timed out)

Whatever node is giving this error message is trying to join the cluster and is failing. This usually happens becase no node has bootstrapped the cluster. I recommend you figure out this node bootstrap node (should be same as cluster_control node) and find out why galera is not running there.

> can the other machines actually be reached at those IPs as expected?  We can
> check other possible issues, but I think that is a good place to start.

Also a posibility.

Comment 6 Jason Guiditta 2014-10-27 17:41:11 UTC

Also, to be able to se any rrors that could have caused a node to not bootstrap the cluster properly, we need the /var/log/messages and /var/log/mariadb/mariadb.log from all nodes, and the host yaml would be helpful as well (this from staypuft ui under host link, click the yaml button on the left). That will give us  chance to see if any parameter are being set incorrectly.

Comment 7 Ofer Blaut 2014-10-28 05:37:58 UTC

1. Machines can reach each other using the ip address 
2. Leonid suggested to add wsrep_cluster_address=gcomm:// to one of the hosts,
in /etc/my.cnf.d/galera.cnf
and systemctl start mariadb


This will start mariadb, but not sure if this is the correct solution .
The file already have the following  wsrep config 

# Group communication system handle
wsrep_cluster_address="gcomm://192.168.0.3,192.168.0.4,192.168.0.2"

Comment 8 Asaf Hirshberg 2014-10-28 07:04:53 UTC

Created attachment 951291 [details]
/var/log/messages of server1

Comment 9 Asaf Hirshberg 2014-10-28 07:05:26 UTC

Created attachment 951292 [details]
/var/log/messages of server2

Comment 10 Asaf Hirshberg 2014-10-28 07:06:05 UTC

Created attachment 951293 [details]
/var/log/messages of server3

Comment 11 Asaf Hirshberg 2014-10-28 07:06:45 UTC

Created attachment 951294 [details]
maria db log server1

Comment 12 Asaf Hirshberg 2014-10-28 07:07:23 UTC

Created attachment 951295 [details]
maria db log server2

Comment 13 Asaf Hirshberg 2014-10-28 07:08:08 UTC

Created attachment 951296 [details]
maria db log server3

Comment 14 Mike Burns 2014-10-28 12:25:11 UTC

*** Bug 1157236 has been marked as a duplicate of this bug. ***

Comment 15 Ryan O'Hara 2014-10-28 13:55:58 UTC

(In reply to Ofer Blaut from comment #7)
> 1. Machines can reach each other using the ip address 
> 2. Leonid suggested to add wsrep_cluster_address=gcomm:// to one of the
> hosts,
> in /etc/my.cnf.d/galera.cnf
> and systemctl start mariadb

This is exactly what a bootstrap node is. The puppet code will bootstrap one node (the same one that owns the cluster_control_ip) and then the other nodes will join in. Are you saying this is not happening? Are you running puppet multiple times?

> This will start mariadb, but not sure if this is the correct solution .
> The file already have the following  wsrep config 

It is not. This should all be handled by puppet.

> # Group communication system handle
> wsrep_cluster_address="gcomm://192.168.0.3,192.168.0.4,192.168.0.2"

Right. After the bootstrap node has started *and* the other nodes have joined, the bootstrap node will get a wsrep_cluster_address with all the nodes' IP addresses. This is by design. If you stop galera on all nodes and try to restart/reboot, you will have to bootstrap manually. Is this what you are doing?

Comment 17 Jason Guiditta 2014-10-28 22:02:23 UTC

Ok, using the same setup, I am unable to reproduce.  There were 2 services that showe as failed in pcs status, cinder-volume and neutron-openvwsitch.  THe former looks like it is wanting to use rbd, but ceph server was not set up.  The latter seems likely a leftover of an earlier issue on the run.  The interface used to get local_ip gets its IP from DHCP, so on initial provisioning of the machine, it does not yet have tis, reporting an error.  The ip later resolves and the local_ip error goes away.  I saw no galera failures, and the cluster is currently up and running for further testing.

Comment 18 Ofer Blaut 2014-10-29 12:49:12 UTC

Well, my setup was still stuck in 45%, so i redeploy the setup

Can you please share the workaround and how we move from there ?

Ofer

Comment 19 Jason Guiditta 2014-10-29 13:49:08 UTC

Sorry, I was not watching the whole thing from the UI, once I kicked it off, I was watching the /var/log/messages tail via ssh, and everything was fine.  I think that for whatever reason, it took to long for staypuft, which then locked the deployment and did not proceed.  I just hit 'resume', and it immediately went to 100% on the controllers, since they were done.  Compute is proceeding now, will update with results when that completes.

Comment 20 Mike Burns 2014-10-29 21:55:11 UTC

per email comments, this is not reproducible.