Bug 1157405
| Summary: | openstack service are in inactive state at the end of the deployment since mariadb is down | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Asaf Hirshberg <ahirshbe> | ||||||||||||||||||
| Component: | openstack-foreman-installer | Assignee: | Jason Guiditta <jguiditt> | ||||||||||||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Ofer Blaut <oblaut> | ||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||
| Version: | 5.0 (RHEL 6) | CC: | ahirshbe, cwolfe, jguiditt, mburns, morazi, oblaut, rhos-maint, rohara, sclewis, yeylon | ||||||||||||||||||
| Target Milestone: | z2 | ||||||||||||||||||||
| Target Release: | Installer | ||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||
| Last Closed: | 2014-10-29 21:55:11 UTC | Type: | Bug | ||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
| Embargoed: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||
|
Description
Asaf Hirshberg
2014-10-27 08:09:40 UTC
Created attachment 950903 [details]
openstack-status output
Created attachment 950917 [details]
mariadb logs
according to neutron.server.log , it can not connect to mariadb
2014-10-26 14:14:22.758 4798 WARNING neutron.openstack.common.db.sqlalchemy.session [-] This application has not enabled MySQL traditional mode, which means silent data corruption may occur. Please encourage the application developers to enable this mode.
2014-10-26 14:14:22.762 4798 WARNING neutron.openstack.common.db.sqlalchemy.session [-] SQL connection failed. infinite attempts left.
mariadb logs attached
I see over and over in those logs: 141026 13:59:17 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'galera_cluster' at 'gcomm://192.168.0.3,192.168.0.4,192.168.0.2': -110 (Connection timed out) can the other machines actually be reached at those IPs as expected? We can check other possible issues, but I think that is a good place to start. (In reply to Jason Guiditta from comment #4) > I see over and over in those logs: > > 141026 13:59:17 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open > channel 'galera_cluster' at 'gcomm://192.168.0.3,192.168.0.4,192.168.0.2': > -110 (Connection timed out) Whatever node is giving this error message is trying to join the cluster and is failing. This usually happens becase no node has bootstrapped the cluster. I recommend you figure out this node bootstrap node (should be same as cluster_control node) and find out why galera is not running there. > can the other machines actually be reached at those IPs as expected? We can > check other possible issues, but I think that is a good place to start. Also a posibility. Also, to be able to se any rrors that could have caused a node to not bootstrap the cluster properly, we need the /var/log/messages and /var/log/mariadb/mariadb.log from all nodes, and the host yaml would be helpful as well (this from staypuft ui under host link, click the yaml button on the left). That will give us chance to see if any parameter are being set incorrectly. 1. Machines can reach each other using the ip address 2. Leonid suggested to add wsrep_cluster_address=gcomm:// to one of the hosts, in /etc/my.cnf.d/galera.cnf and systemctl start mariadb This will start mariadb, but not sure if this is the correct solution . The file already have the following wsrep config # Group communication system handle wsrep_cluster_address="gcomm://192.168.0.3,192.168.0.4,192.168.0.2" Created attachment 951291 [details]
/var/log/messages of server1
Created attachment 951292 [details]
/var/log/messages of server2
Created attachment 951293 [details]
/var/log/messages of server3
Created attachment 951294 [details]
maria db log server1
Created attachment 951295 [details]
maria db log server2
Created attachment 951296 [details]
maria db log server3
*** Bug 1157236 has been marked as a duplicate of this bug. *** (In reply to Ofer Blaut from comment #7) > 1. Machines can reach each other using the ip address > 2. Leonid suggested to add wsrep_cluster_address=gcomm:// to one of the > hosts, > in /etc/my.cnf.d/galera.cnf > and systemctl start mariadb This is exactly what a bootstrap node is. The puppet code will bootstrap one node (the same one that owns the cluster_control_ip) and then the other nodes will join in. Are you saying this is not happening? Are you running puppet multiple times? > This will start mariadb, but not sure if this is the correct solution . > The file already have the following wsrep config It is not. This should all be handled by puppet. > # Group communication system handle > wsrep_cluster_address="gcomm://192.168.0.3,192.168.0.4,192.168.0.2" Right. After the bootstrap node has started *and* the other nodes have joined, the bootstrap node will get a wsrep_cluster_address with all the nodes' IP addresses. This is by design. If you stop galera on all nodes and try to restart/reboot, you will have to bootstrap manually. Is this what you are doing? Ok, using the same setup, I am unable to reproduce. There were 2 services that showe as failed in pcs status, cinder-volume and neutron-openvwsitch. THe former looks like it is wanting to use rbd, but ceph server was not set up. The latter seems likely a leftover of an earlier issue on the run. The interface used to get local_ip gets its IP from DHCP, so on initial provisioning of the machine, it does not yet have tis, reporting an error. The ip later resolves and the local_ip error goes away. I saw no galera failures, and the cluster is currently up and running for further testing. Well, my setup was still stuck in 45%, so i redeploy the setup Can you please share the workaround and how we move from there ? Ofer Sorry, I was not watching the whole thing from the UI, once I kicked it off, I was watching the /var/log/messages tail via ssh, and everything was fine. I think that for whatever reason, it took to long for staypuft, which then locked the deployment and did not proceed. I just hit 'resume', and it immediately went to 100% on the controllers, since they were done. Compute is proceeding now, will update with results when that completes. per email comments, this is not reproducible. |