Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1143053

Summary:

Rubygem-Staypuft: HA deployment fails - Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details

Product:

Red Hat OpenStack

Reporter:

Alexander Chuzhoy <sasha>

Component:

openstack-foreman-installer

Assignee:

Ryan O'Hara <rohara>

Status:

CLOSED DUPLICATE

QA Contact:

Leonid Natapov <lnatapov>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.0 (RHEL 7)

CC:

cwolfe, jguiditt, mburns, morazi, rhos-maint, rohara, sasha, sclewis, yeylon

Target Milestone:

Target Release:

Installer

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-09-19 11:35:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1142873

Attachments:

Description	Flags
messages from controllers/staypuft + foreman.log from the staypuft	none
/var/log/mariadb/mariadb.log file from the controllers	none
galera.cnf from all controllers.	none
/var/log/mariadb/mariadb.log file from the one controller where it was down.	none

Description Alexander Chuzhoy 2014-09-17 20:07:42 UTC

Rubygem-Staypuft:  HA deployment fails - Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details



Environment:
rhel-osp-installer-0.3.5-1.el6ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch
openstack-foreman-installer-2.0.24-1.el6ost.noarch
openstack-puppet-modules-2014.1-21.8.el6ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer
2. Configure/run an HA deployment of Neutron+GRE

Result:
the deployment fails. The puppet agent fails with:
Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details.         
Wrapped exception:                                                                                                                                                                                                   
Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details.                                                 
Error: /Stage[main]/Quickstack::Galera::Server/Galera::Server/Service[galera]/ensure: change from stopped to running failed: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details. 

Expectes result:
the deployment should complete with no errors.

Comment 1 Alexander Chuzhoy 2014-09-17 20:08:28 UTC

Created attachment 938632 [details]
messages from controllers/staypuft + foreman.log from the staypuft

Comment 2 Jason Guiditta 2014-09-17 20:15:08 UTC

This sounds like a bug in galera that already has a fix, though it may not have gotten into the right rpm yet.  Adding Ryan, as I am pretty sure he has fixed this in another BZ.

Comment 3 Alexander Chuzhoy 2014-09-17 20:27:20 UTC

Created attachment 938634 [details]
/var/log/mariadb/mariadb.log file from the controllers

Comment 4 Alexander Chuzhoy 2014-09-18 00:18:54 UTC

Reproduced on bare metal seup.

Comment 5 Ryan O'Hara 2014-09-18 00:41:23 UTC

The bug I fixed what in galera related to a problem is IST. From the logs attached, this does not appear to be the problem. Is it all nodes that are failing or just one? If just one, which one?

Comment 6 Ryan O'Hara 2014-09-18 00:45:05 UTC

Please post /etc/my.cnf.d/galera.cnf.

Comment 7 Ryan O'Hara 2014-09-18 01:06:09 UTC

There is something off with this deployment. Can someone help me understand? At approximately 17:52:21 mariadb is successfully started on "controller1". Then is is stopped at 17:52:53. Then is is started again at 19:01:06, around the same time as a crmd error about a "Bad global update". At this point mariadb fails. The reason it fails is because, as far as I can tell, there are no nodes in the galera cluster to join. I have several questions:

1. Why is mariadb stopped at 17:52:53 and then not started again until more than 1 hour later? That seems strange. What is happening during this time?

2. Does the crmd "Bad global update" have anything to do with this?

Is seems to me there was a galera cluster formed sucessfully, but then mariadb was stopped and not started again for quite a long time. When it was, there was no cluster to join. Bootstrapping only works once, and it did.

Are we absolutely sure that puppet agent was not run multiple times?

Comment 8 Alexander Chuzhoy 2014-09-18 13:24:32 UTC

Created attachment 938908 [details]
galera.cnf from all controllers.

Comment 9 Alexander Chuzhoy 2014-09-18 13:25:19 UTC

Reply to comment #5 - all the nodes are failing.

Comment 10 Ryan O'Hara 2014-09-18 13:29:23 UTC

(In reply to Alexander Chuzhoy from comment #9)
> Reply to comment #5 - all the nodes are failing.

All the galera.cnf files have this:

wsrep_cluster_address="gcomm://192.168.0.7,192.168.0.9,192.168.0.8"

So there is no bootstrap node and it will always fail. We need to figure out why the puppet code is setting them all this way. This should really only happen if puppet is run twice, but perhaps something has changed in astapor. Need to check with Crag.

Comment 11 Alexander Chuzhoy 2014-09-18 17:14:56 UTC

Reproduced with HANova+Flat.

Comment 12 Crag Wolfe 2014-09-18 17:46:06 UTC

After digging through the logs further with Ryan, discovered evidence that stonith was not configured or disabled:

Sep 17 17:45:20 maca25400702875 puppet-agent[3043]: Unexpected value for parameter fencing_type: :.  Expect one of disabled, fence_ipmilan, or fence_xvm
...
Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

Since fencing was not configured or explicitly disabled, pacemaker does not attempt to start the mysqld (galera) resource.  I.e., galera remains shutdown after the initial bootstrap (normally, pacemaker would have started it back up).

Comment 13 Mike Burns 2014-09-18 19:01:17 UTC

Based on comment 12, this appears to be a consequence of bug 1143047.  Awaiting confirmation that resolving that bug resolves this bug

Comment 14 Alexander Chuzhoy 2014-09-18 20:43:52 UTC

Applied the following workaround:

During the first puppet run, have another terminal open and right after the cluster is set up ("/Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns) executed successfully" in /var/log/messages), execute  "pcs property set stonith-enabled=false" on any one node. 



On one controller the mariadb service was down, but now I was able to bring it up with systemctl start mariadb.

Comment 15 Alexander Chuzhoy 2014-09-18 20:45:34 UTC

Created attachment 939042 [details]
/var/log/mariadb/mariadb.log file from the one controller where it was down.

Comment 16 Alexander Chuzhoy 2014-09-18 20:46:44 UTC

The output from pcs status on one controller:
[root@maca25400702875 ~]# pcs status
Cluster name: openstack             
Last updated: Thu Sep 18 20:46:05 2014
Last change: Thu Sep 18 20:44:12 2014 via cibadmin on maca25400702877.example.com
Stack: corosync                                                                  
Current DC: maca25400702875.example.com (2) - partition with quorum              
Version: 1.1.10-32.el7_0-368c726                                                 
3 Nodes configured                                                               
99 Resources configured                                                          


Online: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]

Full list of resources:

 ip-192.168.0.35        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com 
 ip-192.168.0.37        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com 
 ip-192.168.0.34        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com 
 ip-192.168.0.29        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com 
 ip-192.168.0.30        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com 
 ip-192.168.0.28        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com 
 ip-192.168.0.41        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com 
 ip-192.168.0.36        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com 
 Clone Set: memcached-clone [memcached]                                                     
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: rabbitmq-server-clone [rabbitmq-server]                                                  
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: haproxy-clone [haproxy]                                                                  
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.18        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 Clone Set: mysqld-clone [mysqld]                                                                    
     Started: [ maca25400702876.example.com maca25400702877.example.com ]                            
     Stopped: [ maca25400702875.example.com ]                                                        
 ip-192.168.0.33        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.32        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.31        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-keystone-clone [openstack-keystone]                                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: fs-varlibglanceimages-clone [fs-varlibglanceimages]                                      
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.20        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 ip-192.168.0.21        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.19        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]                              
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]                                        
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.38        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 ip-192.168.0.39        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.40        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]                                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]                              
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]                                
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]                                
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.17        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 ip-192.168.0.16        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.15        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]                                        
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started maca25400702877.example.com 
 Clone Set: neutron-server-clone [neutron-server]                                                           
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]       
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]                                                 
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]       
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]                                             
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]       
 Resource Group: neutron-agents                                                                             
     neutron-openvswitch-agent  (systemd:neutron-openvswitch-agent):    Started maca25400702877.example.com 
     neutron-dhcp-agent (systemd:neutron-dhcp-agent):   Started maca25400702877.example.com                 
     neutron-l3-agent   (systemd:neutron-l3-agent):     Started maca25400702877.example.com                 
     neutron-metadata-agent     (systemd:neutron-metadata-agent):       Started maca25400702877.example.com 
 ip-192.168.0.23        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com                 
 ip-192.168.0.24        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com                 
 ip-192.168.0.27        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com                 
 ip-192.168.0.26        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com                 
 ip-192.168.0.25        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com
 ip-192.168.0.22        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Resource Group: heat
     openstack-heat-engine      (systemd:openstack-heat-engine):        Started maca25400702875.example.com
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: httpd-clone [httpd]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]

Failed actions:
    mysqld_start_0 on maca25400702875.example.com 'OCF_PENDING' (196): call=68, status=complete, last-rc-change='Thu Sep 18 20:12:55 2014', queued=2ms, exec=2001ms


PCSD Status:
  192.168.0.11: Online
  192.168.0.7: Online
  192.168.0.8: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 17 Mike Burns 2014-09-19 11:35:46 UTC

Closing as a duplicate per comment 13 and comment 16

*** This bug has been marked as a duplicate of bug 1143047 ***