Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1143053

Summary: Rubygem-Staypuft: HA deployment fails - Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-foreman-installerAssignee: Ryan O'Hara <rohara>
Status: CLOSED DUPLICATE QA Contact: Leonid Natapov <lnatapov>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.0 (RHEL 7)CC: cwolfe, jguiditt, mburns, morazi, rhos-maint, rohara, sasha, sclewis, yeylon
Target Milestone: z1   
Target Release: Installer   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-19 11:35:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1142873    
Attachments:
Description Flags
messages from controllers/staypuft + foreman.log from the staypuft
none
/var/log/mariadb/mariadb.log file from the controllers
none
galera.cnf from all controllers.
none
/var/log/mariadb/mariadb.log file from the one controller where it was down. none

Description Alexander Chuzhoy 2014-09-17 20:07:42 UTC
Rubygem-Staypuft:  HA deployment fails - Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details



Environment:
rhel-osp-installer-0.3.5-1.el6ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el6ost.noarch
openstack-foreman-installer-2.0.24-1.el6ost.noarch
openstack-puppet-modules-2014.1-21.8.el6ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer
2. Configure/run an HA deployment of Neutron+GRE

Result:
the deployment fails. The puppet agent fails with:
Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details.         
Wrapped exception:                                                                                                                                                                                                   
Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details.                                                 
Error: /Stage[main]/Quickstack::Galera::Server/Galera::Server/Service[galera]/ensure: change from stopped to running failed: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details. 

Expectes result:
the deployment should complete with no errors.

Comment 1 Alexander Chuzhoy 2014-09-17 20:08:28 UTC
Created attachment 938632 [details]
messages from controllers/staypuft + foreman.log from the staypuft

Comment 2 Jason Guiditta 2014-09-17 20:15:08 UTC
This sounds like a bug in galera that already has a fix, though it may not have gotten into the right rpm yet.  Adding Ryan, as I am pretty sure he has fixed this in another BZ.

Comment 3 Alexander Chuzhoy 2014-09-17 20:27:20 UTC
Created attachment 938634 [details]
/var/log/mariadb/mariadb.log file from the controllers

Comment 4 Alexander Chuzhoy 2014-09-18 00:18:54 UTC
Reproduced on bare metal seup.

Comment 5 Ryan O'Hara 2014-09-18 00:41:23 UTC
The bug I fixed what in galera related to a problem is IST. From the logs attached, this does not appear to be the problem. Is it all nodes that are failing or just one? If just one, which one?

Comment 6 Ryan O'Hara 2014-09-18 00:45:05 UTC
Please post /etc/my.cnf.d/galera.cnf.

Comment 7 Ryan O'Hara 2014-09-18 01:06:09 UTC
There is something off with this deployment. Can someone help me understand? At approximately 17:52:21 mariadb is successfully started on "controller1". Then is is stopped at 17:52:53. Then is is started again at 19:01:06, around the same time as a crmd error about a "Bad global update". At this point mariadb fails. The reason it fails is because, as far as I can tell, there are no nodes in the galera cluster to join. I have several questions:

1. Why is mariadb stopped at 17:52:53 and then not started again until more than 1 hour later? That seems strange. What is happening during this time?

2. Does the crmd "Bad global update" have anything to do with this?

Is seems to me there was a galera cluster formed sucessfully, but then mariadb was stopped and not started again for quite a long time. When it was, there was no cluster to join. Bootstrapping only works once, and it did.

Are we absolutely sure that puppet agent was not run multiple times?

Comment 8 Alexander Chuzhoy 2014-09-18 13:24:32 UTC
Created attachment 938908 [details]
galera.cnf from all controllers.

Comment 9 Alexander Chuzhoy 2014-09-18 13:25:19 UTC
Reply to comment #5 - all the nodes are failing.

Comment 10 Ryan O'Hara 2014-09-18 13:29:23 UTC
(In reply to Alexander Chuzhoy from comment #9)
> Reply to comment #5 - all the nodes are failing.

All the galera.cnf files have this:

wsrep_cluster_address="gcomm://192.168.0.7,192.168.0.9,192.168.0.8"

So there is no bootstrap node and it will always fail. We need to figure out why the puppet code is setting them all this way. This should really only happen if puppet is run twice, but perhaps something has changed in astapor. Need to check with Crag.

Comment 11 Alexander Chuzhoy 2014-09-18 17:14:56 UTC
Reproduced with HANova+Flat.

Comment 12 Crag Wolfe 2014-09-18 17:46:06 UTC
After digging through the logs further with Ryan, discovered evidence that stonith was not configured or disabled:

Sep 17 17:45:20 maca25400702875 puppet-agent[3043]: Unexpected value for parameter fencing_type: :.  Expect one of disabled, fence_ipmilan, or fence_xvm
...
Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

Since fencing was not configured or explicitly disabled, pacemaker does not attempt to start the mysqld (galera) resource.  I.e., galera remains shutdown after the initial bootstrap (normally, pacemaker would have started it back up).

Comment 13 Mike Burns 2014-09-18 19:01:17 UTC
Based on comment 12, this appears to be a consequence of bug 1143047.  Awaiting confirmation that resolving that bug resolves this bug

Comment 14 Alexander Chuzhoy 2014-09-18 20:43:52 UTC
Applied the following workaround:

During the first puppet run, have another terminal open and right after the cluster is set up ("/Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns) executed successfully" in /var/log/messages), execute  "pcs property set stonith-enabled=false" on any one node. 



On one controller the mariadb service was down, but now I was able to bring it up with systemctl start mariadb.

Comment 15 Alexander Chuzhoy 2014-09-18 20:45:34 UTC
Created attachment 939042 [details]
/var/log/mariadb/mariadb.log file from the one controller where it was down.

Comment 16 Alexander Chuzhoy 2014-09-18 20:46:44 UTC
The output from pcs status on one controller:
[root@maca25400702875 ~]# pcs status
Cluster name: openstack             
Last updated: Thu Sep 18 20:46:05 2014
Last change: Thu Sep 18 20:44:12 2014 via cibadmin on maca25400702877.example.com
Stack: corosync                                                                  
Current DC: maca25400702875.example.com (2) - partition with quorum              
Version: 1.1.10-32.el7_0-368c726                                                 
3 Nodes configured                                                               
99 Resources configured                                                          


Online: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]

Full list of resources:

 ip-192.168.0.35        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com 
 ip-192.168.0.37        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com 
 ip-192.168.0.34        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com 
 ip-192.168.0.29        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com 
 ip-192.168.0.30        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com 
 ip-192.168.0.28        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com 
 ip-192.168.0.41        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com 
 ip-192.168.0.36        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com 
 Clone Set: memcached-clone [memcached]                                                     
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: rabbitmq-server-clone [rabbitmq-server]                                                  
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: haproxy-clone [haproxy]                                                                  
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.18        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 Clone Set: mysqld-clone [mysqld]                                                                    
     Started: [ maca25400702876.example.com maca25400702877.example.com ]                            
     Stopped: [ maca25400702875.example.com ]                                                        
 ip-192.168.0.33        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.32        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.31        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-keystone-clone [openstack-keystone]                                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: fs-varlibglanceimages-clone [fs-varlibglanceimages]                                      
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.20        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 ip-192.168.0.21        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.19        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]                              
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]                                        
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.38        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 ip-192.168.0.39        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.40        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]                                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]                              
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]                                
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]                                
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 ip-192.168.0.17        (ocf::heartbeat:IPaddr2):       Started maca25400702877.example.com          
 ip-192.168.0.16        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com          
 ip-192.168.0.15        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com          
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]                                        
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]                            
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started maca25400702877.example.com 
 Clone Set: neutron-server-clone [neutron-server]                                                           
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]       
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]                                                 
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]       
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]                                             
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]       
 Resource Group: neutron-agents                                                                             
     neutron-openvswitch-agent  (systemd:neutron-openvswitch-agent):    Started maca25400702877.example.com 
     neutron-dhcp-agent (systemd:neutron-dhcp-agent):   Started maca25400702877.example.com                 
     neutron-l3-agent   (systemd:neutron-l3-agent):     Started maca25400702877.example.com                 
     neutron-metadata-agent     (systemd:neutron-metadata-agent):       Started maca25400702877.example.com 
 ip-192.168.0.23        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com                 
 ip-192.168.0.24        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com                 
 ip-192.168.0.27        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com                 
 ip-192.168.0.26        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com                 
 ip-192.168.0.25        (ocf::heartbeat:IPaddr2):       Started maca25400702875.example.com
 ip-192.168.0.22        (ocf::heartbeat:IPaddr2):       Started maca25400702876.example.com
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Resource Group: heat
     openstack-heat-engine      (systemd:openstack-heat-engine):        Started maca25400702875.example.com
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
 Clone Set: httpd-clone [httpd]
     Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]

Failed actions:
    mysqld_start_0 on maca25400702875.example.com 'OCF_PENDING' (196): call=68, status=complete, last-rc-change='Thu Sep 18 20:12:55 2014', queued=2ms, exec=2001ms


PCSD Status:
  192.168.0.11: Online
  192.168.0.7: Online
  192.168.0.8: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 17 Mike Burns 2014-09-19 11:35:46 UTC
Closing as a duplicate per comment 13 and comment 16

*** This bug has been marked as a duplicate of bug 1143047 ***