Bug 1143053
| Summary: | Rubygem-Staypuft: HA deployment fails - Error: Could not start Service[galera]: Execution of '/usr/bin/systemctl start mariadb' returned 1: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
| Component: | openstack-foreman-installer | Assignee: | Ryan O'Hara <rohara> |
| Status: | CLOSED DUPLICATE | QA Contact: | Leonid Natapov <lnatapov> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 5.0 (RHEL 7) | CC: | cwolfe, jguiditt, mburns, morazi, rhos-maint, rohara, sasha, sclewis, yeylon |
| Target Milestone: | z1 | ||
| Target Release: | Installer | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-09-19 11:35:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1142873 | ||
| Attachments: | |||
|
Description
Alexander Chuzhoy
2014-09-17 20:07:42 UTC
Created attachment 938632 [details]
messages from controllers/staypuft + foreman.log from the staypuft
This sounds like a bug in galera that already has a fix, though it may not have gotten into the right rpm yet. Adding Ryan, as I am pretty sure he has fixed this in another BZ. Created attachment 938634 [details]
/var/log/mariadb/mariadb.log file from the controllers
Reproduced on bare metal seup. The bug I fixed what in galera related to a problem is IST. From the logs attached, this does not appear to be the problem. Is it all nodes that are failing or just one? If just one, which one? Please post /etc/my.cnf.d/galera.cnf. There is something off with this deployment. Can someone help me understand? At approximately 17:52:21 mariadb is successfully started on "controller1". Then is is stopped at 17:52:53. Then is is started again at 19:01:06, around the same time as a crmd error about a "Bad global update". At this point mariadb fails. The reason it fails is because, as far as I can tell, there are no nodes in the galera cluster to join. I have several questions: 1. Why is mariadb stopped at 17:52:53 and then not started again until more than 1 hour later? That seems strange. What is happening during this time? 2. Does the crmd "Bad global update" have anything to do with this? Is seems to me there was a galera cluster formed sucessfully, but then mariadb was stopped and not started again for quite a long time. When it was, there was no cluster to join. Bootstrapping only works once, and it did. Are we absolutely sure that puppet agent was not run multiple times? Created attachment 938908 [details]
galera.cnf from all controllers.
Reply to comment #5 - all the nodes are failing. (In reply to Alexander Chuzhoy from comment #9) > Reply to comment #5 - all the nodes are failing. All the galera.cnf files have this: wsrep_cluster_address="gcomm://192.168.0.7,192.168.0.9,192.168.0.8" So there is no bootstrap node and it will always fail. We need to figure out why the puppet code is setting them all this way. This should really only happen if puppet is run twice, but perhaps something has changed in astapor. Need to check with Crag. Reproduced with HANova+Flat. After digging through the logs further with Ryan, discovered evidence that stonith was not configured or disabled: Sep 17 17:45:20 maca25400702875 puppet-agent[3043]: Unexpected value for parameter fencing_type: :. Expect one of disabled, fence_ipmilan, or fence_xvm ... Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option Sep 17 17:52:55 maca25400702877 pengine[12782]: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Since fencing was not configured or explicitly disabled, pacemaker does not attempt to start the mysqld (galera) resource. I.e., galera remains shutdown after the initial bootstrap (normally, pacemaker would have started it back up). Based on comment 12, this appears to be a consequence of bug 1143047. Awaiting confirmation that resolving that bug resolves this bug Applied the following workaround:
During the first puppet run, have another terminal open and right after the cluster is set up ("/Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns) executed successfully" in /var/log/messages), execute "pcs property set stonith-enabled=false" on any one node.
On one controller the mariadb service was down, but now I was able to bring it up with systemctl start mariadb.
Created attachment 939042 [details]
/var/log/mariadb/mariadb.log file from the one controller where it was down.
The output from pcs status on one controller:
[root@maca25400702875 ~]# pcs status
Cluster name: openstack
Last updated: Thu Sep 18 20:46:05 2014
Last change: Thu Sep 18 20:44:12 2014 via cibadmin on maca25400702877.example.com
Stack: corosync
Current DC: maca25400702875.example.com (2) - partition with quorum
Version: 1.1.10-32.el7_0-368c726
3 Nodes configured
99 Resources configured
Online: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Full list of resources:
ip-192.168.0.35 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.37 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
ip-192.168.0.34 (ocf::heartbeat:IPaddr2): Started maca25400702877.example.com
ip-192.168.0.29 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.30 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
ip-192.168.0.28 (ocf::heartbeat:IPaddr2): Started maca25400702877.example.com
ip-192.168.0.41 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.36 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
Clone Set: memcached-clone [memcached]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: rabbitmq-server-clone [rabbitmq-server]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: haproxy-clone [haproxy]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
ip-192.168.0.18 (ocf::heartbeat:IPaddr2): Started maca25400702877.example.com
Clone Set: mysqld-clone [mysqld]
Started: [ maca25400702876.example.com maca25400702877.example.com ]
Stopped: [ maca25400702875.example.com ]
ip-192.168.0.33 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.32 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.31 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
Clone Set: openstack-keystone-clone [openstack-keystone]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: fs-varlibglanceimages-clone [fs-varlibglanceimages]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
ip-192.168.0.20 (ocf::heartbeat:IPaddr2): Started maca25400702877.example.com
ip-192.168.0.21 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.19 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-glance-api-clone [openstack-glance-api]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
ip-192.168.0.38 (ocf::heartbeat:IPaddr2): Started maca25400702877.example.com
ip-192.168.0.39 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.40 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-nova-api-clone [openstack-nova-api]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
ip-192.168.0.17 (ocf::heartbeat:IPaddr2): Started maca25400702877.example.com
ip-192.168.0.16 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.15 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
openstack-cinder-volume (systemd:openstack-cinder-volume): Started maca25400702877.example.com
Clone Set: neutron-server-clone [neutron-server]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Resource Group: neutron-agents
neutron-openvswitch-agent (systemd:neutron-openvswitch-agent): Started maca25400702877.example.com
neutron-dhcp-agent (systemd:neutron-dhcp-agent): Started maca25400702877.example.com
neutron-l3-agent (systemd:neutron-l3-agent): Started maca25400702877.example.com
neutron-metadata-agent (systemd:neutron-metadata-agent): Started maca25400702877.example.com
ip-192.168.0.23 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.24 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
ip-192.168.0.27 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.26 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
ip-192.168.0.25 (ocf::heartbeat:IPaddr2): Started maca25400702875.example.com
ip-192.168.0.22 (ocf::heartbeat:IPaddr2): Started maca25400702876.example.com
Clone Set: openstack-heat-api-clone [openstack-heat-api]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Resource Group: heat
openstack-heat-engine (systemd:openstack-heat-engine): Started maca25400702875.example.com
Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Clone Set: httpd-clone [httpd]
Started: [ maca25400702875.example.com maca25400702876.example.com maca25400702877.example.com ]
Failed actions:
mysqld_start_0 on maca25400702875.example.com 'OCF_PENDING' (196): call=68, status=complete, last-rc-change='Thu Sep 18 20:12:55 2014', queued=2ms, exec=2001ms
PCSD Status:
192.168.0.11: Online
192.168.0.7: Online
192.168.0.8: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Closing as a duplicate per comment 13 and comment 16 *** This bug has been marked as a duplicate of bug 1143047 *** |