Description of problem: After A4 was pushed I attempted to deploy an HA configuration (I was able to do this with engineering builds prior to A4) This config is blocked if I use ipmi fencing. 1: On the first past, the fences are stopped - [root@ospha1 ~]# pcs status Cluster name: openstack Last updated: Mon Jun 2 16:29:33 2014 Last change: Mon Jun 2 16:25:52 2014 via cibadmin on 10.19.139.31 Stack: cman Current DC: 10.19.139.32 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 3 Nodes configured 17 Resources configured Online: [ 10.19.139.31 10.19.139.32 10.19.139.33 ] Full list of resources: stonith-ipmilan-10.19.143.62 (stonith:fence_ipmilan): Stopped stonith-ipmilan-10.19.143.61 (stonith:fence_ipmilan): Stopped Resource Group: db fs-varlibmysql (ocf::heartbeat:Filesystem): Started 10.19.139.33 mysql-ostk-mysql (ocf::heartbeat:mysql): Started 10.19.139.33 stonith-ipmilan-10.19.143.63 (stonith:fence_ipmilan): Stopped Clone Set: lsb-memcached-clone [lsb-memcached] Started: [ 10.19.139.31 10.19.139.32 10.19.139.33 ] ip-10.19.139.2 (ocf::heartbeat:IPaddr2): Started 10.19.139.31 ip-10.19.139.3 (ocf::heartbeat:IPaddr2): Started 10.19.139.32 Clone Set: lsb-qpidd-clone [lsb-qpidd] Started: [ 10.19.139.31 10.19.139.32 10.19.139.33 ] ip-10.19.139.18 (ocf::heartbeat:IPaddr2): Started 10.19.139.31 Clone Set: lsb-haproxy-clone [lsb-haproxy] Started: [ 10.19.139.31 10.19.139.32 10.19.139.33 ] Failed actions: stonith-ipmilan-10.19.143.61_start_0 on 10.19.139.31 'unknown error' (1): call=23, status=Error, last-rc-change='Mon Jun 2 16:24:52 2014', queued=1080ms, exec=0ms stonith-ipmilan-10.19.143.62_start_0 on 10.19.139.31 'unknown error' (1): call=8, status=Error, last-rc-change='Mon Jun 2 16:24:50 2014', queued=2076ms, exec=0ms stonith-ipmilan-10.19.143.63_start_0 on 10.19.139.31 'unknown error' (1): call=34, status=Error, last-rc-change='Mon Jun 2 16:25:04 2014', queued=1155ms, exec=0ms stonith-ipmilan-10.19.143.61_start_0 on 10.19.139.33 'unknown error' (1): call=35, status=Error, last-rc-change='Mon Jun 2 16:25:07 2014', queued=1025ms, exec=0ms stonith-ipmilan-10.19.143.62_start_0 on 10.19.139.33 'unknown error' (1): call=28, status=Error, last-rc-change='Mon Jun 2 16:25:04 2014', queued=2024ms, exec=0ms stonith-ipmilan-10.19.143.63_start_0 on 10.19.139.33 'unknown error' (1): call=41, status=Error, last-rc-change='Mon Jun 2 16:25:08 2014', queued=1055ms, exec=0ms stonith-ipmilan-10.19.143.61_start_0 on 10.19.139.32 'unknown error' (1): call=29, status=Error, last-rc-change='Mon Jun 2 16:25:04 2014', queued=1054ms, exec=0ms stonith-ipmilan-10.19.143.62_start_0 on 10.19.139.32 'unknown error' (1): call=18, status=Error, last-rc-change='Mon Jun 2 16:24:53 2014', queued=10274ms, exec=0ms stonith-ipmilan-10.19.143.63_start_0 on 10.19.139.32 'unknown error' (1): call=35, status=Error, last-rc-change='Mon Jun 2 16:25:07 2014', queued=1065ms, exec=0ms [root@ospha1 ~]# On trying a second pass there is an error when it appears it attempts to add the fences again. Notice: /Stage[main]/Quickstack::Pacemaker::Common/Exec[pcs-resource-default]/returns: executed successfully Debug: /Stage[main]/Quickstack::Pacemaker::Common/Exec[pcs-resource-default]: The container Class[Quickstack::Pacemaker::Common] will propagate my refresh event Debug: Exec[Enable STONITH](provider=posix): Executing check '/usr/sbin/pcs property show stonith-enabled | grep 'stonith-enabled: false'' Debug: Executing '/usr/sbin/pcs property show stonith-enabled | grep 'stonith-enabled: false'' Debug: /Stage[main]/Pacemaker::Stonith/Exec[Enable STONITH]/onlyif: Error: unable to get crm_config Debug: /Stage[main]/Pacemaker::Stonith/Exec[Enable STONITH]/onlyif: Call cib_query failed (-62): Timer expired Debug: Exec[Creating stonith::ipmilan](provider=posix): Executing check '/usr/sbin/pcs stonith show stonith-ipmilan-10.19.143.61 > /dev/null 2>&1' Debug: Executing '/usr/sbin/pcs stonith show stonith-ipmilan-10.19.143.61 > /dev/null 2>&1' Debug: Exec[Creating stonith::ipmilan](provider=posix): Executing '/usr/sbin/pcs stonith create stonith-ipmilan-10.19.143.61 fence_ipmilan pcmk_host_list="$(/usr/sbin/crm_node -n)" ipaddr=10.19.143.61 login="root" passwd="4score&7" lanplus="" op monitor interval=60s' Debug: Executing '/usr/sbin/pcs stonith create stonith-ipmilan-10.19.143.61 fence_ipmilan pcmk_host_list="$(/usr/sbin/crm_node -n)" ipaddr=10.19.143.61 login="root" passwd="4score&7" lanplus="" op monitor interval=60s' Notice: /Stage[main]/Quickstack::Pacemaker::Stonith::Ipmilan/Exec[Creating stonith::ipmilan]/returns: Error: Unable to create resource/fence device Notice: /Stage[main]/Quickstack::Pacemaker::Stonith::Ipmilan/Exec[Creating stonith::ipmilan]/returns: Call cib_create failed (-62): Timer expired Error: /usr/sbin/pcs stonith create stonith-ipmilan-10.19.143.61 fence_ipmilan pcmk_host_list="$(/usr/sbin/crm_node -n)" ipaddr=10.19.143.61 login="root" passwd="4score&7" lanplus="" op monitor interval=60s returned 1 instead of one of [0] /usr/lib/ruby/site_ruby/1.8/puppet/util/errors.rb:96:in `fail' /usr/lib/ruby/site_ruby/1.8/puppet/type/exec.rb:125:in `sync' I will attach the puppet output of the different passes of the node selected as cluster_controller_ip and the yaml of the parameters used. I will try disable fencing to see if I cam make any further progress Version-Release number of selected component (if applicable): [root@ospha-foreman ~]# yum list installed | grep -i -e foreman -e puppet This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register. foreman.noarch 1.3.0.4-1.el6sat @rhel-x86_64-server-6-ost-4 foreman-installer.noarch 1:1.3.0-1.el6sat @rhel-x86_64-server-6-ost-4 foreman-mysql.noarch 1.3.0.4-1.el6sat @rhel-x86_64-server-6-ost-4 foreman-mysql2.noarch 1.3.0.4-1.el6sat @rhel-x86_64-server-6-ost-4 foreman-proxy.noarch 1.3.0-3.el6sat @rhel-x86_64-server-6-ost-4 foreman-selinux.noarch 1.3.0-1.el6sat @rhel-x86_64-server-6-ost-4 openstack-foreman-installer.noarch openstack-puppet-modules.noarch 2013.2-9.1.el6ost @rhel-x86_64-server-6-ost-4 puppet.noarch 3.2.4-3.el6_5 @rhel-x86_64-server-6-ost-4 puppet-server.noarch 3.2.4-3.el6_5 @rhel-x86_64-server-6-ost-4 rubygem-foreman_api.noarch 0.1.6-1.el6sat @rhel-x86_64-server-6-ost-4 [root@ospha-foreman ~]# How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 901789 [details] parameters
Created attachment 901790 [details] puppet output from first pass
Created attachment 901791 [details] puppet output from second pass
Steve -- it worked for me, remember to update the IP address from 143 -> 139 and the root password from old to new after our lab move. I also needed to update my VIPs as they were on the lab network. (10.16->10.19) Jacob
Jacob - Foreman deployed fencing work for you? That is strange since Jay knows about the issue. Also all management processors are on the 143 (not 139) addresses space. I did update my all my VIP. I ran into the problem all last week before the password was updated.
I did not use Foreman. I could not tell if this only applied to Foreman or A4 generally. I tested A4 without Foreman to see if that worked.
Crag, I believe this is fixed in OSP5, right? Maybe you could add the commits where you fixed it to this BZ, and then we can backport when time allows for next OSP4 release?
I believe the root of the problem was that the version of pacemaker used here errored out on a call to "pcs stonith show ...". I have confirmed this is not a problem for el7. I also did not see it previously on el6 -- need to confirm the version of pacemaker in this bug "1.1.10-14.el6_5.3-368c726" needs to be supported and if so we can add puppet code to do either "pcs resource show" vs. "pcs stonith show" at run-time based on the version of pacemaker. If we have an environment now that exhibits this error, we can send a crm_report to Fabio, David and Chris and ask them to narrow down what versions of pacemaker this pertains to. (Note, email sent outside of bug discussing this on 6/9 and 6/10).
Let's reverify that this is no longer reproducible.