Bug 666041

Summary: Heartbeat reboots system
Product: [Fedora] Fedora Reporter: Zoltan Boszormenyi <zboszor>
Component: pacemakerAssignee: Andrew Beekhof <andrew>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 14CC: andrew, fdinitto, kevin, lhh
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.4-5.fc14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 23:34:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cib.xml
none
/var/log/messages
none
/var/log/messages from host OS none

Description Zoltan Boszormenyi 2010-12-28 15:05:17 UTC
Description of problem:

I tried to set up heartbeat on Fedora 14. After creating/editing authkeys
and ha.cf in /etc/ha.d (contents below) I tried to start it up. After some
seconds the system rebooted by itself and according to the logs, it was because
of heartbeat itself. This is the relevant log section:

Dec 28 12:34:46 db-ha1 cib: [13629]: info: startCib: CIB Initialization completed successfully
Dec 28 12:34:46 db-ha1 cib: [13629]: CRIT: get_cluster_type: This installation of Pacemaker does not support the '(null)' cluster infrastructure.  Terminating.
Dec 28 12:34:46 db-ha1 heartbeat: [13574]: WARN: Managed /usr/lib64/heartbeat/cib process 13629 exited with return code 100.
Dec 28 12:34:46 db-ha1 heartbeat: [13574]: EMERG: Rebooting system.  Reason: /usr/lib64/heartbeat/cib

authkeys:
=======================
auth 1
1 sha1 secretpass
=======================

ha.cf:
=======================
crm on
bcast eth0
node db-ha1 db-ha2
=======================

Version-Release number of selected component (if applicable):

# rpm -q heartbeat pacemaker
heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc14.x86_64
pacemaker-1.1.4-4.fc14.x86_64

How reproducible:

Always.

Steps to Reproduce:
1. Install Fedora 14, upgrade
2. Set up heartbeat with a minimal configuration, no resources yet.
3. Start heartbeat
  
Actual results:

System reboots.

Expected results:

No reboot, working CRM, crm_mon should show the node(s) in the cluster.

Additional info:

Comment 1 Zoltan Boszormenyi 2010-12-28 18:24:58 UTC
The reboot problem is quite critical because:

1. installation of heartbeat automatically adds it to the services run at boot
2. setting it up with a minimal configuration and trying it out reboots the system

Bang, instant reboot loop, you need to manually boot into single mode and
remove heartbeat from the auto-started services.

The same minimum configuration on Debian Squeeze with the versions below works:

# dpkg -l heartbeat pacemaker
...
ii  heartbeat  1:3.0.3-2          Subsystem for High-Availability Linux
ii  pacemaker  1.0.9.1+hg15626-1  HA cluster resource manager

Comment 2 Kevin Fenzi 2010-12-29 18:56:38 UTC
Curious. :( 

Whats in your cib? can you attach full messages from startup to reboot and a dump of your cib.xml or perhaps a ha_report output? 

Also, try changing change "crm yes" to "crm respawn" to keep it up long enough to debug whats happening. 

Thanks.

Comment 3 Zoltan Boszormenyi 2010-12-29 19:36:35 UTC
Created attachment 471090 [details]
cib.xml

This is the cib.xml (empty) as was said in the report.

Comment 4 Zoltan Boszormenyi 2010-12-29 19:37:06 UTC
Created attachment 471091 [details]
/var/log/messages

Comment 5 Zoltan Boszormenyi 2010-12-29 19:43:16 UTC
I attached cib.xml and /var/log/messages after I forced logrotate.
What you can see is:
1. started heartbeat with "crm respawn"
2. stopped heartbeat
3. started heartbeat with "crm on", it rebooted the system.

As I said, the heartbeat configuration is empty, there were no resources
added yet, only the ha.cf was set up. Actually, the effective ha.cf is below,
I used the distributed template. The last 3 lines were added by me,
logfacility and auto_failback is set as default.

=============================
# grep -v "^#" ha.cf
logfacility	local0
auto_failback on




crm on
bcast eth0
node db-ha1 db-ha2
=============================

Comment 6 Zoltan Boszormenyi 2010-12-29 20:21:38 UTC
The above was set up between two VMWare guests on my machine, but as my host OS
is also Fedora 14, I tried it natively, faking a two-node setup. Same effect,
system rebooted. I will attach the logs from my host OS, too.

Comment 7 Zoltan Boszormenyi 2010-12-29 20:23:21 UTC
Created attachment 471103 [details]
/var/log/messages from host OS

Comment 8 Zoltan Boszormenyi 2010-12-29 20:26:54 UTC
You can see from the host's /var/lib/messages, this is also a fresh setup,
so there's no point attaching cib.xml:

Dec 29 21:01:18 db-ha2 cib: [3043]: WARN: retrieveCib: Cluster configuration not found: /var/lib/heartbeat/crm/cib.xml
Dec 29 21:01:18 db-ha2 cib: [3043]: WARN: readCibXmlFile: Primary configuration corrupt or unusable, trying backup...
Dec 29 21:01:18 db-ha2 cib: [3043]: WARN: readCibXmlFile: Continuing with an empty configuration.

Comment 9 Zoltan Boszormenyi 2010-12-30 14:06:23 UTC
I tried to recompile pacemaker-1.1.4-4.fc14.src.rpm paying close attention
to the ./configure options. There I saw --without-heartbeat and --without-ais.
I looked at pacemaker.spec so I could find out that I need to compile it with

rpmbuild --define '_with_heartbeat 1' -ba pacemaker.spec

to add heartbeat support. With the recompiled pacemaker packages, heartbeat
can now start up successfully. The automatic compilation of the packages needs
this extra --define option or some conditionals need to be removed from
pacemaker.spec.

Comment 10 Kevin Fenzi 2010-12-30 16:49:03 UTC
Moving over to the pacemaker component.

Comment 11 Andrew Beekhof 2011-01-11 07:38:57 UTC
testing the following patch, looks like i messed up the use of the bcond macros


@@ -173,14 +173,14 @@ resource health.
 %build
 ./autogen.sh
 %{configure}                                           \
-       %{!?_with_heartbeat: --without-heartbeat}       \
-       %{!?_with_ais:       --without-ais}             \
-       %{!?_with_esmtp:     --without-esmtp}           \
-       %{!?_with_snmp:      --without-snmp}            \
-       %{?_with_cman:       --with-cman}               \
-       %{?_with_profiling:  --with-profiling}          \
-       %{?_with_gcov:       --with-gcov}               \
-       %{?_with_tracedata   --with-tracedata}          \
+       %{!?with_heartbeat: --without-heartbeat}        \
+       %{!?with_ais:       --without-ais}              \
+       %{!?with_esmtp:     --without-esmtp}            \
+       %{!?with_snmp:      --without-snmp}             \
+       %{?with_cman:       --with-cman}                \
+       %{?with_profiling:  --with-profiling}           \
+       %{?with_gcov:       --with-gcov}                \
+       %{?with_tracedata   --with-tracedata}           \
        --docdir=%{pcmk_docdir}                         \
        --localstatedir=%{_var}                         \
        --with-initdir=%{_initddir}                     \

Comment 12 Fedora Update System 2011-01-11 09:36:15 UTC
pacemaker-1.1.4-5.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/pacemaker-1.1.4-5.fc14

Comment 13 Fedora Update System 2011-01-12 05:30:59 UTC
pacemaker-1.1.4-5.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update pacemaker'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/pacemaker-1.1.4-5.fc14

Comment 14 Fedora Update System 2011-01-13 23:34:18 UTC
pacemaker-1.1.4-5.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.