Bug 161955 - LOST QUORUM
LOST QUORUM
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
3
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-06-28 14:41 EDT by Luis Alexandre Fontes
Modified: 2009-04-16 16:17 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-04 10:39:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Luis Alexandre Fontes 2005-06-28 14:41:49 EDT
Description of problem:
Contents of /var/log/cluster:
...
Jun 28 15:20:31 linpro035ias clusvcmgrd[14887]: <debug> Exec of 
script /usr/lib/clumanager/services/service, action status, service 
servico_as10g
Jun 28 15:20:31 linpro035ias clusvcmgrd: [14888]: <debug> service debug: 
Checking service servico_as10g ...
Jun 28 15:20:31 linpro035ias clusvcmgrd: [14888]: <debug> service debug: 
servico_as10g is not an NFS service
Jun 28 15:20:31 linpro035ias clusvcmgrd: [14888]: <debug> service debug: 
Running user script '/etc/init.d/as10g status'
Jun 28 15:20:31 linpro035ias clusvcmgrd: [14888]: <debug> service debug: 
Checked service servico_as10g ...
Jun 28 15:20:31 linpro035ias clusvcmgrd[14887]: <debug> Exec of script for 
service servico_as10g returned 0
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Processing message on 10
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Received 20 bytes from 
peer
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> LOCK_MASTER_QUERY
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> replying LOCK_ACK 0
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Processing message on 10
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Received 188 bytes from 
peer
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> LOCK_LOCK | LOCK_TRYLOCK
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> lockd_trylock: member #0 
lock 1
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Replying ACK
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Processing message on 10
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Received 20 bytes from 
peer
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> LOCK_MASTER_QUERY
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> replying LOCK_ACK 0
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Processing message on 10
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> Received 188 bytes from 
peer
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> LOCK_UNLOCK
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> lockd_unlock: member #0 
lock 1
Jun 28 15:20:31 linpro035ias clulockd[10292]: <debug> ACK: lock unlocked
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> Member #1 (10.0.0.2) DOWN
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> MB: Initiating vote on: 
0x00000001
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Connecting to member 
#0
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> MB: New connect: fd10
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Push 0.14993 #3
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Sending to member #0
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Checking for 
consensus...
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> MB: Received VF_MESSAGE, 
fd10
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> VF_JOIN_VIEW from member 
#0! Key: 0x27456381 #3
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> VF: Voting YES
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Member #0 voted YES
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Broadcasting FORMED
Jun 28 15:20:36 linpro035ias clumembd[14993]: <debug> VF: Converge Time: 
0.000000
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> MB: Received VF_MESSAGE, 
fd10
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> VF: Received 
VF_VIEW_FORMED, fd10
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> VF: Commit Key 0x27456381 
#3 from member #0
Jun 28 15:20:36 linpro035ias clumembd[10286]: <info> Membership View 
#3:0x00000001
Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> VF: pid 14993 exited, 
status 0
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> Q: Received 
EV_MEMB_UPDATE, fd5
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> Need 1 more members for 
quorum!
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> process_memb_update: 
Starting VF
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Connecting to 
member #0
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: Key 0x12345678 
Still running
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: Key 0x12345678 
Still running
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Push 0.14994 #2
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Sending to member #0
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Checking for 
consensus...
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> Q: Received VF_MESSAGE, 
fd9
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF_JOIN_VIEW from 
member #0! Key: 0x12345678 #2
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: Voting YES
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: Key 0x12345678 
Still running
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Member #0 voted YES
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Broadcasting FORMED
Jun 28 15:20:36 linpro035ias cluquorumd[14994]: <debug> VF: Converge Time: 
0.010000
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> Q: Received VF_MESSAGE, 
fd9
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: Received 
VF_VIEW_FORMED, fd9
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: Commit Key 
0x12345678 #2 from member #0
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <emerg> Quorum lost!
Jun 28 15:20:36 linpro035ias clusvcmgrd[10298]: <debug> Processing quorum event
Jun 28 15:20:36 linpro035ias clulockd[10292]: <crit> Quorum Event: LOST QUORUM
Jun 28 15:20:36 linpro035ias cluquorumd[10284]: <debug> VF: pid 14994 exited, 
status 0
Jun 28 15:20:36 linpro035ias clusvcmgrd[10298]: <crit> Halting services due to 
loss of quorum
Jun 28 15:20:36 linpro035ias clusvcmgrd[10298]: <warning> Forcing stop of 
service servico_as10g
Jun 28 15:20:37 linpro035ias clusvcmgrd[10298]: <debug> Exec of 
script /usr/lib/clumanager/services/service, action stop, service servico_as10g
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <notice> service notice: 
Stopping service servico_as10g ...
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <notice> service notice: 
Running user script '/etc/init.d/as10g stop'
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <info> service info: Stopping 
IP address 172.27.2.100
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <debug> service debug: 
ifconfig eth0:0 down
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <debug> service debug: 
Removing host based route for 172.27.2.100
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <debug> service debug: 
Removing ARP entry for 172.27.2.100
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <debug> service debug: 
servico_as10g is not an NFS service
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <info> service info: 
unmounting /dev/sdc3 (/u02)
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <debug> service debug: 
Invalidating buffers for /dev/sdc3
Jun 28 15:20:37 linpro035ias clusvcmgrd: [14995]: <notice> service notice: 
Stopped service servico_as10g ...
Jun 28 15:20:37 linpro035ias clusvcmgrd[10298]: <debug> Exec of script for 
service servico_as10g returned 0
Jun 28 15:20:37 linpro035ias clusvcmgrd[10298]: <debug> Sending message to 
quorumd that we are exiting
Jun 28 15:20:37 linpro035ias cluquorumd[10284]: <debug> Verified connect from 
member #0 (10.0.0.1)
Jun 28 15:20:37 linpro035ias clusvcmgrd[10298]: <info> Exiting
Jun 28 15:20:37 linpro035ias cluquorumd[10284]: <debug> Truncating 0 to 0...
Jun 28 15:20:37 linpro035ias cluquorumd[10284]: <debug> Q: Received 
QUORUM_EXIT, fd10
Jun 28 15:20:37 linpro035ias cluquorumd[10284]: <warning> Service Manager has 
exited uncleanly - Not shutting down
Jun 28 15:20:54 linpro035ias cluquorumd[10284]: <notice> IPv4 TB @ 172.27.2.1 
Online
Jun 28 15:21:38 linpro035ias clumanager: [15221]: <notice> Shutting down Red 
Hat Cluster Manager...
Jun 28 15:21:38 linpro035ias clulockd[10292]: <info> Exiting
Jun 28 15:21:38 linpro035ias cluquorumd[10284]: <info> Exiting
Jun 28 15:21:38 linpro035ias clumembd[10286]: <debug> Truncating 0 to 0...
Jun 28 15:21:38 linpro035ias clumembd[10286]: <debug> 
clumembd_sw_watchdog_stop: successfully stopped watchdog.
Jun 28 15:21:38 linpro035ias clumembd[10286]: <info> Exiting
Jun 28 15:21:39 linpro035ias clumanager: [15221]: <notice> Red Hat Cluster 
Manager is stopped.

------------------------------------------------------------
Version-Release number of selected component (if applicable):
# rpm -qa | grep clu
clumanager-1.2.26.1-1
redhat-config-cluster-1.0.7-1

------------------------------------------------------------
How reproducible:
?

------------------------------------------------------------
Steps to Reproduce:
1. start clumanager on node 1
2. start clumanager on node 2
3. after some minutes, quorum is lost
  
------------------------------------------------------------
Actual results:

------------------------------------------------------------
Expected results:

------------------------------------------------------------
Additional info:
I'm working from a remote desktop, and communication with the node 1 (it was 
running the service) is lost.
So, I need to get in the cold server room and restart network service.

This behavior (lost quorum) started when I upgraded clumanager to version 
1.2.26.1-1.
My kernel is 2.4.21-20.ELsmp - Red Hat Enterprise Linux AS release 3 (Taroon 
Update 3).
Comment 1 Lon Hohberger 2005-06-28 16:51:02 EDT
What previous kernel + clumanager version were you running?
Comment 2 Lon Hohberger 2005-06-28 16:52:46 EDT
Also, does it happen when using a disk tiebreaker or not?
Comment 3 Lon Hohberger 2005-06-28 16:54:41 EDT
For future information, you can set clulockd's loglevel back to INFO or NOTICE.
 It's certainly not cause of the problem.
Comment 4 Luis Alexandre Fontes 2005-07-01 14:42:27 EDT
Q: What previous kernel + clumanager version were you running?
A: previous clumanager version as clumanager-1.2.22-2, in the same kernel.
Comment 5 Luis Alexandre Fontes 2005-07-01 14:52:54 EDT
Additional information
----------------------
When the quorum is lost, I also lost network communication with node A from my 
desktop. But node B is still running fine.
The services on node A were stopped and clumanager is still running.
Daemons running on node A after failure: cluquorumd, clumembd, clulockd.

If I stop clumanager on node B, after a couple of minutes, network 
communication with node A is restored.

# clustat (on node A)
Cluster Status - Red Hat Clumanager Cluster                            15:46:54
Incarnation #2
(This member is not part of the cluster quorum)

  Member             Status
  ------------------ ----------
  10.0.0.1           Active     <-- You are here
  10.0.0.2           Inactive

No Quorum - Service States Unknown
Comment 6 Luis Alexandre Fontes 2005-07-01 15:30:08 EDT
Q: Also, does it happen when using a disk tiebreaker or not?
A: Yes, unfortunally.
Comment 7 Lon Hohberger 2005-07-01 15:43:20 EDT
This doesn't look like a bug for a variety of reasons (besides the fact that it
works for me).  Please file a support ticket with the information we've gathered
so far.  (Bugzilla isn't a support tool, and in order to track this down, we'll
need some rather private information from your cluster, including full sysreports.)

http://www.redhat.com/apps/support/

Since this works for me, I'll close it as such for now.  However, please
reference this Bugzilla in your support ticket.  If it turns out to be a bug,
we'll reopen this bugzilla (and I'll hopefully have direct access to all the
info I need to fix it).  

----------------------------------

Here are some last hints you can try before filing a support ticket, if you want:

Jun 28 15:20:36 linpro035ias clumembd[10286]: <debug> Member #1 (10.0.0.2) DOWN

* In a 2-node cluster, a node going down before the tiebreaker is online
constitutes a loss of quorum.

* Nothing between 1.2.22 and 1.2.26.1 should have caused a discrepancy in the
way membership behaves, as nothing changed with respect to membership algorithm
between the two releases.

* A Member-DOWN event can be caused by a lot of factors.  Network connectivity,
routing, system load, etc.  You can try setting clumembd and cluquorumd to use
real-time priority, see "man cludb".  (If the cluster node is under high load
while starting/stopping services, this may help quite a bit!)

* You can try using broadcast heartbeat instead of multicast if you aren't
already doing so.  Most private networks are made using inexpensive
hubs/switches which don't properly handle multicast traffic.  If you do, you may
want to set "clumembd%broadcast_primary_only 1" (see man cludb) to prevent
spamming the public network.


Jun 28 15:20:54 linpro035ias cluquorumd[10284]: <notice> IPv4 TB @ 172.27.2.1 
Online

* You must have the IPv4 TB on the SAME subnet as the cluster member names.  In
this case, it needs to go on the 10.x.x.x network.  Failure to do so can result
in a split brain in certain failure cases.

* If this configuration (e.g. a third, non-cluster IP on the private 10.x.x.x
network) is not available, you will either need to switch to a disk-tiebreaker,
or change your network configuration so that the members are on the 172.27.2.x
network.

Note You need to log in before you can comment on or make changes to this bug.