Bug 143165 - ccsd becomes unresponive after updates (SIGHUP)
ccsd becomes unresponive after updates (SIGHUP)
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2004-12-16 18:15 EST by Adam "mantis" Manthei
Modified: 2009-04-16 16:04 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-13 13:57:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Adam "mantis" Manthei 2004-12-16 18:15:15 EST
Description of problem:
ccsd becomes unresponive after updates (SIGHUP)

I needed to update my cluster.conf in order to make things work for
gulm.  I started with config_version=7 and made a series of
modifications to the cluster.conf file so that I could get lock_gulmd
up and running.  I ended up SIGHUP'ed the one node in my 8 node
cluster.  I assumed that since it didn't produce an error, on the
local node that it propagated to all the remaining nodes (sadly, I
never did verify this).  After hupping the server, i started getting
the following errors while trying to connect to ccsd

   ccsd[8573]: Error while processing connect: 
               Resource temporarily unavailable

On the other nodes in the system, I started getting errors that said:

   ccsd[8586]: Error while processing connect: 
               Operation not permitted 

I restarted ccsd on the node that I tried to update and found that my
config_version had been bumped back to the version that I started with.

Version-Release number of selected component (if applicable):

How reproducible:
ran across it only once so far that I know of

Steps to Reproduce:
1. not tried yet
Actual results:
updates apparently don't update the system and ccs becomes unresponsive

Expected results:
o updates should be propagated.
o decent feed back should be provided to inform users of the state 
  of the updates
o error reporting should be easily available 
o ccsd should not hang

Additional info:
At the time I was running both cman/dlm and trying to start
lock_gulmd.  I guess it's possible that this may be a magma issue? 
but I highly doubt it.  If nothing else, there needs to be some sort
of mechism in place to help facilitate the updates so users can't hurt
them selves like I just did (I would like to see a C program that is
capable of connecting to ccsd and quering/pushing the updates so that
we can avoid the asynchronous characteristics of signals)

possible duplicates: 133254 and 137021
Comment 1 Adam "mantis" Manthei 2004-12-16 18:15:42 EST
possible duplicates: bug #133254 and bug #137021
Comment 2 Adam "mantis" Manthei 2004-12-16 18:36:03 EST
Perhaps this is a clue, or another bug.

I SIGHUP'ed ccsd w/out bumping the config_version just now on node-01.
I then logged into node-02 and looked at the logs for errors.  I saw
none and thought that I was golden.  Upon doing a "ccs_test connect" I
got the error "operation not permitted".  Afterwhich, I looked in the
logs and saw that the update failed.  I then did another "ccs_test
connect" to see what would happened and it succeeded.

# at this point I have already HUP'ed the server on node trin-01
# where the config file had not been updated

# scribble in the logs
[root@trin-06 ~]# logger test1

# See if we can connect
[root@trin-06 ~]# ccs_test connect
ccs_connect failed: Operation not permitted

# An error was produced (this will appear in syslog between the 
# test1 and test2 logger marks)
[root@trin-06 ~]# logger test2

# connect again
[root@trin-06 ~]# ccs_test connect
Connect successful.
 Connection descriptor = 0

# another logger mark
[root@trin-06 ~]# logger test3

# the resulting syslog
[root@trin-06 ~]# tail /var/log/messages
Dec 16 17:31:08 trin-06 root: test1
Dec 16 17:31:15 trin-06 ccsd[8176]: cluster.conf on-disk version is <=
to in-memory version. 
Dec 16 17:31:15 trin-06 ccsd[8176]:  On-disk version   : 13 
Dec 16 17:31:15 trin-06 ccsd[8176]:  In-memory version : 13 
Dec 16 17:31:15 trin-06 ccsd[8176]: Failed to update config file,
required by cluster. 
Dec 16 17:31:15 trin-06 ccsd[8176]: Error while processing connect:
Operation not permitted 
Dec 16 17:31:19 trin-06 root: test2
Dec 16 17:31:23 trin-06 root: test3

Comment 3 Adam "mantis" Manthei 2004-12-16 19:15:50 EST
Things are hopeless busted on my node at the moment :(

I wanted to verify that the ccsd was up to date on the nodes, so I ran
an md5sum on /etc/cluster/cluster.conf for all 8 nodes in the cluster.
and they all mathced.  I also verfied that they were all at version
13.  Then, for good measure I stopped ccsd on all 8 nodes.  Then I
started it again on all 8 nodes.  md5sums matched as before.

The the fit hit the shan.  

This whole time I had cman/fenced/dlm/clvmd/gfs running.  I tried to
stop the gfs service on the nodes.  Two nodes locked tight.  I could
only ping them anyother method of using the machine was hopeless.  I
rebooted the node (was running a modified fence_manual) and on startup
tried to start ccsd and cman.  ccsd started just fine and i had an
identical config on that node as i did on the other 6 nodes (one was
still locked).  But when cman started, I got a bunch of errors on the  
failed node (trin-09):

   CMAN: Cluster membership rejected

On the other 6 responsive nodes I kept getting the error:
   CMAN: Join request from trin-09 rejected, config version 
         local 7 remote 13

It appears that CMAN didn't update it's view of the cluster.conf file
when ccs was updated

(BTW, I started seeing problems while I was updating from 
config_version 7 to 8)
Comment 4 Jonathan Earl Brassow 2004-12-17 12:53:19 EST
- fix bug 143165, 134604, and 133254 - update related issues
  These all seem to be related to the same issue, that is, remote
  nodes were erroneously processing an update as though they were
  the originator - taking on some tasks that didn't belong to them.

  This was causing connect failures, version rollbacks, etc.

Note You need to log in before you can comment on or make changes to this bug.