Bug 143165 - ccsd becomes unresponive after updates (SIGHUP)
ccsd becomes unresponive after updates (SIGHUP)
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2004-12-16 18:15 EST by Adam "mantis" Manthei
Modified: 2017-04-10 16:42 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-13 13:57:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Adam "mantis" Manthei 2004-12-16 18:15:15 EST
Description of problem:
ccsd becomes unresponive after updates (SIGHUP)

I needed to update my cluster.conf in order to make things work for
gulm.  I started with config_version=7 and made a series of
modifications to the cluster.conf file so that I could get lock_gulmd
up and running.  I ended up SIGHUP'ed the one node in my 8 node
cluster.  I assumed that since it didn't produce an error, on the
local node that it propagated to all the remaining nodes (sadly, I
never did verify this).  After hupping the server, i started getting
the following errors while trying to connect to ccsd

   ccsd[8573]: Error while processing connect: 
               Resource temporarily unavailable

On the other nodes in the system, I started getting errors that said:

   ccsd[8586]: Error while processing connect: 
               Operation not permitted 

I restarted ccsd on the node that I tried to update and found that my
config_version had been bumped back to the version that I started with.

Version-Release number of selected component (if applicable):

How reproducible:
ran across it only once so far that I know of

Steps to Reproduce:
1. not tried yet
Actual results:
updates apparently don't update the system and ccs becomes unresponsive

Expected results:
o updates should be propagated.
o decent feed back should be provided to inform users of the state 
  of the updates
o error reporting should be easily available 
o ccsd should not hang

Additional info:
At the time I was running both cman/dlm and trying to start
lock_gulmd.  I guess it's possible that this may be a magma issue? 
but I highly doubt it.  If nothing else, there needs to be some sort
of mechism in place to help facilitate the updates so users can't hurt
them selves like I just did (I would like to see a C program that is
capable of connecting to ccsd and quering/pushing the updates so that
we can avoid the asynchronous characteristics of signals)

possible duplicates: 133254 and 137021
Comment 1 Adam "mantis" Manthei 2004-12-16 18:15:42 EST
possible duplicates: bug #133254 and bug #137021
Comment 2 Adam "mantis" Manthei 2004-12-16 18:36:03 EST
Perhaps this is a clue, or another bug.

I SIGHUP'ed ccsd w/out bumping the config_version just now on node-01.
I then logged into node-02 and looked at the logs for errors.  I saw
none and thought that I was golden.  Upon doing a "ccs_test connect" I
got the error "operation not permitted".  Afterwhich, I looked in the
logs and saw that the update failed.  I then did another "ccs_test
connect" to see what would happened and it succeeded.

# at this point I have already HUP'ed the server on node trin-01
# where the config file had not been updated

# scribble in the logs
[root@trin-06 ~]# logger test1

# See if we can connect
[root@trin-06 ~]# ccs_test connect
ccs_connect failed: Operation not permitted

# An error was produced (this will appear in syslog between the 
# test1 and test2 logger marks)
[root@trin-06 ~]# logger test2

# connect again
[root@trin-06 ~]# ccs_test connect
Connect successful.
 Connection descriptor = 0

# another logger mark
[root@trin-06 ~]# logger test3

# the resulting syslog
[root@trin-06 ~]# tail /var/log/messages
Dec 16 17:31:08 trin-06 root: test1
Dec 16 17:31:15 trin-06 ccsd[8176]: cluster.conf on-disk version is <=
to in-memory version. 
Dec 16 17:31:15 trin-06 ccsd[8176]:  On-disk version   : 13 
Dec 16 17:31:15 trin-06 ccsd[8176]:  In-memory version : 13 
Dec 16 17:31:15 trin-06 ccsd[8176]: Failed to update config file,
required by cluster. 
Dec 16 17:31:15 trin-06 ccsd[8176]: Error while processing connect:
Operation not permitted 
Dec 16 17:31:19 trin-06 root: test2
Dec 16 17:31:23 trin-06 root: test3

Comment 3 Adam "mantis" Manthei 2004-12-16 19:15:50 EST
Things are hopeless busted on my node at the moment :(

I wanted to verify that the ccsd was up to date on the nodes, so I ran
an md5sum on /etc/cluster/cluster.conf for all 8 nodes in the cluster.
and they all mathced.  I also verfied that they were all at version
13.  Then, for good measure I stopped ccsd on all 8 nodes.  Then I
started it again on all 8 nodes.  md5sums matched as before.

The the fit hit the shan.  

This whole time I had cman/fenced/dlm/clvmd/gfs running.  I tried to
stop the gfs service on the nodes.  Two nodes locked tight.  I could
only ping them anyother method of using the machine was hopeless.  I
rebooted the node (was running a modified fence_manual) and on startup
tried to start ccsd and cman.  ccsd started just fine and i had an
identical config on that node as i did on the other 6 nodes (one was
still locked).  But when cman started, I got a bunch of errors on the  
failed node (trin-09):

   CMAN: Cluster membership rejected

On the other 6 responsive nodes I kept getting the error:
   CMAN: Join request from trin-09 rejected, config version 
         local 7 remote 13

It appears that CMAN didn't update it's view of the cluster.conf file
when ccs was updated

(BTW, I started seeing problems while I was updating from 
config_version 7 to 8)
Comment 4 Jonathan Earl Brassow 2004-12-17 12:53:19 EST
- fix bug 143165, 134604, and 133254 - update related issues
  These all seem to be related to the same issue, that is, remote
  nodes were erroneously processing an update as though they were
  the originator - taking on some tasks that didn't belong to them.

  This was causing connect failures, version rollbacks, etc.
Comment 5 openshift-github-bot 2017-04-10 14:10:36 EDT
Commit pushed to master at https://github.com/openshift/openshift-docs

F5-router, "Idling applications" feature does not work

Made a NOTE that unidling is HAProxy only

bug 143165
Comment 6 Jan Pokorný 2017-04-10 16:42:10 EDT
re [comment 5]: see [bug 1431658 comment 3]

Note You need to log in before you can comment on or make changes to this bug.