143165 – ccsd becomes unresponive after updates (SIGHUP)

Bug 143165 - ccsd becomes unresponive after updates (SIGHUP)

Summary: ccsd becomes unresponive after updates (SIGHUP)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	ccs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-12-16 23:15 UTC by Adam "mantis" Manthei
Modified:	2017-04-10 20:42 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-01-13 18:57:56 UTC
Embargoed:

Attachments	(Terms of Use)

Description Adam "mantis" Manthei 2004-12-16 23:15:15 UTC

Description of problem:
ccsd becomes unresponive after updates (SIGHUP)

I needed to update my cluster.conf in order to make things work for
gulm.  I started with config_version=7 and made a series of
modifications to the cluster.conf file so that I could get lock_gulmd
up and running.  I ended up SIGHUP'ed the one node in my 8 node
cluster.  I assumed that since it didn't produce an error, on the
local node that it propagated to all the remaining nodes (sadly, I
never did verify this).  After hupping the server, i started getting
the following errors while trying to connect to ccsd

   ccsd[8573]: Error while processing connect: 
               Resource temporarily unavailable

On the other nodes in the system, I started getting errors that said:

   ccsd[8586]: Error while processing connect: 
               Operation not permitted 

I restarted ccsd on the node that I tried to update and found that my
config_version had been bumped back to the version that I started with.

Version-Release number of selected component (if applicable):
ccs-0.9-0

How reproducible:
ran across it only once so far that I know of

Steps to Reproduce:
1. not tried yet
2.
3.
  
Actual results:
updates apparently don't update the system and ccs becomes unresponsive


Expected results:
o updates should be propagated.
o decent feed back should be provided to inform users of the state 
  of the updates
o error reporting should be easily available 
o ccsd should not hang


Additional info:
At the time I was running both cman/dlm and trying to start
lock_gulmd.  I guess it's possible that this may be a magma issue? 
but I highly doubt it.  If nothing else, there needs to be some sort
of mechism in place to help facilitate the updates so users can't hurt
them selves like I just did (I would like to see a C program that is
capable of connecting to ccsd and quering/pushing the updates so that
we can avoid the asynchronous characteristics of signals)

possible duplicates: 133254 and 137021

Comment 1 Adam "mantis" Manthei 2004-12-16 23:15:42 UTC

possible duplicates: bug #133254 and bug #137021

Comment 2 Adam "mantis" Manthei 2004-12-16 23:36:03 UTC

Perhaps this is a clue, or another bug.

I SIGHUP'ed ccsd w/out bumping the config_version just now on node-01.
I then logged into node-02 and looked at the logs for errors.  I saw
none and thought that I was golden.  Upon doing a "ccs_test connect" I
got the error "operation not permitted".  Afterwhich, I looked in the
logs and saw that the update failed.  I then did another "ccs_test
connect" to see what would happened and it succeeded.

#
# at this point I have already HUP'ed the server on node trin-01
# where the config file had not been updated
#

#
# scribble in the logs
#
[root@trin-06 ~]# logger test1

#
# See if we can connect
#
[root@trin-06 ~]# ccs_test connect
ccs_connect failed: Operation not permitted

#
# An error was produced (this will appear in syslog between the 
# test1 and test2 logger marks)
#
[root@trin-06 ~]# logger test2

#
# connect again
#
[root@trin-06 ~]# ccs_test connect
Connect successful.
 Connection descriptor = 0

#
# another logger mark
#
[root@trin-06 ~]# logger test3

#
# the resulting syslog
#
[root@trin-06 ~]# tail /var/log/messages
Dec 16 17:31:08 trin-06 root: test1
Dec 16 17:31:15 trin-06 ccsd[8176]: cluster.conf on-disk version is <=
to in-memory version. 
Dec 16 17:31:15 trin-06 ccsd[8176]:  On-disk version   : 13 
Dec 16 17:31:15 trin-06 ccsd[8176]:  In-memory version : 13 
Dec 16 17:31:15 trin-06 ccsd[8176]: Failed to update config file,
required by cluster. 
Dec 16 17:31:15 trin-06 ccsd[8176]: Error while processing connect:
Operation not permitted 
Dec 16 17:31:19 trin-06 root: test2
Dec 16 17:31:23 trin-06 root: test3

Comment 3 Adam "mantis" Manthei 2004-12-17 00:15:50 UTC

Things are hopeless busted on my node at the moment :(

I wanted to verify that the ccsd was up to date on the nodes, so I ran
an md5sum on /etc/cluster/cluster.conf for all 8 nodes in the cluster.
and they all mathced.  I also verfied that they were all at version
13.  Then, for good measure I stopped ccsd on all 8 nodes.  Then I
started it again on all 8 nodes.  md5sums matched as before.

The the fit hit the shan.  

This whole time I had cman/fenced/dlm/clvmd/gfs running.  I tried to
stop the gfs service on the nodes.  Two nodes locked tight.  I could
only ping them anyother method of using the machine was hopeless.  I
rebooted the node (was running a modified fence_manual) and on startup
tried to start ccsd and cman.  ccsd started just fine and i had an
identical config on that node as i did on the other 6 nodes (one was
still locked).  But when cman started, I got a bunch of errors on the  
failed node (trin-09):

   CMAN: Cluster membership rejected

On the other 6 responsive nodes I kept getting the error:
   CMAN: Join request from trin-09 rejected, config version 
         local 7 remote 13

It appears that CMAN didn't update it's view of the cluster.conf file
when ccs was updated

(BTW, I started seeing problems while I was updating from 
config_version 7 to 8)

Comment 4 Jonathan Earl Brassow 2004-12-17 17:53:19 UTC

- fix bug 143165, 134604, and 133254 - update related issues
  These all seem to be related to the same issue, that is, remote
  nodes were erroneously processing an update as though they were
  the originator - taking on some tasks that didn't belong to them.

  This was causing connect failures, version rollbacks, etc.

Comment 5 openshift-github-bot 2017-04-10 18:10:36 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/06f09a60b6ff525a2efd599c823085e014c8510b
F5-router, "Idling applications" feature does not work

Made a NOTE that unidling is HAProxy only

bug 143165
https://bugzilla.redhat.com/show_bug.cgi?id=1431658

Comment 6 Jan Pokorný [poki] 2017-04-10 20:42:10 UTC

re [comment 5]: see [bug 1431658 comment 3]

Note You need to log in before you can comment on or make changes to this bug.