Description of problem: ccsd becomes unresponive after updates (SIGHUP) I needed to update my cluster.conf in order to make things work for gulm. I started with config_version=7 and made a series of modifications to the cluster.conf file so that I could get lock_gulmd up and running. I ended up SIGHUP'ed the one node in my 8 node cluster. I assumed that since it didn't produce an error, on the local node that it propagated to all the remaining nodes (sadly, I never did verify this). After hupping the server, i started getting the following errors while trying to connect to ccsd ccsd[8573]: Error while processing connect: Resource temporarily unavailable On the other nodes in the system, I started getting errors that said: ccsd[8586]: Error while processing connect: Operation not permitted I restarted ccsd on the node that I tried to update and found that my config_version had been bumped back to the version that I started with. Version-Release number of selected component (if applicable): ccs-0.9-0 How reproducible: ran across it only once so far that I know of Steps to Reproduce: 1. not tried yet 2. 3. Actual results: updates apparently don't update the system and ccs becomes unresponsive Expected results: o updates should be propagated. o decent feed back should be provided to inform users of the state of the updates o error reporting should be easily available o ccsd should not hang Additional info: At the time I was running both cman/dlm and trying to start lock_gulmd. I guess it's possible that this may be a magma issue? but I highly doubt it. If nothing else, there needs to be some sort of mechism in place to help facilitate the updates so users can't hurt them selves like I just did (I would like to see a C program that is capable of connecting to ccsd and quering/pushing the updates so that we can avoid the asynchronous characteristics of signals) possible duplicates: 133254 and 137021
possible duplicates: bug #133254 and bug #137021
Perhaps this is a clue, or another bug. I SIGHUP'ed ccsd w/out bumping the config_version just now on node-01. I then logged into node-02 and looked at the logs for errors. I saw none and thought that I was golden. Upon doing a "ccs_test connect" I got the error "operation not permitted". Afterwhich, I looked in the logs and saw that the update failed. I then did another "ccs_test connect" to see what would happened and it succeeded. # # at this point I have already HUP'ed the server on node trin-01 # where the config file had not been updated # # # scribble in the logs # [root@trin-06 ~]# logger test1 # # See if we can connect # [root@trin-06 ~]# ccs_test connect ccs_connect failed: Operation not permitted # # An error was produced (this will appear in syslog between the # test1 and test2 logger marks) # [root@trin-06 ~]# logger test2 # # connect again # [root@trin-06 ~]# ccs_test connect Connect successful. Connection descriptor = 0 # # another logger mark # [root@trin-06 ~]# logger test3 # # the resulting syslog # [root@trin-06 ~]# tail /var/log/messages Dec 16 17:31:08 trin-06 root: test1 Dec 16 17:31:15 trin-06 ccsd[8176]: cluster.conf on-disk version is <= to in-memory version. Dec 16 17:31:15 trin-06 ccsd[8176]: On-disk version : 13 Dec 16 17:31:15 trin-06 ccsd[8176]: In-memory version : 13 Dec 16 17:31:15 trin-06 ccsd[8176]: Failed to update config file, required by cluster. Dec 16 17:31:15 trin-06 ccsd[8176]: Error while processing connect: Operation not permitted Dec 16 17:31:19 trin-06 root: test2 Dec 16 17:31:23 trin-06 root: test3
Things are hopeless busted on my node at the moment :( I wanted to verify that the ccsd was up to date on the nodes, so I ran an md5sum on /etc/cluster/cluster.conf for all 8 nodes in the cluster. and they all mathced. I also verfied that they were all at version 13. Then, for good measure I stopped ccsd on all 8 nodes. Then I started it again on all 8 nodes. md5sums matched as before. The the fit hit the shan. This whole time I had cman/fenced/dlm/clvmd/gfs running. I tried to stop the gfs service on the nodes. Two nodes locked tight. I could only ping them anyother method of using the machine was hopeless. I rebooted the node (was running a modified fence_manual) and on startup tried to start ccsd and cman. ccsd started just fine and i had an identical config on that node as i did on the other 6 nodes (one was still locked). But when cman started, I got a bunch of errors on the failed node (trin-09): CMAN: Cluster membership rejected On the other 6 responsive nodes I kept getting the error: CMAN: Join request from trin-09 rejected, config version local 7 remote 13 It appears that CMAN didn't update it's view of the cluster.conf file when ccs was updated (BTW, I started seeing problems while I was updating from config_version 7 to 8)
- fix bug 143165, 134604, and 133254 - update related issues These all seem to be related to the same issue, that is, remote nodes were erroneously processing an update as though they were the originator - taking on some tasks that didn't belong to them. This was causing connect failures, version rollbacks, etc.
Commit pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/06f09a60b6ff525a2efd599c823085e014c8510b F5-router, "Idling applications" feature does not work Made a NOTE that unidling is HAProxy only bug 143165 https://bugzilla.redhat.com/show_bug.cgi?id=1431658
re [comment 5]: see [bug 1431658 comment 3]