| Summary: | rgmanager dies and is not restartable | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Richard Allen <ra> | ||||||||||
| Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 6.3 | CC: | cluster-maint, djansa | ||||||||||
| Target Milestone: | rc | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2011-08-01 21:57:54 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Attachments: |
|
||||||||||||
|
Description
Richard Allen
2011-03-30 20:26:56 UTC
Ah -- I think you've tripped on this, which probably leads to a dlm interaction issue: https://bugzilla.redhat.com/show_bug.cgi?id=688201 Full logs (even if you edit out hostnames or whatever) would be necessary to verify this. Created attachment 488898 [details]
example /var/log/messages file from the system in question
This is a messages log from the one of the nodes. I was starting and stopping services and rebooting wildly to try to get the stuff up :)
Created attachment 488899 [details]
rgmanager.log
rgmanager.log
Created attachment 488900 [details]
corosync.log
corosync.log
Created attachment 488901 [details]
dlm_controld.log
dlm_controld.log
All logs updated as is, only domain names have been stripped. Whatever you're hitting, it's far below rgmanager - rgmanager is merely a victim of the underlying problem you're having. Here are a couple of suggestions: * It looks like you need to add the following to /etc/sysconfig/cman: CMAN_QUORUM_TIMEOUT=45 (This will be the default in RHEL 6.1) * This worries me: Mar 16 21:24:34 test3-vm corosync[1633]: [TOTEM ] Incrementing problem counter for seqid 1 iface 172.29.123.243 to [1 of 10] You need to disable redundant ring; it is not well tested and entirely unsupported. The nodelist fed to from dlm_controld to the kernel may or may not match up with the active ring. That is, corosync might think a given node is "up" but the DLM may be unable to talk to it because. When this happens, anything using the DLM (rgmanager included) will break. * You need to chkconfig --del corosync if it's not already disabled. * I have seen this before; it appears to be a bug in corosync triggered by network issues: Mar 17 15:01:44 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 37 Mar 17 15:01:44 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 39 37 ... (repeat several times) Mar 17 15:01:44 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 39 37 Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 3 Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 2 Mar 17 15:01:54 syseng1-vm kernel: dlm: closing connection to node 1 Mar 17 15:01:54 syseng1-vm corosync[2619]: [TOTEM ] A processor failed, forming new configuration. Mar 17 15:02:00 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 2 ... (50 or 60 more) Mar 17 15:02:19 syseng1-vm corosync[2619]: [TOTEM ] Retransmit List: 2 (In reply to comment #8) > You need to disable redundant ring; it is not well tested and entirely > unsupported. The nodelist fed to from dlm_controld to the kernel may or may > not match up with the active ring. That is, corosync might think a given node > is "up" but the DLM may be unable to talk to it because. When this happens, > anything using the DLM (rgmanager included) will break. Oops... The nodelist given from dlm_controld to the kernel may or may not match up with the active ring. That is, corosync might think a given node is "up" but the DLM may be unable to talk to it. When this happens, anything using the DLM (rgmanager included) will break. If rgmanager crashed (which is possible) there is a bug -- but the underlying issue(s) causing it is far worse. If abrtd caught a core of rgmanager, go ahead and attach it here and I'll figure out how to fix it. If there was no core file, you may need to add "DAEMON_COREFILE_LIMIT=unlimited" to /etc/sysconfig/rgmanager . *** This bug has been marked as a duplicate of bug 725058 *** |