| Summary: | dlm_controld: cpg_mcast_joined error during rgmanager start up/shut down | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||||||
| Component: | cluster | Assignee: | David Teigland <teigland> | ||||||||||||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||||
| Severity: | low | Docs Contact: | |||||||||||||||||||||
| Priority: | low | ||||||||||||||||||||||
| Version: | 6.1 | CC: | ccaulfie, cluster-maint, fdinitto, jfriesse, lhh, rpeterso, sdake, teigland | ||||||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||
| Last Closed: | 2011-08-19 21:25:12 UTC | Type: | --- | ||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Corey Marthaler
2011-08-03 21:32:12 UTC
It looks to me as if corosync is crashing/exiting in backgroup, rgmanager starts to shutdown and dlm_controld report the error. Can you please attach also logs from /var/log/cluster? Is node fenced? It doesn't appear that corosync is crashing, as it's still running on each node and no nodes are removed from the cluster or fenced. I'll post the lasted logs when rgmanager is stopped. Created attachment 516793 [details]
log from taft-01
Created attachment 516794 [details]
log from taft-02
Created attachment 516795 [details]
log from taft-03
Created attachment 516796 [details]
log from taft-04
this could potentially be related to bug 728620. Here are the operations that were run on each of the four nodes in the cluster (taft-0[1234]). [root@taft-01 ~]# service cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] [root@taft-01 ~]# service rgmanager start Starting Cluster Service Manager: [ OK ] [root@taft-01 ~]# clustat Cluster Status for TAFT @ Mon Aug 8 14:03:42 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ taft-01 1 Online, Local, rgmanager taft-02 2 Online, rgmanager taft-03 3 Online, rgmanager taft-04 4 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:halvm1 taft-01 started service:halvm2 taft-01 started [root@taft-01 ~]# service rgmanager stop Stopping Cluster Service Manager: [ OK ] [root@taft-01 ~]# service cman stop Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld... [ OK ] Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Here is the storage configuration: [root@taft-01 ~]# pvs PV VG Fmt Attr PSize PFree /dev/sdg1 TAFT2 lvm2 a- 45.21g 45.21g /dev/sdg2 TAFT2 lvm2 a- 45.21g 37.21g /dev/sdg3 TAFT2 lvm2 a- 45.22g 37.22g /dev/sdh1 TAFT1 lvm2 a- 45.21g 45.21g /dev/sdh2 TAFT1 lvm2 a- 45.21g 37.21g /dev/sdh3 TAFT1 lvm2 a- 45.22g 37.22g [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices ha TAFT1 Mwi--- 8.00g ha_mlog ha_mimage_0(0),ha_mimage_1(0) [ha_mimage_0] TAFT1 Iwi--- 8.00g /dev/sdh3(0) [ha_mimage_1] TAFT1 Iwi--- 8.00g /dev/sdh2(0) [ha_mlog] TAFT1 lwi--- 4.00m /dev/sdh1(0) ha TAFT2 Mwi--- 8.00g ha_mlog ha_mimage_0(0),ha_mimage_1(0) [ha_mimage_0] TAFT2 Iwi--- 8.00g /dev/sdg3(0) [ha_mimage_1] TAFT2 Iwi--- 8.00g /dev/sdg2(0) [ha_mlog] TAFT2 lwi--- 4.00m /dev/sdg1(0) I'll post the cluster.conf file and the newer syslogs containing the error message. Created attachment 517285 [details]
taft cluster.conf file
Created attachment 517286 [details]
log from taft-01
Created attachment 517287 [details]
log from taft-02
Created attachment 517289 [details]
log from taft-03
Created attachment 517290 [details]
log from taft-04
This shows it appears to not be related to the other segfault in rgmanager. Aug 8 14:04:31 taft-03 dlm_controld[2396]: cpg_mcast_joined error 12 handle 7724c67e00000001 start vs. Aug 8 14:04:36 taft-04 kernel: dlm: invalid lockspace 5231f3eb from 1 cmd 1 type 7 ...seem to be the related errors. If the time is in sync, taft-01 would have left the rgmanager lockspace about 14:04:10; the 'Disconnecting from CMAN' message occurs after the lockspace has already been released on that host. Reproduced this on a base 6.1 cluster as well, so not a regression. Aug 15 15:43:59 taft-04 modcluster: Updating cluster.conf Aug 15 15:44:00 taft-04 corosync[2026]: [QUORUM] Members[4]: 1 2 3 4 Aug 15 15:44:17 taft-04 kernel: dlm: Using TCP for communications Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 3 Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 2 Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 1 Aug 15 15:44:18 taft-04 rgmanager[2432]: I am node #4 Aug 15 15:44:18 taft-04 rgmanager[2432]: Resource Group Manager Starting Aug 15 15:44:18 taft-04 rgmanager[2432]: Loading Service Data Aug 15 15:44:21 taft-04 rgmanager[2432]: Initializing Services Aug 15 15:44:21 taft-04 rgmanager[3092]: stop: Could not match /dev/TAFT/ha3 with a real device Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs3" returned 2 (invalid argument(s)) Aug 15 15:44:21 taft-04 rgmanager[3129]: stop: Could not match /dev/TAFT/ha2 with a real device Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs2" returned 2 (invalid argument(s)) Aug 15 15:44:22 taft-04 rgmanager[3166]: stop: Could not match /dev/TAFT/ha1 with a real device Aug 15 15:44:22 taft-04 rgmanager[2432]: stop on fs "fs1" returned 2 (invalid argument(s)) Aug 15 15:44:24 taft-04 rgmanager[2432]: Services Initialized Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: Local UP Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-02 UP Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-03 UP Aug 15 15:44:25 taft-04 rgmanager[2432]: State change: taft-01 UP Aug 15 15:46:03 taft-04 rgmanager[2432]: Shutting down Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutting down Aug 15 15:46:04 taft-04 dlm_controld[2157]: cpg_mcast_joined error 12 handle d34b6a800000001 start Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutdown complete, exiting Linux taft-04 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux openais-1.1.1-7.el6.x86_64 rgmanager-3.0.12-11.el6.x86_64 I still see this error on two different clusters when stopping services with the build in comment #24. HAYES: [...] Aug 19 15:32:40 hayes-02 rgmanager[4044]: Services Initialized Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: Local UP Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-03 UP Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-01 UP Aug 19 15:33:23 hayes-02 rgmanager[4044]: Shutting down Aug 19 15:33:23 hayes-02 dlm_controld[2121]: cpg_mcast_joined error 12 handle 6eb5bd400000002 start [root@hayes-02 ~]# rpm -qa | grep corosync corosync-1.4.1-1.el6.jf.1.x86_64 corosynclib-1.4.1-1.el6.jf.1.x86_64 GRANTS: [...] Aug 19 15:39:21 grant-03 rgmanager[6866]: Services Initialized Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: Local UP Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-01 UP Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-02 UP Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down Aug 19 15:41:33 grant-03 dlm_controld[2046]: cpg_mcast_joined error 12 handle 7d5e18f800000001 start Aug 19 15:41:33 grant-03 rgmanager[6866]: Disconnecting from CMAN Aug 19 15:41:33 grant-03 rgmanager[6866]: Exiting [root@grant-03 ~]# rpm -qa | grep corosync corosynclib-1.4.1-1.el6.jf.1.x86_64 corosync-1.4.1-1.el6.jf.1.x86_64 Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |