Hide Forgot
Description of problem: I continually see this error when starting HA LVM services Aug 3 16:13:17 taft-01 rgmanager[11766]: I am node #1 Aug 3 16:13:17 taft-01 rgmanager[11766]: Resource Group Manager Starting Aug 3 16:13:17 taft-01 rgmanager[11766]: Loading Service Data Aug 3 16:13:22 taft-01 rgmanager[11766]: Initializing Services Aug 3 16:13:22 taft-01 rgmanager[12577]: [fs] stop: Could not match /dev/TAFT/ha3 with a real device Aug 3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs3" returned 2 (invalid argument(s)) Aug 3 16:13:22 taft-01 rgmanager[12614]: [fs] stop: Could not match /dev/TAFT/ha2 with a real device Aug 3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs2" returned 2 (invalid argument(s)) Aug 3 16:13:22 taft-01 rgmanager[12651]: [fs] stop: Could not match /dev/TAFT/ha1 with a real device Aug 3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs1" returned 2 (invalid argument(s)) Aug 3 16:13:25 taft-01 rgmanager[12726]: [lvm] Stripping tag, foobar Aug 3 16:13:26 taft-01 rgmanager[11766]: Services Initialized Aug 3 16:13:26 taft-01 rgmanager[11766]: State change: Local UP Aug 3 16:13:26 taft-01 rgmanager[11766]: State change: taft-02 UP Aug 3 16:13:26 taft-01 rgmanager[11766]: State change: taft-04 UP Aug 3 16:13:26 taft-01 rgmanager[11766]: State change: taft-03 UP Aug 3 16:17:46 taft-01 rgmanager[11766]: Shutting down Aug 3 16:17:47 taft-01 dlm_controld[2291]: cpg_mcast_joined error 12 handle 4962813b00000001 plocks_stored Aug 3 16:17:47 taft-01 rgmanager[11766]: Disconnecting from CMAN Aug 3 16:17:47 taft-01 rgmanager[11766]: Exiting Version-Release number of selected component (if applicable): Linux taft-01 2.6.32-174.el6.x86_64 #1 SMP Thu Jul 28 00:31:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux rgmanager-3.0.12.1-2.el6.x86_64 cman-3.0.12.1-7.el6.x86_64 openais-1.1.1-7.el6.x86_64
It looks to me as if corosync is crashing/exiting in backgroup, rgmanager starts to shutdown and dlm_controld report the error. Can you please attach also logs from /var/log/cluster? Is node fenced?
It doesn't appear that corosync is crashing, as it's still running on each node and no nodes are removed from the cluster or fenced. I'll post the lasted logs when rgmanager is stopped.
Created attachment 516793 [details] log from taft-01
Created attachment 516794 [details] log from taft-02
Created attachment 516795 [details] log from taft-03
Created attachment 516796 [details] log from taft-04
this could potentially be related to bug 728620.
Here are the operations that were run on each of the four nodes in the cluster (taft-0[1234]). [root@taft-01 ~]# service cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] [root@taft-01 ~]# service rgmanager start Starting Cluster Service Manager: [ OK ] [root@taft-01 ~]# clustat Cluster Status for TAFT @ Mon Aug 8 14:03:42 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ taft-01 1 Online, Local, rgmanager taft-02 2 Online, rgmanager taft-03 3 Online, rgmanager taft-04 4 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:halvm1 taft-01 started service:halvm2 taft-01 started [root@taft-01 ~]# service rgmanager stop Stopping Cluster Service Manager: [ OK ] [root@taft-01 ~]# service cman stop Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld... [ OK ] Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Here is the storage configuration: [root@taft-01 ~]# pvs PV VG Fmt Attr PSize PFree /dev/sdg1 TAFT2 lvm2 a- 45.21g 45.21g /dev/sdg2 TAFT2 lvm2 a- 45.21g 37.21g /dev/sdg3 TAFT2 lvm2 a- 45.22g 37.22g /dev/sdh1 TAFT1 lvm2 a- 45.21g 45.21g /dev/sdh2 TAFT1 lvm2 a- 45.21g 37.21g /dev/sdh3 TAFT1 lvm2 a- 45.22g 37.22g [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices ha TAFT1 Mwi--- 8.00g ha_mlog ha_mimage_0(0),ha_mimage_1(0) [ha_mimage_0] TAFT1 Iwi--- 8.00g /dev/sdh3(0) [ha_mimage_1] TAFT1 Iwi--- 8.00g /dev/sdh2(0) [ha_mlog] TAFT1 lwi--- 4.00m /dev/sdh1(0) ha TAFT2 Mwi--- 8.00g ha_mlog ha_mimage_0(0),ha_mimage_1(0) [ha_mimage_0] TAFT2 Iwi--- 8.00g /dev/sdg3(0) [ha_mimage_1] TAFT2 Iwi--- 8.00g /dev/sdg2(0) [ha_mlog] TAFT2 lwi--- 4.00m /dev/sdg1(0) I'll post the cluster.conf file and the newer syslogs containing the error message.
Created attachment 517285 [details] taft cluster.conf file
Created attachment 517286 [details] log from taft-01
Created attachment 517287 [details] log from taft-02
Created attachment 517289 [details] log from taft-03
Created attachment 517290 [details] log from taft-04
This shows it appears to not be related to the other segfault in rgmanager. Aug 8 14:04:31 taft-03 dlm_controld[2396]: cpg_mcast_joined error 12 handle 7724c67e00000001 start vs. Aug 8 14:04:36 taft-04 kernel: dlm: invalid lockspace 5231f3eb from 1 cmd 1 type 7 ...seem to be the related errors. If the time is in sync, taft-01 would have left the rgmanager lockspace about 14:04:10; the 'Disconnecting from CMAN' message occurs after the lockspace has already been released on that host.
Reproduced this on a base 6.1 cluster as well, so not a regression. Aug 15 15:43:59 taft-04 modcluster: Updating cluster.conf Aug 15 15:44:00 taft-04 corosync[2026]: [QUORUM] Members[4]: 1 2 3 4 Aug 15 15:44:17 taft-04 kernel: dlm: Using TCP for communications Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 3 Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 2 Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 1 Aug 15 15:44:18 taft-04 rgmanager[2432]: I am node #4 Aug 15 15:44:18 taft-04 rgmanager[2432]: Resource Group Manager Starting Aug 15 15:44:18 taft-04 rgmanager[2432]: Loading Service Data Aug 15 15:44:21 taft-04 rgmanager[2432]: Initializing Services Aug 15 15:44:21 taft-04 rgmanager[3092]: stop: Could not match /dev/TAFT/ha3 with a real device Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs3" returned 2 (invalid argument(s)) Aug 15 15:44:21 taft-04 rgmanager[3129]: stop: Could not match /dev/TAFT/ha2 with a real device Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs2" returned 2 (invalid argument(s)) Aug 15 15:44:22 taft-04 rgmanager[3166]: stop: Could not match /dev/TAFT/ha1 with a real device Aug 15 15:44:22 taft-04 rgmanager[2432]: stop on fs "fs1" returned 2 (invalid argument(s)) Aug 15 15:44:24 taft-04 rgmanager[2432]: Services Initialized Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: Local UP Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-02 UP Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-03 UP Aug 15 15:44:25 taft-04 rgmanager[2432]: State change: taft-01 UP Aug 15 15:46:03 taft-04 rgmanager[2432]: Shutting down Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutting down Aug 15 15:46:04 taft-04 dlm_controld[2157]: cpg_mcast_joined error 12 handle d34b6a800000001 start Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutdown complete, exiting Linux taft-04 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux openais-1.1.1-7.el6.x86_64 rgmanager-3.0.12-11.el6.x86_64
Build https://brewweb.devel.redhat.com/taskinfo?taskID=3566651
I still see this error on two different clusters when stopping services with the build in comment #24. HAYES: [...] Aug 19 15:32:40 hayes-02 rgmanager[4044]: Services Initialized Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: Local UP Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-03 UP Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-01 UP Aug 19 15:33:23 hayes-02 rgmanager[4044]: Shutting down Aug 19 15:33:23 hayes-02 dlm_controld[2121]: cpg_mcast_joined error 12 handle 6eb5bd400000002 start [root@hayes-02 ~]# rpm -qa | grep corosync corosync-1.4.1-1.el6.jf.1.x86_64 corosynclib-1.4.1-1.el6.jf.1.x86_64 GRANTS: [...] Aug 19 15:39:21 grant-03 rgmanager[6866]: Services Initialized Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: Local UP Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-01 UP Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-02 UP Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down Aug 19 15:41:33 grant-03 dlm_controld[2046]: cpg_mcast_joined error 12 handle 7d5e18f800000001 start Aug 19 15:41:33 grant-03 rgmanager[6866]: Disconnecting from CMAN Aug 19 15:41:33 grant-03 rgmanager[6866]: Exiting [root@grant-03 ~]# rpm -qa | grep corosync corosynclib-1.4.1-1.el6.jf.1.x86_64 corosync-1.4.1-1.el6.jf.1.x86_64
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.