Bug 728000 - dlm_controld: cpg_mcast_joined error during rgmanager start up/shut down
Summary: dlm_controld: cpg_mcast_joined error during rgmanager start up/shut down
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.1
Hardware: x86_64
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-08-03 21:32 UTC by Corey Marthaler
Modified: 2011-08-19 21:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-19 21:25:12 UTC


Attachments (Terms of Use)
log from taft-01 (5.87 KB, text/plain)
2011-08-04 21:28 UTC, Corey Marthaler
no flags Details
log from taft-02 (8.99 KB, text/plain)
2011-08-04 21:29 UTC, Corey Marthaler
no flags Details
log from taft-03 (5.48 KB, text/plain)
2011-08-04 21:29 UTC, Corey Marthaler
no flags Details
log from taft-04 (6.06 KB, text/plain)
2011-08-04 21:30 UTC, Corey Marthaler
no flags Details
taft cluster.conf file (2.58 KB, text/plain)
2011-08-08 19:21 UTC, Corey Marthaler
no flags Details
log from taft-01 (11.06 KB, text/plain)
2011-08-08 19:22 UTC, Corey Marthaler
no flags Details
log from taft-02 (7.60 KB, text/plain)
2011-08-08 19:22 UTC, Corey Marthaler
no flags Details
log from taft-03 (6.99 KB, text/plain)
2011-08-08 19:24 UTC, Corey Marthaler
no flags Details
log from taft-04 (8.52 KB, text/plain)
2011-08-08 19:25 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2011-08-03 21:32:12 UTC
Description of problem:
I continually see this error when starting HA LVM services

Aug  3 16:13:17 taft-01 rgmanager[11766]: I am node #1
Aug  3 16:13:17 taft-01 rgmanager[11766]: Resource Group Manager Starting
Aug  3 16:13:17 taft-01 rgmanager[11766]: Loading Service Data
Aug  3 16:13:22 taft-01 rgmanager[11766]: Initializing Services
Aug  3 16:13:22 taft-01 rgmanager[12577]: [fs] stop: Could not match /dev/TAFT/ha3 with a real device
Aug  3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs3" returned 2 (invalid argument(s))
Aug  3 16:13:22 taft-01 rgmanager[12614]: [fs] stop: Could not match /dev/TAFT/ha2 with a real device
Aug  3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs2" returned 2 (invalid argument(s))
Aug  3 16:13:22 taft-01 rgmanager[12651]: [fs] stop: Could not match /dev/TAFT/ha1 with a real device
Aug  3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs1" returned 2 (invalid argument(s))
Aug  3 16:13:25 taft-01 rgmanager[12726]: [lvm] Stripping tag, foobar
Aug  3 16:13:26 taft-01 rgmanager[11766]: Services Initialized
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: Local UP
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: taft-02 UP
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: taft-04 UP
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: taft-03 UP
Aug  3 16:17:46 taft-01 rgmanager[11766]: Shutting down
Aug  3 16:17:47 taft-01 dlm_controld[2291]: cpg_mcast_joined error 12 handle 4962813b00000001 plocks_stored
Aug  3 16:17:47 taft-01 rgmanager[11766]: Disconnecting from CMAN
Aug  3 16:17:47 taft-01 rgmanager[11766]: Exiting


Version-Release number of selected component (if applicable):
Linux taft-01 2.6.32-174.el6.x86_64 #1 SMP Thu Jul 28 00:31:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
rgmanager-3.0.12.1-2.el6.x86_64
cman-3.0.12.1-7.el6.x86_64
openais-1.1.1-7.el6.x86_64

Comment 1 Fabio Massimo Di Nitto 2011-08-04 04:20:18 UTC
It looks to me as if corosync is crashing/exiting in backgroup, rgmanager starts to shutdown and dlm_controld report the error.

Can you please attach also logs from /var/log/cluster?

Is node fenced?

Comment 2 Corey Marthaler 2011-08-04 21:28:16 UTC
It doesn't appear that corosync is crashing, as it's still running on each node and no nodes are removed from the cluster or fenced. I'll post the lasted logs when rgmanager is stopped.

Comment 3 Corey Marthaler 2011-08-04 21:28:47 UTC
Created attachment 516793 [details]
log from taft-01

Comment 4 Corey Marthaler 2011-08-04 21:29:14 UTC
Created attachment 516794 [details]
log from taft-02

Comment 5 Corey Marthaler 2011-08-04 21:29:40 UTC
Created attachment 516795 [details]
log from taft-03

Comment 6 Corey Marthaler 2011-08-04 21:30:30 UTC
Created attachment 516796 [details]
log from taft-04

Comment 8 Corey Marthaler 2011-08-05 20:36:41 UTC
this could potentially be related to bug 728620.

Comment 11 Corey Marthaler 2011-08-08 19:19:58 UTC
Here are the operations that were run on each of the four nodes in the cluster (taft-0[1234]).

[root@taft-01 ~]# service cman start
Starting cluster:
   Checking if cluster has been disabled at boot...  [  OK  ]
   Checking Network Manager...                       [  OK  ]
   Global setup...                                   [  OK  ]
   Loading kernel modules...                         [  OK  ]
   Mounting configfs...                              [  OK  ]
   Starting cman...                                  [  OK  ]
   Waiting for quorum...                             [  OK  ]
   Starting fenced...                                [  OK  ]
   Starting dlm_controld...                          [  OK  ]
   Starting gfs_controld...                          [  OK  ]
   Unfencing self...                                 [  OK  ]
   Joining fence domain...                           [  OK  ]
[root@taft-01 ~]# service rgmanager start
Starting Cluster Service Manager:                    [  OK  ]

[root@taft-01 ~]# clustat
Cluster Status for TAFT @ Mon Aug  8 14:03:42 2011
Member Status: Quorate

 Member Name           ID   Status
 ------ ----           ---- ------
 taft-01                   1 Online, Local, rgmanager
 taft-02                   2 Online, rgmanager
 taft-03                   3 Online, rgmanager
 taft-04                   4 Online, rgmanager

 Service Name        Owner (Last)      State
 ------- ----        ----- ------      -----
 service:halvm1      taft-01           started
 service:halvm2      taft-01           started

[root@taft-01 ~]# service rgmanager stop
Stopping Cluster Service Manager:                  [  OK  ]
[root@taft-01 ~]# service cman stop
Stopping cluster:
   Leaving fence domain...                         [  OK  ]
   Stopping gfs_controld...                        [  OK  ]
   Stopping dlm_controld...                        [  OK  ]
   Stopping fenced...                              [  OK  ]
   Stopping cman...                                [  OK  ]
   Waiting for corosync to shutdown:               [  OK  ]
   Unloading kernel modules...                     [  OK  ]
   Unmounting configfs...                          [  OK  ]


Here is the storage configuration:
[root@taft-01 ~]# pvs
  PV         VG        Fmt  Attr PSize  PFree
  /dev/sdg1  TAFT2     lvm2 a-   45.21g 45.21g
  /dev/sdg2  TAFT2     lvm2 a-   45.21g 37.21g
  /dev/sdg3  TAFT2     lvm2 a-   45.22g 37.22g
  /dev/sdh1  TAFT1     lvm2 a-   45.21g 45.21g
  /dev/sdh2  TAFT1     lvm2 a-   45.21g 37.21g
  /dev/sdh3  TAFT1     lvm2 a-   45.22g 37.22g

[root@taft-01 ~]# lvs -a -o +devices
 LV            VG     Attr   LSize  Log     Copy%  Devices
 ha            TAFT1  Mwi---  8.00g ha_mlog        ha_mimage_0(0),ha_mimage_1(0)
 [ha_mimage_0] TAFT1  Iwi---  8.00g                /dev/sdh3(0)
 [ha_mimage_1] TAFT1  Iwi---  8.00g                /dev/sdh2(0)
 [ha_mlog]     TAFT1  lwi---  4.00m                /dev/sdh1(0)
 ha            TAFT2  Mwi---  8.00g ha_mlog        ha_mimage_0(0),ha_mimage_1(0)
 [ha_mimage_0] TAFT2  Iwi---  8.00g                /dev/sdg3(0)
 [ha_mimage_1] TAFT2  Iwi---  8.00g                /dev/sdg2(0)
 [ha_mlog]     TAFT2  lwi---  4.00m                /dev/sdg1(0)

I'll post the cluster.conf file and the newer syslogs containing the error message.

Comment 12 Corey Marthaler 2011-08-08 19:21:15 UTC
Created attachment 517285 [details]
taft cluster.conf file

Comment 13 Corey Marthaler 2011-08-08 19:22:27 UTC
Created attachment 517286 [details]
log from taft-01

Comment 14 Corey Marthaler 2011-08-08 19:22:57 UTC
Created attachment 517287 [details]
log from taft-02

Comment 15 Corey Marthaler 2011-08-08 19:24:47 UTC
Created attachment 517289 [details]
log from taft-03

Comment 16 Corey Marthaler 2011-08-08 19:25:16 UTC
Created attachment 517290 [details]
log from taft-04

Comment 17 Lon Hohberger 2011-08-11 20:35:56 UTC
This shows it appears to not be related to the other segfault in rgmanager.

Aug  8 14:04:31 taft-03 dlm_controld[2396]: cpg_mcast_joined error 12 handle 7724c67e00000001 start

vs.

Aug  8 14:04:36 taft-04 kernel: dlm: invalid lockspace 5231f3eb from 1 cmd 1 type 7

...seem to be the related errors.  If the time is in sync, taft-01 would have left the rgmanager lockspace about 14:04:10; the 'Disconnecting from CMAN' message occurs after the lockspace has already been released on that host.

Comment 18 Corey Marthaler 2011-08-15 20:52:05 UTC
Reproduced this on a base 6.1 cluster as well, so not a regression.

Aug 15 15:43:59 taft-04 modcluster: Updating cluster.conf
Aug 15 15:44:00 taft-04 corosync[2026]:   [QUORUM] Members[4]: 1 2 3 4
Aug 15 15:44:17 taft-04 kernel: dlm: Using TCP for communications
Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 3
Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 2
Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 1
Aug 15 15:44:18 taft-04 rgmanager[2432]: I am node #4
Aug 15 15:44:18 taft-04 rgmanager[2432]: Resource Group Manager Starting
Aug 15 15:44:18 taft-04 rgmanager[2432]: Loading Service Data
Aug 15 15:44:21 taft-04 rgmanager[2432]: Initializing Services
Aug 15 15:44:21 taft-04 rgmanager[3092]: stop: Could not match /dev/TAFT/ha3 with a real device
Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs3" returned 2 (invalid argument(s))
Aug 15 15:44:21 taft-04 rgmanager[3129]: stop: Could not match /dev/TAFT/ha2 with a real device
Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs2" returned 2 (invalid argument(s))
Aug 15 15:44:22 taft-04 rgmanager[3166]: stop: Could not match /dev/TAFT/ha1 with a real device
Aug 15 15:44:22 taft-04 rgmanager[2432]: stop on fs "fs1" returned 2 (invalid argument(s))
Aug 15 15:44:24 taft-04 rgmanager[2432]: Services Initialized
Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: Local UP
Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-02 UP
Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-03 UP
Aug 15 15:44:25 taft-04 rgmanager[2432]: State change: taft-01 UP
Aug 15 15:46:03 taft-04 rgmanager[2432]: Shutting down
Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutting down
Aug 15 15:46:04 taft-04 dlm_controld[2157]: cpg_mcast_joined error 12 handle d34b6a800000001 start
Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutdown complete, exiting



Linux taft-04 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
openais-1.1.1-7.el6.x86_64
rgmanager-3.0.12-11.el6.x86_64

Comment 24 Jan Friesse 2011-08-19 07:52:32 UTC
Build https://brewweb.devel.redhat.com/taskinfo?taskID=3566651

Comment 26 Corey Marthaler 2011-08-19 20:45:31 UTC
I still see this error on two different clusters when stopping services with the build in comment #24.


HAYES:
[...]
Aug 19 15:32:40 hayes-02 rgmanager[4044]: Services Initialized
Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: Local UP
Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-03 UP
Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-01 UP
Aug 19 15:33:23 hayes-02 rgmanager[4044]: Shutting down
Aug 19 15:33:23 hayes-02 dlm_controld[2121]: cpg_mcast_joined error 12 handle 6eb5bd400000002 start

[root@hayes-02 ~]# rpm -qa | grep corosync
corosync-1.4.1-1.el6.jf.1.x86_64
corosynclib-1.4.1-1.el6.jf.1.x86_64



GRANTS:
[...]
Aug 19 15:39:21 grant-03 rgmanager[6866]: Services Initialized
Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: Local UP
Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-01 UP
Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-02 UP
Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down
Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down
Aug 19 15:41:33 grant-03 dlm_controld[2046]: cpg_mcast_joined error 12 handle 7d5e18f800000001 start
Aug 19 15:41:33 grant-03 rgmanager[6866]: Disconnecting from CMAN
Aug 19 15:41:33 grant-03 rgmanager[6866]: Exiting

[root@grant-03 ~]# rpm -qa | grep corosync
corosynclib-1.4.1-1.el6.jf.1.x86_64
corosync-1.4.1-1.el6.jf.1.x86_64

Comment 29 RHEL Product and Program Management 2011-08-19 21:25:12 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.


Note You need to log in before you can comment on or make changes to this bug.