Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 728000

Summary:

dlm_controld: cpg_mcast_joined error during rgmanager start up/shut down

Product:

Red Hat Enterprise Linux 6

Reporter:

Corey Marthaler <cmarthal>

Component:

cluster

Assignee:

David Teigland <teigland>

Status:

CLOSED WONTFIX

QA Contact:

Cluster QE <mspqa-list>

Severity:

low

Docs Contact:

Priority:

low

Version:

6.1

CC:

ccaulfie, cluster-maint, fdinitto, jfriesse, lhh, rpeterso, sdake, teigland

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-08-19 21:25:12 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log from taft-01	none
log from taft-02	none
log from taft-03	none
log from taft-04	none
taft cluster.conf file	none
log from taft-01	none
log from taft-02	none
log from taft-03	none
log from taft-04	none

Description Corey Marthaler 2011-08-03 21:32:12 UTC

Description of problem:
I continually see this error when starting HA LVM services

Aug  3 16:13:17 taft-01 rgmanager[11766]: I am node #1
Aug  3 16:13:17 taft-01 rgmanager[11766]: Resource Group Manager Starting
Aug  3 16:13:17 taft-01 rgmanager[11766]: Loading Service Data
Aug  3 16:13:22 taft-01 rgmanager[11766]: Initializing Services
Aug  3 16:13:22 taft-01 rgmanager[12577]: [fs] stop: Could not match /dev/TAFT/ha3 with a real device
Aug  3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs3" returned 2 (invalid argument(s))
Aug  3 16:13:22 taft-01 rgmanager[12614]: [fs] stop: Could not match /dev/TAFT/ha2 with a real device
Aug  3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs2" returned 2 (invalid argument(s))
Aug  3 16:13:22 taft-01 rgmanager[12651]: [fs] stop: Could not match /dev/TAFT/ha1 with a real device
Aug  3 16:13:22 taft-01 rgmanager[11766]: stop on fs "fs1" returned 2 (invalid argument(s))
Aug  3 16:13:25 taft-01 rgmanager[12726]: [lvm] Stripping tag, foobar
Aug  3 16:13:26 taft-01 rgmanager[11766]: Services Initialized
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: Local UP
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: taft-02 UP
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: taft-04 UP
Aug  3 16:13:26 taft-01 rgmanager[11766]: State change: taft-03 UP
Aug  3 16:17:46 taft-01 rgmanager[11766]: Shutting down
Aug  3 16:17:47 taft-01 dlm_controld[2291]: cpg_mcast_joined error 12 handle 4962813b00000001 plocks_stored
Aug  3 16:17:47 taft-01 rgmanager[11766]: Disconnecting from CMAN
Aug  3 16:17:47 taft-01 rgmanager[11766]: Exiting


Version-Release number of selected component (if applicable):
Linux taft-01 2.6.32-174.el6.x86_64 #1 SMP Thu Jul 28 00:31:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
rgmanager-3.0.12.1-2.el6.x86_64
cman-3.0.12.1-7.el6.x86_64
openais-1.1.1-7.el6.x86_64

Comment 1 Fabio Massimo Di Nitto 2011-08-04 04:20:18 UTC

It looks to me as if corosync is crashing/exiting in backgroup, rgmanager starts to shutdown and dlm_controld report the error.

Can you please attach also logs from /var/log/cluster?

Is node fenced?

Comment 2 Corey Marthaler 2011-08-04 21:28:16 UTC

It doesn't appear that corosync is crashing, as it's still running on each node and no nodes are removed from the cluster or fenced. I'll post the lasted logs when rgmanager is stopped.

Comment 3 Corey Marthaler 2011-08-04 21:28:47 UTC

Created attachment 516793 [details]
log from taft-01

Comment 4 Corey Marthaler 2011-08-04 21:29:14 UTC

Created attachment 516794 [details]
log from taft-02

Comment 5 Corey Marthaler 2011-08-04 21:29:40 UTC

Created attachment 516795 [details]
log from taft-03

Comment 6 Corey Marthaler 2011-08-04 21:30:30 UTC

Created attachment 516796 [details]
log from taft-04

Comment 8 Corey Marthaler 2011-08-05 20:36:41 UTC

this could potentially be related to bug 728620.

Comment 11 Corey Marthaler 2011-08-08 19:19:58 UTC

Here are the operations that were run on each of the four nodes in the cluster (taft-0[1234]).

[root@taft-01 ~]# service cman start
Starting cluster:
   Checking if cluster has been disabled at boot...  [  OK  ]
   Checking Network Manager...                       [  OK  ]
   Global setup...                                   [  OK  ]
   Loading kernel modules...                         [  OK  ]
   Mounting configfs...                              [  OK  ]
   Starting cman...                                  [  OK  ]
   Waiting for quorum...                             [  OK  ]
   Starting fenced...                                [  OK  ]
   Starting dlm_controld...                          [  OK  ]
   Starting gfs_controld...                          [  OK  ]
   Unfencing self...                                 [  OK  ]
   Joining fence domain...                           [  OK  ]
[root@taft-01 ~]# service rgmanager start
Starting Cluster Service Manager:                    [  OK  ]

[root@taft-01 ~]# clustat
Cluster Status for TAFT @ Mon Aug  8 14:03:42 2011
Member Status: Quorate

 Member Name           ID   Status
 ------ ----           ---- ------
 taft-01                   1 Online, Local, rgmanager
 taft-02                   2 Online, rgmanager
 taft-03                   3 Online, rgmanager
 taft-04                   4 Online, rgmanager

 Service Name        Owner (Last)      State
 ------- ----        ----- ------      -----
 service:halvm1      taft-01           started
 service:halvm2      taft-01           started

[root@taft-01 ~]# service rgmanager stop
Stopping Cluster Service Manager:                  [  OK  ]
[root@taft-01 ~]# service cman stop
Stopping cluster:
   Leaving fence domain...                         [  OK  ]
   Stopping gfs_controld...                        [  OK  ]
   Stopping dlm_controld...                        [  OK  ]
   Stopping fenced...                              [  OK  ]
   Stopping cman...                                [  OK  ]
   Waiting for corosync to shutdown:               [  OK  ]
   Unloading kernel modules...                     [  OK  ]
   Unmounting configfs...                          [  OK  ]


Here is the storage configuration:
[root@taft-01 ~]# pvs
  PV         VG        Fmt  Attr PSize  PFree
  /dev/sdg1  TAFT2     lvm2 a-   45.21g 45.21g
  /dev/sdg2  TAFT2     lvm2 a-   45.21g 37.21g
  /dev/sdg3  TAFT2     lvm2 a-   45.22g 37.22g
  /dev/sdh1  TAFT1     lvm2 a-   45.21g 45.21g
  /dev/sdh2  TAFT1     lvm2 a-   45.21g 37.21g
  /dev/sdh3  TAFT1     lvm2 a-   45.22g 37.22g

[root@taft-01 ~]# lvs -a -o +devices
 LV            VG     Attr   LSize  Log     Copy%  Devices
 ha            TAFT1  Mwi---  8.00g ha_mlog        ha_mimage_0(0),ha_mimage_1(0)
 [ha_mimage_0] TAFT1  Iwi---  8.00g                /dev/sdh3(0)
 [ha_mimage_1] TAFT1  Iwi---  8.00g                /dev/sdh2(0)
 [ha_mlog]     TAFT1  lwi---  4.00m                /dev/sdh1(0)
 ha            TAFT2  Mwi---  8.00g ha_mlog        ha_mimage_0(0),ha_mimage_1(0)
 [ha_mimage_0] TAFT2  Iwi---  8.00g                /dev/sdg3(0)
 [ha_mimage_1] TAFT2  Iwi---  8.00g                /dev/sdg2(0)
 [ha_mlog]     TAFT2  lwi---  4.00m                /dev/sdg1(0)

I'll post the cluster.conf file and the newer syslogs containing the error message.

Comment 12 Corey Marthaler 2011-08-08 19:21:15 UTC

Created attachment 517285 [details]
taft cluster.conf file

Comment 13 Corey Marthaler 2011-08-08 19:22:27 UTC

Created attachment 517286 [details]
log from taft-01

Comment 14 Corey Marthaler 2011-08-08 19:22:57 UTC

Created attachment 517287 [details]
log from taft-02

Comment 15 Corey Marthaler 2011-08-08 19:24:47 UTC

Created attachment 517289 [details]
log from taft-03

Comment 16 Corey Marthaler 2011-08-08 19:25:16 UTC

Created attachment 517290 [details]
log from taft-04

Comment 17 Lon Hohberger 2011-08-11 20:35:56 UTC

This shows it appears to not be related to the other segfault in rgmanager.

Aug  8 14:04:31 taft-03 dlm_controld[2396]: cpg_mcast_joined error 12 handle 7724c67e00000001 start

vs.

Aug  8 14:04:36 taft-04 kernel: dlm: invalid lockspace 5231f3eb from 1 cmd 1 type 7

...seem to be the related errors.  If the time is in sync, taft-01 would have left the rgmanager lockspace about 14:04:10; the 'Disconnecting from CMAN' message occurs after the lockspace has already been released on that host.

Comment 18 Corey Marthaler 2011-08-15 20:52:05 UTC

Reproduced this on a base 6.1 cluster as well, so not a regression.

Aug 15 15:43:59 taft-04 modcluster: Updating cluster.conf
Aug 15 15:44:00 taft-04 corosync[2026]:   [QUORUM] Members[4]: 1 2 3 4
Aug 15 15:44:17 taft-04 kernel: dlm: Using TCP for communications
Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 3
Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 2
Aug 15 15:44:17 taft-04 kernel: dlm: connecting to 1
Aug 15 15:44:18 taft-04 rgmanager[2432]: I am node #4
Aug 15 15:44:18 taft-04 rgmanager[2432]: Resource Group Manager Starting
Aug 15 15:44:18 taft-04 rgmanager[2432]: Loading Service Data
Aug 15 15:44:21 taft-04 rgmanager[2432]: Initializing Services
Aug 15 15:44:21 taft-04 rgmanager[3092]: stop: Could not match /dev/TAFT/ha3 with a real device
Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs3" returned 2 (invalid argument(s))
Aug 15 15:44:21 taft-04 rgmanager[3129]: stop: Could not match /dev/TAFT/ha2 with a real device
Aug 15 15:44:21 taft-04 rgmanager[2432]: stop on fs "fs2" returned 2 (invalid argument(s))
Aug 15 15:44:22 taft-04 rgmanager[3166]: stop: Could not match /dev/TAFT/ha1 with a real device
Aug 15 15:44:22 taft-04 rgmanager[2432]: stop on fs "fs1" returned 2 (invalid argument(s))
Aug 15 15:44:24 taft-04 rgmanager[2432]: Services Initialized
Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: Local UP
Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-02 UP
Aug 15 15:44:24 taft-04 rgmanager[2432]: State change: taft-03 UP
Aug 15 15:44:25 taft-04 rgmanager[2432]: State change: taft-01 UP
Aug 15 15:46:03 taft-04 rgmanager[2432]: Shutting down
Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutting down
Aug 15 15:46:04 taft-04 dlm_controld[2157]: cpg_mcast_joined error 12 handle d34b6a800000001 start
Aug 15 15:46:04 taft-04 rgmanager[2432]: Shutdown complete, exiting



Linux taft-04 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
openais-1.1.1-7.el6.x86_64
rgmanager-3.0.12-11.el6.x86_64

Comment 24 Jan Friesse 2011-08-19 07:52:32 UTC

Build https://brewweb.devel.redhat.com/taskinfo?taskID=3566651

Comment 26 Corey Marthaler 2011-08-19 20:45:31 UTC

I still see this error on two different clusters when stopping services with the build in comment #24.


HAYES:
[...]
Aug 19 15:32:40 hayes-02 rgmanager[4044]: Services Initialized
Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: Local UP
Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-03 UP
Aug 19 15:32:40 hayes-02 rgmanager[4044]: State change: hayes-01 UP
Aug 19 15:33:23 hayes-02 rgmanager[4044]: Shutting down
Aug 19 15:33:23 hayes-02 dlm_controld[2121]: cpg_mcast_joined error 12 handle 6eb5bd400000002 start

[root@hayes-02 ~]# rpm -qa | grep corosync
corosync-1.4.1-1.el6.jf.1.x86_64
corosynclib-1.4.1-1.el6.jf.1.x86_64



GRANTS:
[...]
Aug 19 15:39:21 grant-03 rgmanager[6866]: Services Initialized
Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: Local UP
Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-01 UP
Aug 19 15:39:21 grant-03 rgmanager[6866]: State change: grant-02 UP
Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down
Aug 19 15:41:33 grant-03 rgmanager[6866]: Shutting down
Aug 19 15:41:33 grant-03 dlm_controld[2046]: cpg_mcast_joined error 12 handle 7d5e18f800000001 start
Aug 19 15:41:33 grant-03 rgmanager[6866]: Disconnecting from CMAN
Aug 19 15:41:33 grant-03 rgmanager[6866]: Exiting

[root@grant-03 ~]# rpm -qa | grep corosync
corosynclib-1.4.1-1.el6.jf.1.x86_64
corosync-1.4.1-1.el6.jf.1.x86_64

Comment 29 RHEL Program Management 2011-08-19 21:25:12 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.