592125 – [LSI CR178786] - clvmd init script is run at the same stop level as gfs & gfs2, causing shutdown to fail

Bug 592125 - [LSI CR178786] - clvmd init script is run at the same stop level as gfs & gfs2, causing shutdown to fail

Summary: [LSI CR178786] - clvmd init script is run at the same stop level as gfs & gfs...

Keywords:
Status:	CLOSED DUPLICATE of bug 588903
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	initscripts
Sub Component:
Version:	5.5
Hardware:	ppc64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	5.6
Assignee:	initscripts Maintenance Team
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	557597
TreeView+	depends on / blocked

Reported:	2010-05-13 22:47 UTC by hong.chung
Modified:	2010-11-09 12:46 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-07-15 18:28:36 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Example fix. (398 bytes, patch) 2010-07-01 21:34 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Description hong.chung 2010-05-13 22:47:03 UTC

Description of problem:
- 2 Nodes RHCS setup.

Both nodes are running RHCS and servicing the gfs.

We are seeing the following every time one of the nodes is powering down from a node reboot.

openais[6111]: [TOTEM] The consensus timeout expired.
openais[6111]: [TOTEM] entering GATHER state from 3.

The node will never shutdown; this output will be printed forever.
The only way to get the node back up is physically power it off and back on.


Version-Release number of selected component (if applicable):

Kernel Release:    2.6.18-194.el5
RHEL Release:      Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Version:           Linux version 2.6.18-194.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Tue Mar 16 22:03:12 EDT 2010
Platform:          ppc64


How reproducible: Often.


Steps to Reproduce:
1. 2 RHEL5.5 PPC servers.
2. Connected to 2 storages.
3. Created 4 volumes from each storage and mapped to cluster group of the 2 servers.
4. RHCS is installed on both servers.
5. Setup the RHCS and also using GFS.
6.  Reboot one of the nodes

Actual results:
The node repeatedly reporting the following message during shutdown sequence:

openais[6111]: [TOTEM] The consensus timeout expired.

openais[6111]: [TOTEM] entering GATHER state from 3.

openais[6111]: [TOTEM] The consensus timeout expired.

openais[6111]: [TOTEM] entering GATHER state from 3.


Expected results:

Node shutdown and boot back up without any issue.


Additional info:

[root@tsunami ~]# chkconfig --list | egrep "clvmd|gfs |rgmanage|cman"
clvmd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
cman            0:off   1:off   2:on    3:on    4:on    5:on    6:off
gfs             0:off   1:off   2:on    3:on    4:on    5:on    6:off
rgmanager       0:off   1:off   2:on    3:on    4:on    5:on    6:off

[root@tsunami ~]# service gfs status
Configured GFS mountpoints:
/home/smashmnt0
/home/smashmnt1
/home/smashmnt2
/home/smashmnt3
/home/smashmnt4
/home/smashmnt5
/home/smashmnt6
/home/smashmnt7
Active GFS mountpoints:
/home/smashmnt0
/home/smashmnt1
/home/smashmnt2
/home/smashmnt3
/home/smashmnt4
/home/smashmnt5
/home/smashmnt6
/home/smashmnt7

[root@tsunami ~]# service cman status
cman is running.


[root@washuu testutils]# chkconfig --list | egrep "clvmd|gfs |rgmanage|cman"
clvmd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
cman            0:off   1:off   2:on    3:on    4:on    5:on    6:off
gfs             0:off   1:off   2:on    3:on    4:on    5:on    6:off
rgmanager       0:off   1:off   2:on    3:on    4:on    5:on    6:off

[root@washuu testutils]# service gfs status
Configured GFS mountpoints:
/home/smashmnt0
/home/smashmnt1
/home/smashmnt2
/home/smashmnt3
/home/smashmnt4
/home/smashmnt5
/home/smashmnt6
/home/smashmnt7
Active GFS mountpoints:
/home/smashmnt0
/home/smashmnt1
/home/smashmnt2
/home/smashmnt3
/home/smashmnt4
/home/smashmnt5
/home/smashmnt6
/home/smashmnt7

[root@washuu testutils]# service cman status
cman is running.


[root@tsunami ~]# cat /etc/cluster/cluster.conf
?xml version="1.0"?>
cluster alias="washuu-tsunami" config_version="4" name="washuu-tsunami">
        fence_daemon post_fail_delay="0" post_join_delay="3"/>
        clusternodes>
                clusternode name="washuu" nodeid="1" votes="1">
                        fence>
                                method name="1">
                                        device name="Persistent_Reserve" node="washuu"/>
                                /method>
                        /fence>
                /clusternode>
                clusternode name="tsunami" nodeid="2" votes="1">
                        fence>
                                method name="1">
                                        device name="Persistent_Reserve" node="tsunami"/>
                                /method>
                        /fence>
                /clusternode>
        /clusternodes>
        cman expected_votes="1" two_node="1"/>
        fencedevices>
                fencedevice agent="fence_scsi" name="Persistent_Reserve"/>
        /fencedevices>
        rm>
                failoverdomains>
                        failoverdomain name="tsunami1" ordered="1" restricted="0">
                                failoverdomainnode name="washuu" priority="2"/>
                                failoverdomainnode name="tsunami" priority="1"/>
                        /failoverdomain>
                        failoverdomain name="washuu1" ordered="1">
                                failoverdomainnode name="washuu" priority="1"/>
                                failoverdomainnode name="tsunami" priority="2"/>
                        /failoverdomain>
                /failoverdomains>
                resources>
                        ip address="172.22.229.160" monitor_link="1"/>
                        ip address="172.22.229.165" monitor_link="1"/>
                /resources>
                service autostart="1" domain="tsunami1" exclusive="0" name="service-172.22.229.160" recovery="relocate">
                        ip ref="172.22.229.160"/>
                /service>
                service autostart="1" domain="washuu1" exclusive="0" name="service-172.22.229.165" recovery="relocate">
                        ip ref="172.22.229.165"/>
                /service>
        /rm>
/cluster>


Console output during shutdown:

The system is going down for reboot NOW!
INIT: Sending processes the TERM signal
Shutting down Cluster Module - cluster monitor: [  OK  ]
Shutting down Cluster Service Manager...
Waiting for services to stop: [  OK  ]
Cluster Service Manager is stopped.
Shutting down ricci: [  OK  ]
Shutting down smartd: [  OK  ]
[  OK  ] down CIM server: [  OK  ]
Shutting down Avahi daemon: [  OK  ]
Shutting down oddjobd: [  OK  ]
Stopping yum-updatesd: [  OK  ]
Stopping anacron: [  OK  ]
Stopping atd: [  OK  ]
Stopping saslauthd: [  OK  ]
Stopping cups: [  OK  ]
Stopping hpiod: [  OK  ]
Stopping hpssd: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon: nfsd: last server has exited
nfsd: unexporting all filesystems
[  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]
Stopping sshd: [  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Shutting down vsftpd: [  OK  ]
Stopping xinetd: [  OK  ]
Stopping crond: [  OK  ]
Stopping autofs:  Stopping automount: [  OK  ]
[  OK  ]
Deactivating VG lvm_vg:   Can't deactivate volume group "lvm_vg" with 8 open logical volume(s)
[FAILED]
Unmounting GFS filesystems:  [  OK  ]
Stopping HAL daemon: [  OK  ]
Unmounting NFS filesystems:  [  OK  ]
Stopping cluster:
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
[FAILED]
Shutting down fcauthd[  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping portmap: [  OK  ]
Stopping auditd: audit(1273785255.045:142): audit_pid=0 old=2562 by auid=4294967295
[  OK  ]
Stopping PC/SC smart card daemon (pcscd): [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Shutting down hidd: [  OK  ]
[  OK  ] Bluetooth services:[  OK  ]
Shutting down interface eth0:  ehea: eth0: Logical port down
ehea: eth0: Physical port up
ehea: External switch port is backup port
[  OK  ]
Shutting down loopback interface:  [  OK  ]
Starting killall:  openais[6111]: [TOTEM] The token was lost in the OPERATIONAL state.

openais[6111]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).

openais[6111]: [TOTEM] Transmit multicast socket send buffer size (258048 bytes).

openais[6111]: [TOTEM] The network interface is down.

openais[6111]: [TOTEM] entering GATHER state from 15.

openais[6111]: [TOTEM] entering GATHER state from 2.

openais[6111]: [TOTEM] entering GATHER state from 0.

openais[6111]: [TOTEM] The consensus timeout expired.

openais[6111]: [TOTEM] entering GATHER state from 3.

openais[6111]: [TOTEM] The consensus timeout expired.

openais[6111]: [TOTEM] entering GATHER state from 3.

Comment 1 Abdel Sadek 2010-05-18 14:51:27 UTC

Any Updates?

Comment 3 Andrius Benokraitis 2010-05-27 18:59:11 UTC

Abdel - is this ppc specific, or can it be reproduced on other arches as well?

Comment 4 Jan Friesse 2010-05-31 10:23:34 UTC

After some testing and log file analysis, I'm really NOT sure if this is problem of openais. It *can* be problem of clvm (Can't deactivate volume group "lvm_vg" with 8 open logical volume(s)), init scripts, or maybe cman (Error leaving cluster: Device or resource busy). But fact is, that openais will never receive shutdown signal (cman is responsible for that).

I'm reassigning bug to cman.

Chrissie, if you will feel that this problem IS problem of openais, please reassign bug back to me.

Comment 5 Christine Caulfield 2010-06-01 07:40:34 UTC

As Honza says, this is all down to the fact that clvmd can't deactivate the logical volumes. Without that happening there is a whole dependency of things that can't shut down. So the first thing to investigate is what is holding those volumes open.

There is also the side-issue that sending signals to openais does not shut it down.

Comment 6 hong.chung 2010-06-02 22:20:41 UTC

For answering the issue is seeing on different architects, we have also seen this issue on x64 and ia64.

Comment 7 Abdel Sadek 2010-06-25 21:02:47 UTC

any updates?

Comment 8 Steven Dake 2010-07-01 20:37:11 UTC

The log messages from openais are not valid indicators of this failure:

   Stopping cman... failed

This will mean that openais does not shutdown.  When openais doesn't shutdown and the network interfaces are later stopped, openais will print the messages in the log.

Not an openais problem.

Comment 9 Lon Hohberger 2010-07-01 20:41:19 UTC

Wow.

[root@molly rc6.d]# ls -l *clvmd*
lrwxrwxrwx 1 root root 15 Jul  1 16:36 K74clvmd -> ../init.d/clvmd
[root@molly rc6.d]# chkconfig --del clvmd
[root@molly rc6.d]# ls -l *clvmd*
ls: *clvmd*: No such file or directory
[root@molly rc6.d]# grep chkconfig ../init.d/clvmd
# chkconfig: - 24 76
[root@molly rc6.d]# chkconfig --level 345 clvmd on
[root@molly rc6.d]# ls -l *clvmd*
lrwxrwxrwx 1 root root 15 Jul  1 16:40 K74clvmd -> ../init.d/clvmd
[root@molly rc6.d]#

Comment 10 Lon Hohberger 2010-07-01 20:43:06 UTC

Clvmd is stopping at 74 instead of 76 - at the same level as gfs/gfs2.  It should be stopping at 76 (after gfs/gfs2).

Comment 11 Lon Hohberger 2010-07-01 20:52:13 UTC

Performing the following allows clvmd to stop at the right time (i.e. after gfs/gfs2 are unmounted by their respective initscripts):

 * Remove "Required-Stop: $local_fs" from /etc/init.d/clvmd
 * chkconfig --del clvmd
 * chkconfig --level 345 clvmd on

Changing component to lvm2-cluster.

Comment 12 Lon Hohberger 2010-07-01 21:33:03 UTC

Steve Dake noticed that this works as expected on later releases.

It turns out that /etc/init.d/netfs "Provides: $local_fs" on Red Hat Enterprise Linux 5, but not on later releases of Fedora or Red Hat Enterprise Linux 6 Beta.

This provision interferes with chkconfig, which reorders the initscripts based on the "Provides:" information at the top of the script.

Comment 13 Lon Hohberger 2010-07-01 21:34:29 UTC

Created attachment 428565 [details]
Example fix.

After applying this patch to /etc/init.d/netfs, clvmd will be stopped at level 76 as expected, which is after the gfs/gfs2 init scripts.

Comment 14 Lon Hohberger 2010-07-01 21:35:20 UTC

After applying the above patch to /etc/init.d/netfs, you must perform:

 * chkconfig --del clvmd
 * chkconfig --level 345 clvmd on

... in order to reset the links.

Comment 15 Alasdair Kergon 2010-07-02 12:31:22 UTC

But how did it choose to set them at the same level if there's a dependency between them?

Is the gfs script missing 'Required-Stop: clvmd' (and Required-Start too)?

And is clvmd missing a dependency on cman?

Comment 16 Bill Nottingham 2010-07-06 17:48:09 UTC

I'd agree... changing this in initscripts seems fishy, not the least because some app may rely on that provide in  RHEL 5.

Comment 17 Shane Bradley 2010-07-15 18:28:36 UTC


*** This bug has been marked as a duplicate of bug 588903 ***

Comment 18 Lon Hohberger 2010-07-15 18:29:03 UTC

You're right, Bill.  Turns out it was a regression in lvm2-cluster.

Note You need to log in before you can comment on or make changes to this bug.

abdel.sadek
adas
agk
andriusb
bmarzins
ccaulfie
cluster-maint
dl-iop-bugzilla
dwysocha
edamato
heinzm
jbrassow
joe.thornber
mbroz
notting
prockai
rpeterso
sbradley
sdake
swhiteho