496130 – node mysteriously evicted by qdisk

Bug 496130 - node mysteriously evicted by qdisk

Summary: node mysteriously evicted by qdisk

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	514627 (view as bug list)
Depends On:	500450
Blocks:	499522
TreeView+	depends on / blocked

Reported:	2009-04-16 19:24 UTC by Corey Marthaler
Modified:	2018-10-20 03:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-10-27 20:41:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log from taft-01 (1.12 MB, text/plain) 2009-04-16 19:32 UTC, Corey Marthaler	no flags	Details
log from taft-02 (1.17 MB, text/plain) 2009-04-16 19:33 UTC, Corey Marthaler	no flags	Details
log from taft-03 (964.11 KB, text/plain) 2009-04-16 19:33 UTC, Corey Marthaler	no flags	Details
log from taft-04 (1.30 MB, text/plain) 2009-04-16 19:34 UTC, Corey Marthaler	no flags	Details
new log from taft-01 (66.44 KB, text/plain) 2009-04-17 16:26 UTC, Corey Marthaler	no flags	Details
new log from taft-02 (66.07 KB, text/plain) 2009-04-17 16:26 UTC, Corey Marthaler	no flags	Details
new log from taft-03 (63.79 KB, text/plain) 2009-04-17 16:27 UTC, Corey Marthaler	no flags	Details
new log from taft-04 (73.10 KB, text/plain) 2009-04-17 16:27 UTC, Corey Marthaler	no flags	Details
iostat numbers (3.70 KB, text/plain) 2009-05-04 20:27 UTC, Lon Hohberger	no flags	Details
View All

Description Corey Marthaler 2009-04-16 19:24:52 UTC

Description of problem:
I was running different cmirror operations with the test cmirror_lock_stress last night and had one of my nodes (taft-03) killed by qdisk (on taft-01).

Apr 15 14:28:38 taft-01 qdiskd[7551]: <notice> Writing eviction notice for node 3
Apr 15 14:28:38 taft-01 kernel: dlm: invalid h_nodeid 0 from 3 lockspace 10002
Apr 15 14:28:39 taft-01 qdiskd[7551]: <notice> Node 3 evicted


Apr 15 14:27:56 taft-03 openais[7443]: [TOTEM] Retransmit List: 285b43
Apr 15 14:28:11 taft-03 openais[7443]: [TOTEM] Retransmit List: 2863f7
Apr 15 14:28:36 taft-03 openais[7443]: [CMAN ] cman killed by node 1 because we were killed by cman_tool or other application
Apr 15 14:28:36 taft-03 dlm_controld[7476]: cluster is down, exiting
Apr 15 14:28:36 taft-03 clogd[7563]: cpg_dispatch failed: SA_AIS_ERR_LIBRARY

There were RX-ERRs on the net, but no more than normal or any of the other clusters on different nets of mine, and I don't remember seeing this exact issue before.

[root@taft-02 ~]# netstat -i
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0      1500   0 244675073  41362      0      0 84496567      0      0      0 BMmRU
eth0       1500   0  1179872      0      0      0   469495      0      0      0 BMRU
eth2       1500   0 34491855     74      0      0 10562068      0      0      0 BMsRU
eth3       1500   0 30327558  16317      0      0 10562072      0      0      0 BMsRU
eth4       1500   0 32549337   8808      0      0 10562069      0      0      0 BMsRU
eth5       1500   0 37722207  12981      0      0 10562072      0      0      0 BMsRU
eth6       1500   0 26951635     59      0      0 10562074      0      0      0 BMsRU
eth7       1500   0 26646002    200      0      0 10562071      0      0      0 BMsRU
eth8       1500   0 26921660    582      0      0 10562070      0      0      0 BMsRU
eth9       1500   0 29064819   2341      0      0 10562071      0      0      0 BMsRU
lo        16436   0      105      0      0      0      105      0      0      0 LRU

[root@taft-01 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2009-04-14 11:19:28  /dev/disk/by-id/scsi-3600805f3000a05b0000000008e75000c-part1
   1   M  37144   2009-04-14 11:17:59  taft-01-bond
   2   M  37148   2009-04-14 11:18:38  taft-02-bond
   3   X  37152                        taft-03-bond
   4   M  37148   2009-04-14 11:18:38  taft-04-bond

I'll post the logs from the 4 nodes in this cluster.

Version-Release number of selected component (if applicable):
2.6.18-128.el5
cman-2.0.99-1.el5
openais-0.80.3-22.el5

Comment 1 Corey Marthaler 2009-04-16 19:32:58 UTC

Created attachment 339905 [details]
log from taft-01

Comment 2 Corey Marthaler 2009-04-16 19:33:28 UTC

Created attachment 339906 [details]
log from taft-02

Comment 3 Corey Marthaler 2009-04-16 19:33:59 UTC

Created attachment 339907 [details]
log from taft-03

Comment 4 Corey Marthaler 2009-04-16 19:34:38 UTC

Created attachment 339908 [details]
log from taft-04

Comment 5 Corey Marthaler 2009-04-17 16:18:28 UTC

FYI - I hit an issue that appears very similar to this one today while running the test revolver. It may have been a timing issue where one of the three nodes came back up faster than the one node left had sorted though the other leaving the cluster.

I'll post the logs for this issue..

Scenario iteration 0.2 started at Fri Apr 17 10:56:49 CDT 2009
Sleeping 1 minute(s) to let the I/O get its lock count up...
Senario: DLM kill Quorum plus one

Those picked to face the revolver... taft-03-bond taft-02-bond taft-01-bond
Feeling lucky taft-03-bond? Well do ya? Go'head make my day...
Feeling lucky taft-02-bond? Well do ya? Go'head make my day...
Feeling lucky taft-01-bond? Well do ya? Go'head make my day...

Verifying nodes were removed from cluster
Verified taft-01-bond was removed on taft-04-bond
Verified taft-02-bond was removed on taft-04-bond
Verified taft-03-bond was removed on taft-04-bond
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
All killed nodes are back up (able to be pinged), making sure they're qarshable...
still not all qarshable, sleeping another 10 seconds
All killed nodes are now qarshable

Verifying that recovery properly took place (on the nodes that stayed in the cluster)
checking that all of the cluster nodes are now/still cman members...
checking fence recovery (state of each service)...
checking dlm recovery (state of each service)...
checking gfs recovery (state of each service)...
checking gfs2 recovery (state of each service)...
checking fence recovery (node membership of each service)...
checking dlm recovery (node membership of each service)...
checking gfs recovery (node membership of each service)...
checking gfs2 recovery (node membership of each service)...

Verifying that clvmd was started properly on the dueler(s)
clvmd is not running on taft-01-bond

Comment 6 Corey Marthaler 2009-04-17 16:26:05 UTC

Created attachment 340037 [details]
new log from taft-01

Comment 7 Corey Marthaler 2009-04-17 16:26:32 UTC

Created attachment 340038 [details]
new log from taft-02

Comment 8 Corey Marthaler 2009-04-17 16:27:01 UTC

Created attachment 340040 [details]
new log from taft-03

Comment 9 Corey Marthaler 2009-04-17 16:27:31 UTC

Created attachment 340041 [details]
new log from taft-04

Comment 10 Lon Hohberger 2009-04-29 16:54:36 UTC

Nate and I were chasing this on RHEL4 too -- using the deadline scheduler helped, but did not entirely resolve the issue.

As a start, switch the i/o scheduler to the deadline scheduler; then we can tune to make the cluster more flexible.

I also have another fix for the 'undead' messages if you would like.

Comment 12 Lon Hohberger 2009-04-29 16:57:22 UTC

It's important to note that Nate's cluster also has an MSA1000

Comment 13 Lon Hohberger 2009-05-04 20:26:44 UTC

Ok, another person using a very different sort of array (EMC² Symmetrix) reported a similar problem on RHEL 5.3.  At the time time of eviction, I/O to the same array (though a different LUN) had very strange iostat numbers.  For example:

avgqu-sz - 30780484020872.61  (that's not a typo)
await    - 5.03
svctm    - 74906.50

I think we need to cross-reference this data on one or both of the MSAs in use and see if we can reproduce this.

Comment 14 Lon Hohberger 2009-05-04 20:27:52 UTC

Created attachment 342375 [details]
iostat numbers

Comment 15 Lon Hohberger 2009-05-05 16:20:49 UTC

See: 

https://bugzilla.redhat.com/show_bug.cgi?id=490147#c9
https://bugzilla.redhat.com/show_bug.cgi?id=490147#c10
https://bugzilla.redhat.com/show_bug.cgi?id=490147#c11

Comment 18 Lon Hohberger 2009-09-22 21:03:56 UTC

*** Bug 514627 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.