Bug 592722 - when Services Initialize,clurgmgrd: <err> nfsclient:Client_1 is missing
when Services Initialize,clurgmgrd: <err> nfsclient:Client_1 is missing
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
ia64 Linux
low Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-16 10:43 EDT by yizeng
Modified: 2016-04-26 21:22 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-07-14 08:50:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description yizeng 2010-05-16 10:43:54 EDT
Description of problem:
REDHAT cluster has two nodes used gfs to format the attached storage, each host has two pathes to storage, make one path fails,another path's status changes between active and failure for two times,active->failure->active->failure->active,then the node which the service started on losts quorum , kills and fences another node , after it regains the quorum and initializes service, unfortunately clurgmgrd reports error,clurgmgrd: [5690]: <err> nfsclient:Client_1 is missing!, service fails to start.

Version-Release number of selected component (if applicable):
#uname -a
2.6.9-67.EL #1 SMP Wed Nov 7 13:43:35 EST 2007 ia64 ia64 ia64 GNU/Linux


How reproducible:


Steps to Reproduce:
1. make one path fails 
2. change another path's status between active and failure.
3. 
  
Actual results:
node2 leaved from the cluster after node1 told it and  fenced by node1 ,
node1 lost quorum ,but then regained the quorum,and initialized service, but clurgmgrd report error,

the log is below:
May 13 21:40:20 archprx26k46328 qdiskd[4921]: <notice> Writing eviction notice for node 2
May 13 21:40:20 archprx26k46328 kernel: CMAN: removing node archprx26k13319 from the cluster : Killed by another node
May 13 21:40:21 archprx26k46328 kernel: Device sdt not ready.
May 13 21:40:21 archprx26k46328 kernel: Device sdx not ready.
May 13 21:40:50 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sds
May 13 21:40:50 archprx26k46328 qdiskd[4921]: <notice> Node 2 evicted
May 13 21:40:50 archprx26k46328 kernel: CMAN: quorum regained, resuming activity
May 13 21:40:50 archprx26k46328 ccsd[4780]: Cluster is quorate.  Allowing connections.
May 13 21:40:51 archprx26k46328 kernel: Device sdt not ready.
May 13 21:40:51 archprx26k46328 kernel: Device sdx not ready.
May 13 21:40:51 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdt
May 13 21:43:23 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdv
May 13 21:43:23 archprx26k46328 kernel: Device sdv not ready.
May 13 21:43:26 archprx26k46328 fenced[4934]: archprx26k13319 not a cluster member after 180 sec post_fail_delay
May 13 21:43:26 archprx26k46328 fenced[4934]: fencing node "archprx26k13319"
May 13 21:43:26 archprx26k46328 kernel: Device sdaa not ready.
May 13 21:43:26 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaa
May 13 21:43:34 archprx26k46328 fenced[4934]: fence "archprx26k13319" success
May 13 21:43:35 archprx26k46328 kernel: Device sdab not ready.
May 13 21:43:35 archprx26k46328 kernel: Device sdz not ready.
May 13 21:43:35 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdab
May 13 21:43:35 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdz
May 13 21:43:36 archprx26k46328 clurgmgrd[5690]: <info> Magma Event: Membership Change
May 13 21:43:36 archprx26k46328 clurgmgrd[5690]: <info> State change: archprx26k13319 DOWN
May 13 21:43:36 archprx26k46328 kernel: clurgmgrd(6539): unaligned access to 0x200000000107a864, ip=0x4000000000012221
May 13 21:43:55 archprx26k46328 clurgmgrd[5690]: <notice> Quorum Achieved
May 13 21:43:55 archprx26k46328 clurgmgrd[5690]: <info> Magma Event: Membership Change
May 13 21:43:55 archprx26k46328 clurgmgrd[5690]: <info> State change: Local UP
May 13 21:43:55 archprx26k46328 clurgmgrd[5690]: <info> Loading Service Data
May 13 21:43:55 archprx26k46328 clurgmgrd[5690]: <warning> Client timeout after new connection.
May 13 21:43:55 archprx26k46328 kernel: clurgmgrd(8942): unaligned access to 0x200000000107a864, ip=0x4000000000012221
May 13 21:43:55 archprx26k46328 kernel: clurgmgrd(8942): unaligned access to 0x200000000107a864, ip=0x4000000000012240
May 13 21:43:56 archprx26k46328 kernel: clurgmgrd(8942): unaligned access to 0x200000000107a86c, ip=0x4000000000012281
May 13 21:43:56 archprx26k46328 kernel: clurgmgrd(8942): unaligned access to 0x200000000107a86c, ip=0x40000000000122a0
May 13 21:43:56 archprx26k46328 clurgmgrd[5690]: <info> Initializing Services
May 13 21:43:56 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_16
May 13 21:43:56 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_15
May 13 21:43:56 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_14
May 13 21:43:56 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_13
May 13 21:43:56 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_12
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_11
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_10
May 13 21:43:57 archprx26k46328 kernel: Device sdaa not ready.
May 13 21:43:57 archprx26k46328 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdaa
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_9
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_8
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_7
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_6
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_5
May 13 21:43:57 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_4
May 13 21:43:58 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_3
May 13 21:43:58 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_2
May 13 21:43:58 archprx26k46328 clurgmgrd: [5690]: <info> Removing export: *:/JB10106_1
May 13 21:43:58 archprx26k46328 clurgmgrd[5690]: <info> Services Initialized
May 13 21:44:01 archprx26k46328 kernel: clurgmgrd(9792): unaligned access to 0x200000000107a864, ip=0x4000000000012221
May 13 21:44:01 archprx26k46328 kernel: clurgmgrd(9792): unaligned access to 0x200000000107a864, ip=0x4000000000012240
May 13 21:44:01 archprx26k46328 kernel: clurgmgrd(9792): unaligned access to 0x200000000107a86c, ip=0x4000000000012281
May 13 21:44:01 archprx26k46328 kernel: clurgmgrd(9792): unaligned access to 0x200000000107a86c, ip=0x40000000000122a0
May 13 21:44:05 archprx26k46328 clurgmgrd: [5690]: <err> nfsclient:Client_1 is missing!
May 13 21:44:05 archprx26k46328 clurgmgrd[5690]: <notice> status on nfsclient:Client_1 returned 1 (generic error)
May 13 21:44:05 archprx26k46328 bash: [9897]: <info> Removing export: *:/JB10106_1
May 13 21:44:05 archprx26k46328 bash: [9897]: <info> Adding export: *:/JB10106_1 (fsid=16632,no_root_squash,rw,sync)
May 13 21:44:05 archprx26k46328 clurgmgrd: [5690]: <err> nfsclient:Client_2 is missing!
........
May 13 21:44:11 archprx26k46328 clurgmgrd[5690]: <notice> Service new_service is recovering
May 13 21:44:11 archprx26k46328 clurgmgrd[5690]: <notice> Recovering failed service new_service
May 13 21:44:33 archprx26k46328 clurgmgrd[5689]: <crit> Watchdog: Daemon died, rebooting...



May 13 21:48:00 archprx26k46328 syslogd 1.4.1: restart.
May 13 21:48:00 archprx26k46328 syslog: syslogd startup succeeded
May 13 21:59:01 archprx26k46328 clurgmgrd[4915]: <info> Services Initialized
May 13 21:59:01 archprx26k46328 clurgmgrd[4915]: <info> Logged in SG "usrm::manager"
May 13 21:59:01 archprx26k46328 clurgmgrd[4915]: <info> Magma Event: Membership Change
May 13 21:59:01 archprx26k46328 clurgmgrd[4915]: <info> State change: Local UP
May 13 21:59:02 archprx26k46328 kernel: clurgmgrd(7468): unaligned access to 0x200000000107a864, ip=0x4000000000019d30
May 13 21:59:03 archprx26k46328 kernel: clurgmgrd(7468): unaligned access to 0x200000000107a86c, ip=0x4000000000019d50
May 13 21:59:03 archprx26k46328 kernel: clurgmgrd(7468): unaligned access to 0x200000000107a87c, ip=0x4000000000019db0
May 13 21:59:03 archprx26k46328 kernel: clurgmgrd(7468): unaligned access to 0x200000000107a864, ip=0x4000000000012221
May 13 21:59:04 archprx26k46328 clurgmgrd[4915]: <notice> Starting stopped service new_service
May 13 21:59:05 archprx26k46328 kernel: Lock_Harness 2.6.9-75.9 (built Nov  8 2007 15:09:08) installed
May 13 21:59:05 archprx26k46328 kernel: GFS 2.6.9-75.9 (built Nov  8 2007 15:09:58) installed
May 13 21:59:05 archprx26k46328 kernel: GFS: Trying to join cluster "lock_dlm", "new_cluster:JB10106_1"
May 13 21:59:05 archprx26k46328 kernel: Lock_DLM (built Nov  8 2007 15:09:15) installed
May 13 21:59:05 archprx26k46328 kernel: lock_dlm: fence domain not found; check fenced
May 13 21:59:05 archprx26k46328 kernel: GFS: can't mount proto = lock_dlm, table = new_cluster:JB10106_1, hostdata =
May 13 21:59:05 archprx26k46328 clurgmgrd: [4915]: <err> 'mount -t gfs -o rw /dev/mapper/path2 /JB10106_1' failed, error=32
May 13 21:59:05 archprx26k46328 clurgmgrd[4915]: <notice> start on clusterfs:JB10106_1 returned 2 (invalid argument(s))
May 13 21:59:05 archprx26k46328 clurgmgrd[4915]: <warning> #68: Failed to start new_service; return value: 1
May 13 21:59:05 archprx26k46328 clurgmgrd[4915]: <notice> Stopping service new_service







Expected results:
service starts successfully,


Additional info:
<fence_daemon post_fail_delay="180" post_join_delay="180"/>

<quorumd interval="30" tko="5" votes="1" label="disk1" device="/dev/mapper/path1" status_file="/tmp/">
                <heuristic program="ping 9.11.112.1 -c1 -t1" score="1" interval="2"/>
Comment 1 Lon Hohberger 2010-10-18 17:39:03 EDT
This must have slipped through the cracks.  What release of rgmanager did this occur on?
Comment 3 Perry Myers 2011-07-14 08:50:45 EDT
No add'l info from reporter in a long time, so closing this as INSUFFICIENT_DATA

Note You need to log in before you can comment on or make changes to this bug.