Bug 519159

Summary: gfs lock is reported on 2 out of 4 nodes in cluster.
Product: Red Hat Enterprise Linux 5 Reporter: Everett Bennett <everett.bennett>
Component: openaisAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.3CC: adas, bmarzins, cluster-maint, edamato, jesse.marlin, John.Hadad, mwaite, nikb321, sdake, sghosh, stefan.hurwitz, swhiteho, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openais-0.80.6-12.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 525280 557621 (view as bug list) Environment:
Last Closed: 2010-03-30 07:48:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 554487, 554488, 557621, 557622    
Attachments:
Description Flags
Lockdump/dlm info from all 4 nodes in test cluster.
none
group_tool dump plock data02 on ime1, 2, 3, and 4.
none
Related to comment 18
none
patch which may fix problem none

Description Everett Bennett 2009-08-25 13:20:50 UTC
Description of problem:
-----------------------           
On a 4-node RH 5.3 cluster using GFS file systems, a file is reported locked by different PIDs on different nodes and none of the PIDs exist on any of the nodes. An application log file shows that at some point in time in the past (before a reboot), a process with a PID of 18480 on node 1 had a lock on the file. The lock was released when the node was fenced (reboot) and when the node came up, GFS reports that a PID that ran on it before the reboot has a lock on the file. 2 nodes report that PID 3513 has a lock and 2 nodes report that PID 18480 has the lock.     


Version-Release number of selected component (if applicable):

kmod-gfs-0.1.28-1.el5
gfs-utils-0.1.18-1.el5


How reproducible:


Steps to Reproduce:
1.  The 4 nodes are stable and the master service is on node 1, ime1.
2.  Start data collection and while running, 'fence node 1' .
3.  The master service relocates to node 2.
4.  After all 4 nodes are stable, 'fence node 2' .
5.  Again, the symptoms are manifested when all 4 nodes are stable.
  
Actual results:

Here is some supporting evidence:

NOTE: The lock_check program does not use any application library and should you need to examine the
source for it, we may be able to provide it to you for inspection.

----ime4----
ssh ime4 " /opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd"
File: "/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd" locked by:
pid: 3513
type: UNKNOWN

1 of 1 files opened had locks set

----ime3----
ssh ime3 " /opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd"
File: "/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd" locked by:
pid: 3513
type: UNKNOWN

1 of 1 files opened had locks set

----ime2----
ssh ime2 " /opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd"
File: "/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd" locked by:
pid: 18480
type: WRITE

1 of 1 files opened had locks set

----ime1----
ssh ime1 " /opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd"
File: "/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd" locked by:
pid: 18480
type: WRITE

1 of 1 files opened had locks set


----ime4----
ssh ime4 "lsof | grep 18480"
----ime3----
ssh ime3 "lsof | grep 18480"
----ime2----
ssh ime2 "lsof | grep 18480"
----ime1----
ssh ime1 "lsof | grep 18480"

----ime4----
ssh ime4 "lsof | grep 3513"
----ime3----
ssh ime3 "lsof | grep 3513"
----ime2----
ssh ime2 "lsof | grep 3513"
----ime1----
ssh ime1 "lsof | grep 3513"


----ime4----
ssh ime4 "ps -ef | grep -v grep | egrep "3513|18480""
----ime3----
ssh ime3 "ps -ef | grep -v grep | egrep "3513|18480""
----ime2----
ssh ime2 "ps -ef | grep -v grep | egrep "3513|18480""
----ime1----
ssh ime1 "ps -ef | grep -v grep | egrep "3513|18480""

Expected results:


Additional info:

Comment 1 Robert Peterson 2009-08-25 21:03:50 UTC
I need to get more information about this failure.  I don't know
what application is used, and I don't know if the locks you are
inquiring about are normal GFS file locks (i.e. open files) or
if the locks correspond to posix locks or flocks.

I would like to get both gfs and dlm lock dumps from a recreation
of the failure.  To get the GFS locks, please follow these
instructions:
http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_gfs
To get the DLM locks, please follow these instructions:
http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_dlm

I'd also like to either get the source for your lock_check
program or at least a detailed technical description of how it works.

Comment 2 Everett Bennett 2009-08-26 13:27:34 UTC
Created attachment 358714 [details]
Lockdump/dlm info from all 4 nodes in test cluster.

Here's the info you requested.  Let me know if you need anything or if I missed something.

Comment 3 Steve Whitehouse 2009-08-26 15:26:47 UTC
This seems to be related to fcntl locks, so adding Dave Teigland to the CC list.

Comment 4 Robert Peterson 2009-08-26 15:44:11 UTC
The GFS and DLM lock dumps you provided do not indicate any GFS
flocks held by any node for any purpose.  GFS flocks in the glock
dump would look something like this: Glock (6, 123456) (where
123456 is an inode number).  GFS flocks appearing in the DLM dump
would look something like this (assuming the same inode):
2c300c6 2 2a0002 6310 0 4c 10000 2 5 -1 0 0 24 "       6           1e240"
(The "      6 " in quotes at the end indicates the DLM lock is for
a GFS flock.)

So based on this information, I have a few more questions:
Are you sure that the flocks are for files on the GFS file system
and not some other file system, like the root file system?
For example, please make sure file
/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd
is really on the GFS file system.

Also, please make sure the GFS file system is being mounted
without -o localflocks (which uses the system flock interface
in favor of GFS's cluster-wide flock interface.)  Also, please
make sure that the lock dumps are taken from the time of failure.

Comment 5 Everett Bennett 2009-08-26 15:51:49 UTC
There is an symbolic link from /var/dcs_6.0_sp/db to
/dcs/data02/var_dcs_6.0_sp_db/. where /dcs/data02 is a gfs file system.
This is done on all 4 nodes.


ime1/root> ls -dl /var/dcs_6.0_sp/db
lrwxrwxrwx 1 cgi dev 29 Aug 20 09:04 /var/dcs_6.0_sp/db ->
/dcs/data02/var_dcs_6.0_sp_db


ime1/root> ls -ld /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd 
-rw-r--r-- 1 cgi dev 94200320 Aug 24 15:59
/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd


ime1/root> ls -ld /dcs/data02/var_dcs_6.0_sp_db/dcs/dcs_filreg_file.ixd 
-rw-r--r-- 1 cgi dev 94200320 Aug 24 15:59
/dcs/data02/var_dcs_6.0_sp_db/dcs/dcs_filreg_file.ixd

ime1/root> df -k /var/dcs_6.0_sp/db/dcs/.
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vgdata02-lvol1
                     141788512  31392872 110395640  23% /dcs/data02

ime1/root> mount|grep data02
/dev/mapper/vgdata02-lvol1 on /dcs/data02 type gfs
(rw,hostdata=jid=1:id=327682:first=0)
ime1/root>

Comment 6 Everett Bennett 2009-08-26 18:15:07 UTC
Regards to this statement:

Also, please
make sure that the lock dumps are taken from the time of failure.  

The cluster was left in it's current state from the last test.

Any other suggestions on how to collect data for this test going forward.

It should be noted that we change 'plock_ownership' to '0' per bugid 
for an issue we had with a 2-node gfs2 cluster.

https://bugzilla.redhat.com/show_bug.cgi?id=512799

Comment 7 David Teigland 2009-08-26 19:09:17 UTC
It sounds like you're probably using posix locks (plocks) rather than flocks.
All flocks on gfs appear in gfs and dlm lock dumps.
All plocks on gfs appear in "group_tool dump plocks <fsname>".

Comment 8 Everett Bennett 2009-08-26 19:14:30 UTC
Any suggestions what to do going forward?

Comment 9 David Teigland 2009-08-26 19:36:47 UTC
We'll need to correlate the plocks dumped from group_tool (from each node) with info from the application log and lock_check.

Comment 10 Everett Bennett 2009-08-26 19:38:13 UTC
Created attachment 358759 [details]
group_tool dump plock data02 on ime1, 2, 3, and 4.

Comment 11 David Teigland 2009-08-26 21:37:04 UTC
lock_check is probably just using fcntl/GETLK, it's results match what gfs thinks the plock state is.

Was the cluster shut down when you changed from plock_ownership=1 to plock_ownership=0?

Are you by any chance trying to use plocks between threads?  posix locks only work properly between separate processes.  This is because threads have the same "owner" value, so locks cannot be distinguished among them by the fs.  (I've run into this before with an app IBM was trying to run on gfs.)

Another thing to note is that the "pid" fields can be misleading.  gfs (and the kernel in general) use the owner value, e.g. ffff8107fb9ac3c0, for all real work.  The only reason they save the pid is to return in GETLK, even though it may not be accurate.  For processes that hold plocks and fork or clone, the process holding a lock can change, meaning the pid no longer matches the process holding the lock.

nodes 1 and 2 have matching plock state, and nodes 3 and 4 have matching plock state.  That's a curious pattern.  If the state is not being changed (app suspended or something), the dump from all nodes should be identical.  With a running app that's doing locking, the collected state can be unmatching since the dump commands are skewed among the nodes.  I can't say how much of the difference is due to this skew.  Collecting plock state a second time might help clear that up.

It appears that at least the locks on
/var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd, (inode 5194392) are in a steady, but inconsistent, state.

Nodes 1 and 2 think 1/18480/ffff810828f950c0 has a WR lock on it:
5194392 WR 1-1 nodeid 1 pid 18480 owner ffff810828f950c0

Nodes 3 and 4 think three different processes have RD locks on it:
5194392 RD 1-1 nodeid 1 pid 30219 owner ffff8107fb9ac3c0
5194392 RD 1-1 nodeid 1 pid 30227 owner ffff81082a4a80c0
5194392 RD 1-1 nodeid 1 pid 3513 owner ffff810826d6a680

I don't have any theories yet on how that could have happened.

Using a combination of ps -o pid,stat,cmd,wchan, strace, /proc/locks, do you know which processes are blocked waiting for plocks on which files? 

A couple more places where we might dig up some clues are:
- /proc/locks output from all nodes
- group_tool dump gfs <fsname> from all nodes

Comment 12 Jesse Marlin 2009-08-26 22:13:15 UTC
(In reply to comment #11)
> Are you by any chance trying to use plocks between threads?  posix locks only
> work properly between separate processes.  This is because threads have the
> same "owner" value, so locks cannot be distinguished among them by the fs. 
> (I've run into this before with an app IBM was trying to run on gfs.)
These are separate processes.  There are threads but a single process obtains a lock on the file and keeps it for the duration of its life.

> 
> Another thing to note is that the "pid" fields can be misleading.  gfs (and the
> kernel in general) use the owner value, e.g. ffff8107fb9ac3c0, for all real
> work.  The only reason they save the pid is to return in GETLK, even though it
> may not be accurate.  For processes that hold plocks and fork or clone, the
> process holding a lock can change, meaning the pid no longer matches the
> process holding the lock.

Here is the sequence of events that we believe lead to this:

ime1 -> master
ime2 -> slave
ime3 -> slave
ime4 -> slave

We cut the power to ime1 and service failed over to ime2.  The application came up without problems.

ime1 ->  rebooting
ime2 ->  master
ime3 ->  slave accessing GFS
ime4 ->  slave accessing GFS

At this point we are processing pretty heavily on all nodes including ime1.  Then we pull the plug on ime2.  The service fails back to ime1, and now we see the problem.  The PIDs that ime1 and ime2 think have locks were the same PIDs from before the first failure test on ime1.  I was wondering if some sort of lock recovery took place when it became master a second time.

> nodes 1 and 2 have matching plock state, and nodes 3 and 4 have matching plock
> state.  That's a curious pattern.  If the state is not being changed (app
> suspended or something), the dump from all nodes should be identical.  With a
> running app that's doing locking, the collected state can be unmatching since
> the dump commands are skewed among the nodes.  I can't say how much of the
> difference is due to this skew.  Collecting plock state a second time might
> help clear that up.

We did not see the lock problem on 3 & 4 until we tried to move the service to node 4.  At that point they both started complaining about the lock but from a different PID that nodes 1 & 2.

> I don't have any theories yet on how that could have happened.
> 
> Using a combination of ps -o pid,stat,cmd,wchan, strace, /proc/locks, do you
> know which processes are blocked waiting for plocks on which files? 

Our application checks to see if it can lock the file before it comes up, so at this point there are no processes that have locks on the file.

Comment 13 Everett Bennett 2009-08-27 14:36:17 UTC
To the question:

Was the cluster shut down when you changed from plock_ownership=1 to
plock_ownership=0?


Yes, this was changed sometime ago.  All nodes were rebooted prior to testing.

Everett

Comment 14 David Teigland 2009-08-27 15:32:03 UTC
Thanks, I'm going to run some tests copying the main activities and events you described.  It does look related to recovery.

Is the master node the only node that acquires locks?  Do slaves do any file locking?  Is the service failover coordinated with any cluster or gfs events?
i.e. does the new master take over before/after gfs recovery completes?

Given this info we could possibly dump gfs plocks at certain points in the failover/recovery process to verify that all plocks from failed nodes have been purged.  e.g. if the new master takes over after gfs recovery completes, then a plock dump just before the new master begins should show no plocks from the old master.  Also, a 'group_tool dump gfs' at that same point should include a record of how many plocks have been purged from the failed master.

By your description, it sounds like the initial recovery and failover may be working properly (checks like those above could confirm), but problems begin at the second recovery where the master moves back to the original node.  We could again dump plocks at certain points of recovery to check that plocks don't exist that shouldn't.

Also, when a failed node comes back and mounts gfs, it would be useful to collect the 'group_tool dump gfs' log which will show us whether the plock state it's syncing is stale (we've had a problem like this before where the openais checkpoints used for syncing plocks are out of sync.)

Comment 15 Jesse Marlin 2009-08-28 21:50:02 UTC
(In reply to comment #14)
> Thanks, I'm going to run some tests copying the main activities and events you
> described.  It does look related to recovery.
> 
> Is the master node the only node that acquires locks?  Do slaves do any file
> locking?  Is the service failover coordinated with any cluster or gfs events?
> i.e. does the new master take over before/after gfs recovery completes?

Yes the master is the only one holding the locks.  For the test we performed we simply killed the power from the bladecenter.  Not sure about the recovery.  We would wait until we were able to mount the GFS volume before starting the service, but if recovery is going on while the volume is mounted then I don't know.

We are gonna try and reproduce and collect the info you requested.

Comment 16 David Teigland 2009-08-31 13:55:14 UTC
So gfs is not mounted on all the nodes, but just on the master?  If so, why not use a local file system?

Comment 17 Jesse Marlin 2009-08-31 21:03:46 UTC
(In reply to comment #16)
> So gfs is not mounted on all the nodes, but just on the master?  If so, why not
> use a local file system?  

GFS is mounted on all nodes.  The master delegates writes for some databases but not all.  For databases that are not handled by the master we use GFS so that we can share the files across nodes.

We see problems in both cases, where just the master had the lock and the other case where potentially any of the slave nodes could have had a lock on a file.

The alternative to what we are doing would be to use local disk and NFS which we are trying to get away from.

Would it be helpful to provide you with external access to the cluster which is having the problem.  If so we can try to coordinate something.

Thanks

Comment 18 John Hadad 2009-09-03 00:42:33 UTC
I have duplicated the problem and collected the information you
requested at the points you requested. I used a script (gfs_cmd.sh) to
collect the information on all nodes and dumped the result into files
named RH_info.n

The script and the files are attached (bugzilla_515159_JH_1.tar.gz).
I attached the script b/c it will make it easier for you to understand
the content of RH_info.n if you have the script.
You will probably be most interested in RH_info.10 and RH_info.11 but I
included all of them in case it is of some benefit to you.


Here is what I did.

Master running on ime1
Start data processing
RH_info.1
Start data copy
(A) Fence reboot ime1
Wait till ime1 is offline
RH_info.2
Wait a couple of minutes (master service starting on ime2)
RH_info.3
Wait till ime1 is back online and slave cluster service started
RH_info.4
Start data copy
(B) Fence reboot ime2
RH_info.5 (before master shifted to ime1)
Wait till after master shifted to ime1 and application started
RH_info.6
Let data processing resume
Wait till ime2 is back online and slave cluster service started
RH_info.7
Start data copy
(C) Fence off ime1
before master shifted to ime2
RH_info.8
Wait till after master shifted to ime2 and application started
RH_info.9
Wait for data processing to resume
Wait till ime1 is back online and slave cluster service started
Run ls -l on the GFS dir (/dcs/data02/) and while ls was running
(D) Fence off ime2
RH_info.10 (before master shifted to ime1)
Wait till after master shifted to ime1 (application failed to start due to file locks)
RH_info.11


Our application failed to start due to locks that were obtained on ime1
by a process that ran on ime1 prior to "(C) fence off ime1" (the old
PIDs are logged in one of our application logs). As you can see from the
above, the locks showed up after ime1 came up from a reboot.
Here is a list of the locked files (all of which are on the GFS mount /dcs/data02/):

FILE UIO_DB_PATH:uio_maint_values_db LOCKED!
FILE EAL_DB_PATH:eal_lang_english LOCKED!
FILE RES_DB_PATH:res_res_type_file LOCKED!
FILE RES_DB_PATH:res_svc_node_file LOCKED!
FILE RES_DB_PATH:res_res_class_file LOCKED!
FILE RES_DB_PATH:res_svc_type_file LOCKED!
FILE RES_DB_PATH:res_res_def_file LOCKED!
FILE RES_DB_PATH:res_svc_inst_file LOCKED!
FILE RES_DB_PATH:res_svc_unit_file LOCKED!
FILE DCS_DB_PATH:dcs_audit_file LOCKED!
FILE DCS_DB_PATH:dcs_custeq_file LOCKED!
FILE DCS_DB_PATH:dcs_dispatch_file LOCKED!
FILE DCS_DB_PATH:dcs_dportal_fai_params_file LOCKED!
FILE DCS_DB_PATH:dcs_dportal_input_ctx_file LOCKED!
FILE DCS_DB_PATH:dcs_dportal_shared_ctx LOCKED!
FILE DCS_DB_PATH:dcs_dportal_output_ctx_file LOCKED!
FILE DCS_DB_PATH:dcs_dportal_root_file LOCKED!
FILE DCS_DB_PATH:dcs_dportal_routing_file LOCKED!
FILE DCS_DB_PATH:dcs_dportal_state_file LOCKED!
FILE DCS_DB_PATH:dcs_filreg_file LOCKED!
FILE DCS_DB_PATH:dcs_input_volume_file LOCKED!
FILE DCS_DB_PATH:dcs_plan_rtg_file LOCKED!
FILE DCS_DB_PATH:dcs_rtgreq_file LOCKED!
FILE DCS_DB_PATH:dcs_oper_file LOCKED!
FILE DCS_DB_PATH:dcs_oper_portal_file LOCKED!
FILE DCS_DB_PATH:dcs_plan_file LOCKED!
FILE EPS_DB_PATH:eps_filreg_file_db LOCKED!
FILE EPS_DB_PATH:eps_filreg_code_db LOCKED!
FILE DRD_DB_PATH:drd_isam_file LOCKED!
FILE DRD_DB_PATH:drd_proc_file LOCKED!
FILE DRD_DB_PATH:drd_area_file LOCKED!
FILE DRD_DB_PATH:drd_state_file LOCKED!
FILE PRA_DB_PATH:pra_assembled_file_db LOCKED!
FILE PRA_DB_PATH:pra_state_db LOCKED!
FILE PRA_DB_PATH:pra_sys_parms_db LOCKED!


Here is a list of the PIDs that have a lock on the above files:

READ locked by pid: 30697 Start: 0x1 Length: 0x1
READ locked by pid: 30840 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
READ locked by pid: 31016 Start: 0x1 Length: 0x1
WRITE locked by pid: 30721 Start: 0x1 Length: 0x1
WRITE locked by pid: 30747 Start: 0x1 Length: 0x1
WRITE locked by pid: 30744 Start: 0x1 Length: 0x1
WRITE locked by pid: 30745 Start: 0x1 Length: 0x1
READ locked by pid: 31112 Start: 0x1 Length: 0x1
WRITE locked by pid: 30723 Start: 0x1 Length: 0x1
WRITE locked by pid: 30746 Start: 0x1 Length: 0x1
WRITE locked by pid: 30724 Start: 0x1 Length: 0x1
WRITE locked by pid: 30722 Start: 0x1 Length: 0x1
WRITE locked by pid: 30720 Start: 0x1 Length: 0x1
WRITE locked by pid: 30729 Start: 0x1 Length: 0x1
WRITE locked by pid: 30718 Start: 0x1 Length: 0x1
WRITE locked by pid: 30725 Start: 0x1 Length: 0x1
WRITE locked by pid: 30719 Start: 0x1 Length: 0x1
WRITE locked by pid: 30734 Start: 0x1 Length: 0x1
WRITE locked by pid: 30735 Start: 0x1 Length: 0x1
WRITE locked by pid: 30728 Start: 0x1 Length: 0x1
WRITE locked by pid: 30741 Start: 0x1 Length: 0x1
WRITE locked by pid: 30740 Start: 0x1 Length: 0x1
WRITE locked by pid: 30736 Start: 0x1 Length: 0x1
WRITE locked by pid: 30738 Start: 0x1 Length: 0x1
WRITE locked by pid: 30739 Start: 0x1 Length: 0x1
WRITE locked by pid: 30737 Start: 0x1 Length: 0x1
WRITE locked by pid: 30730 Start: 0x1 Length: 0x1
WRITE locked by pid: 30731 Start: 0x1 Length: 0x1
WRITE locked by pid: 30732 Start: 0x1 Length: 0x1

I ran "ps -ef" on all nodes looking for the above PIDs and they don't exist.

----ime4----
ssh ime4 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732""
----ime3----
ssh ime3 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732""
----ime2----
ssh ime2 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732""
----ime1----
ssh ime1 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732""
ime1/cgi>



I ran our lock_check program (source provided to you earlier as
attachment to Comment #2) on a couple of the files on all nodes in the
cluster. ime1 and ime2 agree on the locking PID but ime3 and ime4
claim the files are not locked. This happen before as well (described in
Comment #12). I did not yet relocate the master in case you need more
info before I do (i.e. the master is still on ime1).


----ime4----
ssh ime4 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd"

0 of 1 files opened had locks set

----ime3----
ssh ime3 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd"

0 of 1 files opened had locks set

----ime2----
ssh ime2 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd"
File: "/var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" locked by:
pid: 30697
type: UNKNOWN

1 of 1 files opened had locks set

----ime1----
ssh ime1 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd"
File: "/var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" locked by:
pid: 30697
type: UNKNOWN

1 of 1 files opened had locks set






----ime4----
ssh ime4 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd"

0 of 1 files opened had locks set

----ime3----
ssh ime3 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd"

0 of 1 files opened had locks set

----ime2----
ssh ime2 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd"
File: "/var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" locked by:
pid: 30840
type: UNKNOWN

1 of 1 files opened had locks set

----ime1----
ssh ime1 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd"
File: "/var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" locked by:
pid: 30840
type: UNKNOWN

1 of 1 files opened had locks set

Thanks!

John Hadad

Comment 19 John Hadad 2009-09-03 00:58:11 UTC
Created attachment 359610 [details]
Related to comment 18

NOTE: I have to exclude RH_info.1, 2 and 3 b/c the file size exceeded your 20 MB limit on attachments.

Attachment contains:

gfs_tool lockdump
cat /proc/locks
group_tool dump gfs 
group_tool dump plocks 
cat /debug/dlm/data02
cat /debug/dlm/data02_locks

Comment 20 David Teigland 2009-09-10 21:37:44 UTC
This looks like an openais checkpoint bug.  gfs_controld is reading old, stale plock state from an old, stale, unlinked checkpoint instead of from the most recent checkpoint.  Analysis:

a. master 1, slave 2,3,4
b. kill 1
c. master 2, slave 3,4
d. master 2, slave 1,3,4
e. kill 2
f. master 1, slave 3,4
g. master 1, slave 2,3,4
h. kill 1
i. master 2, slave 3,4
j. master 2, slave 1,3,4
k. kill 2
l. master 1, slave 3,4  <-- problem with master starting on 1

In step l, 1 runs into old stale locks acquired by various pids on 1 before h.
These old stale locks are not seen on 3,4 at step l.

Possible explanations for old stale locks:
. the 1 locks were not purged by 2 in recovery for step h
  (but they *were* purged by 3,4?)
. 2 received and added locks from 1 after it purged 1 locks in
  recovery for step h
  (wouldn't 3,4 also do the same?)
. 1 read old stale locks from an old stale checkpoint when it rejoined
  and initialized plock state in j.  1 created a ckpt of all locks in g
  when 2 mounted.  Then 1 was killed in h.  Then when 1 mounts in j,
  2 should unlink the ckpt 1 created in g and create a new ckpt with the
  new locks.  1 opens/reads locks from ckpt.  If it gets the old ckpt
  content from g instead of new ckpt content then we'd see these symptoms.

We'd want to compare the ckpt content read by 2 in step g to the ckpt content
read by 1 in step j...

Extract ckpt data read by 2 in step g from RH_info.7
Extract ckpt data read by 1 in step j from RH_info.10
both match:

"r14548045"
"r14527465"
"r14527445"
"r14527463"
"r14527451"
"r14527454"
"r14527517"
"r14527459"
"r14062648"
"r14062659"
"r14540649"
"r14539655"
"r14539633"
"r14062974"
"r14539624"
"r14540652"
"r14540644"
"r14063230"
"r14548925"
"r12689671"
"r4949306"
"r14548938"
"r14539641"
"r14548049"
"r14062958"
"r14062956"
"r14062947"
"r14067021"
"r14067015"
"r14063231"
"r14067026"
"r14548041"
"r14063579"
"r14539649"
"r14539663"
"r14539668"
"r14539644"
"r5194392"
"r14539625"
"r14067018"
"r14066741"

Comment 21 Everett Bennett 2009-09-22 15:26:29 UTC
What do you recommend at this point?

Should we upgrade to RHEL 5.4 or apply some RHEL 5.3 errata patch and then test further?

Comment 22 Steven Dake 2009-09-22 17:48:03 UTC
5.4 fixes a bunch of bugs in 5.3 related to totem messages during recovery.  I don't know if for sure it will fix the customer's problem.

Dave,

Is there a test case that reproduces this issue that you have run?

Regards
-steve

Comment 23 Everett Bennett 2009-09-22 17:57:12 UTC
The information in this bug report reflects testing that has been done thus far
in our Development env.  Both our test env and the customer env are identical as much as that is possible.  The customer site has exhibited similar issues among
them including a node faulting within a 5-node RHEL 5.3 cluster with the
following messages.  The node uipi was fenced but for no obvious reason.  This
apparently has happended several times over the last couple of weeks. Any
suggestions on how to proceed?

Sep 18 18:55:07 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:10 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:12 tururim clurgmgrd: [21417]: <info> Executing
/etc/cmcluster/imeRJBAslave_global_start_stop.sh status 
Sep 18 18:55:17 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:20 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:22 tururim clurgmgrd: [21417]: <info> Executing
/etc/cmcluster/imePRNMslave_global_start_stop.sh status 
Sep 18 18:55:27 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:30 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:32 tururim clurgmgrd: [21417]: <info> Executing
/etc/cmcluster/imeREPslave_global_start_stop.sh status 
Sep 18 18:55:37 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8
Sep 18 18:55:40 tururim xinetd[19689]: START: echo-dgram pid=0
from=10.128.236.8

Sep 18 14:11:03 tururim clurgmgrd[29977]: <notice> service:uipi is not runnable
- restricted domain offline

Sep 18 14:11:02 tururim kernel: dlm: closing connection to node 3
Sep 18 14:11:02 tururim TOP: 30642 cgi       16   0  222m  40m 3100 S  0.0  0.1
  0:01.15 drd_stream_area -ar
Sep 18 14:11:02 tururim openais[18616]: [TOTEM] Saving state aru 43 high seq
received 43 
Sep 18 14:11:02 tururim fenced[18636]: uipi not a cluster member after 0 sec
post_fail_delay
Sep 18 14:11:02 tururim clurgmgrd[29977]: <info> State change: uipi DOWN 
Sep 18 14:11:02 tururim TOP: 31039 cgi       17   0 95160 2480 1228 S  0.0  0.0
  0:00.00 /opt/dcs_6.0_rjba/u
Sep 18 14:11:02 tururim openais[18616]: [TOTEM] Storing new sequence id for
ring 55e0 
Sep 18 14:11:02 tururim fenced[18636]: fencing node "uipi"
Sep 18 14:11:02 tururim TOP: 31054 root      16   0 94176 3340 2584 S  0.0  0.0
  0:00.00 sshd: bacenter [pri
Sep 18 14:11:02 tururim openais[18616]: [TOTEM] entering COMMIT state. 
Sep 18 14:11:02 tururim TOP: 31056 bacenter  15   0 94304 1924 1164 S  0.0  0.0
  0:03.90 sshd: bacenter@pts/
Sep 18 14:11:02 tururim openais[18616]: [TOTEM] entering RECOVERY state. 
Sep 18 14:11:02 tururim TOP: 31057 bacenter  16   0 67044 2556 1184 S  0.0  0.0
  0:19.80 -bash              
Sep 18 14:11:02 tururim openais[18616]: [TOTEM] position [0] member
10.128.11.68: 
Sep 18 14:11:02 tururim TOP: 31093 cgi       17   0 95160 2480 1228 S  0.0  0.0
  0:00.00 /opt/dcs_6.0_rjba/u
Sep 18 14:11:02 tururim openais[18616]: [TOTEM] previous ring seq 21980 rep
10.128.11.68 


----tururim----
ssh tururim "sh /tmp/chk.sh"
/var/log
Sep 18 14:11:02 tururim fenced[18636]: uipi not a cluster member after 0 sec
post_fail_delay
Sep 18 14:11:02 tururim fenced[18636]: fencing node "uipi"
Sep 18 14:11:41 tururim fenced[18636]: fence "uipi" success
Sep 18 17:49:13 tururim fenced[18636]: uipi not a cluster member after 0 sec
post_fail_delay
Sep 18 17:49:14 tururim fenced[18636]: fencing node "uipi"

Sep 18 14:10:43 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7
Sep 18 14:10:44 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] The token was lost in the
OPERATIONAL state. 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes). 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Transmit multicast socket send
buffer size (288000 bytes). 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] entering GATHER state from 2. 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Saving state aru 48ad high seq
received 48ad 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Storing new sequence id for ring
55dc 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] entering COMMIT state. 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] entering RECOVERY state. 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [0] member 10.128.11.68: 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep
10.128.11.68 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad
received flag 1 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [1] member 10.128.11.69: 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep
10.128.11.68 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad
received flag 1 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [2] member 10.128.11.70: 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep
10.128.11.68 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [3] member 10.128.11.71: 
Sep 18 14:10:47 uipi gfs_controld[18675]: cpg_mcast_joined retry 100 MSG_PLOCK
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep
10.128.11.68 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad
received flag 1 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [4] member 10.128.11.72: 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep
10.128.11.68 
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad
received flag 1 
Sep 18 14:10:47 uipi gfs_controld[18675]: cpg_mcast_joined retry 200 MSG_PLOCK
Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Did not need to originate any
messages in recovery. 
Sep 18 14:10:47 uipi openais[18644]: [CLM  ] CLM CONFIGURATION CHANGE 
Sep 18 14:10:47 uipi openais[18644]: [CLM  ] New Configuration: 
Sep 18 14:10:47 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.68)  
Sep 18 14:10:47 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.69)  
Sep 18 14:10:47 uipi gfs_controld[18675]: cpg_mcast_joined retry 300 MSG_PLOCK
Sep 18 14:10:47 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.70)  
Sep 18 14:10:47 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.71)  
Sep 18 14:10:48 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.72)  
Sep 18 14:10:48 uipi openais[18644]: [CLM  ] Members Left: 
Sep 18 14:10:48 uipi openais[18644]: [CLM  ] Members Joined: 
Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 400 MSG_PLOCK
Sep 18 14:10:48 uipi openais[18644]: [CLM  ] CLM CONFIGURATION CHANGE 
Sep 18 14:10:48 uipi openais[18644]: [CLM  ] New Configuration: 
Sep 18 14:10:48 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.68)  
Sep 18 14:10:48 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.69)  
Sep 18 14:10:48 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.70)  
Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 500 MSG_PLOCK
Sep 18 14:10:48 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.71)  
Sep 18 14:10:48 uipi openais[18644]: [CLM  ]    r(0) ip(10.128.11.72)  
Sep 18 14:10:48 uipi openais[18644]: [CLM  ] Members Left: 
Sep 18 14:10:48 uipi openais[18644]: [CLM  ] Members Joined: 
Sep 18 14:10:48 uipi openais[18644]: [SYNC ] This node is within the primary
component and will provide service. 
Sep 18 14:10:48 uipi openais[18644]: [TOTEM] entering OPERATIONAL state. 
Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 600 MSG_PLOCK
Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 3 4 5 6  
Sep 18 14:10:48 uipi last message repeated 4 times
Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 14  
Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 700 MSG_PLOCK
Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 16 17 18 14 19  
Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 16 1b 17 1c 18 14
19  
Sep 18 14:10:49 uipi gfs_controld[18675]: cpg_mcast_joined retry 800 MSG_PLOCK
Sep 18 14:10:49 uipi openais[18644]: [CLM  ] got nodejoin message 10.128.11.69 
Sep 18 14:10:49 uipi openais[18644]: [CLM  ] got nodejoin message 10.128.11.70 
Sep 18 14:10:49 uipi openais[18644]: [CLM  ] got nodejoin message 10.128.11.71 
Sep 18 14:10:49 uipi openais[18644]: [CLM  ] got nodejoin message 10.128.11.72 
Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 1b 1c 18 19  
Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 1b 1c  
Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 1e 1f  
Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 21 1e 1f  
Sep 18 14:10:49 uipi last message repeated 3 times
Sep 18 14:10:49 uipi gfs_controld[18675]: cpg_mcast_joined retry 900 MSG_PLOCK
Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 21  
Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 36 37 38 39 3a 3b
3c 3d 3e 3f 40 41 2f 30 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 31 32 33 34  
Sep 18 14:10:50 uipi dlm_controld[18669]: cluster is down, exiting
Sep 18 14:10:50 uipi kernel: dlm: closing connection to node 5
Sep 18 14:10:50 uipi fenced[18663]: cluster is down, exiting
Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 4
Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 3
Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 2
Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 1
Sep 18 14:10:51 uipi kernel: dlm: connect from non cluster node
Sep 18 14:10:52 uipi last message repeated 50 times
Sep 18 14:10:53 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7
Sep 18 14:10:53 uipi kernel: dlm: connect from non cluster node
Sep 18 14:10:54 uipi last message repeated 32 times
Sep 18 14:10:54 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7
ep 18 14:10:54 uipi kernel: dlm: connect from non cluster node
Sep 18 14:11:01 uipi last message repeated 35 times
Sep 18 14:11:03 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7
Sep 18 14:11:04 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7
Sep 18 14:11:07 uipi shutdown[8859]: shutting down for system halt
Sep 18 14:11:09 uipi rgmanager: [8917]: <notice> Shutting down Cluster Service
Manager... 
Sep 18 14:11:09 uipi clurgmgrd[11132]: <notice> Shutting down 
Sep 18 17:48:39 uipi syslogd 1.4.1: restart.
Sep 18 17:48:39 uipi kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 18 17:48:39 uipi kernel: ache: write through
Sep 18 17:48:39 uipi kernel:  sdma: unknown partition table

Comment 24 David Teigland 2009-09-22 18:15:50 UTC
I've not tried to reproduce this, but intec have reproduced and collected the data I needed.

Comment 25 Steven Dake 2009-09-22 18:18:54 UTC
This bug report has atleast two seperate issues.  First is the bug report relating to the data dct analyzed in comment #20.  Second you report nodes are fenced.

This usually indicates either your network failed or openais failed.  5.3 may have some issues with segfaults.

I suggest checking /var/lib/openais for core files.  If there are any, install openais-debuginfo package and run the following commmand:

gdb /usr/sbin/corosync /var/lib/openais/corefile.name
bt

this will give me a backtrace and I can tell you if the segfaults you see are known issues with 5.3 fixed by a later errata.

Regards
-steve

Comment 26 Steven Dake 2009-09-22 18:22:38 UTC
sorry intead of corosync in comment #25, replace with openais

Comment 27 Everett Bennett 2009-09-22 18:56:57 UTC
Here's the trace.

uipi> ls -dl core*
-rw------- 1 root root 141422592 Aug 27 11:41 core.18617
-rw------- 1 root root 139124736 Sep 18 14:10 core.18644
uipi> file core*
core.18617: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec'
core.18644: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec'
uipi> type aisexec
aisexec is /usr/sbin/aisexec
uipi> gdb /usr/sbin/aisexec core.18644
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/libexec/lcrso/objdb.lcrso...
(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/objdb.lcrso
Reading symbols from /usr/libexec/lcrso/service_cman.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_cman.lcrso
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso
Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...
(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso
Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso
Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso
Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso
Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso...
(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso
Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso
Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso
Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...(no debugging symbols found)...done.
Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso
Reading symbols from /lib64/libgcc_s.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.
[New process 18644]
[New process 11137]
[New process 19619]
[New process 19611]
[New process 19581]
[New process 19573]
[New process 19543]
[New process 19535]
[New process 19505]
[New process 19497]
[New process 19467]
[New process 19459]
[New process 19429]
[New process 19421]
[New process 19391]
[New process 19383]
[New process 19353]
[New process 19345]
[New process 19315]
[New process 19307]
[New process 19277]
[New process 19269]
[New process 19239]
[New process 19231]
[New process 19201]
[New process 19193]
[New process 19163]
[New process 19155]
[New process 19122]
[New process 19114]
[New process 19084]
[New process 19076]
[New process 19049]
[New process 19030]
[New process 18726]
[New process 18683]
[New process 18680]
[New process 18679]
[New process 18657]
[New process 18645]
#0  0x0000003904030215 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003904030215 in raise () from /lib64/libc.so.6
#1  0x0000003904031cc0 in abort () from /lib64/libc.so.6
#2  0x000000390406a7fb in __libc_message () from /lib64/libc.so.6
#3  0x00000039040e823f in __stack_chk_fail () from /lib64/libc.so.6
#4  0x0000000000411fb6 in ?? ()
#5  0x0000000000409c43 in rrp_deliver_fn ()
#6  0x00000000004080d6 in ?? ()
#7  0x0000000000405b60 in poll_run ()
#8  0x00000000004184f2 in main ()
(gdb)

Comment 28 Steven Dake 2009-09-22 19:03:54 UTC
Everett,

It appears openais-debuginfo package is not installed.  This provides the symbols for gdb to produce a useful backtrace output.  Could you install it and rerun the command?

Regards
-steve

Comment 29 Everett Bennett 2009-09-22 19:09:39 UTC
Dude, let met try again.... I installed openais-devel...oops.

Comment 30 Everett Bennett 2009-09-22 19:33:12 UTC
uipi> pwd
/var/lib/openais
uipi> ls -dl core*
-rw------- 1 root root 141422592 Aug 27 11:41 core.18617
-rw------- 1 root root 139124736 Sep 18 14:10 core.18644

uipi> pwd
/var/lib/openais
uipi> gdb /usr/sbin/aisexec /var/lib/openais/core.18644
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/libexec/lcrso/objdb.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/objdb.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/objdb.lcrso
Reading symbols from /usr/libexec/lcrso/service_cman.lcrso...done.
Loaded symbols for /usr/libexec/lcrso/service_cman.lcrso
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cpg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso
Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cfg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso
Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_msg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso
Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_lck.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso
Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evt.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso
Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_ckpt.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso
Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_amf.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso
Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_clm.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso
Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evs.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.
[New process 18644]
[New process 11137]
[New process 19619]
[New process 19611]
[New process 19581]
[New process 19573]
[New process 19543]
[New process 19535]
[New process 19505]
[New process 19497]
[New process 19467]
[New process 19459]
[New process 19429]
[New process 19421]
[New process 19391]
[New process 19383]
[New process 19353]
[New process 19345]
[New process 19315]
[New process 19307]
[New process 19277]
[New process 19269]
[New process 19239]
[New process 19231]
[New process 19201]
[New process 19193]
[New process 19163]
[New process 19155]
[New process 19122]
[New process 19114]
[New process 19084]
[New process 19076]
[New process 19049]
[New process 19030]
[New process 18726]
[New process 18683]
[New process 18680]
[New process 18679]
[New process 18657]
[New process 18645]
#0  0x0000003904030215 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003904030215 in raise () from /lib64/libc.so.6
#1  0x0000003904031cc0 in abort () from /lib64/libc.so.6
#2  0x000000390406a7fb in __libc_message () from /lib64/libc.so.6
#3  0x00000039040e823f in __stack_chk_fail () from /lib64/libc.so.6
#4  0x0000000000411fb6 in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3453
#5  0x0000000000409c43 in rrp_deliver_fn (context=0x0, msg=0x1722eef8, msg_len=1090) at totemrrp.c:1308
#6  0x00000000004080d6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, 
    data=0x1722e850) at totemnet.c:676
#7  0x0000000000405b60 in poll_run (handle=0) at aispoll.c:402
#8  0x00000000004184f2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:609


uipi> gdb /usr/sbin/aisexec /var/lib/openais/core.18617
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...done.
...
...
...
[New process 18618]
#0  0x0000003904030215 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003904030215 in raise () from /lib64/libc.so.6
#1  0x0000003904031cc0 in abort () from /lib64/libc.so.6
#2  0x000000390406a7fb in __libc_message () from /lib64/libc.so.6
#3  0x00000039040e823f in __stack_chk_fail () from /lib64/libc.so.6
#4  0x0000000000411fb6 in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3453
#5  0x0000000000409c43 in rrp_deliver_fn (context=0x0, msg=0x1fc8ef8, msg_len=1090) at totemrrp.c:1308
#6  0x00000000004080d6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, 
    data=0x1fc8850) at totemnet.c:676
#7  0x0000000000405b60 in poll_run (handle=0) at aispoll.c:402
#8  0x00000000004184f2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:609

Comment 31 Steven Dake 2009-09-22 20:03:43 UTC
Comment #30 is an unknown issue.  Please file a new defect with comment #30 backtrace against component openais.

Regards
-steve

Comment 32 Everett Bennett 2009-09-23 12:52:38 UTC
The cores appear to be dumped by aisexec.
Besides, I do not see an 'openais' binary.

uipi> pwd
/var/lib/openais
uipi> file core*
core.18617: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec'
core.18644: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec'
uipi> type aisexec
aisexec is /usr/sbin/aisexec


ibmblade05> find / -xdev -type f -name 'openais' -ls
11769055    4 -rwxr-xr-x   1 root     root         1077 Mar 27 15:49 /backup_ibmblade05/etc/rc.d/init.d/openais
12616268    4 -rwxr-xr-x   1 root     root         1077 Apr  9 17:07 /etc/rc.d/init.d/openais

ipi> find / /opt /usr/openv  -xdev -type f -name 'openais' -print
/etc/rc.d/init.d/openais
uipi> ps -eaf|grep aisexec
root     18609     1  4 Sep22 ?        00:53:42 aisexec
root     19925 14710  0 09:50 pts/4    00:00:00 grep aisexec

----tururim----
ssh tururim "ps -eaf|egrep "aisexec|openais""
root     11920 11895  0 09:52 ?        00:00:00 bash -c ps -eaf|egrep "aisexec|openais"
root     11938 11920  0 09:52 ?        00:00:00 egrep aisexec|openais
root     18609     1  5 Sep22 ?        01:06:45 aisexec
----turuturu----
ssh turuturu "ps -eaf|egrep "aisexec|openais""
root     13522 13506  0 09:52 ?        00:00:00 bash -c ps -eaf|egrep "aisexec|openais"
root     13543 13522  0 09:52 ?        00:00:00 egrep aisexec|openais
root     18621     1  4 Sep22 ?        00:54:34 aisexec
----uipi----
ssh uipi "ps -eaf|egrep "aisexec|openais""
root     18609     1  4 Sep22 ?        00:53:47 aisexec
root     23657 23466  0 09:52 pts/4    00:00:00 ssh uipi ps -eaf|egrep "aisexec|openais"
root     23662 23658  0 09:52 ?        00:00:00 bash -c ps -eaf|egrep "aisexec|openais"
root     23680 23662  0 09:52 ?        00:00:00 egrep aisexec|openais
----uiracu----
ssh uiracu "ps -eaf|egrep "aisexec|openais""
root     18606     1  4 Sep22 ?        00:55:47 aisexec
root     23413 23408  0 09:52 ?        00:00:00 bash -c ps -eaf|egrep "aisexec|openais"
root     23431 23413  0 09:52 ?        00:00:00 egrep aisexec|openais
----uirapuru----
ssh uirapuru "ps -eaf|egrep "aisexec|openais""
root      5099  5093  0 09:52 ?        00:00:00 bash -c ps -eaf|egrep "aisexec|openais"
root      5117  5099  0 09:52 ?        00:00:00 egrep aisexec|openais
root     18619     1  5 Sep22 ?        00:57:38 aisexec

Comment 33 Everett Bennett 2009-09-23 15:00:44 UTC
I am confused.
I do not see an 'openais' binary.
Is the above trace reflect known issues?

Do you still want a different bug report?

Also, what is the eta on the issue regarding comment #24 ?

The customer is close to determing whether or not they will proceed
with using RHEL Clustering and moving on to Veritas, etc.

Please advise next steps that we should consider, e.g., upgrade to RHEL 5.4, etc.

Comment 34 Nate Straz 2009-09-23 15:16:57 UTC
aisexec is the binary for openais.  You generated the openais backtraces correctly in comment #30.

Comment 35 Everett Bennett 2009-09-23 17:50:29 UTC
Does the above trace reflect known issues?
Do you still want a different bug report?
Also, what is the eta on the issue regarding comment #24 ?
The customer is close to determing whether or not they will proceed
with using RHEL Clustering and moving on to Veritas, etc.
Please advise next steps that we should consider, e.g., upgrade to RHEL 5.4,
etc.

Comment 36 Steven Dake 2009-09-23 18:03:56 UTC
The segfault backtrace shown in comment #30 is a new (currently unknown) defect.  Please file a defect as per instructions in comment #31 against the component openais.

comment #24 does not reflect a known issue.  In order for it to be fixed, either I have to try to duplicate your application in a separate test case, or use your application internally to duplicate the issue.  Using your application internally would be more effective.  If you can provide me a copy of your application, that would be helpful.  Please contact me off-bugzilla at sdake for transfer instructions.

Comment 37 Everett Bennett 2009-09-23 19:38:26 UTC
The application is a bit cumbersome and I will not be able to transfer to you.
If necessary, we could arrange you access to our dev env.

With regards to comment #24:

Comment #24 From  Dave Teigland (teigland)  2009-09-22 14:15:50 EDT   (-) [reply] -------

I've not tried to reproduce this, but intec have reproduced and collected the
data I needed.  


So, the data provided did not help?

I've cloned the ticket and included the trace info for aisexec.
The bug report is # 525280 .

Comment 38 Everett Bennett 2009-09-24 15:12:14 UTC
Are we going in circles ore going nowhere?

Comment #24 From  Dave Teigland (teigland)  2009-09-22 14:15:50 EDT   (-) [reply] -------

I've not tried to reproduce this, but intec have reproduced and collected the
data I needed.

Comment 39 Steven Dake 2009-09-24 15:38:29 UTC
Everett,

We are not going in circles.

First, Dave has diagnosed there may be a checkpoint problem in comment #20.

In comment #30 I diagnosed a new unknown bug for which you cloned a bugzilla.

Jan Friesse will be taking over this bug and working full time on it now that it has been triaged properly.

Please arrange remote access for him to your development environment.  Once he duplicates the issue and can instrument the behavior, he can fix the problem.

Regards
-steve

Comment 40 Jan Friesse 2009-09-24 16:02:40 UTC
Everett,
I'm now assigned to this bug. First of all. I'm in GTM+2 TZ, so communication
*may* be little slower. Anyway, can you please try to do this:
- Upgrade to 5.4. Does it help?
- In case it is not problem, try to give me access to your machines and maybe
some instructions HOW to reproduce bug (please contact me directly by mail
jfriesse).

Comment 41 David Teigland 2009-09-24 16:05:51 UTC
We should try to reproduce this ourselves before trying to reproduce remotely;
this shouldn't require using their application.  It looks like a relatively
simple sequence of steps:

start with node1, node3, node4 all mounting the fs

node1 and node2 (when it mounts) both do plock operations on the fs during the
testing

repeat the following:
a. node2 mounts (node1 unlinks ckpt, node1 creates ckpt, node2 reads ckpt)
b. kill node1
c. node1 mounts (node2 unlinks ckpt, node2 creates ckpt, node1 reads ckpt)
d. kill node2

we need to check:
in step a that node2 reads the most recent ckpt created by node1
in step c taht node1 reads the most recent ckpt created by node2

it seems that after mounting, a node reads an old/stale version of the ckpt

Comment 42 David Teigland 2009-09-24 16:44:41 UTC
One of the key parts to debugging this is verifying whether or not the new ckpt data was read versus the old ckpt data, i.e. detecting when the wrong ckpt data has been read.  Nothing may appear to be wrong if the old/wrong ckpt content is read.  gfs_controld blindly trusts that it is reading the correct ckpt data, we don't have a test that validates the plock state, and openais doesn't produce any errors when the wrong ckpt data is read.

Here are a couple things I look at to debug ckpt data from outside of openais (and we may want to add more):

- "group_tool dump gfs" shows the ckpt sections that are read by a node when it mounts. Each section is for a separate inode/resource.  With the test running, these sections should be different each time.

-  "group_tool dump plocks <fsname>" shows the plock state from gfs_controld, which is initialized from the ckpt during mount

Steve may also have some tips for debugging ckpt state from within openais.

Comment 43 Jan Friesse 2009-10-02 10:09:39 UTC
Status after +- week of testing (= waiting nodes for reboot).

I was NOT able to reproduce bug. I used lock_load.c provided by Dave. Everything seems to be perfectly working, valgrind didn't complain about anything. I tried to copy situation as close as possible, so I had:

- 3 nodes with shared disk (n1, n2, n3)
- n3 still running lock_load.c = create load
- n1 running "hacked" lock_load which creates lock on one file and doesn't change it (I will call it APP)

:loop

- fence n1
- run APP on n2
- wait until n1 will run (cman + gfs), check sections of every node - everything seems same
- fence n2
- run APP on n1
- wait until n2 will run (cman + gfs)
- check section on every node - everything seems same -> no inconsistency
goto :loop

For me, it just looks like (at least) I'm not able to reproduce this. So Everett, can you please manage to grant access to your machines?

Comment 44 Everett Bennett 2009-10-02 13:14:14 UTC
I'll consult with my manager on how he would like to proceed.

Comment 45 Everett Bennett 2009-10-02 14:03:03 UTC
I have a vnc connection available that we can use.
If you call me at 404-705-2849, I will provide the info.
Also, we need to setup a time when you are available.

Comment 46 John Hadad 2009-10-02 16:46:26 UTC
Does the fact that we have an active RH support contract help us in increasing the priority with which this issue is worked on? If so, what do we need to do to take advantage of this?

Thanks.

Comment 47 Steven Dake 2009-10-02 17:01:27 UTC
John,

There is a engineer from the R&D development team working on this issue full time.  The priority is currently "high" with severity "urgent" which is why you have 100% engineering time for 1 dedicated engineer who is doing absolutely nothing else but looking at this issue.

Usually TAMs (technical account managers) interface between redhat support + engineering and customers or partners.  I don't know the process for obtaining a TAM, but they are helpful in providing visibility of your issues into management.

Regards
-steve

Comment 48 Jan Friesse 2009-10-06 07:52:46 UTC
John,
as I said, I'm in UTC+2, so I'm able to call you anytime 9.00 - 18.00 UTC+2. So I suggest 15.30 UTC (11:30 EDT). I'm not sure, what TZ are you in, so please let me now.

Comment 49 John Hadad 2009-10-06 16:57:00 UTC
Jan, I am on EST time zone and 15:00 today works for me.
I will send you a separate e-mail to go over logistics (connection details & test/prob duplication process).

Thanks!

John Hadad

Comment 50 John Hadad 2009-10-06 22:59:19 UTC
Looks like I got the times all wrong.
For some reason, the time that stuck in my mind was 15:00 EST rather than 11:30 EST. I apologize for the misunderstanding.

Can we setup to run a few tests tomorrow (Wed Oct 7) @ 11:30 EST?
If not, do you have another time window when you can be available for testing around the 11:30 EST time frame?

Comment 51 Jan Friesse 2009-10-07 12:01:57 UTC
John,
11:30 EST is really too late (11:30 E!!!D!!!!T, this means 10:30 EST is one of latest reasonable time for me). I would rather prefer communication by mail. Only thing what I really need is to get some kind of access (ideally ssh, because ... latency of network for such long distances is not big friend of VNC) and maybe some little details about environment (like where your application is, how it runs, ...) to be able to replicate problem. Of course, my aim will be to find reason of behavior, and replicate it on my nodes, but it's really hard to say, how long it will take.

Comment 52 Everett Bennett 2009-10-07 17:07:05 UTC
Is it possible to get someone in the USA to work this issue?
This timezone thing is not going to work?

Comment 53 Everett Bennett 2009-10-07 17:51:02 UTC
I will need an public ssh key to allow support access.
Please email your 'id_dsa.pub' key, etc to my account.

Comment 54 Jan Friesse 2009-10-19 16:21:41 UTC
Small update about this bug.

We managed call and starting talk directly (not thru BZ). Perry decided to be good thing to have update there.

So I'm currently still working on issue. It seems very very hard to debug, because even on customer machines, bug is not 100% times reproducible.

We found, that bug appears on node with lowest id only. It appears BEFORE node with (node_with_lowest_id + 1), right after node_with_lowest_id will boot up.

Currently, I'm trying to make customized aisexec with more debug informations, so I can confirm/decline that this bug is caused by aisexec ckpt (exec part)

Comment 55 Jan Friesse 2009-10-21 15:34:33 UTC
Update from 2009-10-21

I finished ckpt exec with more debug informations and put to customer machine. I was able to reproduce bug and get some interesting informations.

Sadly, it was not enough, so currently, I'm working on more enhanced version.

It looks like bug is caused by uninitialized memory somewhere in ckpt, because section_iterate returns some sections, which shouldn't exists and wasn't sent by previous master node.

Comment 56 Steven Dake 2009-10-21 17:53:19 UTC
Honzaf,

Try the attached patch.

It prevents any of the iterator functions from executing during the synchronization process.

I had a look over the ckpt usage code in gfs - it handles the case where iteration returns try again.

Regards
-steve

Comment 57 Steven Dake 2009-10-21 17:55:17 UTC
Created attachment 365573 [details]
patch which may fix problem

Comment 68 errata-xmlrpc 2010-03-30 07:48:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0180.html