Bug 519159
Summary: | gfs lock is reported on 2 out of 4 nodes in cluster. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Everett Bennett <everett.bennett> | ||||||||||
Component: | openais | Assignee: | Jan Friesse <jfriesse> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 5.3 | CC: | adas, bmarzins, cluster-maint, edamato, jesse.marlin, John.Hadad, mwaite, nikb321, sdake, sghosh, stefan.hurwitz, swhiteho, teigland | ||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | openais-0.80.6-12.el5 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 525280 557621 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2010-03-30 07:48:10 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 554487, 554488, 557621, 557622 | ||||||||||||
Attachments: |
|
Description
Everett Bennett
2009-08-25 13:20:50 UTC
I need to get more information about this failure. I don't know what application is used, and I don't know if the locks you are inquiring about are normal GFS file locks (i.e. open files) or if the locks correspond to posix locks or flocks. I would like to get both gfs and dlm lock dumps from a recreation of the failure. To get the GFS locks, please follow these instructions: http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_gfs To get the DLM locks, please follow these instructions: http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_dlm I'd also like to either get the source for your lock_check program or at least a detailed technical description of how it works. Created attachment 358714 [details]
Lockdump/dlm info from all 4 nodes in test cluster.
Here's the info you requested. Let me know if you need anything or if I missed something.
This seems to be related to fcntl locks, so adding Dave Teigland to the CC list. The GFS and DLM lock dumps you provided do not indicate any GFS flocks held by any node for any purpose. GFS flocks in the glock dump would look something like this: Glock (6, 123456) (where 123456 is an inode number). GFS flocks appearing in the DLM dump would look something like this (assuming the same inode): 2c300c6 2 2a0002 6310 0 4c 10000 2 5 -1 0 0 24 " 6 1e240" (The " 6 " in quotes at the end indicates the DLM lock is for a GFS flock.) So based on this information, I have a few more questions: Are you sure that the flocks are for files on the GFS file system and not some other file system, like the root file system? For example, please make sure file /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd is really on the GFS file system. Also, please make sure the GFS file system is being mounted without -o localflocks (which uses the system flock interface in favor of GFS's cluster-wide flock interface.) Also, please make sure that the lock dumps are taken from the time of failure. There is an symbolic link from /var/dcs_6.0_sp/db to /dcs/data02/var_dcs_6.0_sp_db/. where /dcs/data02 is a gfs file system. This is done on all 4 nodes. ime1/root> ls -dl /var/dcs_6.0_sp/db lrwxrwxrwx 1 cgi dev 29 Aug 20 09:04 /var/dcs_6.0_sp/db -> /dcs/data02/var_dcs_6.0_sp_db ime1/root> ls -ld /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd -rw-r--r-- 1 cgi dev 94200320 Aug 24 15:59 /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd ime1/root> ls -ld /dcs/data02/var_dcs_6.0_sp_db/dcs/dcs_filreg_file.ixd -rw-r--r-- 1 cgi dev 94200320 Aug 24 15:59 /dcs/data02/var_dcs_6.0_sp_db/dcs/dcs_filreg_file.ixd ime1/root> df -k /var/dcs_6.0_sp/db/dcs/. Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vgdata02-lvol1 141788512 31392872 110395640 23% /dcs/data02 ime1/root> mount|grep data02 /dev/mapper/vgdata02-lvol1 on /dcs/data02 type gfs (rw,hostdata=jid=1:id=327682:first=0) ime1/root> Regards to this statement: Also, please make sure that the lock dumps are taken from the time of failure. The cluster was left in it's current state from the last test. Any other suggestions on how to collect data for this test going forward. It should be noted that we change 'plock_ownership' to '0' per bugid for an issue we had with a 2-node gfs2 cluster. https://bugzilla.redhat.com/show_bug.cgi?id=512799 It sounds like you're probably using posix locks (plocks) rather than flocks. All flocks on gfs appear in gfs and dlm lock dumps. All plocks on gfs appear in "group_tool dump plocks <fsname>". Any suggestions what to do going forward? We'll need to correlate the plocks dumped from group_tool (from each node) with info from the application log and lock_check. Created attachment 358759 [details]
group_tool dump plock data02 on ime1, 2, 3, and 4.
lock_check is probably just using fcntl/GETLK, it's results match what gfs thinks the plock state is. Was the cluster shut down when you changed from plock_ownership=1 to plock_ownership=0? Are you by any chance trying to use plocks between threads? posix locks only work properly between separate processes. This is because threads have the same "owner" value, so locks cannot be distinguished among them by the fs. (I've run into this before with an app IBM was trying to run on gfs.) Another thing to note is that the "pid" fields can be misleading. gfs (and the kernel in general) use the owner value, e.g. ffff8107fb9ac3c0, for all real work. The only reason they save the pid is to return in GETLK, even though it may not be accurate. For processes that hold plocks and fork or clone, the process holding a lock can change, meaning the pid no longer matches the process holding the lock. nodes 1 and 2 have matching plock state, and nodes 3 and 4 have matching plock state. That's a curious pattern. If the state is not being changed (app suspended or something), the dump from all nodes should be identical. With a running app that's doing locking, the collected state can be unmatching since the dump commands are skewed among the nodes. I can't say how much of the difference is due to this skew. Collecting plock state a second time might help clear that up. It appears that at least the locks on /var/dcs_6.0_sp/db/dcs/dcs_filreg_file.ixd, (inode 5194392) are in a steady, but inconsistent, state. Nodes 1 and 2 think 1/18480/ffff810828f950c0 has a WR lock on it: 5194392 WR 1-1 nodeid 1 pid 18480 owner ffff810828f950c0 Nodes 3 and 4 think three different processes have RD locks on it: 5194392 RD 1-1 nodeid 1 pid 30219 owner ffff8107fb9ac3c0 5194392 RD 1-1 nodeid 1 pid 30227 owner ffff81082a4a80c0 5194392 RD 1-1 nodeid 1 pid 3513 owner ffff810826d6a680 I don't have any theories yet on how that could have happened. Using a combination of ps -o pid,stat,cmd,wchan, strace, /proc/locks, do you know which processes are blocked waiting for plocks on which files? A couple more places where we might dig up some clues are: - /proc/locks output from all nodes - group_tool dump gfs <fsname> from all nodes (In reply to comment #11) > Are you by any chance trying to use plocks between threads? posix locks only > work properly between separate processes. This is because threads have the > same "owner" value, so locks cannot be distinguished among them by the fs. > (I've run into this before with an app IBM was trying to run on gfs.) These are separate processes. There are threads but a single process obtains a lock on the file and keeps it for the duration of its life. > > Another thing to note is that the "pid" fields can be misleading. gfs (and the > kernel in general) use the owner value, e.g. ffff8107fb9ac3c0, for all real > work. The only reason they save the pid is to return in GETLK, even though it > may not be accurate. For processes that hold plocks and fork or clone, the > process holding a lock can change, meaning the pid no longer matches the > process holding the lock. Here is the sequence of events that we believe lead to this: ime1 -> master ime2 -> slave ime3 -> slave ime4 -> slave We cut the power to ime1 and service failed over to ime2. The application came up without problems. ime1 -> rebooting ime2 -> master ime3 -> slave accessing GFS ime4 -> slave accessing GFS At this point we are processing pretty heavily on all nodes including ime1. Then we pull the plug on ime2. The service fails back to ime1, and now we see the problem. The PIDs that ime1 and ime2 think have locks were the same PIDs from before the first failure test on ime1. I was wondering if some sort of lock recovery took place when it became master a second time. > nodes 1 and 2 have matching plock state, and nodes 3 and 4 have matching plock > state. That's a curious pattern. If the state is not being changed (app > suspended or something), the dump from all nodes should be identical. With a > running app that's doing locking, the collected state can be unmatching since > the dump commands are skewed among the nodes. I can't say how much of the > difference is due to this skew. Collecting plock state a second time might > help clear that up. We did not see the lock problem on 3 & 4 until we tried to move the service to node 4. At that point they both started complaining about the lock but from a different PID that nodes 1 & 2. > I don't have any theories yet on how that could have happened. > > Using a combination of ps -o pid,stat,cmd,wchan, strace, /proc/locks, do you > know which processes are blocked waiting for plocks on which files? Our application checks to see if it can lock the file before it comes up, so at this point there are no processes that have locks on the file. To the question: Was the cluster shut down when you changed from plock_ownership=1 to plock_ownership=0? Yes, this was changed sometime ago. All nodes were rebooted prior to testing. Everett Thanks, I'm going to run some tests copying the main activities and events you described. It does look related to recovery. Is the master node the only node that acquires locks? Do slaves do any file locking? Is the service failover coordinated with any cluster or gfs events? i.e. does the new master take over before/after gfs recovery completes? Given this info we could possibly dump gfs plocks at certain points in the failover/recovery process to verify that all plocks from failed nodes have been purged. e.g. if the new master takes over after gfs recovery completes, then a plock dump just before the new master begins should show no plocks from the old master. Also, a 'group_tool dump gfs' at that same point should include a record of how many plocks have been purged from the failed master. By your description, it sounds like the initial recovery and failover may be working properly (checks like those above could confirm), but problems begin at the second recovery where the master moves back to the original node. We could again dump plocks at certain points of recovery to check that plocks don't exist that shouldn't. Also, when a failed node comes back and mounts gfs, it would be useful to collect the 'group_tool dump gfs' log which will show us whether the plock state it's syncing is stale (we've had a problem like this before where the openais checkpoints used for syncing plocks are out of sync.) (In reply to comment #14) > Thanks, I'm going to run some tests copying the main activities and events you > described. It does look related to recovery. > > Is the master node the only node that acquires locks? Do slaves do any file > locking? Is the service failover coordinated with any cluster or gfs events? > i.e. does the new master take over before/after gfs recovery completes? Yes the master is the only one holding the locks. For the test we performed we simply killed the power from the bladecenter. Not sure about the recovery. We would wait until we were able to mount the GFS volume before starting the service, but if recovery is going on while the volume is mounted then I don't know. We are gonna try and reproduce and collect the info you requested. So gfs is not mounted on all the nodes, but just on the master? If so, why not use a local file system? (In reply to comment #16) > So gfs is not mounted on all the nodes, but just on the master? If so, why not > use a local file system? GFS is mounted on all nodes. The master delegates writes for some databases but not all. For databases that are not handled by the master we use GFS so that we can share the files across nodes. We see problems in both cases, where just the master had the lock and the other case where potentially any of the slave nodes could have had a lock on a file. The alternative to what we are doing would be to use local disk and NFS which we are trying to get away from. Would it be helpful to provide you with external access to the cluster which is having the problem. If so we can try to coordinate something. Thanks I have duplicated the problem and collected the information you requested at the points you requested. I used a script (gfs_cmd.sh) to collect the information on all nodes and dumped the result into files named RH_info.n The script and the files are attached (bugzilla_515159_JH_1.tar.gz). I attached the script b/c it will make it easier for you to understand the content of RH_info.n if you have the script. You will probably be most interested in RH_info.10 and RH_info.11 but I included all of them in case it is of some benefit to you. Here is what I did. Master running on ime1 Start data processing RH_info.1 Start data copy (A) Fence reboot ime1 Wait till ime1 is offline RH_info.2 Wait a couple of minutes (master service starting on ime2) RH_info.3 Wait till ime1 is back online and slave cluster service started RH_info.4 Start data copy (B) Fence reboot ime2 RH_info.5 (before master shifted to ime1) Wait till after master shifted to ime1 and application started RH_info.6 Let data processing resume Wait till ime2 is back online and slave cluster service started RH_info.7 Start data copy (C) Fence off ime1 before master shifted to ime2 RH_info.8 Wait till after master shifted to ime2 and application started RH_info.9 Wait for data processing to resume Wait till ime1 is back online and slave cluster service started Run ls -l on the GFS dir (/dcs/data02/) and while ls was running (D) Fence off ime2 RH_info.10 (before master shifted to ime1) Wait till after master shifted to ime1 (application failed to start due to file locks) RH_info.11 Our application failed to start due to locks that were obtained on ime1 by a process that ran on ime1 prior to "(C) fence off ime1" (the old PIDs are logged in one of our application logs). As you can see from the above, the locks showed up after ime1 came up from a reboot. Here is a list of the locked files (all of which are on the GFS mount /dcs/data02/): FILE UIO_DB_PATH:uio_maint_values_db LOCKED! FILE EAL_DB_PATH:eal_lang_english LOCKED! FILE RES_DB_PATH:res_res_type_file LOCKED! FILE RES_DB_PATH:res_svc_node_file LOCKED! FILE RES_DB_PATH:res_res_class_file LOCKED! FILE RES_DB_PATH:res_svc_type_file LOCKED! FILE RES_DB_PATH:res_res_def_file LOCKED! FILE RES_DB_PATH:res_svc_inst_file LOCKED! FILE RES_DB_PATH:res_svc_unit_file LOCKED! FILE DCS_DB_PATH:dcs_audit_file LOCKED! FILE DCS_DB_PATH:dcs_custeq_file LOCKED! FILE DCS_DB_PATH:dcs_dispatch_file LOCKED! FILE DCS_DB_PATH:dcs_dportal_fai_params_file LOCKED! FILE DCS_DB_PATH:dcs_dportal_input_ctx_file LOCKED! FILE DCS_DB_PATH:dcs_dportal_shared_ctx LOCKED! FILE DCS_DB_PATH:dcs_dportal_output_ctx_file LOCKED! FILE DCS_DB_PATH:dcs_dportal_root_file LOCKED! FILE DCS_DB_PATH:dcs_dportal_routing_file LOCKED! FILE DCS_DB_PATH:dcs_dportal_state_file LOCKED! FILE DCS_DB_PATH:dcs_filreg_file LOCKED! FILE DCS_DB_PATH:dcs_input_volume_file LOCKED! FILE DCS_DB_PATH:dcs_plan_rtg_file LOCKED! FILE DCS_DB_PATH:dcs_rtgreq_file LOCKED! FILE DCS_DB_PATH:dcs_oper_file LOCKED! FILE DCS_DB_PATH:dcs_oper_portal_file LOCKED! FILE DCS_DB_PATH:dcs_plan_file LOCKED! FILE EPS_DB_PATH:eps_filreg_file_db LOCKED! FILE EPS_DB_PATH:eps_filreg_code_db LOCKED! FILE DRD_DB_PATH:drd_isam_file LOCKED! FILE DRD_DB_PATH:drd_proc_file LOCKED! FILE DRD_DB_PATH:drd_area_file LOCKED! FILE DRD_DB_PATH:drd_state_file LOCKED! FILE PRA_DB_PATH:pra_assembled_file_db LOCKED! FILE PRA_DB_PATH:pra_state_db LOCKED! FILE PRA_DB_PATH:pra_sys_parms_db LOCKED! Here is a list of the PIDs that have a lock on the above files: READ locked by pid: 30697 Start: 0x1 Length: 0x1 READ locked by pid: 30840 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 READ locked by pid: 31016 Start: 0x1 Length: 0x1 WRITE locked by pid: 30721 Start: 0x1 Length: 0x1 WRITE locked by pid: 30747 Start: 0x1 Length: 0x1 WRITE locked by pid: 30744 Start: 0x1 Length: 0x1 WRITE locked by pid: 30745 Start: 0x1 Length: 0x1 READ locked by pid: 31112 Start: 0x1 Length: 0x1 WRITE locked by pid: 30723 Start: 0x1 Length: 0x1 WRITE locked by pid: 30746 Start: 0x1 Length: 0x1 WRITE locked by pid: 30724 Start: 0x1 Length: 0x1 WRITE locked by pid: 30722 Start: 0x1 Length: 0x1 WRITE locked by pid: 30720 Start: 0x1 Length: 0x1 WRITE locked by pid: 30729 Start: 0x1 Length: 0x1 WRITE locked by pid: 30718 Start: 0x1 Length: 0x1 WRITE locked by pid: 30725 Start: 0x1 Length: 0x1 WRITE locked by pid: 30719 Start: 0x1 Length: 0x1 WRITE locked by pid: 30734 Start: 0x1 Length: 0x1 WRITE locked by pid: 30735 Start: 0x1 Length: 0x1 WRITE locked by pid: 30728 Start: 0x1 Length: 0x1 WRITE locked by pid: 30741 Start: 0x1 Length: 0x1 WRITE locked by pid: 30740 Start: 0x1 Length: 0x1 WRITE locked by pid: 30736 Start: 0x1 Length: 0x1 WRITE locked by pid: 30738 Start: 0x1 Length: 0x1 WRITE locked by pid: 30739 Start: 0x1 Length: 0x1 WRITE locked by pid: 30737 Start: 0x1 Length: 0x1 WRITE locked by pid: 30730 Start: 0x1 Length: 0x1 WRITE locked by pid: 30731 Start: 0x1 Length: 0x1 WRITE locked by pid: 30732 Start: 0x1 Length: 0x1 I ran "ps -ef" on all nodes looking for the above PIDs and they don't exist. ----ime4---- ssh ime4 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732"" ----ime3---- ssh ime3 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732"" ----ime2---- ssh ime2 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732"" ----ime1---- ssh ime1 "ps -ef | grep -v grep | egrep "30697|30840|31016|31016|31016|31016|31016|31016|31016|30721|30747|30744|30745|31112|30723|30746|30724|30722|30720|30729|30718|30725|30719|30734|30735|30728|30741|30740|30736|30738|30739|30737|30730|30731|30732"" ime1/cgi> I ran our lock_check program (source provided to you earlier as attachment to Comment #2) on a couple of the files on all nodes in the cluster. ime1 and ime2 agree on the locking PID but ime3 and ime4 claim the files are not locked. This happen before as well (described in Comment #12). I did not yet relocate the master in case you need more info before I do (i.e. the master is still on ime1). ----ime4---- ssh ime4 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" 0 of 1 files opened had locks set ----ime3---- ssh ime3 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" 0 of 1 files opened had locks set ----ime2---- ssh ime2 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" File: "/var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" locked by: pid: 30697 type: UNKNOWN 1 of 1 files opened had locks set ----ime1---- ssh ime1 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" File: "/var/dcs_6.0_sp/db/uio/uio_maint_values_db.ixd" locked by: pid: 30697 type: UNKNOWN 1 of 1 files opened had locks set ----ime4---- ssh ime4 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" 0 of 1 files opened had locks set ----ime3---- ssh ime3 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" 0 of 1 files opened had locks set ----ime2---- ssh ime2 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" File: "/var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" locked by: pid: 30840 type: UNKNOWN 1 of 1 files opened had locks set ----ime1---- ssh ime1 "/opt/dcs_6.0_sp/bin/lock_check /var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" File: "/var/dcs_6.0_sp/db/eal/eal_lang_english.ixd" locked by: pid: 30840 type: UNKNOWN 1 of 1 files opened had locks set Thanks! John Hadad Created attachment 359610 [details] Related to comment 18 NOTE: I have to exclude RH_info.1, 2 and 3 b/c the file size exceeded your 20 MB limit on attachments. Attachment contains: gfs_tool lockdump cat /proc/locks group_tool dump gfs group_tool dump plocks cat /debug/dlm/data02 cat /debug/dlm/data02_locks This looks like an openais checkpoint bug. gfs_controld is reading old, stale plock state from an old, stale, unlinked checkpoint instead of from the most recent checkpoint. Analysis: a. master 1, slave 2,3,4 b. kill 1 c. master 2, slave 3,4 d. master 2, slave 1,3,4 e. kill 2 f. master 1, slave 3,4 g. master 1, slave 2,3,4 h. kill 1 i. master 2, slave 3,4 j. master 2, slave 1,3,4 k. kill 2 l. master 1, slave 3,4 <-- problem with master starting on 1 In step l, 1 runs into old stale locks acquired by various pids on 1 before h. These old stale locks are not seen on 3,4 at step l. Possible explanations for old stale locks: . the 1 locks were not purged by 2 in recovery for step h (but they *were* purged by 3,4?) . 2 received and added locks from 1 after it purged 1 locks in recovery for step h (wouldn't 3,4 also do the same?) . 1 read old stale locks from an old stale checkpoint when it rejoined and initialized plock state in j. 1 created a ckpt of all locks in g when 2 mounted. Then 1 was killed in h. Then when 1 mounts in j, 2 should unlink the ckpt 1 created in g and create a new ckpt with the new locks. 1 opens/reads locks from ckpt. If it gets the old ckpt content from g instead of new ckpt content then we'd see these symptoms. We'd want to compare the ckpt content read by 2 in step g to the ckpt content read by 1 in step j... Extract ckpt data read by 2 in step g from RH_info.7 Extract ckpt data read by 1 in step j from RH_info.10 both match: "r14548045" "r14527465" "r14527445" "r14527463" "r14527451" "r14527454" "r14527517" "r14527459" "r14062648" "r14062659" "r14540649" "r14539655" "r14539633" "r14062974" "r14539624" "r14540652" "r14540644" "r14063230" "r14548925" "r12689671" "r4949306" "r14548938" "r14539641" "r14548049" "r14062958" "r14062956" "r14062947" "r14067021" "r14067015" "r14063231" "r14067026" "r14548041" "r14063579" "r14539649" "r14539663" "r14539668" "r14539644" "r5194392" "r14539625" "r14067018" "r14066741" What do you recommend at this point? Should we upgrade to RHEL 5.4 or apply some RHEL 5.3 errata patch and then test further? 5.4 fixes a bunch of bugs in 5.3 related to totem messages during recovery. I don't know if for sure it will fix the customer's problem. Dave, Is there a test case that reproduces this issue that you have run? Regards -steve The information in this bug report reflects testing that has been done thus far in our Development env. Both our test env and the customer env are identical as much as that is possible. The customer site has exhibited similar issues among them including a node faulting within a 5-node RHEL 5.3 cluster with the following messages. The node uipi was fenced but for no obvious reason. This apparently has happended several times over the last couple of weeks. Any suggestions on how to proceed? Sep 18 18:55:07 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:10 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:12 tururim clurgmgrd: [21417]: <info> Executing /etc/cmcluster/imeRJBAslave_global_start_stop.sh status Sep 18 18:55:17 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:20 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:22 tururim clurgmgrd: [21417]: <info> Executing /etc/cmcluster/imePRNMslave_global_start_stop.sh status Sep 18 18:55:27 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:30 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:32 tururim clurgmgrd: [21417]: <info> Executing /etc/cmcluster/imeREPslave_global_start_stop.sh status Sep 18 18:55:37 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 18:55:40 tururim xinetd[19689]: START: echo-dgram pid=0 from=10.128.236.8 Sep 18 14:11:03 tururim clurgmgrd[29977]: <notice> service:uipi is not runnable - restricted domain offline Sep 18 14:11:02 tururim kernel: dlm: closing connection to node 3 Sep 18 14:11:02 tururim TOP: 30642 cgi 16 0 222m 40m 3100 S 0.0 0.1 0:01.15 drd_stream_area -ar Sep 18 14:11:02 tururim openais[18616]: [TOTEM] Saving state aru 43 high seq received 43 Sep 18 14:11:02 tururim fenced[18636]: uipi not a cluster member after 0 sec post_fail_delay Sep 18 14:11:02 tururim clurgmgrd[29977]: <info> State change: uipi DOWN Sep 18 14:11:02 tururim TOP: 31039 cgi 17 0 95160 2480 1228 S 0.0 0.0 0:00.00 /opt/dcs_6.0_rjba/u Sep 18 14:11:02 tururim openais[18616]: [TOTEM] Storing new sequence id for ring 55e0 Sep 18 14:11:02 tururim fenced[18636]: fencing node "uipi" Sep 18 14:11:02 tururim TOP: 31054 root 16 0 94176 3340 2584 S 0.0 0.0 0:00.00 sshd: bacenter [pri Sep 18 14:11:02 tururim openais[18616]: [TOTEM] entering COMMIT state. Sep 18 14:11:02 tururim TOP: 31056 bacenter 15 0 94304 1924 1164 S 0.0 0.0 0:03.90 sshd: bacenter@pts/ Sep 18 14:11:02 tururim openais[18616]: [TOTEM] entering RECOVERY state. Sep 18 14:11:02 tururim TOP: 31057 bacenter 16 0 67044 2556 1184 S 0.0 0.0 0:19.80 -bash Sep 18 14:11:02 tururim openais[18616]: [TOTEM] position [0] member 10.128.11.68: Sep 18 14:11:02 tururim TOP: 31093 cgi 17 0 95160 2480 1228 S 0.0 0.0 0:00.00 /opt/dcs_6.0_rjba/u Sep 18 14:11:02 tururim openais[18616]: [TOTEM] previous ring seq 21980 rep 10.128.11.68 ----tururim---- ssh tururim "sh /tmp/chk.sh" /var/log Sep 18 14:11:02 tururim fenced[18636]: uipi not a cluster member after 0 sec post_fail_delay Sep 18 14:11:02 tururim fenced[18636]: fencing node "uipi" Sep 18 14:11:41 tururim fenced[18636]: fence "uipi" success Sep 18 17:49:13 tururim fenced[18636]: uipi not a cluster member after 0 sec post_fail_delay Sep 18 17:49:14 tururim fenced[18636]: fencing node "uipi" Sep 18 14:10:43 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7 Sep 18 14:10:44 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] The token was lost in the OPERATIONAL state. Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). Sep 18 14:10:47 uipi openais[18644]: [TOTEM] entering GATHER state from 2. Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Saving state aru 48ad high seq received 48ad Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Storing new sequence id for ring 55dc Sep 18 14:10:47 uipi openais[18644]: [TOTEM] entering COMMIT state. Sep 18 14:10:47 uipi openais[18644]: [TOTEM] entering RECOVERY state. Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [0] member 10.128.11.68: Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep 10.128.11.68 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad received flag 1 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [1] member 10.128.11.69: Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep 10.128.11.68 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad received flag 1 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [2] member 10.128.11.70: Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep 10.128.11.68 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [3] member 10.128.11.71: Sep 18 14:10:47 uipi gfs_controld[18675]: cpg_mcast_joined retry 100 MSG_PLOCK Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep 10.128.11.68 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad received flag 1 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] position [4] member 10.128.11.72: Sep 18 14:10:47 uipi openais[18644]: [TOTEM] previous ring seq 21976 rep 10.128.11.68 Sep 18 14:10:47 uipi openais[18644]: [TOTEM] aru 48ad high delivered 48ad received flag 1 Sep 18 14:10:47 uipi gfs_controld[18675]: cpg_mcast_joined retry 200 MSG_PLOCK Sep 18 14:10:47 uipi openais[18644]: [TOTEM] Did not need to originate any messages in recovery. Sep 18 14:10:47 uipi openais[18644]: [CLM ] CLM CONFIGURATION CHANGE Sep 18 14:10:47 uipi openais[18644]: [CLM ] New Configuration: Sep 18 14:10:47 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.68) Sep 18 14:10:47 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.69) Sep 18 14:10:47 uipi gfs_controld[18675]: cpg_mcast_joined retry 300 MSG_PLOCK Sep 18 14:10:47 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.70) Sep 18 14:10:47 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.71) Sep 18 14:10:48 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.72) Sep 18 14:10:48 uipi openais[18644]: [CLM ] Members Left: Sep 18 14:10:48 uipi openais[18644]: [CLM ] Members Joined: Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 400 MSG_PLOCK Sep 18 14:10:48 uipi openais[18644]: [CLM ] CLM CONFIGURATION CHANGE Sep 18 14:10:48 uipi openais[18644]: [CLM ] New Configuration: Sep 18 14:10:48 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.68) Sep 18 14:10:48 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.69) Sep 18 14:10:48 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.70) Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 500 MSG_PLOCK Sep 18 14:10:48 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.71) Sep 18 14:10:48 uipi openais[18644]: [CLM ] r(0) ip(10.128.11.72) Sep 18 14:10:48 uipi openais[18644]: [CLM ] Members Left: Sep 18 14:10:48 uipi openais[18644]: [CLM ] Members Joined: Sep 18 14:10:48 uipi openais[18644]: [SYNC ] This node is within the primary component and will provide service. Sep 18 14:10:48 uipi openais[18644]: [TOTEM] entering OPERATIONAL state. Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 600 MSG_PLOCK Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 3 4 5 6 Sep 18 14:10:48 uipi last message repeated 4 times Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 14 Sep 18 14:10:48 uipi gfs_controld[18675]: cpg_mcast_joined retry 700 MSG_PLOCK Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 16 17 18 14 19 Sep 18 14:10:48 uipi openais[18644]: [TOTEM] Retransmit List: 16 1b 17 1c 18 14 19 Sep 18 14:10:49 uipi gfs_controld[18675]: cpg_mcast_joined retry 800 MSG_PLOCK Sep 18 14:10:49 uipi openais[18644]: [CLM ] got nodejoin message 10.128.11.69 Sep 18 14:10:49 uipi openais[18644]: [CLM ] got nodejoin message 10.128.11.70 Sep 18 14:10:49 uipi openais[18644]: [CLM ] got nodejoin message 10.128.11.71 Sep 18 14:10:49 uipi openais[18644]: [CLM ] got nodejoin message 10.128.11.72 Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 1b 1c 18 19 Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 1b 1c Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 1e 1f Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 21 1e 1f Sep 18 14:10:49 uipi last message repeated 3 times Sep 18 14:10:49 uipi gfs_controld[18675]: cpg_mcast_joined retry 900 MSG_PLOCK Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 21 Sep 18 14:10:49 uipi openais[18644]: [TOTEM] Retransmit List: 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 2f 30 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 31 32 33 34 Sep 18 14:10:50 uipi dlm_controld[18669]: cluster is down, exiting Sep 18 14:10:50 uipi kernel: dlm: closing connection to node 5 Sep 18 14:10:50 uipi fenced[18663]: cluster is down, exiting Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 4 Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 3 Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 2 Sep 18 14:10:51 uipi kernel: dlm: closing connection to node 1 Sep 18 14:10:51 uipi kernel: dlm: connect from non cluster node Sep 18 14:10:52 uipi last message repeated 50 times Sep 18 14:10:53 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7 Sep 18 14:10:53 uipi kernel: dlm: connect from non cluster node Sep 18 14:10:54 uipi last message repeated 32 times Sep 18 14:10:54 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7 ep 18 14:10:54 uipi kernel: dlm: connect from non cluster node Sep 18 14:11:01 uipi last message repeated 35 times Sep 18 14:11:03 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7 Sep 18 14:11:04 uipi xinetd[19815]: START: echo-dgram pid=0 from=10.128.236.7 Sep 18 14:11:07 uipi shutdown[8859]: shutting down for system halt Sep 18 14:11:09 uipi rgmanager: [8917]: <notice> Shutting down Cluster Service Manager... Sep 18 14:11:09 uipi clurgmgrd[11132]: <notice> Shutting down Sep 18 17:48:39 uipi syslogd 1.4.1: restart. Sep 18 17:48:39 uipi kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 18 17:48:39 uipi kernel: ache: write through Sep 18 17:48:39 uipi kernel: sdma: unknown partition table I've not tried to reproduce this, but intec have reproduced and collected the data I needed. This bug report has atleast two seperate issues. First is the bug report relating to the data dct analyzed in comment #20. Second you report nodes are fenced. This usually indicates either your network failed or openais failed. 5.3 may have some issues with segfaults. I suggest checking /var/lib/openais for core files. If there are any, install openais-debuginfo package and run the following commmand: gdb /usr/sbin/corosync /var/lib/openais/corefile.name bt this will give me a backtrace and I can tell you if the segfaults you see are known issues with 5.3 fixed by a later errata. Regards -steve sorry intead of corosync in comment #25, replace with openais Here's the trace. uipi> ls -dl core* -rw------- 1 root root 141422592 Aug 27 11:41 core.18617 -rw------- 1 root root 139124736 Sep 18 14:10 core.18644 uipi> file core* core.18617: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec' core.18644: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec' uipi> type aisexec aisexec is /usr/sbin/aisexec uipi> gdb /usr/sbin/aisexec core.18644 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... (no debugging symbols found) Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/libexec/lcrso/objdb.lcrso... (no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/objdb.lcrso Reading symbols from /usr/libexec/lcrso/service_cman.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_cman.lcrso Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso... (no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso... (no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...(no debugging symbols found)...done. Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso Reading symbols from /lib64/libgcc_s.so.1... (no debugging symbols found)...done. Loaded symbols for /lib64/libgcc_s.so.1 Core was generated by `aisexec'. Program terminated with signal 6, Aborted. [New process 18644] [New process 11137] [New process 19619] [New process 19611] [New process 19581] [New process 19573] [New process 19543] [New process 19535] [New process 19505] [New process 19497] [New process 19467] [New process 19459] [New process 19429] [New process 19421] [New process 19391] [New process 19383] [New process 19353] [New process 19345] [New process 19315] [New process 19307] [New process 19277] [New process 19269] [New process 19239] [New process 19231] [New process 19201] [New process 19193] [New process 19163] [New process 19155] [New process 19122] [New process 19114] [New process 19084] [New process 19076] [New process 19049] [New process 19030] [New process 18726] [New process 18683] [New process 18680] [New process 18679] [New process 18657] [New process 18645] #0 0x0000003904030215 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003904030215 in raise () from /lib64/libc.so.6 #1 0x0000003904031cc0 in abort () from /lib64/libc.so.6 #2 0x000000390406a7fb in __libc_message () from /lib64/libc.so.6 #3 0x00000039040e823f in __stack_chk_fail () from /lib64/libc.so.6 #4 0x0000000000411fb6 in ?? () #5 0x0000000000409c43 in rrp_deliver_fn () #6 0x00000000004080d6 in ?? () #7 0x0000000000405b60 in poll_run () #8 0x00000000004184f2 in main () (gdb) Everett, It appears openais-debuginfo package is not installed. This provides the symbols for gdb to produce a useful backtrace output. Could you install it and rerun the command? Regards -steve Dude, let met try again.... I installed openais-devel...oops. uipi> pwd /var/lib/openais uipi> ls -dl core* -rw------- 1 root root 141422592 Aug 27 11:41 core.18617 -rw------- 1 root root 139124736 Sep 18 14:10 core.18644 uipi> pwd /var/lib/openais uipi> gdb /usr/sbin/aisexec /var/lib/openais/core.18644 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...done. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/libexec/lcrso/objdb.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/objdb.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/objdb.lcrso Reading symbols from /usr/libexec/lcrso/service_cman.lcrso...done. Loaded symbols for /usr/libexec/lcrso/service_cman.lcrso Reading symbols from /lib64/libnss_files.so.2...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cpg.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cfg.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_msg.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_lck.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evt.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_ckpt.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_amf.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_clm.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evs.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso Reading symbols from /lib64/libgcc_s.so.1...done. Loaded symbols for /lib64/libgcc_s.so.1 Core was generated by `aisexec'. Program terminated with signal 6, Aborted. [New process 18644] [New process 11137] [New process 19619] [New process 19611] [New process 19581] [New process 19573] [New process 19543] [New process 19535] [New process 19505] [New process 19497] [New process 19467] [New process 19459] [New process 19429] [New process 19421] [New process 19391] [New process 19383] [New process 19353] [New process 19345] [New process 19315] [New process 19307] [New process 19277] [New process 19269] [New process 19239] [New process 19231] [New process 19201] [New process 19193] [New process 19163] [New process 19155] [New process 19122] [New process 19114] [New process 19084] [New process 19076] [New process 19049] [New process 19030] [New process 18726] [New process 18683] [New process 18680] [New process 18679] [New process 18657] [New process 18645] #0 0x0000003904030215 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003904030215 in raise () from /lib64/libc.so.6 #1 0x0000003904031cc0 in abort () from /lib64/libc.so.6 #2 0x000000390406a7fb in __libc_message () from /lib64/libc.so.6 #3 0x00000039040e823f in __stack_chk_fail () from /lib64/libc.so.6 #4 0x0000000000411fb6 in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3453 #5 0x0000000000409c43 in rrp_deliver_fn (context=0x0, msg=0x1722eef8, msg_len=1090) at totemrrp.c:1308 #6 0x00000000004080d6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x1722e850) at totemnet.c:676 #7 0x0000000000405b60 in poll_run (handle=0) at aispoll.c:402 #8 0x00000000004184f2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:609 uipi> gdb /usr/sbin/aisexec /var/lib/openais/core.18617 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...done. ... ... ... [New process 18618] #0 0x0000003904030215 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003904030215 in raise () from /lib64/libc.so.6 #1 0x0000003904031cc0 in abort () from /lib64/libc.so.6 #2 0x000000390406a7fb in __libc_message () from /lib64/libc.so.6 #3 0x00000039040e823f in __stack_chk_fail () from /lib64/libc.so.6 #4 0x0000000000411fb6 in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3453 #5 0x0000000000409c43 in rrp_deliver_fn (context=0x0, msg=0x1fc8ef8, msg_len=1090) at totemrrp.c:1308 #6 0x00000000004080d6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x1fc8850) at totemnet.c:676 #7 0x0000000000405b60 in poll_run (handle=0) at aispoll.c:402 #8 0x00000000004184f2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:609 Comment #30 is an unknown issue. Please file a new defect with comment #30 backtrace against component openais. Regards -steve The cores appear to be dumped by aisexec. Besides, I do not see an 'openais' binary. uipi> pwd /var/lib/openais uipi> file core* core.18617: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec' core.18644: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'aisexec' uipi> type aisexec aisexec is /usr/sbin/aisexec ibmblade05> find / -xdev -type f -name 'openais' -ls 11769055 4 -rwxr-xr-x 1 root root 1077 Mar 27 15:49 /backup_ibmblade05/etc/rc.d/init.d/openais 12616268 4 -rwxr-xr-x 1 root root 1077 Apr 9 17:07 /etc/rc.d/init.d/openais ipi> find / /opt /usr/openv -xdev -type f -name 'openais' -print /etc/rc.d/init.d/openais uipi> ps -eaf|grep aisexec root 18609 1 4 Sep22 ? 00:53:42 aisexec root 19925 14710 0 09:50 pts/4 00:00:00 grep aisexec ----tururim---- ssh tururim "ps -eaf|egrep "aisexec|openais"" root 11920 11895 0 09:52 ? 00:00:00 bash -c ps -eaf|egrep "aisexec|openais" root 11938 11920 0 09:52 ? 00:00:00 egrep aisexec|openais root 18609 1 5 Sep22 ? 01:06:45 aisexec ----turuturu---- ssh turuturu "ps -eaf|egrep "aisexec|openais"" root 13522 13506 0 09:52 ? 00:00:00 bash -c ps -eaf|egrep "aisexec|openais" root 13543 13522 0 09:52 ? 00:00:00 egrep aisexec|openais root 18621 1 4 Sep22 ? 00:54:34 aisexec ----uipi---- ssh uipi "ps -eaf|egrep "aisexec|openais"" root 18609 1 4 Sep22 ? 00:53:47 aisexec root 23657 23466 0 09:52 pts/4 00:00:00 ssh uipi ps -eaf|egrep "aisexec|openais" root 23662 23658 0 09:52 ? 00:00:00 bash -c ps -eaf|egrep "aisexec|openais" root 23680 23662 0 09:52 ? 00:00:00 egrep aisexec|openais ----uiracu---- ssh uiracu "ps -eaf|egrep "aisexec|openais"" root 18606 1 4 Sep22 ? 00:55:47 aisexec root 23413 23408 0 09:52 ? 00:00:00 bash -c ps -eaf|egrep "aisexec|openais" root 23431 23413 0 09:52 ? 00:00:00 egrep aisexec|openais ----uirapuru---- ssh uirapuru "ps -eaf|egrep "aisexec|openais"" root 5099 5093 0 09:52 ? 00:00:00 bash -c ps -eaf|egrep "aisexec|openais" root 5117 5099 0 09:52 ? 00:00:00 egrep aisexec|openais root 18619 1 5 Sep22 ? 00:57:38 aisexec I am confused. I do not see an 'openais' binary. Is the above trace reflect known issues? Do you still want a different bug report? Also, what is the eta on the issue regarding comment #24 ? The customer is close to determing whether or not they will proceed with using RHEL Clustering and moving on to Veritas, etc. Please advise next steps that we should consider, e.g., upgrade to RHEL 5.4, etc. aisexec is the binary for openais. You generated the openais backtraces correctly in comment #30. Does the above trace reflect known issues? Do you still want a different bug report? Also, what is the eta on the issue regarding comment #24 ? The customer is close to determing whether or not they will proceed with using RHEL Clustering and moving on to Veritas, etc. Please advise next steps that we should consider, e.g., upgrade to RHEL 5.4, etc. The segfault backtrace shown in comment #30 is a new (currently unknown) defect. Please file a defect as per instructions in comment #31 against the component openais. comment #24 does not reflect a known issue. In order for it to be fixed, either I have to try to duplicate your application in a separate test case, or use your application internally to duplicate the issue. Using your application internally would be more effective. If you can provide me a copy of your application, that would be helpful. Please contact me off-bugzilla at sdake for transfer instructions. The application is a bit cumbersome and I will not be able to transfer to you. If necessary, we could arrange you access to our dev env. With regards to comment #24: Comment #24 From Dave Teigland (teigland) 2009-09-22 14:15:50 EDT (-) [reply] ------- I've not tried to reproduce this, but intec have reproduced and collected the data I needed. So, the data provided did not help? I've cloned the ticket and included the trace info for aisexec. The bug report is # 525280 . Are we going in circles ore going nowhere? Comment #24 From Dave Teigland (teigland) 2009-09-22 14:15:50 EDT (-) [reply] ------- I've not tried to reproduce this, but intec have reproduced and collected the data I needed. Everett, We are not going in circles. First, Dave has diagnosed there may be a checkpoint problem in comment #20. In comment #30 I diagnosed a new unknown bug for which you cloned a bugzilla. Jan Friesse will be taking over this bug and working full time on it now that it has been triaged properly. Please arrange remote access for him to your development environment. Once he duplicates the issue and can instrument the behavior, he can fix the problem. Regards -steve Everett, I'm now assigned to this bug. First of all. I'm in GTM+2 TZ, so communication *may* be little slower. Anyway, can you please try to do this: - Upgrade to 5.4. Does it help? - In case it is not problem, try to give me access to your machines and maybe some instructions HOW to reproduce bug (please contact me directly by mail jfriesse). We should try to reproduce this ourselves before trying to reproduce remotely; this shouldn't require using their application. It looks like a relatively simple sequence of steps: start with node1, node3, node4 all mounting the fs node1 and node2 (when it mounts) both do plock operations on the fs during the testing repeat the following: a. node2 mounts (node1 unlinks ckpt, node1 creates ckpt, node2 reads ckpt) b. kill node1 c. node1 mounts (node2 unlinks ckpt, node2 creates ckpt, node1 reads ckpt) d. kill node2 we need to check: in step a that node2 reads the most recent ckpt created by node1 in step c taht node1 reads the most recent ckpt created by node2 it seems that after mounting, a node reads an old/stale version of the ckpt One of the key parts to debugging this is verifying whether or not the new ckpt data was read versus the old ckpt data, i.e. detecting when the wrong ckpt data has been read. Nothing may appear to be wrong if the old/wrong ckpt content is read. gfs_controld blindly trusts that it is reading the correct ckpt data, we don't have a test that validates the plock state, and openais doesn't produce any errors when the wrong ckpt data is read. Here are a couple things I look at to debug ckpt data from outside of openais (and we may want to add more): - "group_tool dump gfs" shows the ckpt sections that are read by a node when it mounts. Each section is for a separate inode/resource. With the test running, these sections should be different each time. - "group_tool dump plocks <fsname>" shows the plock state from gfs_controld, which is initialized from the ckpt during mount Steve may also have some tips for debugging ckpt state from within openais. Status after +- week of testing (= waiting nodes for reboot). I was NOT able to reproduce bug. I used lock_load.c provided by Dave. Everything seems to be perfectly working, valgrind didn't complain about anything. I tried to copy situation as close as possible, so I had: - 3 nodes with shared disk (n1, n2, n3) - n3 still running lock_load.c = create load - n1 running "hacked" lock_load which creates lock on one file and doesn't change it (I will call it APP) :loop - fence n1 - run APP on n2 - wait until n1 will run (cman + gfs), check sections of every node - everything seems same - fence n2 - run APP on n1 - wait until n2 will run (cman + gfs) - check section on every node - everything seems same -> no inconsistency goto :loop For me, it just looks like (at least) I'm not able to reproduce this. So Everett, can you please manage to grant access to your machines? I'll consult with my manager on how he would like to proceed. I have a vnc connection available that we can use. If you call me at 404-705-2849, I will provide the info. Also, we need to setup a time when you are available. Does the fact that we have an active RH support contract help us in increasing the priority with which this issue is worked on? If so, what do we need to do to take advantage of this? Thanks. John, There is a engineer from the R&D development team working on this issue full time. The priority is currently "high" with severity "urgent" which is why you have 100% engineering time for 1 dedicated engineer who is doing absolutely nothing else but looking at this issue. Usually TAMs (technical account managers) interface between redhat support + engineering and customers or partners. I don't know the process for obtaining a TAM, but they are helpful in providing visibility of your issues into management. Regards -steve John, as I said, I'm in UTC+2, so I'm able to call you anytime 9.00 - 18.00 UTC+2. So I suggest 15.30 UTC (11:30 EDT). I'm not sure, what TZ are you in, so please let me now. Jan, I am on EST time zone and 15:00 today works for me. I will send you a separate e-mail to go over logistics (connection details & test/prob duplication process). Thanks! John Hadad Looks like I got the times all wrong. For some reason, the time that stuck in my mind was 15:00 EST rather than 11:30 EST. I apologize for the misunderstanding. Can we setup to run a few tests tomorrow (Wed Oct 7) @ 11:30 EST? If not, do you have another time window when you can be available for testing around the 11:30 EST time frame? John, 11:30 EST is really too late (11:30 E!!!D!!!!T, this means 10:30 EST is one of latest reasonable time for me). I would rather prefer communication by mail. Only thing what I really need is to get some kind of access (ideally ssh, because ... latency of network for such long distances is not big friend of VNC) and maybe some little details about environment (like where your application is, how it runs, ...) to be able to replicate problem. Of course, my aim will be to find reason of behavior, and replicate it on my nodes, but it's really hard to say, how long it will take. Is it possible to get someone in the USA to work this issue? This timezone thing is not going to work? I will need an public ssh key to allow support access. Please email your 'id_dsa.pub' key, etc to my account. Small update about this bug. We managed call and starting talk directly (not thru BZ). Perry decided to be good thing to have update there. So I'm currently still working on issue. It seems very very hard to debug, because even on customer machines, bug is not 100% times reproducible. We found, that bug appears on node with lowest id only. It appears BEFORE node with (node_with_lowest_id + 1), right after node_with_lowest_id will boot up. Currently, I'm trying to make customized aisexec with more debug informations, so I can confirm/decline that this bug is caused by aisexec ckpt (exec part) Update from 2009-10-21 I finished ckpt exec with more debug informations and put to customer machine. I was able to reproduce bug and get some interesting informations. Sadly, it was not enough, so currently, I'm working on more enhanced version. It looks like bug is caused by uninitialized memory somewhere in ckpt, because section_iterate returns some sections, which shouldn't exists and wasn't sent by previous master node. Honzaf, Try the attached patch. It prevents any of the iterator functions from executing during the synchronization process. I had a look over the ckpt usage code in gfs - it handles the case where iteration returns try again. Regards -steve Created attachment 365573 [details]
patch which may fix problem
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0180.html |