Bug 207301 - stuck clvmd process unable to be killed on one node in cluster
stuck clvmd process unable to be killed on one node in cluster
Status: CLOSED DUPLICATE of bug 211914
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-20 11:46 EDT by Corey Marthaler
Modified: 2009-04-16 16:01 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-11-07 06:05:52 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2006-09-20 11:46:36 EDT
Description of problem:
I had an x86_64 cluster (taft-01 - 04) running lvm_config which was basically
just starting clvmd, creating volumes, altering them, and then stopping clvmd,
and after 70+ iterations, the clvmd process on taft-04 was unable to be killed.
It was stopped on all three other nodes, and a cman_tool service shows that the
service is stopped on taft-04, but it just won't die.

[root@taft-04 ~]# ps -elf | grep clvmd
5 D root     14912     1  0  76   0 - 22466 kthrea Sep19 ?        00:00:01 clvmd

[root@taft-04 ~]# dmsetup ls
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)

[root@taft-04 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 4 2 3]

[root@taft-04 ~]# cat /proc/cluster/dlm_locks
[root@taft-04 ~]# cat /proc/cluster/dlm_dir
[root@taft-04 ~]# cat /proc/cluster/dlm_debug
d updated 0 resources
clvmd rebuild locks
clvmd rebuilt 0 locks
clvmd recover event 2765 done
clvmd move flags 0,0,1 ids 2764,2765,2765
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 2765 finished
clvmd move flags 1,0,0 ids 2765,2765,2765
clvmd move flags 0,1,0 ids 2765,2766,2765
clvmd move use event 2766
clvmd recover event 2766
clvmd remove node 2
clvmd total nodes 1
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd purge requests
clvmd purged 0 requests
clvmd mark waiting requests
clvmd marked 0 requests
clvmd purge locks of departed nodes
clvmd purged 0 locks
clvmd update remastered resources
clvmd updated 0 resources
clvmd rebuild locks
clvmd rebuilt 0 locks
clvmd recover event 2766 done
clvmd move flags 0,0,1 ids 2765,2766,2766
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 2766 finished
clvmd move flags 1,0,0 ids 2766,2766,2766
[root@taft-04 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:       2501
Unlock operations:     2501
Convert operations:     401
Completion ASTs:       5403
Blocking ASTs:            0

Lockqueue        num  waittime   ave
WAIT_RSB        1955      4614     2
WAIT_CONV        315        69     0
WAIT_GRANT       612      1470     2
WAIT_UNLOCK      563       153     0
Total           3445      6306     1
[root@taft-04 ~]# cat /proc/cluster/sm_debug
3
01000248 uevent state 8 node 2
01000248 uevent state 10 node 2
01000248 del node 2 count 1
01000248 uevent state 12 node 2
01000248 uevent state 7 node 2
01000248 sevent state 10
01000248 sevent state 12
01000248 sevent state 14
01000248 sevent state 16


Sep 20 05:31:55 taft-04 kernel: dlm_recoverd  S 0000000000000000     0 14916   
 10               14914 (L-TLB)
Sep 20 05:31:55 taft-04 kernel: 00000101fcacdea8 0000000000000046
0000000000000000 0000000000000ace
Sep 20 05:31:55 taft-04 kernel:        000001021791c030 0000000000000073
0000010001061a40 0000000000000246
Sep 20 05:31:55 taft-04 kernel:        0000010216150030 00000000000047f9
Sep 20 05:31:55 taft-04 kernel: Call Trace:<ffffffffa024d8a7>{:dlm:wake_astd+27}
<ffffffffa025cf60>{:dlm:dlm_recoverd+60}
Sep 20 05:31:55 taft-04 kernel:        <ffffffffa025cf24>{:dlm:dlm_recoverd+0}
<ffffffff8014b4f0>{keventd_create_kthread+0}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff8014b4c7>{kthread+200}
<ffffffff80110f47>{child_rip+8}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b3ff>{kthread+0}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff80110f3f>{child_rip+0}
Sep 20 05:31:55 taft-04 kernel: btimed        S 0000000000000100     0 15547  
3218                     (NOTLB)
Sep 20 05:31:55 taft-04 kernel: 0000010207bdfb18 0000000000000002
ffffffff803d4400 0000000000001000
Sep 20 05:31:55 taft-04 kernel:        00000102165e7000 ffffffff802bed60
00000000ffffffff 00000000ffffffff
Sep 20 05:31:55 taft-04 kernel:        00000101fba9e7f0 0000000000000111
Sep 20 05:31:55 taft-04 kernel: Call Trace:<ffffffff802bed60>{qdisc_restart+30}
<ffffffff802b0055>{dev_queue_xmit+530}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff802cc146>{ip_finish_output+366}
<ffffffff8030ac54>{schedule_timeout+224}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff801356aa>{prepare_to_wait_exclusive+21}
<ffffffff802acd75>{skb_recv_datagram+373}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff80135752>{autoremove_wake_function+0}
<ffffffff80135752>{autoremove_wake_function+0}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff802e9584>{udp_recvmsg+118}
<ffffffff802aa872>{sock_common_recvmsg+48}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff802a7394>{sock_recvmsg+284}
<ffffffff80186dc0>{link_path_walk+176}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff80135752>{autoremove_wake_function+0} <ffffffff802a6f97>{sockfd_lookup+16}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff802a87c3>{sys_recvfrom+182}
<ffffffff8013fcb4>{__mod_timer+293}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff8014062f>{sys_alarm+53}
<ffffffff8011026a>{system_call+126}
Sep 20 05:31:55 taft-04 kernel:
Sep 20 05:31:55 taft-04 kernel: strace        S 0000007fbfffd6f4     0  5936  
4264 14912               (NOTLB)
Sep 20 05:31:55 taft-04 kernel: 000001020226beb8 0000000000000006
000001020226be78 0000000000000014
Sep 20 05:31:55 taft-04 kernel:        000001020226bef8 ffffffff80143439
0000000000640030 000000020226bf58
Sep 20 05:31:55 taft-04 kernel:        00000101fc72d030 0000000000002249
Sep 20 05:31:55 taft-04 kernel: Call
Trace:<ffffffff80143439>{get_signal_to_deliver+1117}
<ffffffff8010f6fb>{do_signal+131}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff8013b87d>{do_wait+3298}
<ffffffff80133da9>{default_wake_function+0}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff8013f344>{ptrace_check_attach+179}
<ffffffff80133da9>{default_wake_function+0}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff801102f3>{sysret_signal+28}
<ffffffff8011026a>{system_call+126}
Sep 20 05:31:55 taft-04 kernel:
Sep 20 05:31:55 taft-04 kernel: clvmd         D 0000000000000000     0 14912  
5936                     (NOTLB)
Sep 20 05:31:55 taft-04 kernel: 00000101fb5c5d98 0000000000000002
0000010006939030 0000000300000206
Sep 20 05:31:55 taft-04 kernel:        0000010217a3b7f0 000000000000085b
000028f26f39f97c 0000000306939030
Sep 20 05:31:55 taft-04 kernel:        0000010217a3b7f0 0000000000000331
Sep 20 05:31:55 taft-04 kernel: Call
Trace:<ffffffff8030a405>{wait_for_completion+167}
<ffffffff80133da9>{default_wake_function+0}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff80133da9>{default_wake_function+0} <ffffffff8014b7c0>{kthread_stop+147}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffffa0254b6c>{:dlm:release_lockspace+248}
<ffffffffa024e3dc>{:dlm:unregister_lockspace+27}
Sep 20 05:31:55 taft-04 kernel:        <ffffffffa024f9c6>{:dlm:dlm_write+2467}
<ffffffff801796bc>{vfs_write+207}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff801797a4>{sys_write+69}
<ffffffff8011026a>{system_call+126}
Sep 20 05:31:55 taft-04 kernel:
Sep 20 05:31:55 taft-04 kernel: clvmd         T 0000010216bc7708     0 14940   
  1                5264 (NOTLB)
Sep 20 05:31:55 taft-04 kernel: 00000101fc9abdf8 0000000000000002
0000000000000002 0000000000000000
Sep 20 05:31:55 taft-04 kernel:        0000000000000000 000000000055a9e0
0000000000000012 000000038013271e
Sep 20 05:31:55 taft-04 kernel:        0000010216bc7030 00000000000002af
Sep 20 05:31:55 taft-04 kernel: Call Trace:<ffffffff80142dc8>{finish_stop+142}
<ffffffff801433ab>{get_signal_to_deliver+975}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff8010f6fb>{do_signal+131}
<ffffffff80110129>{sys_rt_sigreturn+736}
Sep 20 05:31:55 taft-04 kernel:       
<ffffffff80133da9>{default_wake_function+0} <ffffffff8014cc9b>{sys_futex+203}
Sep 20 05:31:55 taft-04 kernel:        <ffffffff801102f3>{sysret_signal+28}
<ffffffff801105df>{ptregscall_common+103}



Version-Release number of selected component (if applicable):
[root@taft-04 tmp]# rpm -q lvm2
lvm2-2.02.06-6.0.RHEL4B3
[root@taft-04 tmp]# rpm -q lvm2-cluster
lvm2-cluster-2.02.06-7.0.RHEL4B3


How reproducible:
first time I've ever seen this
Comment 1 Christine Caulfield 2006-09-21 06:27:36 EDT
It looks like clvmd is releasing the lockspace, waiting for dlm_recoverd to
finish. It's not clear why dlm_recoverd is stopped though. It looks like it
might be waiting for dlm_astd but I don't know what that thread is doing and I
can't find anywhere in the code that it would wait for any length of time
holding the semaphore.

If it happens again can you get a full list of processes please? :-)
Comment 3 Christine Caulfield 2006-11-07 06:05:52 EST
I've just seen #211914 and it looks indentical to me.

*** This bug has been marked as a duplicate of 211914 ***

Note You need to log in before you can comment on or make changes to this bug.