Bug 791182

Summary: [glusterfs-3.3.0qa22]: glusterd entering an infinite loop and using 100% CPU
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: gluster-bugs, nsathyan, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-16 09:16:05 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Raghavendra Bhat 2012-02-16 07:28:21 EST
Description of problem:
4 peer cluster in which 2x2 distributed replicate volume is running. glusterd on one of the peers has entered infinite loop while looping around the list of peers (not coming out of the loop) and causing almost 100% CPU usage.

This is the backtrace when gdb was attached to the running glusterd process.

0x00000034c232b500 in __strncasecmp_l_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64
(gdb) bt
#0  0x00000034c232b500 in __strncasecmp_l_sse42 () from /lib64/libc.so.6
#1  0x00007f038a916f29 in glusterd_friend_find_by_hostname (hoststr=0x1ddd3b0 "10.1.11.130", peerinfo=0x7ffff003d180)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-utils.c:3608
#2  0x00007f038a9173b1 in glusterd_hostname_to_uuid (hostname=0x1ddd3b0 "10.1.11.130", 
    uuid=0x1dde7c0 "\274+\370\270\250\343Fd\212\235\036JP<\257\351\311]") at ../../../../../xlators/mgmt/glusterd/src/glusterd-utils.c:3683
#3  0x00007f038a90cb29 in glusterd_resolve_brick (brickinfo=0x1ddd3b0) at ../../../../../xlators/mgmt/glusterd/src/glusterd-utils.c:665
#4  0x00007f038a9495bb in glusterd_op_stage_start_volume (dict=0x1df16b0, op_errstr=0x7ffff003dad0)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-volume-ops.c:790
#5  0x00007f038a907cd0 in glusterd_op_stage_validate (op=GD_OP_START_VOLUME, dict=0x1df16b0, op_errstr=0x7ffff003dad0, rsp_dict=0x1df17c0)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-op-sm.c:2305
#6  0x00007f038a907610 in glusterd_op_ac_stage_op (event=0x1df19f0, ctx=0x1dd8d00)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-op-sm.c:2144
#7  0x00007f038a90a872 in glusterd_op_sm () at ../../../../../xlators/mgmt/glusterd/src/glusterd-op-sm.c:3600
#8  0x00007f038a8f829f in glusterd_handle_stage_op (req=0x7f038a59904c) at ../../../../../xlators/mgmt/glusterd/src/glusterd-handler.c:601
#9  0x00007f038e29cf89 in rpcsvc_handle_rpc_call (svc=0x1dc8de0, trans=0x1de8a80, msg=0x1dd0f50) at ../../../../rpc/rpc-lib/src/rpcsvc.c:514
#10 0x00007f038e29d32c in rpcsvc_notify (trans=0x1de8a80, mydata=0x1dc8de0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1dd0f50)
    at ../../../../rpc/rpc-lib/src/rpcsvc.c:610
#11 0x00007f038e2a2c08 in rpc_transport_notify (this=0x1de8a80, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x1dd0f50)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:498
#12 0x00007f038a38e23d in socket_event_poll_in (this=0x1de8a80) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1675
#13 0x00007f038a38e7c1 in socket_event_handler (fd=11, idx=12, data=0x1de8a80, poll_in=1, poll_out=0, poll_err=0)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1790
#14 0x00007f038e4fbc20 in event_dispatch_epoll_handler (event_pool=0x1dc7b80, events=0x1dd4b70, i=0) at ../../../libglusterfs/src/event.c:794
#15 0x00007f038e4fbe43 in event_dispatch_epoll (event_pool=0x1dc7b80) at ../../../libglusterfs/src/event.c:856
#16 0x00007f038e4fc1ce in event_dispatch (event_pool=0x1dc7b80) at ../../../libglusterfs/src/event.c:956
#17 0x0000000000407c2e in main (argc=1, argv=0x7ffff003e118) at ../../../glusterfsd/src/glusterfsd.c:1601
(gdb) f 1
#1  0x00007f038a916f29 in glusterd_friend_find_by_hostname (hoststr=0x1ddd3b0 "10.1.11.130", peerinfo=0x7ffff003d180)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-utils.c:3608
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) p entry->hostname
$1 = 0x1de58b0 "10.1.11.144"
(gdb) p hoststr
$2 = 0x1ddd3b0 "10.1.11.130"
(gdb) l
3603            priv    = THIS->private;
3604
3605            GF_ASSERT (priv);
3606
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
3608                    if (!strncasecmp (entry->hostname, hoststr,
3609                                      1024)) {
3610
3611                            gf_log ("glusterd", GF_LOG_DEBUG,
3612                                     "Friend %s found.. state: %d", hoststr,
(gdb) p *entry
$3 = {uuid = "\355\037\333\001\222\006As\233\t\356\000\071\264\207\036", uuid_str = '\000' <repeats 49 times>, state = {
    state = GD_FRIEND_STATE_BEFRIENDED, transition_time = {tv_sec = 0, tv_usec = 0}}, hostname = 0x1de58b0 "10.1.11.144", port = 0, 
  uuid_list = {next = 0x1de5a20, prev = 0x1de5a20}, op_peers_list = {next = 0x0, prev = 0x0}, rpc = 0x1de6690, mgmt = 0x7f038ab7a3a0, 
  peer = 0x7f038ab7a420, connected = 1, shandle = 0x1de5860, sm_log = {transitions = 0x1de5ad0, current = 3, size = 50, count = 4, 
    state_name_get = 0x7f038a8fed8e <glusterd_friend_sm_state_name_get>, 
    event_name_get = 0x7f038a8fedc4 <glusterd_friend_sm_event_name_get>}}
(gdb)  n
Single stepping until exit from function __strncasecmp_l_sse42,
which has no line number information.
glusterd_friend_find_by_hostname (hoststr=0x1ddd3b0 "10.1.11.130", peerinfo=0x7ffff003d180)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-utils.c:3607
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb) 
3608                    if (!strncasecmp (entry->hostname, hoststr,
(gdb) 
3607            list_for_each_entry (entry, &priv->peers, uuid_list) {
(gdb)  p priv->peers
$4 = {next = 0x1de2a00, prev = 0x1de5a20}
(gdb) p priv->peers->next
$5 = (struct list_head *) 0x1de2a00
(gdb) p priv->peers->next->next
$6 = (struct list_head *) 0x1de4170
(gdb) p priv->peers->next->next->next
$7 = (struct list_head *) 0x1de5a20
(gdb) p priv->peers->next->next->next->next
$8 = (struct list_head *) 0x1de5a20
(gdb) p priv->peers->next->next->next->next->next
$9 = (struct list_head *) 0x1de5a20
(gdb) p priv->peers->next->next->next->next->next->next
$10 = (struct list_head *) 0x1de5a20
(gdb) p priv->peers->next->next->next->next->next->next->next
$11 = (struct list_head *) 0x1de5a20
(gdb) 





Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:

glusterd entering infinite loop and using 100% CPU

Expected results:

glusterd should not enter the infinite loop and thus should not use 100% CPU


Additional info:

glusterd log:

[2012-02-16 00:58:00.724257] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 00:58:00.724327] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-02-16 00:58:00.725060] I [glusterd-handler.c:1291:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: c8128c3e-efeb-4
fc6-85c3-f9a8e3a4e52b
[2012-02-16 00:58:00.725119] I [glusterd-handler.c:1267:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-02-16 01:03:33.805566] I [glusterd-handler.c:487:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: c8128c3e-efeb-4fc6-8
5c3-f9a8e3a4e52b
[2012-02-16 01:03:33.805625] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:03:33.805675] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-02-16 01:03:33.806401] I [glusterd-handler.c:1291:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: c8128c3e-efeb-4
fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:03:33.806453] I [glusterd-handler.c:1267:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-02-16 01:09:06.889849] I [glusterd-handler.c:487:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: c8128c3e-efeb-4fc6-8
5c3-f9a8e3a4e52b
[2012-02-16 01:09:06.889912] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:09:06.889990] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-02-16 01:09:06.890855] I [glusterd-handler.c:1291:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: c8128c3e-efeb-4
fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:09:06.890926] I [glusterd-handler.c:1267:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-02-16 01:14:39.988029] I [glusterd-handler.c:487:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: c8128c3e-efeb-4fc6-8
5c3-f9a8e3a4e52b
[2012-02-16 01:14:39.988120] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:14:39.988189] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-02-16 01:14:39.988862] I [glusterd-handler.c:1291:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: c8128c3e-efeb-4
fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:14:39.988914] I [glusterd-handler.c:1267:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-02-16 01:20:13.096929] I [glusterd-handler.c:487:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: c8128c3e-efeb-4fc6-8
5c3-f9a8e3a4e52b
[2012-02-16 01:20:13.097017] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:20:13.097090] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-02-16 01:20:13.097628] I [glusterd-handler.c:1291:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: c8128c3e-efeb-4
fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:20:13.097683] I [glusterd-handler.c:1267:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-02-16 01:25:46.253557] I [glusterd-handler.c:487:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: c8128c3e-efeb-4fc6-8
5c3-f9a8e3a4e52b
[2012-02-16 01:25:46.253680] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:25:46.253743] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-02-16 01:25:46.254283] I [glusterd-handler.c:1291:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: c8128c3e-efeb-4
fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:25:46.254333] I [glusterd-handler.c:1267:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-02-16 01:31:19.335760] I [glusterd-handler.c:487:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: c8128c3e-efeb-4fc6-8
5c3-f9a8e3a4e52b
[2012-02-16 01:31:19.335831] I [glusterd-utils.c:262:glusterd_lock] 0-glusterd: Cluster lock held by c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
[2012-02-16 01:31:19.335913] I [glusterd-handler.c:1247:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
:

gluster peer status on other peers of the cluster.

10.1.11.131:

 gluster peer status
Number of Peers: 3

Hostname: 10.1.11.130
Uuid: bc2bf8b8-a8e3-4664-8a9d-1e4a503cafe9
State: Peer in Cluster (Connected)

Hostname: 10.1.11.145
Uuid: 1f3f3e03-3ba9-43c7-ac85-3a31a4f0bdbd
State: Peer in Cluster (Connected)

Hostname: 10.1.11.144
Uuid: ed1fdb01-9206-4173-9b09-ee0039b4871e
State: Peer in Cluster (Connected)

10.1.11.144:

 gluster peer status
Number of Peers: 3

Hostname: 10.1.11.131
Uuid: c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
State: Peer in Cluster (Connected)

Hostname: 10.1.11.145
Uuid: 1f3f3e03-3ba9-43c7-ac85-3a31a4f0bdbd
State: Peer in Cluster (Connected)

Hostname: 10.1.11.130
Uuid: bc2bf8b8-a8e3-4664-8a9d-1e4a503cafe9
State: Peer in Cluster (Connected)

10.1.11.145:

gluster peer status
Number of Peers: 4

Hostname: 10.1.11.130
Uuid: bc2bf8b8-a8e3-4664-8a9d-1e4a503cafe9
State: Peer in Cluster (Connected)

Hostname: 10.1.11.131
Uuid: c8128c3e-efeb-4fc6-85c3-f9a8e3a4e52b
State: Peer in Cluster (Connected)

Hostname: 10.1.11.144
Uuid: ed1fdb01-9206-4173-9b09-ee0039b4871e
State: Peer in Cluster (Connected)

Hostname: 10.1.11.144
Uuid: ed1fdb01-9206-4173-9b09-ee0039b4871e
State: Peer in Cluster (Connected)
[root@RHEL6 ~]# 

Above 10.1.11.144 is printed twice.


top command on the peer where glusterd took 100% CPU:

top - 07:25:39 up 2 days, 21 min,  2 users,  load average: 1.39, 1.61, 1.77
Tasks: 128 total,   3 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.2%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2054952k total,  1906528k used,   148424k free,    11592k buffers
Swap:  1048568k total,   111052k used,   937516k free,   535276k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
22243 root      20   0  204m  66m 1120 R 100.1  3.3 256:24.38 glusterd                                                                        
25293 root      20   0 15016 1228  940 R  0.3  0.1   0:00.02 top                                                                              
    1 root      20   0 19324 1032  836 S  0.0  0.1   0:01.02 init                                                                             
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd                                                                         
    3 root      RT   0     0    0    0 S  0.0  0.0   0:01.25 migration/0                                                                      
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.17 ksoftirqd/0                                                                      
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                                                      
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                                                       
    7 root      RT   0     0    0    0 S  0.0  0.0   0:01.87 migration/1                                                                      
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1                                                                      
    9 root      20   0     0    0    0 S  0.0  0.0
Comment 1 krishnan parthasarathi 2012-04-06 02:34:35 EDT
Johnny, Could you check if this issue continues to happen?