Bug 1714460
| Summary: | gluster peer status --xml command is timing out post glusterd restart (due to multiplexing) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Bala Konda Reddy M <bmekala> |
| Component: | glusterd | Assignee: | Sanju <srakonde> |
| Status: | CLOSED DEFERRED | QA Contact: | Bala Konda Reddy M <bmekala> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.5 | CC: | amukherj, rhs-bugs, rkavunga, srakonde, storage-qa-internal, vbellur, vdas |
| Target Milestone: | --- | Keywords: | Automation |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | shd-multiplexing | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-21 09:26:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Bala Konda Reddy M
2019-05-28 07:42:53 UTC
When I tried to reproduce this in Bala's setup, there were no volumes present in the cluster. After creating one volume I'm able to hit this issue. In glusterd.log I have observed the following. If we execute the peer status command once glustershd is started, it works as expected. But If we execute the command, just after friend handshake complete it hungs. Also, one thing I didn't understand is why glustershd is started and stopped several times, once the glusterd is restarted? I see different pid's in the following log. [2019-05-30 05:39:51.667174] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e [2019-05-30 05:39:51.872878] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:51.900464] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:51.900700] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-30 05:39:52.907268] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21169 [2019-05-30 05:39:53.907629] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:54.274356] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:54.274626] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-30 05:39:55.281439] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21181 [2019-05-30 05:39:56.281810] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:56.307311] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:56.307605] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service [2019-05-30 05:39:57.313569] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 21193 [2019-05-30 05:39:58.313830] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-05-30 05:39:58.356303] I [rpc-clnt.c:1014:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2019-05-30 05:39:58.356562] I [MSGID: 106567] [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting glustershd service Rafi, can you please take a look at this? If you are loaded with work, please provide me with some pointers to debug this issue. Thanks, Sanju (In reply to Sanju from comment #4) > When I tried to reproduce this in Bala's setup, there were no volumes > present in the cluster. After creating one volume I'm able to hit this issue. > > In glusterd.log I have observed the following. If we execute the peer status > command once glustershd is started, it works as expected. But If we execute > the command, just after friend handshake complete it hungs. Also, one thing > I didn't understand is why glustershd is started and stopped several times, > once the glusterd is restarted? I see different pid's in the following log. > > [2019-05-30 05:39:51.667174] I [MSGID: 106493] > [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received > ACC from uuid: c621234d-e846-4f82-9f05-062b01b5095e > [2019-05-30 05:39:51.872878] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:51.900464] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:51.900700] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-05-30 05:39:52.907268] I [MSGID: 106568] > [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 21169 > [2019-05-30 05:39:53.907629] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:54.274356] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:54.274626] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-05-30 05:39:55.281439] I [MSGID: 106568] > [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 21181 > [2019-05-30 05:39:56.281810] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:56.307311] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:56.307605] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > [2019-05-30 05:39:57.313569] I [MSGID: 106568] > [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 21193 > [2019-05-30 05:39:58.313830] I [MSGID: 106568] > [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: glustershd service > is stopped > [2019-05-30 05:39:58.356303] I [rpc-clnt.c:1014:rpc_clnt_connection_init] > 0-glustershd: setting frame-timeout to 600 > [2019-05-30 05:39:58.356562] I [MSGID: 106567] > [glusterd-svc-mgmt.c:226:glusterd_svc_start] 0-management: Starting > glustershd service > > Rafi, can you please take a look at this? If you are loaded with work, > please provide me with some pointers to debug this issue. > > Thanks, > Sanju This is an expected behavior where we call svc manager multiple times during glusterd restart, especially when you have more than 1 node in the trusted storage pool. This is not just for shd, but for all daemons. Particular to shd, I have sent a patch to optimise the restart scenario https://review.gluster.org/#/c/glusterfs/+/22729/. But that won't help with other daemons. Nevertheless, Do you have any reason why you suspect calling multiple managers will lead to a deadlock? |