1415178 – systemic testing: seeing lot of ping time outs which would lead to splitbrains

Bug 1415178 - systemic testing: seeing lot of ping time outs which would lead to splitbrains

Summary: systemic testing: seeing lot of ping time outs which would lead to splitbrains

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rpc
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Raghavendra G
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:	1421938 1427387 1427390
Blocks:	1417147
TreeView+	depends on / blocked

Reported:	2017-01-20 13:31 UTC by Nag Pavan Chilakam
Modified:	2020-07-16 09:08 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-3.8.4-26
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1421938 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:30:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
profile output (8.17 MB, application/x-gzip) 2017-01-23 10:17 UTC, Nag Pavan Chilakam	no flags	Details
profile log while strace being taken (60.14 KB, text/plain) 2017-01-24 11:41 UTC, Nag Pavan Chilakam	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-01-20 13:31:47 UTC

Description of problem:
=====================
I am seeing lot of ping time outs on my systemic setup clients as below
[2017-01-20 08:07:53.591156] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-6: server 10.70.35.156:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 08:07:53.595238] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-6: socket disconnected
[2017-01-20 08:32:58.781026] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-2: server 10.70.35.156:49152 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 08:32:58.818598] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-2: socket disconnected
[2017-01-20 08:35:31.801622] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-7: server 10.70.37.154:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 08:35:31.802349] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-7: socket disconnected
[2017-01-20 08:39:22.835813] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-7: server 10.70.37.154:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 08:39:22.841445] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-7: socket disconnected
[2017-01-20 08:52:28.949400] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-3: server 10.70.37.154:49152 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 08:52:28.950309] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-3: socket disconnected
[2017-01-20 09:11:15.090477] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-7: server 10.70.37.154:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 09:11:15.091079] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-7: socket disconnected
[2017-01-20 09:15:03.122268] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-7: server 10.70.37.154:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 09:15:03.125457] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-7: socket disconnected
[2017-01-20 09:17:28.141896] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-6: server 10.70.35.156:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 09:17:28.146266] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-6: socket disconnected
[2017-01-20 09:36:10.260003] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-2: server 10.70.35.156:49152 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 09:36:10.329709] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-2: socket disconnected
[2017-01-20 09:39:02.283174] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-3: server 10.70.37.154:49152 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 09:39:02.283875] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-3: socket disconnected
[2017-01-20 09:42:15.311267] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-systemic-client-7: server 10.70.37.154:49153 has not responded in the last 42 seconds, disconnecting.
[2017-01-20 09:42:15.311701] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-systemic-client-7: socket disconnected



This would eventually lead to splitbrains which is not a good place to be in


For more details of testing refer to 

https://bugzilla.redhat.com/show_bug.cgi?id=1415101 - glustershd process crashed on systemic setup 



Version-Release number of selected component (if applicable):
==========
3.8.4-12

Comment 2 Raghavendra G 2017-01-23 05:17:07 UTC

I've some queries:

* what was the disk utilization during testing?
* Can you please get profile info of bricks? I need to see what operations are being performed and their frequency.
* Is it possible to get strace of brick process when tests are being run? You can use following cmdline:
strace -ff -T -p <pid-of-brick> -o <path-where-you-want-the-strace-output-saved>

Following is a possible hypothesis

<hypothesis>

server not able to read ping request.

Me, Kritika and Pranith were about an issue where they had hit ping timer expiry in replicated setups when disk usage was high. The reason for this as Pranith pointed out was,
1. posix has some fops (like posix_xattrop, posix_fxattrop) which do syscalls after holding a lock on inode (inode->lock).
2. During high disk usage scenarios, syscall latencies were high (sometimes >= ping-timeout value)
3. Before being handed over to a new thread at io-threads xlator, a fop gets executed in one of the threads that reads incoming messages from socket. This execution path includes some translators like protocol/server, index, quota-enforcer, marker. And these translators might access inode-ctx which involves locking inode (inode->lock). Due to this locking latency of syscall gets transferred to poller thread. Since poller thread is waiting on inode->lock, it won't be able to read ping requests from network in-time resulting in ping-timer expiry.

</hypothesis>

Comment 3 Nag Pavan Chilakam 2017-01-23 10:17:38 UTC

Created attachment 1243534 [details]
profile output

Comment 5 Nag Pavan Chilakam 2017-01-24 11:41:57 UTC

Created attachment 1243893 [details]
profile log while strace being taken

Comment 6 Raghavendra G 2017-01-25 11:59:41 UTC

I went through strace output and volume profile. Though volume profile shows large max latencies for fops like xattrop/fxattrop, syscall latencies shown by strace is not too much to cause a ping timeout. As to large xattrop/fxattrop latency, the fops are serialized and when there too many (f)xattrops on the same file, the last fop can have a latency which is cumulative of all the (f)xattrops in the waiting queue before it. So, its not really an issue.

Nag also informed me that the disk utilization was not high when ping-timeout happened. So, I doubt whether its high disk utilization that caused this ping-timeout.

ATM, I don't have an RCA, but we can experiment with two things:
1. Increase the poller threads to 8 on bricks
2. Use fix [1]

Note that both the above two steps need to be done before starting tests.

[1] https://review.gluster.org/15046

Comment 7 Mohit Agrawal 2017-01-27 08:13:56 UTC

Hi,

 Can you please confirm about reproducer steps? As per doc It seems below are the steps 

 1) Run the script to create deep directory structure (5 Level) from two different client
 2) Run untar/tar linux.tar command operation 50 times in directory 

Regards
Mohit Agrawal

Comment 13 Mohit Agrawal 2017-02-02 05:21:13 UTC

Hi,


I tried to reproduce the issue as step mentioned in previous comment but i did not get success.
I have checked on the setup (server 10.70.35.20), vm is accesible but it seems all resources(cpu,memory) are consumed and even i am not able to run any linux command on vm.

On the client side below message are continuously coming in logs  

[2017-02-02 05:26:23.661141] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-systemic-client-4: changing port to 49153 (from 0)
[2017-02-02 05:26:23.667104] E [socket.c:2309:socket_connect_finish] 0-systemic-client-4: connection to 10.70.35.20:49153 failed (Connection refused)


Ping timeout issue could be happen if brick process is not able to respond within time period (42 second) and brick is not able to respond because sufficient resources are not available on vm.


Regards
Mohit Agrawal

Comment 14 Nag Pavan Chilakam 2017-02-09 06:17:08 UTC

i had shared the setup.
hence removing needinfo

Comment 15 Nag Pavan Chilakam 2017-02-23 12:32:00 UTC

I have seen this on 3.8.4-14 also, where parallel lookups from 2 clients are being issued.
Note that i ran a small script which echos date every second on all the 4 servers to see if there is any VM pause, but I didn't find any issue there as the echo worked perfectly fine for 2 days, while the clients did see disconnects

logs avaialble at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/final-reg-round/client-facing-brick-disconnects/

refer to clients  rhs-client12 and rhs-client26

Comment 25 Atin Mukherjee 2017-05-19 11:46:18 UTC

upstream patch : https://review.gluster.org/16785

Comment 26 Krutika Dhananjay 2017-05-23 08:37:55 UTC

https://code.engineering.redhat.com/gerrit/#/q/topic:bug-1415178+status:open

Comment 34 Nag Pavan Chilakam 2017-08-10 10:36:41 UTC

I have tested on 3.3 ie 3.8.4-37 for about 1 week and didn't see any ping timeouts. I ran IOs from multiple clients(similar to what was run while bz was raised, but not exactly)
Hence moving to verified.
If i hit it again, i may have to reopen.

Comment 36 errata-xmlrpc 2017-09-21 04:30:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 37 errata-xmlrpc 2017-09-21 04:56:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.