This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 976670 - Spurious disconnects on one of the servers
Spurious disconnects on one of the servers
Status: CLOSED WONTFIX
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.0
Unspecified Unspecified
urgent Severity unspecified
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
storage-qa-internal@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-21 03:21 EDT by Sachidananda Urs
Modified: 2015-03-23 03:39 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Sachidananda Urs 2013-06-21 03:21:10 EDT
glusterfs 3.3.0.10rhs built on May 29 2013 05:29:15

In this setup of the replica pair one of them is unstable (it keeps disconnecting due to operating system crashes), the scenario stated below is from the stable server.

The server stops responding under load and client disconnects with the message:

[2013-06-21 01:14:49.694334] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-xfstest-client-3: server 10.70.34.94:24010 has not responded in the last 42 seconds, disconnecting.

The server in this case 10.70.34.94, never went down nor there was any network disrupts. However in the server logs I see that connection from client was destroyed (No reasons stated in the logs):

=================================

[2013-06-21 00:50:08.109891] I [server-helpers.c:632:server_connection_destroy] 0-xfstest-server: 
destroyed connection of van.lab.eng.blr.redhat.com-1503-2013/06/20-21:28:35:869525-xfstest-client-
3-0
[2013-06-21 01:14:51.582652] I [server.c:770:server_rpc_notify] 0-xfstest-server: disconnecting connectionfrom hamm.lab.eng.blr.redhat.com-5392-2013/06/20-11:37:09:392614-xfstest-client-3-0
[2013-06-21 01:14:51.582746] I [server-helpers.c:744:server_connection_put] 0-xfstest-server: Shutting down connection hamm.lab.eng.blr.redhat.com-5392-2013/06/20-11:37:09:392614-xfstest-client-3-0
[2013-06-21 01:14:51.582784] I [server-helpers.c:330:do_lock_table_cleanup] 0-xfstest-server: finodelk released on /new-latest/61/linux-3.9.5/block/blk-throttle.c
[2013-06-21 01:14:51.582839] I [server-helpers.c:330:do_lock_table_cleanup] 0-xfstest-server: finodelk released on /new-latest/8/linux-3.9.5/drivers/ata/ata_piix.c

==================================

The applications on the mount fail at this point:

=================================================
xzcat: (stdin): Compressed data is corrupt
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
xzcat: (stdin): Compressed data is corrupt
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
tar: Exiting with failure status due to previous errors

=============================
Comment 2 Sachidananda Urs 2013-06-21 03:33:18 EDT
sosreports can be found at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/976670/
Comment 3 Sachidananda Urs 2013-07-05 02:28:03 EDT
Any status on this bug? This is a serious issue which would end up in data loss.
Comment 4 Raghavendra G 2013-07-05 05:35:35 EDT
<ps-output-from-sos-reports>

root      1549  101  1.0 1146884 40456 ?       Ssl  11:51  32:33 /usr/sbin/glusterfsd 

</output>

As can be seen CPU usage has been high (101). After speaking to sacchi, it became clear the load on brick was very high. The disconnects are probably because of this high load. The thread which picks up requests from network is probably not scheduled to a cpu because of this high load causing ping timer to expire. So, I suspect nothing much can be done (from gluster scalability perspective like creating more threads etc) to "fix" this issue.

regards,
Raghavendra.
Comment 5 Raghavendra G 2013-07-05 05:49:41 EDT
I can try to set scheduling priority of server thread which reads requests from network to a high value. But that may not solve the problem completely since we make already overloaded server to read more requests. Even then, this seems a reasonable solution because the purpose of ping timer is to detect server deadlock (which is not the case here). The requests themselves can wait in io-threads request queue. Others, any comments?

regards,
Raghavendra.
Comment 6 santosh pradhan 2013-07-05 08:30:50 EDT
Is it possible to get the ps -L -p <brick pid> output which ll tell provide the thread wise break up. May be along with a pstack would help to understand what that thread was doing to consume more CPU ticks.
Comment 7 santosh pradhan 2013-07-05 08:32:00 EDT
(In reply to santosh pradhan from comment #6)

typo corrected.

> Is it possible to get the ps -L -p <brick pid> output which ll provide
> the thread wise break up. May be along with a pstack would help to
> understand what that thread was doing to consume more CPU ticks.
Comment 8 Vivek Agarwal 2015-03-23 03:38:46 EDT
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version.







[1] https://rhn.redhat.com/errata/RHSA-2014-0821.html
Comment 9 Vivek Agarwal 2015-03-23 03:39:59 EDT
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version.







[1] https://rhn.redhat.com/errata/RHSA-2014-0821.html

Note You need to log in before you can comment on or make changes to this bug.