Bug 1790208

Summary:	When the network of the second server is disconnected, applications on the first hang for duration of ping.timeout
Product:	[Community] GlusterFS	Reporter:	vebmasterHtml <vbnmail>
Component:	glusterd	Assignee:	Sanju <srakonde>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	6	CC:	bugs, pasik, sabose, srakonde
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-13 11:31:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description vebmasterHtml 2020-01-12 15:35:05 UTC

Hi.
Ubuntu 18.04

Description of problem:
If you disconnect the second server from the network, then it is impossible to access the file on 1 server. The program freezes for the duration of "network.ping-timeout" or more.


Version-Release number of selected component (if applicable):
The problem is present on two versions I use:
glusterfs 6.7
glusterfs 7.1


How reproducible:
File 1.php for server 1:
--------------------
<?php

while(1)
{
	file_put_contents("/mnt/gluster/text.txt", "text message");
	$txt = file_get_contents("/mnt/gluster/text.txt");
	echo date("H:i:s") . " : $txt \n";
	usleep(490*1000);
}
--------------------


Steps to Reproduce:
server 1: run "php 1.php"
server 2: run "service networking stop"
server 1: look at the output "php 1.php"
my video: https://yadi.sk/i/v6ghR2ETk8wF_A


Actual results:
17:54:24 : text message
17:54:24 : text message
17:54:25 : text message
17:54:25 : text message
17:54:26 : text message
17:54:26 : text message
17:54:27 : text message
17:54:27 : text message
17:54:28 : text message
17:54:28 : text message
17:54:29 : text message
17:54:48 : text message
17:54:48 : text message
17:54:49 : text message
17:54:49 : text message
17:54:50 : text message
17:54:50 : text message

between 17:54:29 and 17:54:48, a pause of 19 seconds.


Additional info:
I think the program should not create a hang.
Thx.

Comment 1 Sanju 2020-01-13 06:02:39 UTC

Hi,

Can you please elobarate? Share the details like, how many nodes are in the cluster, configuration of volume which is facing the issue and what are the exact operations performed which led you to this?

Thanks,
Sanju

Comment 2 vebmasterHtml 2020-01-13 08:47:04 UTC

two servers: "s1" and "s2"

install:
add-apt-repository ppa:gluster/glusterfs-6
or
add-apt-repository ppa:gluster/glusterfs-7
apt install glusterfs-server

On s1 and s2:
mkdir /mnt/dir1

On s1:
gluster peer probe s2
gluster volume create vol02 replica 2 transport tcp s1:/mnt/dir1 s2:/mnt/dir1 force
gluster volume set vol02 network.ping-timeout 10

On s1 and s2 check:
gluster peer status

Mount on s1 and s2:
mkdir /mnt/gluster
mount.glusterfs localhost:/vol02 /mnt/gluster

Comment 3 Sanju 2020-01-13 11:31:28 UTC

Hi,

This is the expected bahaviour.

A comment below from Raghavendra G explains it:
"
Maximum latency of a single fop from application/kernel during a single ungraceful shutdown (hard reboot/ethernet cable pull/hard power down etc) of a hyperconverged node (which has a brick and client of the same volume) is dependent on following things:

1. Time required for client to fail the operations pending on the rebooted brick. These operations can include lock and non-lock operations like (f)inodelk, write, lookup, (f)stat etc. Since this requires client to identify the unresponsive/dead brick it is bound by (2 * network.ping-timeout).

2. Time required for client to acquire a lock on an healthy brick (as clients can be doing transacations in afr). Note that the lock request could be conflicting with a lock already granted to the dead client on rebooted node. So, the lock request from healthy client to a healthy brick cannot proceed till the stale lock from dead client is cleaned up. This means the healthy brick needs to identify that client is dead. A brick can identify a client connected to it is dead using the combination of (tcp-user-timeout and keepalive) tunings on brick/server. There are quite a few scenarios in this case:
2a. Healthy brick never writes a response to dead client. In this case tcp-keepalive tunings on server ((server.keepalive-time + server.keepalive-interval * server.keepalive-count) seconds after last communication with dead client) bounds the maximum time required for brick to cleanup stale locks from dead client. server.tcp-user-timeout has no role in this case
2b. Healthy brick writes a response (maybe one of requests dead-client sent before it died) to socket. Note that writing a response to socket doesn't necessarily mean the dead-client read the response.
2b.i healthy brick tries to write a response after keepalive timer has expired since its last communication with dead client(In reality it can't as keepalive timer expiry would close the connection). In this case since keepalive timer has already closed the connection, maximum time for brick to identify dead client is bound by server.keepalive tunings
2b.ii healthy brick writes a response to socket immediately after last communication with dead-client (i.e., last acked communication with dead client). In this case healthy brick terminates connection to dead-client in server.tcp-user-timeout seconds since last successful communication with dead client.
2b.iii healthy brick writes a response before keepalive timer has expired since its last communication with dead client(case explained by comment #140). Where response is written after keepalive is triggered but before it expired. In this case, tcp-keepalive timer is stopped and tcp-user-timeout timer is started. So, the healthy brick can identify the dead client at a maximum of (server.tcp-user-timeout + server.keepalive) seconds after last communication with dead client

Note that 1 and 2 can happen serially based on different transactions done by afr.

So the worst case/maximum latency of a fop from application is bounded by (2 * network.ping-timeout + server.tcp-user-timeout + (server.keepalive-time + server.keepalive-interval * server.keepalive-count))
"

Since, this is expected, closing it as not a bug.

Comment 4 vebmasterHtml 2020-01-13 11:45:30 UTC

Thanks.
But I believe that this is the wrong behavior of the program.
In the first place should be the continuous operation of programs on the current computer. And only then synchronization.
To prevent things from happening with other servers in the cluster, the current server should continue to work as if clusterfs did not exist.
It's my opinion.

Comment 5 vebmasterHtml 2020-01-13 11:49:24 UTC

Even if another server in the cluster is down, the current server should continue to work as if glusterfs did not exist.