1599769 – seeing lot of sockets which are possibly stale under brick process

Bug 1599769 - seeing lot of sockets which are possibly stale under brick process

Summary: seeing lot of sockets which are possibly stale under brick process

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nikhil Ladha
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1662629
TreeView+	depends on / blocked

Reported:	2018-07-10 14:33 UTC by Nag Pavan Chilakam
Modified:	2021-06-28 07:18 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1662629 (view as bug list)
Environment:
Last Closed:	2021-06-22 05:53:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2018-07-10 14:33:47 UTC

Description of problem:
========================
I see more than 700 sockets under /proc/$(pgrep glusterfsd)/fd
which possibly are stale
I don't think this is expected.
Below are some sample set

rwx------. 1 root root 64 Jul 10 17:44 12385 -> socket:[3399901]
lrwx------. 1 root root 64 Jul 10 17:44 12386 -> socket:[3399902]
lrwx------. 1 root root 64 Jul 10 17:44 12387 -> socket:[3399903]
lrwx------. 1 root root 64 Jul 10 17:44 12388 -> socket:[3399904]
lrwx------. 1 root root 64 Jul 10 17:44 12389 -> socket:[3399916]
lrwx------. 1 root root 64 Jul 10 17:44 12390 -> socket:[3399917]
lrwx------. 1 root root 64 Jul 10 17:44 12391 -> socket:[3399918]
lrwx------. 1 root root 64 Jul 10 17:44 12392 -> socket:[3399919]
lrwx------. 1 root root 64 Jul 10 17:44 12393 -> socket:[3399920]
lrwx------. 1 root root 64 Jul 10 17:44 12394 -> socket:[3399921]
lrwx------. 1 root root 64 Jul 10 17:44 12395 -> socket:[3399922]
lrwx------. 1 root root 64 Jul 10 17:44 12396 -> socket:[3399923]
lrwx------. 1 root root 64 Jul 10 17:44 12397 -> socket:[3399924]
lrwx------. 1 root root 64 Jul 10 17:44 12398 -> socket:[3399925]
lrwx------. 1 root root 64 Jul 10 17:44 12399 -> socket:[3399926]
lrwx------. 1 root root 64 Jul 10 17:44 12400 -> socket:[3399927]
lrwx------. 1 root root 64 Jul 10 17:44 12401 -> socket:[3399928]
lrwx------. 1 root root 64 Jul 10 17:44 12402 -> socket:[3399929]
lrwx------. 1 root root 64 Jul 10 17:44 12403 -> socket:[3399930]





Version-Release number of selected component (if applicable):
----------------
3.8.4-54.14


How reproducible:
==============
have run below steps once and am seeing it on all nodes

Steps to Reproduce:
=====================
1.have a 6 node cluster, 8x3 volume  started
2.run glusterd restart in loop from one terminal of n1 and  gluster v heal command from another terminal of n1 (as part of bz#1595752 verification)
3.now from 3 clients simultaneously keep mounting the same volume using n1 IP on 1000 directories 
ie from each node for i  in {1..1000};do mkdir /mnt/vol.$i; mount -t glusterfs n1:vol /mnt/vol.$i;done
4. now the loops in step#3
5. now do in loop, glusterd restart from t1 of n1 and from t2 of n1 run quota enable/disable 

you will hit issue bz#1599702 on n1


6. now unmount all the mounts on 3 clients simultaneously
ie  for i  in {1..1000};do umount /mnt/vol.$i; done
this will succeed
7. stop step 5 and bring the cluster to idle state 


Actual results:
=============
notice /proc/$glusterfsd/fd
and you will see many sockets ie more than 700+


workaround:
==========
reboot node or kill all gluster procs and restart glusterd

Comment 2 Nag Pavan Chilakam 2018-07-10 14:34:26 UTC

logs same as bz#1599702 
 http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1599702/

Comment 3 Nag Pavan Chilakam 2018-07-10 14:40:24 UTC

the lists are available under each server log directory as 
glusterfsd.proc.fd.list  
lsof_fd.list

Comment 6 Amar Tumballi 2018-08-13 04:58:20 UTC

A similar tests are required on top of RHGS3.4.0 (Once released, to see if we should consider for batch update. I am not aware if this is happening any more.

Comment 7 Amar Tumballi 2018-08-13 04:58:21 UTC

A similar tests are required on top of RHGS3.4.0 (Once released, to see if we should consider for batch update. I am not aware if this is happening any more.

Comment 8 Atin Mukherjee 2018-11-10 08:12:01 UTC

Setting a needinfo on Nag based on comment 7.

Comment 9 Nag Pavan Chilakam 2018-11-30 13:31:09 UTC

cleared needinfo accidentally, placing it back

Comment 10 Nag Pavan Chilakam 2018-12-03 07:39:44 UTC

the problem still exists even on 3.12.2-29

I am seeing anything b/w 15-75 stale sockets for the same steps as mentioned in description(only difference is I infact reduced number of mounts to 500 only)

Comment 11 Nag Pavan Chilakam 2018-12-03 07:52:21 UTC

sosreports and health reports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1599769

Comment 14 Atin Mukherjee 2019-01-02 02:41:33 UTC

Upstream patch : https://review.gluster.org/#/c/glusterfs/+/21966/

Comment 18 Xavi Hernandez 2020-01-20 11:07:06 UTC

@Mohit, what's the state of this ?

Comment 19 Mohit Agrawal 2020-01-28 06:20:23 UTC

I have asked Sanju to look the same.

Note You need to log in before you can comment on or make changes to this bug.