Bug 865406 - [RHEV-RHS] self-heal-daemon continuously reporting messages about disconnects to bricks
[RHEV-RHS] self-heal-daemon continuously reporting messages about disconnects...
Status: CLOSED DUPLICATE of bug 865693
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.0
Unspecified Unspecified
low Severity high
: ---
: ---
Assigned To: Amar Tumballi
Sudhir D
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-11 07:15 EDT by spandura
Modified: 2013-12-18 19:08 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-10-19 00:56:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description spandura 2012-10-11 07:15:30 EDT
Description of problem:
----------------------
when the storage nodes containing a brick in a pure replicate volume (1x2) comes online after the reboot of the storage node, the self-heal daemon process  reports continuously messages about the disconnects to the bricks. 

[2012-10-11 16:35:01.675590] I [client.c:2090:client_rpc_notify] 0-replicate-rhevh-client-0: disconnected
[2012-10-11 16:35:02.680086] I [client.c:2090:client_rpc_notify] 0-replicate-rhevh-client-1: disconnected


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
[10/11/12 - 15:53:58 root@rhs-client7 ~]# rpm -qa | grep gluster
glusterfs-geo-replication-3.3.0rhsvirt1-7.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-7.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-fuse-3.3.0rhsvirt1-7.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch

[10/11/12 - 15:54:07 root@rhs-client7 ~]# gluster --version
glusterfs 3.3.0rhsvirt1 built on Oct  8 2012 15:23:00


Steps to Reproduce:
--------------------
1.Create a pure replicate volume (1x2) with 2 servers and 1 brick on each server. This is the storage for the VM's. start the volume.

2.Set-up the KVM to use the volume as VM store. 

3.Create 4 virtual machines (vm1 and vm2) . start the VM's

4.power off server1  (one of the server from each replicate pair)

5.perform operations on the VM's (rhn_register, yum update, reboot the VM's after the yum update)

6.power on server1.

7.In the glustershd.log file of the server1 we continuously see the messages about disconnects to the bricks
  
Actual results:
--------------
the self-heal-daemon process has reported the message about disconnects to bricks "1500 times" in a span of 1 hour 15 minutes. 

Additional info:
------------------

[10/11/12 - 16:39:56 root@rhs-client7 ~]# gluster volume info replicate-rhevh2
 
Volume Name: replicate-rhevh2
Type: Replicate
Volume ID: 1e697968-2e90-4589-8225-f596fee8af97
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhs-client6.lab.eng.blr.redhat.com:/replicate-disk
Brick2: rhs-client7.lab.eng.blr.redhat.com:/replicate-disk
Options Reconfigured:
storage.linux-aio: disable
cluster.eager-lock: enable
performance.read-ahead: disable
performance.stat-prefetch: disable
performance.io-cache: disable
performance.quick-read: disable

[10/11/12 - 16:40:04 root@rhs-client7 ~]# gluster volume status replicate-rhevh2
Status of volume: replicate-rhevh2
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick rhs-client6.lab.eng.blr.redhat.com:/replicate-dis
k							N/A	Y	2937
Brick rhs-client7.lab.eng.blr.redhat.com:/replicate-dis
k							24013	Y	32385
NFS Server on localhost					38467	Y	10740
Self-heal Daemon on localhost				N/A	Y	10746
NFS Server on rhs-client6.lab.eng.blr.redhat.com	38467	Y	2963
Self-heal Daemon on rhs-client6.lab.eng.blr.redhat.com	N/A	Y	2972
NFS Server on rhs-client8.lab.eng.blr.redhat.com	38467	Y	7636
Self-heal Daemon on rhs-client8.lab.eng.blr.redhat.com	N/A	Y	7642
NFS Server on 10.70.36.33				38467	Y	2406
Self-heal Daemon on 10.70.36.33				N/A	Y	2412
 

[10/11/12 - 16:40:17 root@rhs-client7 ~]# netstat -alnp | grep 10746
tcp        0      0 10.70.36.31:976             10.70.36.33:24010           ESTABLISHED 10746/glusterfs     
tcp        0      0 10.70.36.31:1009            10.70.36.30:24014           ESTABLISHED 10746/glusterfs     
tcp        0      0 10.70.36.31:982             10.70.36.31:24011           ESTABLISHED 10746/glusterfs     
tcp        0      0 10.70.36.31:984             10.70.36.31:24013           ESTABLISHED 10746/glusterfs     
tcp        0      0 10.70.36.31:974             10.70.36.32:24010           ESTABLISHED 10746/glusterfs     
tcp        0      0 10.70.36.31:1010            10.70.36.30:24015           ESTABLISHED 10746/glusterfs     
tcp        0      0 ::1:1011                    ::1:24007                   ESTABLISHED 10746/glusterfs     
unix  2      [ ACC ]     STREAM     LISTENING     12813043 10746/glusterfs     /tmp/f65f7f2956f53d4e6a12ed4af33a9624.socket
unix  3      [ ]         STREAM     CONNECTED     12813131 10746/glusterfs     /tmp/f65f7f2956f53d4e6a12ed4af33a9624.socket


[10/11/12 - 16:40:54 root@rhs-client7 ~]# ps -ef | grep glusterfsd
root     10734     1  4 13:40 ?        00:07:55 /usr/sbin/glusterfsd -s localhost --volfile-id dist-rep-rhevh.rhs-client7.lab.eng.blr.redhat.com.disk1 -p /var/lib/glusterd/vols/dist-rep-rhevh/run/rhs-client7.lab.eng.blr.redhat.com-disk1.pid -S /tmp/d8cb96b9810260656e0e505e0a9df706.socket --brick-name /disk1 -l /var/log/glusterfs/bricks/disk1.log --xlator-option *-posix.glusterd-uuid=b9d6cb21-051f-4791-9476-734856e77fbf --brick-port 24011 --xlator-option dist-rep-rhevh-server.listen-port=24011
root     14029  8689  0 16:41 pts/0    00:00:00 grep glusterfsd
root     32385     1 16 Oct10 ?        04:05:41 /usr/sbin/glusterfsd -s localhost --volfile-id replicate-rhevh2.rhs-client7.lab.eng.blr.redhat.com.replicate-disk -p /var/lib/glusterd/vols/replicate-rhevh2/run/rhs-client7.lab.eng.blr.redhat.com-replicate-disk.pid -S /tmp/34ce168cca1ffd0f64c69b974431b3a4.socket --brick-name /replicate-disk -l /var/log/glusterfs/bricks/replicate-disk.log --xlator-option *-posix.glusterd-uuid=b9d6cb21-051f-4791-9476-734856e77fbf --brick-port 24013 --xlator-option replicate-rhevh2-server.listen-port=24013
Comment 1 spandura 2012-10-11 07:24:25 EDT
[10/11/12 - 16:50:30 root@rhs-client6 glusterfs]# cat /var/lib/glusterd/vols/replicate-rhevh2/bricks/rhs-client6.lab.eng.blr.redhat.com\:-replicate-disk 
hostname=rhs-client6.lab.eng.blr.redhat.com
path=/replicate-disk
listen-port=0
rdma.listen-port=0
decommissioned=0

[10/11/12 - 16:50:58 root@rhs-client6 glusterfs]# cat /var/lib/glusterd/vols/replicate-rhevh2/bricks/rhs-client7.lab.eng.blr.redhat.com\:-replicate-disk 
hostname=rhs-client7.lab.eng.blr.redhat.com
path=/replicate-disk
listen-port=0
rdma.listen-port=0
decommissioned=0
 
[10/11/12 - 16:51:35 root@rhs-client7 ~]# cat /var/lib/glusterd/vols/replicate-rhevh2/bricks/rhs-client6.lab.eng.blr.redhat.com\:-replicate-disk
hostname=rhs-client6.lab.eng.blr.redhat.com
path=/replicate-disk
listen-port=0
rdma.listen-port=0
decommissioned=0

[10/11/12 - 16:52:18 root@rhs-client7 ~]# cat /var/lib/glusterd/vols/replicate-rhevh2/bricks/rhs-client7.lab.eng.blr.redhat.com\:-replicate-disk
hostname=rhs-client7.lab.eng.blr.redhat.com
path=/replicate-disk
listen-port=24013
rdma.listen-port=0
decommissioned=0
Comment 3 Amar Tumballi 2012-10-12 03:37:06 EDT
happening because of bug 865693
Comment 4 Amar Tumballi 2012-10-19 00:56:38 EDT
thinking more at it, there is no wrong of self-heal daemon trying to connect to a brick which never existed, because in this particular case glusterd thought the volume existed, and on the remote machine it was not even there... so the main issue here is volumes being out of sync.

*** This bug has been marked as a duplicate of bug 865693 ***

Note You need to log in before you can comment on or make changes to this bug.