Bug 844956

Summary:	core: volume info cmd fails even though glusterd is up and running
Product:	[Community] GlusterFS	Reporter:	Saurabh <saujain>
Component:	core	Assignee:	Kaushal <kaushal>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	high
Version:	pre-release	CC:	amarts, gluster-bugs, mzywusko, rabhat, rfortier
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-07-24 17:35:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Saurabh 2012-08-01 11:02:06 UTC

Description of problem:

[root@gqac026 ~]# gluster volume info test1
 
Volume Name: test1
Type: Distributed-Replicate
Volume ID: ae4f5ddc-dcf1-4298-b705-c71e81a5b12f
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.16.157.81:/home/test1-dr
Brick2: 10.16.157.75:/home/test1-dr2
Brick3: 10.16.157.78:/home/test1-d2r
Brick4: 10.16.157.21:/home/test1-d2r2
Brick5: 10.16.157.81:/home/test1-d3r
Brick6: 10.16.157.75:/home/test1-d3r2
Brick7: 10.16.157.78:/home/test1-d4r
Brick8: 10.16.157.21:/home/test1-d4r2
Brick9: 10.16.157.81:/home/test1-d5r
Brick10: 10.16.157.75:/home/test1-d5r2
Brick11: 10.16.157.78:/home/test1-d6r
Brick12: 10.16.157.21:/home/test1-d6r2
Options Reconfigured:
features.quota: off
geo-replication.indexing: off


the root the system was full because of the /var/log/messages has grown upto GBs.

Hence I truncated it.

then df -h was able to show some space.

But issuing the command "gluster volume info <vol-name>" reported
"Please check if glusterd is operational." Though it was up and running

Now executing the gluster volume stop <vol-name> from other works successfully from some other node of the cluster.

Even "gluster volume start <vol-name>" works fine, but on the node in consideration the brick processes were still not spawned,

the /var/log/glusterfs/bricks/<brick-name>.log

throws these errors at the time of gluster  volume start command execution,

-----------------------------------------------------------------------------
 
[root@gqac028 AUTH_test1]# less /var/log/glusterfs/bricks/home-test1-d5r.log | grep "07:13"
[2012-08-01 07:13:11.369312] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.3.0
[2012-08-01 07:13:11.387632] W [socket.c:1512:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Transport endpoint is not connected), peer (::1:24007)
[2012-08-01 07:13:11.387803] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x322040f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x322040f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x322040ef0e]))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2012-08-01 07:13:11.377157 (xid=0x1x)
[2012-08-01 07:13:11.387824] E [glusterfsd-mgmt.c:1623:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:test1.10.16.157.81.home-test1-d5r)
[2012-08-01 07:13:11.387850] W [glusterfsd.c:831:cleanup_and_exit] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x322040ef0e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x214) [0x322040ee74] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x292) [0x40c922]))) 0-: received signum (0), shutting down
[2012-08-01 07:13:11.390586] I [socket.c:2315:socket_submit_request] 0-glusterfs: not connected (priv->connected = 0)
[2012-08-01 07:13:11.390609] W [rpc-clnt.c:1498:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x2x Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)

-------------------------------------------------------------------------------
 

Version-Release number of selected component (if applicable):

[root@gqac028 AUTH_test1]# glusterfs -V
glusterfs 3.3.0 built on Jul 19 2012 14:08:45
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

How reproducible:

seen it once

Steps to Reproduce:
1. same as mentioned above
2.
3.
  
Actual results:
spawning the brick processes failed only on node out of the cluster of four nodes.

Expected results:
the processes should be spawned properly.

Additional info:


Presently also in latest trial the volume start has spawned the processes but the gluster volume info still fails, as can be seen from the below mentioned exectuion logs,

==============================================================================
[root@gqac028 AUTH_test1]# ps -eaf | grep gluster
root      3369     1  0 Jul30 ?        00:26:03 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
root      7568     1 96 10:49 ?        00:06:03 /usr/sbin/glusterfsd -s localhost --volfile-id test1.10.16.157.81.home-test1-dr -p /var/lib/glusterd/vols/test1/run/10.16.157.81-home-test1-dr.pid -S /tmp/34e6a2c2feb4767bc7a6dd7ad959e856.socket --brick-name /home/test1-dr -l /var/log/glusterfs/bricks/home-test1-dr.log --xlator-option *-posix.glusterd-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 --brick-port 24012 --xlator-option test1-server.listen-port=24012
root      7573     1 99 10:49 ?        00:06:52 /usr/sbin/glusterfsd -s localhost --volfile-id test1.10.16.157.81.home-test1-d3r -p /var/lib/glusterd/vols/test1/run/10.16.157.81-home-test1-d3r.pid -S /tmp/89bfdd4200899acca82a78a78df512e5.socket --brick-name /home/test1-d3r -l /var/log/glusterfs/bricks/home-test1-d3r.log --xlator-option *-posix.glusterd-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 --brick-port 24013 --xlator-option test1-server.listen-port=24013
root      7579     1 99 10:49 ?        00:17:27 /usr/sbin/glusterfsd -s localhost --volfile-id test1.10.16.157.81.home-test1-d5r -p /var/lib/glusterd/vols/test1/run/10.16.157.81-home-test1-d5r.pid -S /tmp/7c22c5ba2048b1ce857155ab1ff0c34a.socket --brick-name /home/test1-d5r -l /var/log/glusterfs/bricks/home-test1-d5r.log --xlator-option *-posix.glusterd-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 --brick-port 24014 --xlator-option test1-server.listen-port=24014
root      7588     1  0 10:49 ?        00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /tmp/94dbe38f818854a429781fc1f1a933b1.socket
root      7594     1  0 10:49 ?        00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /tmp/b4a7809b83a1003777623c5f51ac2d7b.socket --xlator-option *replicate*.node-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9
root     14422     1  1 Jul30 ?        00:45:27 /usr/sbin/glusterfs --volfile-id=test --volfile-server=localhost /mnt/gluster-object/AUTH_test
root     19353 30429  0 10:55 pts/1    00:00:00 grep gluster
root     19354  3810  0 10:55 ?        00:00:00 gluster volume info
[root@gqac028 AUTH_test1]# gluster volume info test1

Connection failed. Please check if gluster daemon is operational.
[root@gqac028 AUTH_test1]# 
[root@gqac028 AUTH_test1]# 
[root@gqac028 AUTH_test1]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_gqac028-lv_root
                       50G   44G  2.8G  95% /
tmpfs                  24G     0   24G   0% /dev/shm
/dev/sda1             485M   31M  429M   7% /boot
/dev/mapper/vg_gqac028-lv_home
                      366G   23G  344G   7% /home
df: `/mnt/gluster-object/AUTH_test1': Transport endpoint is not connected
df: `/mnt/gluster-object/AUTH_test': Transport endpoint is not connected
[root@gqac028 AUTH_test1]# 
=============================================================================

dmesg from this node:-

===========================================
__ratelimit: 2679 callbacks suppressed
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
__ratelimit: 2153 callbacks suppressed
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.

==================================================

Latest,
==============================================================================
[root@gqac026 ~]# gluster volume status test1
Status of volume: test1
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.16.157.81:/home/test1-dr			24012	Y	7568
Brick 10.16.157.75:/home/test1-dr2			24009	Y	10439
Brick 10.16.157.78:/home/test1-d2r			24009	Y	8641
Brick 10.16.157.21:/home/test1-d2r2			24009	Y	26847
Brick 10.16.157.81:/home/test1-d3r			24013	Y	7573
Brick 10.16.157.75:/home/test1-d3r2			24010	Y	10445
Brick 10.16.157.78:/home/test1-d4r			24010	Y	8646
Brick 10.16.157.21:/home/test1-d4r2			24010	Y	26852
Brick 10.16.157.81:/home/test1-d5r			24014	Y	7579
Brick 10.16.157.75:/home/test1-d5r2			24011	Y	10451
Brick 10.16.157.78:/home/test1-d6r			24011	Y	8652
Brick 10.16.157.21:/home/test1-d6r2			24011	Y	26859
NFS Server on localhost					38467	Y	10460
Self-heal Daemon on localhost				N/A	Y	10465
NFS Server on 10.16.157.81				38467	Y	7588
Self-heal Daemon on 10.16.157.81			N/A	Y	7594
NFS Server on 10.16.157.78				38467	Y	8659
Self-heal Daemon on 10.16.157.78			N/A	Y	8664
NFS Server on 10.16.157.21				38467	Y	26865
Self-heal Daemon on 10.16.157.21			N/A	Y	26871

===============================================================================

Well, all this is happening while executing thousnads of REST API requests on node, being sent by curl commands from different clients.

Comment 1 Amar Tumballi 2012-08-02 17:01:24 UTC

Johny,

want your 'gf-loganalyser.sh' script to be available for everyone in QE. I want everyone to run this script once before deleting if the disk is full because of logs. Want to understand who is contributing most for the logs.

Comment 2 Amar Tumballi 2012-12-21 09:40:36 UTC

Saurabh, have made many improvements to both UFO (rest api part), and also glusterd part, can you please re-check once again on 3.4.0qa releases?