Description of problem: [root@gqac026 ~]# gluster volume info test1 Volume Name: test1 Type: Distributed-Replicate Volume ID: ae4f5ddc-dcf1-4298-b705-c71e81a5b12f Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.16.157.81:/home/test1-dr Brick2: 10.16.157.75:/home/test1-dr2 Brick3: 10.16.157.78:/home/test1-d2r Brick4: 10.16.157.21:/home/test1-d2r2 Brick5: 10.16.157.81:/home/test1-d3r Brick6: 10.16.157.75:/home/test1-d3r2 Brick7: 10.16.157.78:/home/test1-d4r Brick8: 10.16.157.21:/home/test1-d4r2 Brick9: 10.16.157.81:/home/test1-d5r Brick10: 10.16.157.75:/home/test1-d5r2 Brick11: 10.16.157.78:/home/test1-d6r Brick12: 10.16.157.21:/home/test1-d6r2 Options Reconfigured: features.quota: off geo-replication.indexing: off the root the system was full because of the /var/log/messages has grown upto GBs. Hence I truncated it. then df -h was able to show some space. But issuing the command "gluster volume info <vol-name>" reported "Please check if glusterd is operational." Though it was up and running Now executing the gluster volume stop <vol-name> from other works successfully from some other node of the cluster. Even "gluster volume start <vol-name>" works fine, but on the node in consideration the brick processes were still not spawned, the /var/log/glusterfs/bricks/<brick-name>.log throws these errors at the time of gluster volume start command execution, ----------------------------------------------------------------------------- [root@gqac028 AUTH_test1]# less /var/log/glusterfs/bricks/home-test1-d5r.log | grep "07:13" [2012-08-01 07:13:11.369312] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.3.0 [2012-08-01 07:13:11.387632] W [socket.c:1512:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Transport endpoint is not connected), peer (::1:24007) [2012-08-01 07:13:11.387803] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x322040f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x322040f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x322040ef0e]))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2012-08-01 07:13:11.377157 (xid=0x1x) [2012-08-01 07:13:11.387824] E [glusterfsd-mgmt.c:1623:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:test1.10.16.157.81.home-test1-d5r) [2012-08-01 07:13:11.387850] W [glusterfsd.c:831:cleanup_and_exit] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x322040ef0e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x214) [0x322040ee74] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x292) [0x40c922]))) 0-: received signum (0), shutting down [2012-08-01 07:13:11.390586] I [socket.c:2315:socket_submit_request] 0-glusterfs: not connected (priv->connected = 0) [2012-08-01 07:13:11.390609] W [rpc-clnt.c:1498:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x2x Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs) ------------------------------------------------------------------------------- Version-Release number of selected component (if applicable): [root@gqac028 AUTH_test1]# glusterfs -V glusterfs 3.3.0 built on Jul 19 2012 14:08:45 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. How reproducible: seen it once Steps to Reproduce: 1. same as mentioned above 2. 3. Actual results: spawning the brick processes failed only on node out of the cluster of four nodes. Expected results: the processes should be spawned properly. Additional info: Presently also in latest trial the volume start has spawned the processes but the gluster volume info still fails, as can be seen from the below mentioned exectuion logs, ============================================================================== [root@gqac028 AUTH_test1]# ps -eaf | grep gluster root 3369 1 0 Jul30 ? 00:26:03 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid root 7568 1 96 10:49 ? 00:06:03 /usr/sbin/glusterfsd -s localhost --volfile-id test1.10.16.157.81.home-test1-dr -p /var/lib/glusterd/vols/test1/run/10.16.157.81-home-test1-dr.pid -S /tmp/34e6a2c2feb4767bc7a6dd7ad959e856.socket --brick-name /home/test1-dr -l /var/log/glusterfs/bricks/home-test1-dr.log --xlator-option *-posix.glusterd-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 --brick-port 24012 --xlator-option test1-server.listen-port=24012 root 7573 1 99 10:49 ? 00:06:52 /usr/sbin/glusterfsd -s localhost --volfile-id test1.10.16.157.81.home-test1-d3r -p /var/lib/glusterd/vols/test1/run/10.16.157.81-home-test1-d3r.pid -S /tmp/89bfdd4200899acca82a78a78df512e5.socket --brick-name /home/test1-d3r -l /var/log/glusterfs/bricks/home-test1-d3r.log --xlator-option *-posix.glusterd-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 --brick-port 24013 --xlator-option test1-server.listen-port=24013 root 7579 1 99 10:49 ? 00:17:27 /usr/sbin/glusterfsd -s localhost --volfile-id test1.10.16.157.81.home-test1-d5r -p /var/lib/glusterd/vols/test1/run/10.16.157.81-home-test1-d5r.pid -S /tmp/7c22c5ba2048b1ce857155ab1ff0c34a.socket --brick-name /home/test1-d5r -l /var/log/glusterfs/bricks/home-test1-d5r.log --xlator-option *-posix.glusterd-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 --brick-port 24014 --xlator-option test1-server.listen-port=24014 root 7588 1 0 10:49 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /tmp/94dbe38f818854a429781fc1f1a933b1.socket root 7594 1 0 10:49 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /tmp/b4a7809b83a1003777623c5f51ac2d7b.socket --xlator-option *replicate*.node-uuid=f43c2541-d021-4278-ba13-3b31c81c04e9 root 14422 1 1 Jul30 ? 00:45:27 /usr/sbin/glusterfs --volfile-id=test --volfile-server=localhost /mnt/gluster-object/AUTH_test root 19353 30429 0 10:55 pts/1 00:00:00 grep gluster root 19354 3810 0 10:55 ? 00:00:00 gluster volume info [root@gqac028 AUTH_test1]# gluster volume info test1 Connection failed. Please check if gluster daemon is operational. [root@gqac028 AUTH_test1]# [root@gqac028 AUTH_test1]# [root@gqac028 AUTH_test1]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_gqac028-lv_root 50G 44G 2.8G 95% / tmpfs 24G 0 24G 0% /dev/shm /dev/sda1 485M 31M 429M 7% /boot /dev/mapper/vg_gqac028-lv_home 366G 23G 344G 7% /home df: `/mnt/gluster-object/AUTH_test1': Transport endpoint is not connected df: `/mnt/gluster-object/AUTH_test': Transport endpoint is not connected [root@gqac028 AUTH_test1]# ============================================================================= dmesg from this node:- =========================================== __ratelimit: 2679 callbacks suppressed nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. __ratelimit: 2153 callbacks suppressed nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. ================================================== Latest, ============================================================================== [root@gqac026 ~]# gluster volume status test1 Status of volume: test1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.157.81:/home/test1-dr 24012 Y 7568 Brick 10.16.157.75:/home/test1-dr2 24009 Y 10439 Brick 10.16.157.78:/home/test1-d2r 24009 Y 8641 Brick 10.16.157.21:/home/test1-d2r2 24009 Y 26847 Brick 10.16.157.81:/home/test1-d3r 24013 Y 7573 Brick 10.16.157.75:/home/test1-d3r2 24010 Y 10445 Brick 10.16.157.78:/home/test1-d4r 24010 Y 8646 Brick 10.16.157.21:/home/test1-d4r2 24010 Y 26852 Brick 10.16.157.81:/home/test1-d5r 24014 Y 7579 Brick 10.16.157.75:/home/test1-d5r2 24011 Y 10451 Brick 10.16.157.78:/home/test1-d6r 24011 Y 8652 Brick 10.16.157.21:/home/test1-d6r2 24011 Y 26859 NFS Server on localhost 38467 Y 10460 Self-heal Daemon on localhost N/A Y 10465 NFS Server on 10.16.157.81 38467 Y 7588 Self-heal Daemon on 10.16.157.81 N/A Y 7594 NFS Server on 10.16.157.78 38467 Y 8659 Self-heal Daemon on 10.16.157.78 N/A Y 8664 NFS Server on 10.16.157.21 38467 Y 26865 Self-heal Daemon on 10.16.157.21 N/A Y 26871 =============================================================================== Well, all this is happening while executing thousnads of REST API requests on node, being sent by curl commands from different clients.
Johny, want your 'gf-loganalyser.sh' script to be available for everyone in QE. I want everyone to run this script once before deleting if the disk is full because of logs. Want to understand who is contributing most for the logs.
Saurabh, have made many improvements to both UFO (rest api part), and also glusterd part, can you please re-check once again on 3.4.0qa releases?