Bug 764879 (GLUSTER-3147)

Summary: Inconsistent volume info
Product: [Community] GlusterFS Reporter: maciej.galkiewicz
Component: glusterdAssignee: Kaushal <kaushal>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.0CC: amarts, gluster-bugs, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Logs from affected node (n6c1)
none
Logs from all nodes except n6c1
none
More logs from affected node (n6c1) none

Description maciej.galkiewicz 2011-07-11 01:38:28 EDT
Created attachment 550
Comment 1 maciej.galkiewicz 2011-07-11 02:20:40 EDT
I forget to add that albert volume has authentication.

Volume Name: albert
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: n4c1:/srv/glusterfs/albert
Brick2: n5c1:/srv/glusterfs/albert
Brick3: n6c1:/srv/glusterfs/albert
Brick4: n7c1:/srv/glusterfs/albert
Options Reconfigured:
auth.allow: 10.0.5.27
Comment 2 maciej.galkiewicz 2011-07-11 04:37:52 EDT
I have 4 glusterfs nodes (3.2.0) on debian squeeze with kernel 2.6.32-5-xen-amd64. All volumes have replica count 2 and 4 bricks. Glusterfs was installed from debian package downloaded from your site. Sometimes after operation on volume (start, stop) one node reports different volume info than the rest. For example on nodes n4c1, n5c1, n7c1 volume's state is "Stopped" and on n6c1 volume is "Started".

n6c1# gluster volume info albert

Volume Name: albert
Type: Replicate
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: n4c1:/srv/glusterfs/albert
Brick2: n5c1:/srv/glusterfs/albert
Brick3: n6c1:/srv/glusterfs/albert
Brick4: n7c1:/srv/glusterfs/albert

When I try to stop it glusterfs complains that volume is already stopped. The solution is to restart glusterd service on the node with different volume status (n6c1). I do not know how to reproduce the problem. It happens randomly. I have attached logs from affected node (from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log). There are some really strange problems with connections to endpoints. I did not experience any network/connection problems between nodes.
Comment 3 Amar Tumballi 2011-07-27 22:22:54 EDT
Please discuss with KP and then Vijay on what should be the behavior and take this to completion.
Comment 4 Kaushal 2011-08-01 00:11:41 EDT
Hi,

I tried to replicate the bug with the information you provided but wasn't able to get the same result.

I have some questions regarding this.
1. What was the result of the 'gluster volume stop' command? Did it give a 'Stopping volume successful' message?
2.Can you provide the glusterd logs for all the servers, when this problem occurred.

Thanks.
Comment 5 maciej.galkiewicz 2011-08-01 00:53:56 EDT
(In reply to comment #4)
> 1. What was the result of the 'gluster volume stop' command? Did it give a
> 'Stopping volume successful' message?

I'm not sure but I think I got 'Volume albert already stopped'. Here is the log from two different nodes:

n5c1:
[2011-06-01 09:30:15.199799] Volume stop : on volname: albert SUCCESS
[2011-06-01 09:39:15.242785] volume start : on volname: albert SUCCESS
[2011-06-01 09:49:00.523818] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16
[2011-06-01 09:49:00.523870] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 SUCCEDED
[2011-06-01 09:55:10.422000] Volume reset : volume  : albert
[2011-06-01 09:55:10.422074] Volume reset :  on volume albert  SUCCEDED 
[2011-06-01 09:55:14.871873] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16
[2011-06-01 09:55:14.871932] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 SUCCEDED
[2011-06-01 09:56:10.784013] Volume stop : on volname: albert SUCCESS
[2011-06-01 09:56:31.179914] volume start : on volname: albert SUCCESS
[2011-06-01 10:04:20.444227] Volume stop : on volname: albert SUCCESS
[2011-06-01 10:04:23.519550] volume start : on volname: albert SUCCESS
[2011-06-01 10:04:26.364165] Volume stop : on volname: albert SUCCESS
[2011-06-01 10:06:34.923113] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16
[2011-06-01 10:06:34.923170] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 SUCCEDED
[2011-06-01 14:44:49.936117] Volume stop : on volname: albert SUCCESS
[2011-06-01 14:57:48.915508] volume start : on volname: albert SUCCESS
[2011-06-01 14:57:51.868049] volume set : volume-name:albert: key:auth.allow, value:10.0.2.24
[2011-06-01 14:57:51.868105] volume set : volume-name:albert: key:auth.allow, value:10.0.2.24 SUCCEDED

n6c1:
[2011-06-01 10:01:07.14611] volume set : volume-name:albert: key:Status:, value:Stopped
[2011-06-01 10:01:07.58044] volume set : volume-name:albert: key:Status:, value:Stopped SUCCEDED
[2011-06-01 10:01:21.86333] Volume reset : volume  : albert
[2011-06-01 10:01:21.86383] Volume reset :  on volume albert  SUCCEDED 
[2011-06-01 10:04:07.407277] Volume stop : on volname: albert SUCCESS
[2011-06-01 10:04:11.922521] Volume stop : on volname: albert SUCCESS
[2011-06-01 10:06:21.675914] volume start : on volname: albert SUCCESS

Notice that albert was stopped around 10:01 (n6c1) and it was again successfully stopped around 10:04.

> 2.Can you provide the glusterd logs for all the servers, when this problem
> occurred.

I have attached missing logs.
Comment 6 maciej.galkiewicz 2011-08-01 00:54:58 EDT
Created attachment 596
Comment 7 Kaushal 2011-08-01 01:32:37 EDT
Hi again.

From the logs provided, I believe that you first sent the stop command from n5c1 at around 9:30 and it failed. Is this correct?

If so, a problem occurred in n6c1 at the time. The logs provided for n6c1 are not complete for the mentioned time. Could you please upload the n6c1 logs again?

Thanks.
Comment 8 maciej.galkiewicz 2011-08-01 01:36:39 EDT
Created attachment 597
Comment 9 maciej.galkiewicz 2011-08-01 01:42:16 EDT
(In reply to comment #7)
> Hi again.
> 
> From the logs provided, I believe that you first sent the stop command from
> n5c1 at around 9:30 and it failed. Is this correct?

Yes it is correct. The output from gluster volume stop:

[Wed, 01 Jun 2011 09:30:15 +0200] DEBUG: Processing execute[stop glusterfs
volume albert] on n5c1.winniecloud.net-storage
[Wed, 01 Jun 2011 09:30:15 +0200] DEBUG: Executing yes | gluster volume stop
albert
[Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: ---- Begin output of yes | gluster
volume stop albert ----
[Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: STDOUT: Stopping volume will make its
data inaccessible. Do you want to continue? (y/n) Operation failed on n6c1
[Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: STDERR:
[Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: ---- End output of yes | gluster
volume stop albert ----
[Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: Ran yes | gluster volume stop albert
returned 255
[Wed, 01 Jun 2011 09:30:32 +0200] ERROR: execute[stop glusterfs volume albert]
(/var/cache/chef/storage/cookbooks/glusterfs/recipes/volume_authentication.rb:13:in
`from_file') had an error:
yes | gluster volume stop albert returned 255, expected 0
---- Begin output of yes | gluster volume stop albert ----
STDOUT: Stopping volume will make its data inaccessible. Do you want to
continue? (y/n) Operation failed on n6c1STDERR:
---- End output of yes | gluster volume stop albert ----
[Wed, 01 Jun 2011 09:30:32 +0200] ERROR: Running exception handlers
[Wed, 01 Jun 2011 09:30:32 +0200] ERROR: Creating Hoptoad exception report
[Wed, 01 Jun 2011 09:30:33 +0200] ERROR: Exception handlers complete
[Wed, 01 Jun 2011 09:30:33 +0200] DEBUG: Re-raising exception:
Chef::Exceptions::Exec - yes | gluster volume stop albert returned 255,
expected 0
---- Begin output of yes | gluster volume stop albert ----
STDOUT: Stopping volume will make its data inaccessible. Do you want to
continue? (y/n) Operation failed on n6c1STDERR:
---- End output of yes | gluster volume stop albert ----

> If so, a problem occurred in n6c1 at the time. The logs provided for n6c1 are
> not complete for the mentioned time. Could you please upload the n6c1 logs
> again?

Sure.
Comment 10 Kaushal 2011-08-01 22:36:53 EDT
Hi again.

After examining the logs, I think that the glusterfs brick process on n6c1 was not running or had crashed, when the stop command was issued. Can you confirm this? Also could you give the logs for the brick process in n6c1.

Thanks.
Comment 11 maciej.galkiewicz 2011-08-03 23:30:35 EDT
Brick log:
[2011-06-01 09:30:05.220873] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.albert-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.0.4.11:1015)
[2011-06-01 09:30:05.264666] I [server-helpers.c:360:do_lock_table_cleanup] 0-albert-server: inodelk released on /disk/assets/products
[2011-06-01 09:30:05.264712] I [server.c:438:server_rpc_notify] 0-albert-server: disconnected connection from 10.0.4.11:1015
[2011-06-01 09:30:05.306495] I [server-helpers.c:783:server_connection_destroy] 0-albert-server: destroyed connection of i-10-0-4-11-1455-2011/05/28-12:10:43:738313-albert-client-0
[2011-06-01 09:30:21.317563] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.albert-server: reading from socket failed. Error (Transport endpoint is not connected), peer (1.1.1.1:815)
[2011-06-01 09:30:21.317647] I [server.c:438:server_rpc_notify] 0-albert-server: disconnected connection from 1.1.1.1:815
[2011-06-01 09:30:21.317663] I [server-helpers.c:783:server_connection_destroy] 0-albert-server: destroyed connection of n5c1-32673-2011/05/31-12:00:00:211874-albert-client-0
[2011-06-01 09:30:22.534946] W [glusterfsd.c:700:cleanup_and_exit] (-->/usr/lib/libgfrpc.so.0(rpcsvc_notify+0x93) [0x7f0df4a8f073] (-->/usr/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x27f) [0x7f0df4a8ef3f] (-->/usr/sbin/glusterfsd(glusterfs_handle_terminate+0x1d) [0x4078fd]))) 0-: received signum (15), shutting down
Comment 12 Anand Avati 2011-08-17 02:23:58 EDT
CHANGE: http://review.gluster.com/246 (This fixes problem caused by race between glusterd and glusterfsd) merged in release-3.2 by Anand Avati (avati@gluster.com)
Comment 13 Anand Avati 2011-08-17 21:02:25 EDT
CHANGE: http://review.gluster.com/206 (This fixes problem caused by race between glusterd and glusterfsd) merged in master by Anand Avati (avati@gluster.com)