Bug 764879 (GLUSTER-3147)
Summary: | Inconsistent volume info | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | maciej | ||||||||
Component: | glusterd | Assignee: | Kaushal <kaushal> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3.2.0 | CC: | amarts, gluster-bugs, vijay | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | Type: | --- | |||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
maciej
2011-07-11 05:38:28 UTC
I forget to add that albert volume has authentication. Volume Name: albert Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: n4c1:/srv/glusterfs/albert Brick2: n5c1:/srv/glusterfs/albert Brick3: n6c1:/srv/glusterfs/albert Brick4: n7c1:/srv/glusterfs/albert Options Reconfigured: auth.allow: 10.0.5.27 I have 4 glusterfs nodes (3.2.0) on debian squeeze with kernel 2.6.32-5-xen-amd64. All volumes have replica count 2 and 4 bricks. Glusterfs was installed from debian package downloaded from your site. Sometimes after operation on volume (start, stop) one node reports different volume info than the rest. For example on nodes n4c1, n5c1, n7c1 volume's state is "Stopped" and on n6c1 volume is "Started". n6c1# gluster volume info albert Volume Name: albert Type: Replicate Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: n4c1:/srv/glusterfs/albert Brick2: n5c1:/srv/glusterfs/albert Brick3: n6c1:/srv/glusterfs/albert Brick4: n7c1:/srv/glusterfs/albert When I try to stop it glusterfs complains that volume is already stopped. The solution is to restart glusterd service on the node with different volume status (n6c1). I do not know how to reproduce the problem. It happens randomly. I have attached logs from affected node (from /var/log/glusterfs/etc-glusterfs-glusterd.vol.log). There are some really strange problems with connections to endpoints. I did not experience any network/connection problems between nodes. Please discuss with KP and then Vijay on what should be the behavior and take this to completion. Hi, I tried to replicate the bug with the information you provided but wasn't able to get the same result. I have some questions regarding this. 1. What was the result of the 'gluster volume stop' command? Did it give a 'Stopping volume successful' message? 2.Can you provide the glusterd logs for all the servers, when this problem occurred. Thanks. (In reply to comment #4) > 1. What was the result of the 'gluster volume stop' command? Did it give a > 'Stopping volume successful' message? I'm not sure but I think I got 'Volume albert already stopped'. Here is the log from two different nodes: n5c1: [2011-06-01 09:30:15.199799] Volume stop : on volname: albert SUCCESS [2011-06-01 09:39:15.242785] volume start : on volname: albert SUCCESS [2011-06-01 09:49:00.523818] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 [2011-06-01 09:49:00.523870] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 SUCCEDED [2011-06-01 09:55:10.422000] Volume reset : volume : albert [2011-06-01 09:55:10.422074] Volume reset : on volume albert SUCCEDED [2011-06-01 09:55:14.871873] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 [2011-06-01 09:55:14.871932] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 SUCCEDED [2011-06-01 09:56:10.784013] Volume stop : on volname: albert SUCCESS [2011-06-01 09:56:31.179914] volume start : on volname: albert SUCCESS [2011-06-01 10:04:20.444227] Volume stop : on volname: albert SUCCESS [2011-06-01 10:04:23.519550] volume start : on volname: albert SUCCESS [2011-06-01 10:04:26.364165] Volume stop : on volname: albert SUCCESS [2011-06-01 10:06:34.923113] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 [2011-06-01 10:06:34.923170] volume set : volume-name:albert: key:auth.allow, value:10.0.4.16 SUCCEDED [2011-06-01 14:44:49.936117] Volume stop : on volname: albert SUCCESS [2011-06-01 14:57:48.915508] volume start : on volname: albert SUCCESS [2011-06-01 14:57:51.868049] volume set : volume-name:albert: key:auth.allow, value:10.0.2.24 [2011-06-01 14:57:51.868105] volume set : volume-name:albert: key:auth.allow, value:10.0.2.24 SUCCEDED n6c1: [2011-06-01 10:01:07.14611] volume set : volume-name:albert: key:Status:, value:Stopped [2011-06-01 10:01:07.58044] volume set : volume-name:albert: key:Status:, value:Stopped SUCCEDED [2011-06-01 10:01:21.86333] Volume reset : volume : albert [2011-06-01 10:01:21.86383] Volume reset : on volume albert SUCCEDED [2011-06-01 10:04:07.407277] Volume stop : on volname: albert SUCCESS [2011-06-01 10:04:11.922521] Volume stop : on volname: albert SUCCESS [2011-06-01 10:06:21.675914] volume start : on volname: albert SUCCESS Notice that albert was stopped around 10:01 (n6c1) and it was again successfully stopped around 10:04. > 2.Can you provide the glusterd logs for all the servers, when this problem > occurred. I have attached missing logs. Created attachment 596 Hi again. From the logs provided, I believe that you first sent the stop command from n5c1 at around 9:30 and it failed. Is this correct? If so, a problem occurred in n6c1 at the time. The logs provided for n6c1 are not complete for the mentioned time. Could you please upload the n6c1 logs again? Thanks. Created attachment 597 (In reply to comment #7) > Hi again. > > From the logs provided, I believe that you first sent the stop command from > n5c1 at around 9:30 and it failed. Is this correct? Yes it is correct. The output from gluster volume stop: [Wed, 01 Jun 2011 09:30:15 +0200] DEBUG: Processing execute[stop glusterfs volume albert] on n5c1.winniecloud.net-storage [Wed, 01 Jun 2011 09:30:15 +0200] DEBUG: Executing yes | gluster volume stop albert [Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: ---- Begin output of yes | gluster volume stop albert ---- [Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: STDOUT: Stopping volume will make its data inaccessible. Do you want to continue? (y/n) Operation failed on n6c1 [Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: STDERR: [Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: ---- End output of yes | gluster volume stop albert ---- [Wed, 01 Jun 2011 09:30:32 +0200] DEBUG: Ran yes | gluster volume stop albert returned 255 [Wed, 01 Jun 2011 09:30:32 +0200] ERROR: execute[stop glusterfs volume albert] (/var/cache/chef/storage/cookbooks/glusterfs/recipes/volume_authentication.rb:13:in `from_file') had an error: yes | gluster volume stop albert returned 255, expected 0 ---- Begin output of yes | gluster volume stop albert ---- STDOUT: Stopping volume will make its data inaccessible. Do you want to continue? (y/n) Operation failed on n6c1STDERR: ---- End output of yes | gluster volume stop albert ---- [Wed, 01 Jun 2011 09:30:32 +0200] ERROR: Running exception handlers [Wed, 01 Jun 2011 09:30:32 +0200] ERROR: Creating Hoptoad exception report [Wed, 01 Jun 2011 09:30:33 +0200] ERROR: Exception handlers complete [Wed, 01 Jun 2011 09:30:33 +0200] DEBUG: Re-raising exception: Chef::Exceptions::Exec - yes | gluster volume stop albert returned 255, expected 0 ---- Begin output of yes | gluster volume stop albert ---- STDOUT: Stopping volume will make its data inaccessible. Do you want to continue? (y/n) Operation failed on n6c1STDERR: ---- End output of yes | gluster volume stop albert ---- > If so, a problem occurred in n6c1 at the time. The logs provided for n6c1 are > not complete for the mentioned time. Could you please upload the n6c1 logs > again? Sure. Hi again. After examining the logs, I think that the glusterfs brick process on n6c1 was not running or had crashed, when the stop command was issued. Can you confirm this? Also could you give the logs for the brick process in n6c1. Thanks. Brick log: [2011-06-01 09:30:05.220873] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.albert-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.0.4.11:1015) [2011-06-01 09:30:05.264666] I [server-helpers.c:360:do_lock_table_cleanup] 0-albert-server: inodelk released on /disk/assets/products [2011-06-01 09:30:05.264712] I [server.c:438:server_rpc_notify] 0-albert-server: disconnected connection from 10.0.4.11:1015 [2011-06-01 09:30:05.306495] I [server-helpers.c:783:server_connection_destroy] 0-albert-server: destroyed connection of i-10-0-4-11-1455-2011/05/28-12:10:43:738313-albert-client-0 [2011-06-01 09:30:21.317563] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.albert-server: reading from socket failed. Error (Transport endpoint is not connected), peer (1.1.1.1:815) [2011-06-01 09:30:21.317647] I [server.c:438:server_rpc_notify] 0-albert-server: disconnected connection from 1.1.1.1:815 [2011-06-01 09:30:21.317663] I [server-helpers.c:783:server_connection_destroy] 0-albert-server: destroyed connection of n5c1-32673-2011/05/31-12:00:00:211874-albert-client-0 [2011-06-01 09:30:22.534946] W [glusterfsd.c:700:cleanup_and_exit] (-->/usr/lib/libgfrpc.so.0(rpcsvc_notify+0x93) [0x7f0df4a8f073] (-->/usr/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x27f) [0x7f0df4a8ef3f] (-->/usr/sbin/glusterfsd(glusterfs_handle_terminate+0x1d) [0x4078fd]))) 0-: received signum (15), shutting down CHANGE: http://review.gluster.com/246 (This fixes problem caused by race between glusterd and glusterfsd) merged in release-3.2 by Anand Avati (avati) CHANGE: http://review.gluster.com/206 (This fixes problem caused by race between glusterd and glusterfsd) merged in master by Anand Avati (avati) |