Bug 839021
Summary: | Poor error messages when directory not present on remote brick | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | b.candler |
Component: | glusterfs | Assignee: | Krutika Dhananjay <kdhananj> |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> |
Severity: | low | Docs Contact: | |
Priority: | high | ||
Version: | unspecified | CC: | aavati, amarts, gluster-bugs, kparthas |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.4.0qa8 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-09-23 22:32:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
need bug 762935 to be fixed for handling this. Today, absence of brick directories is detected much earlier at the time of volume start (as opposed to it being undetected until mount) by failing the operation. Please check whether the above mentioned problem is solved in glusterfs version 3.4.0qa4. If you mean at the time of "gluster volume start", then that doesn't sound like it will fix this problem. The case here was a volume which had already been created and started, but a storage server was then rebooted. The scenario: 1. For a reason which doesn't matter here, the array on /disk/storage1 did not mount at boot time. 2. Hence /disk/storage1 was an empty directory. 3. The volume references /disk/storage1/single1 as a brick. Because the brick dir was not present, glusterfsd immediately terminated: [2012-07-10 18:25:57.804526] E [posix.c:3930:init] 0-single1-posix: Directory '/disk/storage1/single1' doesn't exist, exiting. 4. At mount time, the client tried to talk to the brick, but nothing was listening on the port it was supposed to be on. 5. Hence the mount failed, but with only "Please check the log file for more details." as the error. 6. The client log file shows nothing more helpful than "Transport endpoint not connected". It is necessary to grub around the brick log files until you eventually find the message I showed under (3) So IMO this still needs to be reported properly at mount time. In the case of a failure to connect to a brick, a message like "failed to communicate with brick XXX, check 'gluster volume status' on one of the server nodes" would be better than nothing. > So IMO this still needs to be reported properly at mount time. In the case of a failure to connect to a brick, a message like "failed to communicate with brick XXX, check 'gluster volume status' on one of the server nodes" would be better than nothing.
I agree with you, and it makes sense. We will try to address this within next few days.
http://review.gluster.org/4330 http://review.gluster.org/4331 posted for review CHANGE: http://review.gluster.org/4341 (protocol/client: print a message regarding brick status in case of failure) merged in master by Vijay Bellur (vbellur) the other part discussed in patch 4330 will be implemented separately, please see if this is enough for now. Verified with glusterfs-3.4.0.10rhs-1 Did the following steps, 1. Created 2 Logical volume and XFS formatted it 2. Mounted above 2 Logical Volumes, @ /rhs/brick1 and /rhs/brick2 3. Added fstab entry for /rhs/brick1 only, so after rebooting /rhs/brick2 doesn't have xattrs as the Logical Volume is not mounted 4. Tried to mount the volume from Client side. Seen the reasonable message @ client side <snip> [2013-06-19 11:31:13.216116] E [client-handshake.c:1741:client_query_portmap_cbk] 0-testvol-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2013-06-19 11:31:13.216209] W [socket.c:515:__socket_rwv] 0-testvol-client-0: readv on 10.70.35.171:24007 failed (No data available) [2013-06-19 11:31:13.216255] I [client.c:2103:client_rpc_notify] 0-testvol-client-0: disconnected from 10.70.35.171:24007. Client process will keep trying to connect to glusterd until brick's port is available. [2013-06-19 11:31:13.216297] E [client-handshake.c:1741:client_query_portmap_cbk] 0-testvol-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2013-06-19 11:31:13.216348] W [socket.c:515:__socket_rwv] 0-testvol-client-1: readv on 10.70.35.171:24007 failed (No data available) [2013-06-19 11:31:13.216381] I [client.c:2103:client_rpc_notify] 0-testvol-client-1: disconnected from 10.70.35.171:24007. Client process will keep trying to connect to glusterd until brick's port is available. </snip> This log would suffice for the requirement. Moving it to VERIFIED state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |
Summary: very poor error reporting makes it very difficult to diagnose a problem which turned out to have a basic underlying cause. This is glusterfs-3.3.0 under ubuntu 12.04 root@dev-storage2:~# grep single1 /etc/fstab dev-storage2:/single1 /gluster/single1 glusterfs noauto 0 0 root@dev-storage2:~# mount /gluster/single1 unknown option noauto (ignored) Mount failed. Please check the log file for more details. However, looking at the client log file is not very helpful. It just shows "failed to get the port number for remote subvolume" and "Transport endpoint is not connected". (Full log at the end of this message). The volume info is sensible: root@dev-storage2:~# gluster volume info single1 Volume Name: single1 Type: Distribute Volume ID: 74d62eb4-176e-4671-8471-779d909e19f0 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: dev-storage1:/disk/storage1/single1 The remote server (dev-storage1) which holds the brick *is* online, and I had restarted glusterd there. Now, it turns out that the problem was as follows. On the remote brick, dev-storage1, the RAID array which held the data /disk/storage1/single1 had not come up. root@dev-storage1:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : inactive sdk[9](S) sdl[10](S) sdj[8](S) sdg[5](S) sdc[1](S) sdm[11](S) sdf[4](S) sde[3](S) sdi[7](S) sdd[2](S) sdb[0](S) sdh[6](S) 35163186720 blocks super 1.2 unused devices: <none> So the issue is that glusterfs cannot find the directory and/or the xattr metadata that it's expecting. But this had to be found out by guesswork. It would be much better if this status could be relayed back to the client as a message. It would also be good at the server side if "service glusterd restart" would report broken bricks to stderr. In fact if you look at the server side in /var/log/glusterfs/bricks/disk-storage1-single1.log there *is* one line which shows the problem: [2012-07-10 18:25:55.775339] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.3.0 [2012-07-10 18:25:57.802308] I [graph.c:241:gf_add_cmdline_options] 0-single1-server: adding option 'listen-port' for volume 'single1-server' with value '24011' [2012-07-10 18:25:57.802358] I [graph.c:241:gf_add_cmdline_options] 0-single1-posix: adding option 'glusterd-uuid' for volume 'single1-posix' with value 'a1a8a3cd-468e-44d5-8c08-23821db4c80f' [2012-07-10 18:25:57.804060] W [options.c:782:xl_opt_validate] 0-single1-server: option 'listen-port' is deprecated, preferred is 'transport.socket.listen-port', continuing with correction >>>>>>>> AHA! [2012-07-10 18:25:57.804526] E [posix.c:3930:init] 0-single1-posix: Directory '/disk/storage1/single1' doesn't exist, exiting. <<<<<<<< [2012-07-10 18:25:57.804575] E [xlator.c:385:xlator_init] 0-single1-posix: Initialization of volume 'single1-posix' failed, review your volfile again [2012-07-10 18:25:57.804594] E [graph.c:294:glusterfs_graph_init] 0-single1-posix: initializing translator failed [2012-07-10 18:25:57.804611] E [graph.c:483:glusterfs_graph_activate] 0-graph: init failed [2012-07-10 18:25:57.804905] W [glusterfsd.c:831:cleanup_and_exit] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7fe7bb608b85] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0xe0) [0x40c8b0] (-->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x192) [0x407562]))) 0-: received signum (0), shutting down IMO this is much harder to debug operationally than it should be. Regards. Brian. (Client log follows) [2012-07-10 18:16:27.135716] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.3.0 [2012-07-10 18:16:27.139567] I [quick-read.c:3627:check_cache_size_ok] 0-single1-quick-read: Max cache size is 8351887360 [2012-07-10 18:16:27.139668] I [io-cache.c:1549:check_cache_size_ok] 0-single1-io-cache: Max cache size is 8351887360 [2012-07-10 18:16:27.140465] I [client.c:2142:notify] 0-single1-client-0: parent translators are ready, attempting connect on transport Given volfile: +------------------------------------------------------------------------------+ 1: volume single1-client-0 2: type protocol/client 3: option remote-host dev-storage1 4: option remote-subvolume /disk/storage1/single1 5: option transport-type tcp 6: option username 7762a8fe-a6f4-43b0-9b24-8a5165260cf1 7: option password b0f0c963-159a-4243-9b1f-5965baf5943d 8: end-volume 9: 10: volume single1-write-behind 11: type performance/write-behind 12: subvolumes single1-client-0 13: end-volume 14: 15: volume single1-read-ahead 16: type performance/read-ahead 17: subvolumes single1-write-behind 18: end-volume 19: 20: volume single1-io-cache 21: type performance/io-cache 22: subvolumes single1-read-ahead 23: end-volume 24: 25: volume single1-quick-read 26: type performance/quick-read 27: subvolumes single1-io-cache 28: end-volume 29: 30: volume single1-md-cache 31: type performance/md-cache 32: subvolumes single1-quick-read 33: end-volume 34: 35: volume single1 36: type debug/io-stats 37: option latency-measurement off 38: option count-fop-hits off 39: subvolumes single1-md-cache 40: end-volume +------------------------------------------------------------------------------+ [2012-07-10 18:16:27.141561] E [client-handshake.c:1717:client_query_portmap_cbk] 0-single1-client-0: failed to get the port number for remote subvolume [2012-07-10 18:16:27.141638] I [client.c:2090:client_rpc_notify] 0-single1-client-0: disconnected [2012-07-10 18:16:27.147086] I [fuse-bridge.c:4193:fuse_graph_setup] 0-fuse: switched to graph 0 [2012-07-10 18:16:27.147318] I [fuse-bridge.c:3376:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.17 [2012-07-10 18:16:27.147547] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected) [2012-07-10 18:16:27.150528] I [fuse-bridge.c:4093:fuse_thread_proc] 0-fuse: unmounting /gluster/single1 [2012-07-10 18:16:27.150838] W [glusterfsd.c:831:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f50b74cf4bd] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f50b77a1e9a] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x406d95]))) 0-: received signum (15), shutting down [2012-07-10 18:16:27.150873] I [fuse-bridge.c:4643:fini] 0-fuse: Unmounting '/gluster/single1'.