Hide Forgot
Before I get started here's our server info: We are running with the default config. gluster --version glusterfs 3.1.2 built on Feb 1 2011 07:33:37 Repository revision: v3.1.1-64-gf2a067c Copyright (c) 2006-2010 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU Affero General Public License. gluster volume info SAN Volume Name: SAN Type: Distributed-Replicate Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: med-san00:/glusterfs/export01 Brick2: med-san01:/glusterfs/export01 Brick3: med-san00:/glusterfs/export02 Brick4: med-san01:/glusterfs/export02 Brick5: med-san00:/glusterfs/export03 Brick6: med-san01:/glusterfs/export03 Brick7: med-san00:/glusterfs/export04 Brick8: med-san01:/glusterfs/export04 Hi All, This week we rolled out our new gluster server currently we have 4 clients that connect to it. On all 4 clients we have identified a strange issue with one of the folders in the gluster volume (/san/tiles/urbanstreets-v2) Allow me to elaborate as to what this gluster volume contains. +6 Million files in a heavily nested folder structure each file being between 2K and 6K The core of the problem is that each client is behaving differently as to what it displays in this folder For example on one of our gluster clients when we navigate to the /san/tiles/urbanstreets-v2 folder and issue a ls we retrieve this: pdeschamps@tc003:/san/tiles/urbanstreets-v2$ ls 00 02 04 06 08 10 12 14 16 urban_king1.tar 01 03 05 07 09 11 13 15 17 urban_king.tar This is the correct expected output. On another client connected to the same volume when we do a ls in the same folder we get this: pdeschamps@tilecache-004:/san/tiles/urbanstreets-v2$ ls urbanstreets_missed.log When I navigate to the gluster export volumes and do a ls in the export i get the correct structure: EXPORT 1 ls /glusterfs/export01/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 urban_king1.tar EXPORT 2 ls /glusterfs/export02/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 urban_king.tar EXPORT 3 ls /glusterfs/export03/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 urbanstreets_missed.log EXPORT 4 ls /glusterfs/export04/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 Below is a dump of the client log from three file operations on the file that exists according to the client. (urbanstreets_missed.log) BEGIN DUMP OF THREE OPERATIONS LS STAT MV SERVER 3.1.2 : DEFAULT CONFIG CLIENT: # /etc/fstab: static file system information. # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc nodev,noexec,nosuid 0 0 LABEL=uec-rootfs / ext3 defaults 0 0 /dev/sda2 /mnt auto defaults,nobootwait,comment=cloudconfig 0 0 /dev/sda3 none swap sw,comment=cloudconfig 0 0 /dev/sdf1 /mnt/data ext3 defaults 0 0 med-san00:7997:/SAN /san glusterfs defaults,_netdev,log-level=DEBUG,log-file=/var/log/gluster-debug.log 0 0 dmesg | grep -i fuse [ 17.123074] fuse init (API version 7.13) LS OPERATION ls: /san/tiles/urbanstreets-v2/urbanstreets_missed.log: No such file or directory total 408K -rwxr-xr-- 1 apache data 402K 2011-01-30 11:11 urbanstreets_missed.log CLIENT LOG----------------- [2011-03-10 14:32:44.170086] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-6: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:44.170220] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-7: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:44.170287] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (Invalid argument) [2011-03-10 14:32:44.170439] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-0: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:44.170495] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-1: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:44.170523] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (Invalid argument) [2011-03-10 14:32:44.170634] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-2: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:44.170701] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-3: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:44.170734] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (Invalid argument) [2011-03-10 14:32:44.173888] D [dht-common.c:2137:dht_fd_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (No such file or directory) [2011-03-10 14:32:44.174165] D [dht-common.c:2137:dht_fd_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (No such file or directory) [2011-03-10 14:32:44.174231] D [dht-common.c:2137:dht_fd_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (No such file or directory) [2011-03-10 14:32:44.174367] D [client3_1-fops.c:4852:client3_1_readdirp] SAN-client-0: (352321552): failed to get fd ctx. EBADFD [2011-03-10 14:32:44.174489] D [client3_1-fops.c:4852:client3_1_readdirp] SAN-client-2: (352321552): failed to get fd ctx. EBADFD [2011-03-10 14:32:44.176553] D [dht-common.c:1164:dht_attr_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (No such file or directory) [2011-03-10 14:32:44.176712] D [dht-common.c:1164:dht_attr_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (No such file or directory) [2011-03-10 14:32:44.176765] D [dht-common.c:1164:dht_attr_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (No such file or directory) [2011-03-10 14:32:44.176938] E [dht-common.c:1865:dht_getxattr] SAN-dht: layout is NULL [2011-03-10 14:32:44.176975] W [fuse-bridge.c:2432:fuse_xattr_cbk] glusterfs-fuse: 951: GETXATTR() /tiles/urbanstreets-v2/urbanstreets_missed.log => -1 (No such file or directory) [2011-03-10 14:32:44.178664] D [client3_1-fops.c:4852:client3_1_readdirp] SAN-client-6: (352321552): failed to get fd ctx. EBADFD [2011-03-10 14:32:44.178804] D [stat-prefetch.c:3587:sp_release] SAN-stat-prefetch: cache hits: 1, cache miss: 0 STAT OPERATION root@tilecache-004:~# stat /san/tiles/urbanstreets-v2/urbanstreets_missed.log File: `/san/tiles/urbanstreets-v2/urbanstreets_missed.log' Size: 410872 Blocks: 816 IO Block: 131072 regular file Device: 13h/19d Inode: 1459625834 Links: 1 Access: (0754/-rwxr-xr--) Uid: ( 1001/ apache) Gid: ( 2001/ data) Access: 2011-03-03 10:59:15.000000000 -0500 Modify: 2011-01-30 11:11:11.000000000 -0500 Change: 2011-03-10 11:40:00.000000000 -0500 [2011-03-10 14:32:56.905605] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-6: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:56.905758] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-0: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:56.905809] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-7: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:56.905840] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (Invalid argument) [2011-03-10 14:32:56.905930] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-2: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:56.906061] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-3: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:56.906090] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (Invalid argument) [2011-03-10 14:32:56.906178] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-1: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:56.906205] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (Invalid argument) [2011-03-10 14:32:56.906889] D [dht-common.c:2137:dht_fd_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (No such file or directory) [2011-03-10 14:32:56.907309] D [dht-common.c:2137:dht_fd_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (No such file or directory) [2011-03-10 14:32:56.907350] D [dht-common.c:2137:dht_fd_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (No such file or directory) [2011-03-10 14:32:56.907507] D [client3_1-fops.c:4852:client3_1_readdirp] SAN-client-0: (352321552): failed to get fd ctx. EBADFD [2011-03-10 14:32:56.907545] D [client3_1-fops.c:4852:client3_1_readdirp] SAN-client-2: (352321552): failed to get fd ctx. EBADFD [2011-03-10 14:32:56.909448] D [client3_1-fops.c:4852:client3_1_readdirp] SAN-client-6: (352321552): failed to get fd ctx. EBADFD [2011-03-10 14:32:56.909588] D [stat-prefetch.c:3587:sp_release] SAN-stat-prefetch: cache hits: 0, cache miss: 0 [2011-03-10 14:32:56.910639] D [dht-common.c:1164:dht_attr_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (No such file or directory) [2011-03-10 14:32:56.911089] D [dht-common.c:1164:dht_attr_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (No such file or directory) [2011-03-10 14:32:56.911130] D [dht-common.c:1164:dht_attr_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (No such file or directory) [2011-03-10 14:32:56.913111] D [dht-common.c:587:dht_lookup_everywhere_cbk] SAN-dht: found on SAN-replicate-2 file /tiles/urbanstreets-v2/urbanstreets_missed.log [2011-03-10 14:32:56.913138] D [dht-common.c:672:dht_lookup_everywhere_cbk] SAN-dht: linking file /tiles/urbanstreets-v2/urbanstreets_missed.log existing on SAN-replicate-2 to SAN-replicate-1 (hash) [2011-03-10 14:32:56.913799] D [afr-lk-common.c:410:transaction_lk_op] SAN-replicate-1: lk op is for a transaction [2011-03-10 14:32:56.913827] D [afr-transaction.c:1023:afr_post_nonblocking_entrylk_cbk] SAN-replicate-1: Non blocking entrylks failed. Proceeding to blocking [2011-03-10 14:32:56.914892] D [afr-lk-common.c:964:afr_lock_blocking] SAN-replicate-1: unable to lock on even one child [2011-03-10 14:32:56.914917] D [afr-lk-common.c:410:transaction_lk_op] SAN-replicate-1: lk op is for a transaction [2011-03-10 14:32:56.914942] D [afr-transaction.c:998:afr_post_blocking_entrylk_cbk] SAN-replicate-1: Blocking entrylks failed. [2011-03-10 14:32:58.579440] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-6: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:58.579581] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-7: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:58.579613] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (Invalid argument) [2011-03-10 14:32:58.579703] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-2: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:58.579785] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-3: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:58.579813] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (Invalid argument) [2011-03-10 14:32:58.579872] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-1: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:58.579952] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-0: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:32:58.579979] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (Invalid argument) [2011-03-10 14:32:58.581669] D [dht-common.c:587:dht_lookup_everywhere_cbk] SAN-dht: found on SAN-replicate-2 file /tiles/urbanstreets-v2/urbanstreets_missed.log [2011-03-10 14:32:58.581860] D [dht-common.c:672:dht_lookup_everywhere_cbk] SAN-dht: linking file /tiles/urbanstreets-v2/urbanstreets_missed.log existing on SAN-replicate-2 to SAN-replicate-1 (hash) [2011-03-10 14:32:58.582551] D [afr-lk-common.c:410:transaction_lk_op] SAN-replicate-1: lk op is for a transaction [2011-03-10 14:32:58.582577] D [afr-transaction.c:1023:afr_post_nonblocking_entrylk_cbk] SAN-replicate-1: Non blocking entrylks failed. Proceeding to blocking [2011-03-10 14:32:58.583611] D [afr-lk-common.c:964:afr_lock_blocking] SAN-replicate-1: unable to lock on even one child [2011-03-10 14:32:58.583650] D [afr-lk-common.c:410:transaction_lk_op] SAN-replicate-1: lk op is for a transaction [2011-03-10 14:32:58.583672] D [afr-transaction.c:998:afr_post_blocking_entrylk_cbk] SAN-replicate-1: Blocking entrylks failed. MOVE OPPERATION mv /san/tiles/urbanstreets-v2/urbanstreets_missed.log ~/ mv: cannot open `/san/tiles/urbanstreets-v2/urbanstreets_missed.log' for reading: Invalid argument CLIENT LOG: [2011-03-10 14:33:05.529603] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-7: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:33:05.529690] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-6: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:33:05.529721] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-3 returned -1 (Invalid argument) [2011-03-10 14:33:05.529886] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-0: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:33:05.529946] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-2: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:33:05.530003] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-3: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:33:05.530034] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-1 returned -1 (Invalid argument) [2011-03-10 14:33:05.530105] D [client3_1-fops.c:1925:client3_1_lookup_cbk] SAN-client-1: gfid changed for /tiles/urbanstreets-v2 [2011-03-10 14:33:05.530132] I [dht-common.c:369:dht_revalidate_cbk] SAN-dht: subvolume SAN-replicate-0 returned -1 (Invalid argument) [2011-03-10 14:33:05.531737] D [dht-common.c:587:dht_lookup_everywhere_cbk] SAN-dht: found on SAN-replicate-2 file /tiles/urbanstreets-v2/urbanstreets_missed.log [2011-03-10 14:33:05.531827] D [dht-common.c:672:dht_lookup_everywhere_cbk] SAN-dht: linking file /tiles/urbanstreets-v2/urbanstreets_missed.log existing on SAN-replicate-2 to SAN-replicate-1 (hash) [2011-03-10 14:33:05.532344] D [afr-lk-common.c:410:transaction_lk_op] SAN-replicate-1: lk op is for a transaction [2011-03-10 14:33:05.532370] D [afr-transaction.c:1023:afr_post_nonblocking_entrylk_cbk] SAN-replicate-1: Non blocking entrylks failed. Proceeding to blocking [2011-03-10 14:33:05.533223] D [afr-lk-common.c:964:afr_lock_blocking] SAN-replicate-1: unable to lock on even one child [2011-03-10 14:33:05.533248] D [afr-lk-common.c:410:transaction_lk_op] SAN-replicate-1: lk op is for a transaction [2011-03-10 14:33:05.533269] D [afr-transaction.c:998:afr_post_blocking_entrylk_cbk] SAN-replicate-1: Blocking entrylks failed. [2011-03-10 14:33:05.533396] D [dht-common.c:2172:dht_open] SAN-dht: no cached subvolume for fd=0xb42c8044 [2011-03-10 14:33:05.533433] W [fuse-bridge.c:574:fuse_fd_cbk] glusterfs-fuse: 971: OPEN() /tiles/urbanstreets-v2/urbanstreets_missed.log => -1 (Invalid argument)
Bump. Is anyone looking at this?
Allow me to elaborate with some more details as per our infrastructure We are running Gluster in the amazon cloud two Ubuntu 10.04.1 LTS instances. each export or brick is a separate ebs volume 750gigs. The original issue does not exist anymore. After we rebooted gluster master and slave and then waited between 30-40 mins the clients started sync up. What i mean here is that when the server rebooted and the gluster was back up and running for 30-40. All the clients started to look the same. However with this new setup on our (gluster using replication and 3.1.2) we are experiencing lots of interesting issues that did not occur with (3.0.5 and no replication) Either the addition of replication or the version itself seems to have revealed some issues. Allow me to explain our latest issue. We have 4 tile servers that take in requests to either serve up a png's or create one. When the png needs to be rendered we render it and save it to the glusterfs and immediately return the imagedata from that saved file. What we are getting is half rendered png file back from gluster. This codebase has not changed, the only thing that has changed is our version of gluster and replication. If I were to guess: gluster is still replicating or saving the stream to the brick when a request to read the data of this file is occurring. Thus the read results in a incomplete file. Troubleshooting this issue is going to be pretty crazy. Have you done some QA / QC done on this kind of scenario? 4 bricks replicated two servers on the AMAZON cloud with 8 ebs volumes containing 6,000,000 million png's ranging between 2-6K each :) Pretty specific but I think it may reveal some issues.
I apologize I made an error as to the actual size of our data offering. We are hosting about 1.5TB of these 2K-6K png files. Which equates to somewhere between 268,435,456 - 805,306,368 files. I would give you an exact count however doing a "du /san" operation never returns even after days of running.
Well we are moving our infrastructure off of gluster. too many issues and obviously not really that enough community involvement.
(In reply to comment #4) > Well we are moving our infrastructure off of gluster. too many issues and > obviously not really that enough community involvement. hi Paul, We apologize for Not giving any input for the bug 764243, we were busy with getting 3.1.3 release out the door. We would love to have you back on gluster. With your help we will get your setup going. We found that the first issue you faced is similar to 2464, we want ls -l output on each of the bricks to confirm its the same. For the second issue of half-rendering of the png files, we would like to understand the problem a little bit more. Could you provide the logs of the client when this issue happens. Does it happen only for one file or for all the files?. Do give the ls -l output for the file in the backends for this. Thanks Pranith
Here's the output form SAN00 ls -l export01 export02 export03 export04 export01: total 116 drwxr-xr-x 18 root root 4096 2010-06-08 08:33 data drwxr-xr-x 5 root root 4096 2010-07-12 11:35 hold drwx------ 2 root root 16384 2011-02-28 10:22 lost+found drwxr-xr-x 7 apache data 4096 2010-11-19 14:15 mapsherpa drwxr-xr-x 3 apache apache 4096 2010-09-03 14:22 ofsc drwxr-xr-x 3 root root 4096 2010-06-10 13:40 old_mapsherpa drwxr-xr-x 10 root data 4096 2011-03-03 12:05 tiles drwxrwxr-x 2 root data 57344 2011-03-22 10:04 tmp export02: total 128 drwxr-xr-x 18 root data 4096 2010-06-08 08:33 data drwxr-xr-x 5 root root 4096 2010-07-12 11:36 hold drwx------ 2 root root 16384 2011-03-01 08:04 lost+found drwxr-xr-x 7 apache data 4096 2010-11-19 14:15 mapsherpa drwxr-xr-x 3 apache apache 4096 2010-09-03 14:22 ofsc drwxr-xr-x 3 root root 4096 2010-06-10 13:40 old_mapsherpa drwxr-xr-x 10 root root 4096 2011-03-10 14:54 tiles drwxrwxr-x 2 root data 65536 2011-03-22 10:05 tmp export03: total 124 drwxr-xr-x 18 root data 4096 2010-11-24 08:32 data drwxr-xr-x 5 root root 4096 2011-02-02 11:07 hold drwx------ 2 root root 16384 2011-03-02 07:43 lost+found drwxr-xr-x 7 apache data 4096 2011-03-06 22:46 mapsherpa drwxr-xr-x 3 apache apache 4096 2010-09-03 14:22 ofsc drwxr-xr-x 18 apache data 4096 2010-12-07 04:48 ofsc-network drwxr-xr-x 3 root root 4096 2010-06-10 13:40 old_mapsherpa drwxr-xr-x 10 root root 4096 2011-03-10 12:20 tiles drwxrwxr-x 2 root data 61440 2011-03-22 10:29 tmp export04: total 104 drwxr-xr-x 18 root root 4096 2011-03-06 07:19 data drwxr-xr-x 5 root root 4096 2011-03-06 06:59 hold drwx------ 2 root root 16384 2011-03-03 12:33 lost+found drwxr-xr-x 7 root root 4096 2011-03-06 22:46 mapsherpa drwxr-xr-x 3 root root 4096 2011-03-06 22:46 ofsc drwxr-xr-x 3 root root 4096 2011-03-06 07:21 old_mapsherpa drwxr-xr-x 10 root root 4096 2011-03-05 08:57 tiles drwxrwxr-x 2 root data 53248 2011-03-22 10:27 tmp and now the output from SAN01 ls -l export01 export02 export03 export04 export01: total 116 drwxr-xr-x 18 root root 4096 2010-06-08 08:33 data drwxr-xr-x 5 root root 4096 2010-07-12 11:35 hold drwx------ 2 root root 16384 2011-02-28 10:22 lost+found drwxr-xr-x 7 apache data 4096 2010-11-19 14:15 mapsherpa drwxr-xr-x 3 apache apache 4096 2010-09-03 14:22 ofsc drwxr-xr-x 3 root root 4096 2010-06-10 13:40 old_mapsherpa drwxr-xr-x 10 root data 4096 2011-03-03 12:05 tiles drwxrwxr-x 2 root data 57344 2011-03-22 11:04 tmp export02: total 128 drwxr-xr-x 18 root data 4096 2010-06-08 08:33 data drwxr-xr-x 5 root root 4096 2010-07-12 11:36 hold drwx------ 2 root root 16384 2011-03-01 08:04 lost+found drwxr-xr-x 7 apache data 4096 2010-11-19 14:15 mapsherpa drwxr-xr-x 3 apache apache 4096 2010-09-03 14:22 ofsc drwxr-xr-x 3 root root 4096 2010-06-10 13:40 old_mapsherpa drwxr-xr-x 10 root root 4096 2011-03-10 14:54 tiles drwxrwxr-x 2 root data 65536 2011-03-22 11:07 tmp export03: total 124 drwxr-xr-x 18 root data 4096 2010-11-24 08:32 data drwxr-xr-x 5 root root 4096 2011-02-02 11:07 hold drwx------ 2 root root 16384 2011-03-02 07:43 lost+found drwxr-xr-x 7 apache data 4096 2011-03-06 22:46 mapsherpa drwxr-xr-x 3 apache apache 4096 2010-09-03 14:22 ofsc drwxr-xr-x 18 apache data 4096 2010-12-07 04:48 ofsc-network drwxr-xr-x 3 root root 4096 2010-06-10 13:40 old_mapsherpa drwxr-xr-x 10 root root 4096 2011-03-10 12:20 tiles drwxrwxr-x 2 root data 61440 2011-03-22 11:07 tmp export04: total 108 drwxr-xr-x 18 root root 4096 2011-03-06 07:19 data drwxr-xr-x 5 root root 4096 2011-03-06 06:59 hold drwx------ 2 root root 16384 2011-03-03 13:26 lost+found drwxr-xr-x 7 root root 4096 2011-03-06 22:46 mapsherpa drwxr-xr-x 3 root root 4096 2011-03-06 22:46 ofsc drwxr-xr-x 3 root root 4096 2011-03-06 07:21 old_mapsherpa drwxr-xr-x 10 root root 4096 2011-03-05 08:57 tiles drwxrwxr-x 2 root data 57344 2011-03-22 11:05 tmp If you want the complete recursive listing that is going to take days to cat that into a file. not to mention it's going to be HUGE. I do not have access to that bug you mentioned earlier. So if you want me to look in the bricks for some interesting file attributes you'll have to be more specific. On to the other issue: Let me illustrate this other png issue differently: This is not the same one as I described earlier but this one is easy to reproduce. To debug a Tile being rendered on the fly and push it to the gluster mount is going to be really hard when it's requesting 20-30 files per pan. For this issue we have: Gluster client A Gluster client B Gluster Master X Gluster Slave Z gluster client mount /san 1. Client (A) Starts up process that polls every second for a file to appear in /san/tmp called PREVIEW.PNG 2. Client (B) Renders an image to /tmp/PREVIEW.png (locally) and then copies it to /san/tmp/PREVIEW.PNG 3. Client (A) Finds the the image and then preforms a fopen and then streams it out to HTTP Before installing this version of gluster we always retrieved a complete image. Now we get a portion of it. Gluster seems to be in the middle of saving the file and when it's requested it only outputs what it has so far versus the entire buffer. the older version did in fact stream out the entire file. Is this because of replication? If we call client A to refresh the image again instantly we will get the complete image, telling me that when the file exists the GLUSTER server is still in the process of writing out the file.
hi Paul, For issue-1: Please post the "ls -l" output of the following directory like you did it in the first comment. May be it is already fixed because of the reboot, but I just want to be sure. ------------------------------------------------------------------------- EXPORT 1 ls /glusterfs/export01/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 urban_king1.tar EXPORT 2 ls /glusterfs/export02/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 urban_king.tar EXPORT 3 ls /glusterfs/export03/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 urbanstreets_missed.log EXPORT 4 ls /glusterfs/export04/tiles/urbanstreets-v2/ 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 ------------------------------------------------------------------------- Issue-2: io-threads translator in glusterfs handles parallel I/O-operations. It has undergone changes from 3.0.5 to 3.1.2 to improve the performance, in the way parallel requests are served. This might have exposed the race between the clients A and B. If I understand the steps correctly there is one synchronization problem (race) between client A and client B. Client A does not have any idea whether client B has completely copied the file PREVIEW.png at the time it opens the file. It can be solved with the following steps: 1. Client (A) Starts up process that polls every second for a file to appear in /san/tmp called PREVIEW.PNG 2. Client (B) Renders an image to /tmp/PREVIEW.png (locally) and then copies it to /san/tmp/PREVIEW.PNG.part 3. Client (B) renames the file /san/tmp/PREVIEW.PNG.part to /san/tmp/PREVIEW.PNG after the copying is complete. 4. Client (A) Finds the the image and then preforms a fopen and then streams it out to HTTP Let me know your thoughts on this. Most of the file downloaders, internet browsers do it the same way to ensure the file is completely copied before using it. Pranith.
(In reply to comment #7) > hi Paul, > For issue-1: > Please post the "ls -l" output of the following directory like you did it > in the first comment. May be it is already fixed because of the reboot, but I > just want to be sure. > ------------------------------------------------------------------------- > EXPORT 1 > ls /glusterfs/export01/tiles/urbanstreets-v2/ > 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 > urban_king1.tar > > EXPORT 2 > ls /glusterfs/export02/tiles/urbanstreets-v2/ > 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 > urban_king.tar > > EXPORT 3 > ls /glusterfs/export03/tiles/urbanstreets-v2/ > 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 > urbanstreets_missed.log > > EXPORT 4 > ls /glusterfs/export04/tiles/urbanstreets-v2/ > 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 > ------------------------------------------------------------------------- > > Issue-2: > > io-threads translator in glusterfs handles parallel I/O-operations. It has > undergone changes from 3.0.5 to 3.1.2 to improve the performance, in the way > parallel requests are served. This might have exposed the race between the > clients A and B. > > If I understand the steps correctly there is one synchronization problem (race) > between client A and client B. Client A does not have any idea whether client B > has completely copied the file PREVIEW.png at the time it opens the file. It > can be solved with the following steps: > > 1. Client (A) Starts up process that polls every second for a file to appear > in > /san/tmp called PREVIEW.PNG > 2. Client (B) Renders an image to /tmp/PREVIEW.png (locally) and then copies > it to /san/tmp/PREVIEW.PNG.part > 3. Client (B) renames the file /san/tmp/PREVIEW.PNG.part to > /san/tmp/PREVIEW.PNG after the copying is complete. > 4. Client (A) Finds the the image and then preforms a fopen and then streams > it > out to HTTP > > Let me know your thoughts on this. Most of the file downloaders, internet > browsers do it the same way to ensure the file is completely copied before > using it. > > Pranith. Paul, Any updates?. Pranith
Sorry. I've been busy with other fires :) We are going to remove gluster from our tiling infrastructure and use a different methodology. Although we are still going to use it for other activities we also do not expect it to have these same kind issues once we stop serving up the 805 Million images from gluster. This probably outlines a larger issue with gluster that should be investigated. If you want further details about our infrastructure to setup a use case let me know. Cheers Paul.
(In reply to comment #9) > Sorry. I've been busy with other fires :) > > We are going to remove gluster from our tiling infrastructure and use a > different methodology. > > Although we are still going to use it for other activities we also do not > expect it to have these same kind issues once we stop serving up the 805 > Million images from gluster. > > This probably outlines a larger issue with gluster that should be investigated. > > If you want further details about our infrastructure to setup a use case let me > know. > > Cheers > > Paul. hi Paul, Could you let us know the infrastructure of your setup so that we can see whether it is working fine now. It would be great if we can have sample code of what the application does. Thanks Pranith
I am closing the bug as no updates from the submitter for 2 months.