Bug 762225 (GLUSTER-493) - tcp + dht + armv5tel: ???brick: disk layout has invalid count 29696???
Summary: tcp + dht + armv5tel: ???brick: disk layout has invalid count 29696???
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-493
Product: GlusterFS
Classification: Community
Component: unclassified
Version: mainline
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Anand Avati
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-12-19 17:27 UTC by Hraban Luyat
Modified: 2015-12-01 16:45 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Debug output of the glusterfs client trying to dht two bricks imported over TCP. (8.11 KB, text/plain)
2009-12-19 14:27 UTC, Hraban Luyat
no flags Details
Debug output of the glusterfsd server serving two bricks imported over TCP. (6.54 KB, text/plain)
2009-12-19 14:31 UTC, Hraban Luyat
no flags Details
[PATCH BUG:493] Remove pointer casting (copy data to proper memory before using it). Fixes bug #493. (1.56 KB, patch)
2009-12-20 05:30 UTC, Hraban Luyat
no flags Details | Diff

Description Hraban Luyat 2009-12-19 14:31:08 UTC
Created attachment 123 [details]
test case

Nothing interesting here, as far as I can see. Posting it for completeness. Also, it seems that this output is from 3.0.0 code, I tried it with the latest git build and that failed as well.

Comment 1 Hraban Luyat 2009-12-19 17:27:44 UTC
When applying dht over two bricks imported through tcp on an armv5tel machine the resulting brick can not be used. Logs of glusterfsd and glusterfs are attached, most significant section of glusterfs is:

[2009-12-18 02:32:40] N [client-protocol.c:6224:client_setvolume_cbk] block2: Connected to 10.6.2.1:6996, attached to remote volume 'block2-h-0brg-net'.
[2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block1': avail_percent is: 100.00 and avail_space is: 39371034624
[2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block1': avail_percent is: 100.00 and avail_space is: 39371034624
[2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block2': avail_percent is: 100.00 and avail_space is: 39371034624
[2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block2': avail_percent is: 100.00 and avail_space is: 39371034624
[2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x58fd0--0x59002 (size 50).
[2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x584f0--0x58522 (size 50).
[2009-12-18 02:32:44] D [dht-layout.c:290:dht_disk_layout_merge] brick: disk layout has invalid count 29696
[2009-12-18 02:32:44] D [dht-layout.c:357:dht_layout_merge] brick: layout merge from subvolume block1 failed
[2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x584f0--0x58522 (size 50).
[2009-12-18 02:32:44] D [dht-layout.c:290:dht_disk_layout_merge] brick: disk layout has invalid count 29696
[2009-12-18 02:32:44] D [dht-layout.c:357:dht_layout_merge] brick: layout merge from subvolume block2 failed
[2009-12-18 02:32:44] D [dht-layout.c:571:dht_layout_normalize] brick: directory / looked up first time
[2009-12-18 02:32:44] D [dht-common.c:164:dht_lookup_dir_cbk] brick: fixing assignment on /
[2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x58fd0--0x59002 (size 50).
[2009-12-18 02:32:44] D [dht-layout.c:658:dht_layout_dir_mismatch] brick: / - disk layout has invalid count 29696
[2009-12-18 02:32:44] D [dht-common.c:274:dht_revalidate_cbk] brick: mismatching layouts for /
[2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x58fd0--0x59002 (size 50).
[2009-12-18 02:32:44] D [dht-layout.c:658:dht_layout_dir_mismatch] brick: / - disk layout has invalid count 29696
[2009-12-18 02:32:44] D [dht-common.c:274:dht_revalidate_cbk] brick: mismatching layouts for /
[2009-12-18 02:32:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 2: LOOKUP() / => -1 (Stale NFS file handle)

02:34:44 was when I tried to do ls /mnt/test.

I am willing to do some debugging on this but I have no idea where to start looking. The big problem I have here is that dht works fine when joining local storage bricks, and mounting a brick imported over tcp works fine as well. It is the combination that fails. Hence, too, the lack of classification of this bug.

If somebody could point me in the right direction of the source code, that would be great.

Thanks and greetings,

Hraban Luyat

Comment 2 Hraban Luyat 2009-12-20 05:30:27 UTC
Created attachment 124 [details]
patch for gtkrc.ru

Hello,

So, I did some debugging on my own and found out what the error was. Like with bug #762129, the problem is with pointer casting. Unlike bug #762129, this one is probably not that quickly fixed.

In dict.c there is a function dict_get_ptr, which, if I understand it correctly (no documentation), given a key, extracts the corresponding value from a dictionary. It returns a pointer to that area by storing the address in a pointer, which is passed by reference, i.e.: a (void **) argument.

This means two things:
- The value you get out of this function is in network order, so every time you want to use it you have to convert it to host order.
- The pointer to that value is a (void *) pointer, so you have to extract the value to a meaningful pointer first.

The first point is outside the scope of this discussion. The second point, however, is not. The address that is stored in that pointer can not be used as a real value, the compiler will generate a warning about that (with good reason). Casting the pointer only hides the error, it does not fix it (re: bug #762129). The value has to be extracted to a usable memory area first, with memcpy.

Right now, the (void *) pointer is cast to a (int32_t *). This means that you try to make a pointer of the latter type point to a memory area that was previously pointed to by a (void *). However, (void *) pointers are incompatible with (int32_t *) ones: the latter are (apparently) always word-aligned (at least on armv5tel), the former are not.

I fixed this, for now, in dht, by copying the data to properly allocated memory as soon asap (i.e.: just after a call to dict_get_ptr). The data is left in network order as it used to be.

In summary: pointer casting is a no-no.

Patch attached.

Greetings,

Hraban Luyat

Comment 3 Anand Avati 2010-01-23 04:56:59 UTC
Hraban,
  Apologies for the delayed response. This patch has some unsafe memcpy() calls which do not check for the presence of the source pointer.  Can you check if dict_get_ptr() calls are successful before performing memcpy()?

Thanks,
Avati

Comment 4 Anand Avati 2010-02-07 08:43:26 UTC
PATCH: http://patches.gluster.com/patch/2629 in master (Fix memory access in afr's self-heal code (replace pointer casts by memcpy).)

Comment 5 Anand Avati 2010-02-07 08:43:30 UTC
PATCH: http://patches.gluster.com/patch/2738 in master (dht: Remove pointer casting in layout handling)


Note You need to log in before you can comment on or make changes to this bug.