| Summary: | tcp + dht + armv5tel: ???brick: disk layout has invalid count 29696??? | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Hraban Luyat <bubblboy> |
| Component: | unclassified | Assignee: | Anand Avati <aavati> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | mainline | CC: | aavati, avati, chrisw, fharshav, gluster-bugs, vijay |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Attachments: | |||
When applying dht over two bricks imported through tcp on an armv5tel machine the resulting brick can not be used. Logs of glusterfsd and glusterfs are attached, most significant section of glusterfs is: [2009-12-18 02:32:40] N [client-protocol.c:6224:client_setvolume_cbk] block2: Connected to 10.6.2.1:6996, attached to remote volume 'block2-h-0brg-net'. [2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block1': avail_percent is: 100.00 and avail_space is: 39371034624 [2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block1': avail_percent is: 100.00 and avail_space is: 39371034624 [2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block2': avail_percent is: 100.00 and avail_space is: 39371034624 [2009-12-18 02:32:40] D [dht-diskusage.c:71:dht_du_info_cbk] brick: on subvolume 'block2': avail_percent is: 100.00 and avail_space is: 39371034624 [2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x58fd0--0x59002 (size 50). [2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x584f0--0x58522 (size 50). [2009-12-18 02:32:44] D [dht-layout.c:290:dht_disk_layout_merge] brick: disk layout has invalid count 29696 [2009-12-18 02:32:44] D [dht-layout.c:357:dht_layout_merge] brick: layout merge from subvolume block1 failed [2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x584f0--0x58522 (size 50). [2009-12-18 02:32:44] D [dht-layout.c:290:dht_disk_layout_merge] brick: disk layout has invalid count 29696 [2009-12-18 02:32:44] D [dht-layout.c:357:dht_layout_merge] brick: layout merge from subvolume block2 failed [2009-12-18 02:32:44] D [dht-layout.c:571:dht_layout_normalize] brick: directory / looked up first time [2009-12-18 02:32:44] D [dht-common.c:164:dht_lookup_dir_cbk] brick: fixing assignment on / [2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x58fd0--0x59002 (size 50). [2009-12-18 02:32:44] D [dht-layout.c:658:dht_layout_dir_mismatch] brick: / - disk layout has invalid count 29696 [2009-12-18 02:32:44] D [dht-common.c:274:dht_revalidate_cbk] brick: mismatching layouts for / [2009-12-18 02:32:44] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x58fd0--0x59002 (size 50). [2009-12-18 02:32:44] D [dht-layout.c:658:dht_layout_dir_mismatch] brick: / - disk layout has invalid count 29696 [2009-12-18 02:32:44] D [dht-common.c:274:dht_revalidate_cbk] brick: mismatching layouts for / [2009-12-18 02:32:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 2: LOOKUP() / => -1 (Stale NFS file handle) 02:34:44 was when I tried to do ls /mnt/test. I am willing to do some debugging on this but I have no idea where to start looking. The big problem I have here is that dht works fine when joining local storage bricks, and mounting a brick imported over tcp works fine as well. It is the combination that fails. Hence, too, the lack of classification of this bug. If somebody could point me in the right direction of the source code, that would be great. Thanks and greetings, Hraban Luyat Created attachment 124 [details] patch for gtkrc.ru Hello, So, I did some debugging on my own and found out what the error was. Like with bug #762129, the problem is with pointer casting. Unlike bug #762129, this one is probably not that quickly fixed. In dict.c there is a function dict_get_ptr, which, if I understand it correctly (no documentation), given a key, extracts the corresponding value from a dictionary. It returns a pointer to that area by storing the address in a pointer, which is passed by reference, i.e.: a (void **) argument. This means two things: - The value you get out of this function is in network order, so every time you want to use it you have to convert it to host order. - The pointer to that value is a (void *) pointer, so you have to extract the value to a meaningful pointer first. The first point is outside the scope of this discussion. The second point, however, is not. The address that is stored in that pointer can not be used as a real value, the compiler will generate a warning about that (with good reason). Casting the pointer only hides the error, it does not fix it (re: bug #762129). The value has to be extracted to a usable memory area first, with memcpy. Right now, the (void *) pointer is cast to a (int32_t *). This means that you try to make a pointer of the latter type point to a memory area that was previously pointed to by a (void *). However, (void *) pointers are incompatible with (int32_t *) ones: the latter are (apparently) always word-aligned (at least on armv5tel), the former are not. I fixed this, for now, in dht, by copying the data to properly allocated memory as soon asap (i.e.: just after a call to dict_get_ptr). The data is left in network order as it used to be. In summary: pointer casting is a no-no. Patch attached. Greetings, Hraban Luyat Hraban, Apologies for the delayed response. This patch has some unsafe memcpy() calls which do not check for the presence of the source pointer. Can you check if dict_get_ptr() calls are successful before performing memcpy()? Thanks, Avati PATCH: http://patches.gluster.com/patch/2629 in master (Fix memory access in afr's self-heal code (replace pointer casts by memcpy).) PATCH: http://patches.gluster.com/patch/2738 in master (dht: Remove pointer casting in layout handling) |
Created attachment 123 [details] test case Nothing interesting here, as far as I can see. Posting it for completeness. Also, it seems that this output is from 3.0.0 code, I tried it with the latest git build and that failed as well.