Hide Forgot
Created attachment 110 [details] regression test for valloc() in glibc.
2.0.4 and 2.0.8 exhibit the same behaviour when installed on a armv5tel machine: the transport modules can not be used. Local rules work fine. This has something to do, apparently, with so-called "dictionaries", but I do not know more about it than the error message tells me. The exact same configuration works fine on a i686 machine. The significant part of the error message on the client: [2009-11-23 00:20:04] D [client-protocol.c:5690:client_setvolume_cbk] client: failed to get 'process-uuid' from reply dictionary [2009-11-23 00:20:04] D [client-protocol.c:5696:client_setvolume_cbk] client: SETVOLUME on remote-host failed: Internal error: failed to unserialize request dictionary and on the server: [2009-11-23 00:20:04] E [dict.c:2372:dict_unserialize] dict: undersized buffer passsed [2009-11-23 00:20:04] N [server-protocol.c:7825:notify] server: 127.0.0.1:1023 disconnected [2009-11-23 00:20:04] D [server-protocol.c:7833:notify] server: POLLERR received on (127.0.0.1:6996) even before handshake with (127.0.0.1:1023) is successful [2009-11-23 00:20:04] D [server-protocol.c:7852:notify] server: transport (127.0.0.1:6996) cleaned up even before handshake with (127.0.0.1:1023) is successful Context: we are trying to create a network of Qnap TS-110 nodes that we want to connect to eachother with glusterfs (over VPN). /sbin/glusterfs output is attached, /sbin/glusterfsd will follow later. on a side note: when I try to install glusterfs as a normal user, make install fails: it tries to install /sbin/mount.glusterfs and /etc/init.d/glusterfs-server without the prefix. I had to manually edit those two Makefiles.
Minor update: I just tried it with some more versions, here is the summary: 1.3.12: works 2.0.4: does not work 2.0.8: does not work 2.0.9rc3: does not work 3.0.0pre1: does not work All fail with the same error (about the undersized dictionary), except 1.3.12. I guess we will be using that version for now.
Raghavendra G (one of the developers, I assume ;) ) was nice enough to contact me and help me through some of the debug steps. I have to check some information and run some tests now, but time is currently lacking, unfortunately. In the mean time, here is the conversation so far, for archival purposes and for anybody who may run into the same problem. Thanks for the help so far, I will check in soon with answers to your questions. Hraban Luyat Raghavendra G skrev (2009-12-01): > Thanks for the info :). > > The length of the dictionary is not corrupted. The bug can be that the > length of one of the keys or values may be corrupted or improper and there > by the total length as calculated from adding up each individual key/value > pairs of dictionary may be greater than the size of the buffer (dict_len) > being passed. It would be difficult to debug without tracing through code. > Basically there is this for loop in dict.c which extracts key/value pairs > from the buffer and constructs a new dictionary. This loop is in > dict_unserialize. I've pasted the loop in the mail. See if you can find out > anything suspicious. > > file: libglusterfs/src/dict.c > procedure: dict_unserialize > > for (i = 0; i < count; i++) { > if ((buf + DICT_DATA_HDR_KEY_LEN) > (orig_buf + size)) { > gf_log ("dict", GF_LOG_ERROR, > "undersized buffer passsed"); > goto out; > } > keylen = ntoh32 (*(int32_t *) buf); > buf += DICT_DATA_HDR_KEY_LEN; > > if ((buf + DICT_DATA_HDR_VAL_LEN) > (orig_buf + size)) { > gf_log ("dict", GF_LOG_ERROR, > "undersized buffer passsed"); > goto out; > } > vallen = ntoh32 (*(int32_t *) buf); > buf += DICT_DATA_HDR_VAL_LEN; > > if ((buf + keylen) > (orig_buf + size)) { > gf_log ("dict", GF_LOG_ERROR, > "undersized buffer passsed"); > goto out; > } > key = buf; > buf += keylen + 1; /* for '\0' */ > > if ((buf + vallen) > (orig_buf + size)) { > gf_log ("dict", GF_LOG_ERROR, > "undersized buffer passsed"); > goto out; > } > value = get_new_data (); > value->len = vallen; > value->data = buf; > value->is_static = 1; > buf += vallen; > > dict_set (*fill, key, value); > } > > Please checkout what is the value of count. Also try to find out processing > which key/value pair results in the error msg "undersize buffer passed". > > regards, > On Tue, Dec 1, 2009 at 7:21 PM, Hraban Luyat <bubblboy> wrote: > >> Raghavendra G skrev: >> >> Hruban, >>> comments are inlined. >>> >>> On Tue, Dec 1, 2009 at 4:01 AM, Hraban Luyat <bubblboy> wrote: >>> >>> Raghavendra G skrev: >>>> Hi Hruban, >>>> >>>>> This mail is regarding the bug you've filed at >>>>> http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=397 >>>>> >>>>> Since we don't have the hardware in our labs its difficult for us to >>>>> debug >>>>> the issue. Is it possible for you to give us remote access to your >>>>> systems? >>>>> >>>>> regards, >>>>> >>>>> Hi Raghavendra, >>>> Unfortunately, this is not possible. I am, however, willing to do a fair >>>> share of debugging, depending on the instructions and clarity of the code >>>> in >>>> question. >>>> >>> >>> Thanks for the support. Here are the instructions. Through gdb we need to >>> check whether correct size is passed from client and server. >>> * build glusterfs with CFLAGS='-g3 -O0'. you can export CFLAGS just before >>> running configure script. >>> >> With which version of glusterfs? I did the following with >> glusterfs-3.0.0pre1 >> >> >> * attach gdb to glusterfs client - gdb -p <glusterfs client process id> >>> * set breakpoint at protocol_client_handshake >>> gdb> b protocol_client_handshake >>> * get the value of variable dict_len after it is assigned the value of >>> dict_serialized_length. >>> >> Breakpoint 3, protocol_client_handshake (this=0x202e0, trans=0x58008) at >> client-protocol.c:6191 >> 6191 if (dict_len < 0) { >> (gdb) print dict_len >> $3 = 260 >> (gdb) >> >> >> * also check the value of req->dict_len, which is nothing but hton32 >>> (dict_len). >>> >> Breakpoint 4, protocol_client_handshake (this=0x202e0, trans=0x58008) at >> client-protocol.c:6221 >> 6221 return ret; >> (gdb) print req->dict_len >> $5 = 67174400 >> >> >> what are the values of dict_len and req->dict_len? >>> Debugging the server: >>> ================ >>> * attach gdb to glusterfs server - gdb -p <glusterfs server process id> >>> * set breakpoint at mop_setvolume. >>> gdb> b mop_setvolume >>> * get the value of req->dictlen after req is assigned the return value of >>> gf_param (req_hdr) >>> * also get the value of req_dictlen, which is nothing but ntoh32 >>> (req->dict_len). >>> >>> what are the values of req_dictlen and req->dict_len? >>> >> Breakpoint 1, mop_setvolume (frame=0x23340, bound_xl=0x0, req_hdr=0x231e8, >> req_hdrlen=296, iobuf=0x0) >> at server-protocol.c:5376 >> 5376 ret = dict_unserialize (req->buf, req_dictlen, ¶ms); >> (gdb) print req->dictlen >> There is no member named dictlen. >> (gdb) print req->dict_len >> $1 = 67174400 >> (gdb) print req_dictlen >> $2 = 260 >> >> >> >> Are the values from client match with those from the server? >> They seem inverted (a hton32 too many, somewhere?) >> >> >> Also what is the sizeof char on your architecture? you can check that with >>> following C program >>> >>> #include <stdio.h> >>> >>> int main (void) >>> { >>> printf ("sizeof (char): %d", sizeof (char)); >>> return 0; >>> } >>> >> This gives 1, as expected. >> >> >> >>> I can not make any guarantees about this. >>> >>>> We will start deploying the system with the latest stable version that >>>> does >>>> work. In the mean time, let us work on this bug :) >>>> >>>> Greetings, >>>> >>>> Hraban Luyat >>>> >>>> >>> regards, >>> >> Hope it helps. Let me know what more you need. >> >> Greetings, >> >> Hraban
Hello, Alrighty, I finally have the time to do some debugging. The error was actually pretty much to be expected, given the keylen that was so off as appeared from the first e-mail-gdb-report. The error is from this part (I added debugging prints and I used 3.0.0 code): 2429 if ((buf + keylen) > (orig_buf + size)) { 2430 gf_log ("dict", GF_LOG_DEBUG, 2431 "No room for key (size %d, buffer at %p).", 2432 keylen, (void *) buf); 2433 gf_log ("dict", GF_LOG_ERROR, 2434 "undersized buffer passsed"); 2435 goto out; 2436 } 2437 key = buf; That gives: [2009-12-16 04:26:48] D [dict.c:2386:dict_unserialize] dict: Unserializing buffer 0x232e0--0x233ff (size 287). [2009-12-16 05:32:47] D [dict.c:2432:dict_unserialize] dict: No room for key (size 268435456, buffer at 0x2332b). [2009-12-16 05:32:47] E [dict.c:2434:dict_unserialize] dict: undersized buffer passsed Needless to say that that size is wayyyyy off. For the curious: 268435456 = 2^28 = 0x10000000 (seven zeroes). Here are the raw buffer contents of a different run. The memory addresses do not match but I am going to assume that the buffer contents do (at least the headers). (gdb) x/287xb orig_buf 0x232e0: 0x00 0x00 0x00 0x08 0x00 0x00 0x00 0x0c 0x232e8: 0x00 0x00 0x00 0x2a 0x70 0x72 0x6f 0x63 0x232f0: 0x65 0x73 0x73 0x2d 0x75 0x75 0x69 0x64 0x232f8: 0x00 0x68 0x2d 0x33 0x31 0x34 0x36 0x31 0x23300: 0x2d 0x32 0x30 0x30 0x39 0x2f 0x31 0x32 0x23308: 0x2f 0x31 0x36 0x2d 0x30 0x34 0x3a 0x32 0x23310: 0x36 0x3a 0x34 0x38 0x3a 0x34 0x39 0x35 0x23318: 0x30 0x36 0x37 0x2d 0x63 0x6c 0x69 0x65 0x23320: 0x00 0x00 0x00 0x10 0x00 0x00 0x00 0x04 0x23328: 0x00 0x00 0x00 0x70 0x72 0x6f 0x74 0x6f 0x23330: 0x63 0x6f 0x6c 0x2d 0x76 0x65 0x72 0x73 0x23338: 0x69 0x6f 0x6e 0x00 0x33 0x2e 0x30 0x00 0x23340: 0x00 0x00 0x00 0x17 0x00 0x00 0x00 0x05 0x23348: 0x74 0x72 0x61 0x6e 0x73 0x70 0x6f 0x72 0x23350: 0x74 0x2e 0x73 0x6f 0x63 0x6b 0x65 0x74 0x23358: 0x2e 0x6c 0x6f 0x77 0x6c 0x61 0x74 0x00 0x23360: 0x74 0x72 0x75 0x65 0x00 0x00 0x00 0x18 0x23368: 0x00 0x00 0x00 0x05 0x00 0x74 0x72 0x61 0x23370: 0x6e 0x73 0x70 0x6f 0x72 0x74 0x2e 0x61 0x23378: 0x64 0x64 0x72 0x65 0x73 0x73 0x2d 0x66 0x23380: 0x61 0x6d 0x69 0x6c 0x79 0x00 0x69 0x6e 0x23388: 0x00 0x00 0x00 0x10 0x00 0x00 0x00 0x06 0x23390: 0x00 0x00 0x00 0x72 0x65 0x6d 0x6f 0x74 0x23398: 0x65 0x2d 0x73 0x75 0x62 0x76 0x6f 0x6c 0x233a0: 0x75 0x6d 0x65 0x00 0x62 0x72 0x69 0x63 0x233a8: 0x00 0x00 0x00 0x0b 0x00 0x00 0x00 0x05 0x233b0: 0x00 0x00 0x72 0x65 0x6d 0x6f 0x74 0x65 0x233b8: 0x2d 0x70 0x6f 0x72 0x74 0x00 0x36 0x39 0x233c0: 0x00 0x00 0x00 0x0b 0x00 0x00 0x00 0x0a 0x233c8: 0x00 0x00 0x00 0x72 0x65 0x6d 0x6f 0x74 0x233d0: 0x65 0x2d 0x68 0x6f 0x73 0x74 0x00 0x31 0x233d8: 0x32 0x37 0x2e 0x30 0x2e 0x30 0x2e 0x31 0x233e0: 0x00 0x00 0x00 0x0e 0x00 0x00 0x00 0x07 0x233e8: 0x00 0x74 0x72 0x61 0x6e 0x73 0x70 0x6f 0x233f0: 0x72 0x74 0x2d 0x74 0x79 0x70 0x65 0x00 0x233f8: 0x73 0x6f 0x63 0x6b 0x65 0x74 0x00 GDB Tells me that the invalid key size shows up during the second iteration of the loop (i = 1). At the time of the error, the address difference between buf and orig_buf is: 0x23323 - 0x232d8 = 75 I checked this with the code and it makes sense (4 + 4 + 4 + first keylen (12) + 1 + first vallen (42) + 4 + 4 = 75). Then, gdb shows us that ntoh32 is aliased to __swap32 in byte-order.h: 2409 keylen = ntoh32 (*(int32_t *) buf); (gdb) __swap32 (x=201326592) at byte-order.h:55 55 { (gdb) 57 } (gdb) The fact that ntoh32 is aliased to __swap32, as opposed to __noswap32, suggests that the processor is little endian mode (network order is big endian). According to wikipedia (my only paraphrased source on this) "All ARM processors are bi-endian, meaning they can switch endianness at runtime." And indeed: count = ntoh32( 0x00 0x00 0x00 0x08 ) = 8: 2394 count = ntoh32 (*(int32_t *) buf); Value returned is $2 = 8 So, interpreting 0x08 0x00 0x00 0x00 as an integer gives 8: the machine is in little endian mode. Where did this fubar keylen come from? It is read when buf has an offset of 4 + 4 + 4 + 12 + 1 + 42 = 67: 0x10 0x00 0x00 0x00. This is in network order, thus it comes out as 2^28. The first four bytes read in the second iteration indicate the size of the key, which is read at an offset of eight bytes (two integers), and which is followed by a null-byte. There are three explanations: - The three 0x00 bytes /before/ the 0x10 indicate a possible misalignment: perhaps the second payload header starts at 64 and not at 67? That would mean that the key is 16 bytes long and starts at 64 + 8 = 72. However, there are three 0x00 bytes, there, as well, that makes little sense for key data. In addition, this bug would be unrelated to endianness and manifest itself on any architecture. The fact that this bug has been around since v1.4 leads me to believe that that is rather unlikely. - The three 0x00 bytes /after/ the 0x10 indicate another possible misalignment: perhaps the second payload header starts at 68. That would give a key of size 0x00000004 and a value of size 0x00000070, and the key itself would start at 76. However, the key is always followed by a null-byte, but the byte at 76 + 4 + 1 = 81 is 0x63, not 0x00. In addition, for this explanation, the same condition as above holds: this bug would manifest itself on any architecture. Thus, also, this option can be considered quite improbable. - The key length of the second payload is sent in the wrong endianness (host/little instead of network/big) by the client. This gives an implied key size of 0x00000010, starting at 67 + 8, and probably also a value length of 0x00000004. Indeed, the byte at 67 + 8 + 0x10 + 1 = 92 is 0x00! And 4 bytes (the value) after that, a new byte sequence starts that looks a lot like a header: two (big endian) integers, depicting correct offset (follow the key and it gives a 0x00 again at the correct place). This means that only the second payload is sent in host order instead of network order. Because host- and network order are the same on a big-endian machine (pretty much any popular architecture), this bug will go unnoticed there (assuming that the fault is with the client, which sends host-order data but which then happens to be of correct endianness by "accident"). This does not explain why the client sends that second payload heade in host-order. Quite frankly, I have had enough debugging for today, I will leave the rest to you guys. By the way: the fourth payload header seems misaligned: the third value is supposedly five bytes, but the fourth header (keylen) already starts four bytes after the third body (0x23364). The fourth key body then starts with a 0x00 byte and is one byte short of being followed by a 0x00 byte. If that leading 0x00 byte is placed in front of the header, the fourth key length makes sense (otherwise it is 0x00001800, which is pretty big a key). Also interestingly, the fifth keylength seems to be in little endian again... In summary: the payload is a mess. Halp!1!!1!!!1! If this seriously affects all little-endian machines, then this is a pretty serious bug... Greetings, Hraban Luyat
Created attachment 121 [details] set of cyrillic fonts and maps for console-tools Also adds some debug logging I used, feel free to remove.
Hello, After debugging the client-side I have come to the following conclusions: - My little/big endian theory was wrong. - The "the client is at fault" theory was correct. - Pointer casting is very naughty. This pointer cast (and two others) in _dict_serialize (also dict.c): *(int32_t *) buf = hton32 (keylen); Were changed to this: int32_t netword; netword = hton32 (keylen); memcpy (buf, &netword, sizeof(netword)); The incoming buffer server-side now looks much more sane: (gdb) x/287xb orig_buf 0x232e0: 0x00 0x00 0x00 0x08 0x00 0x00 0x00 0x0c 0x232e8: 0x00 0x00 0x00 0x2a 0x70 0x72 0x6f 0x63 0x232f0: 0x65 0x73 0x73 0x2d 0x75 0x75 0x69 0x64 0x232f8: 0x00 0x68 0x2d 0x31 0x30 0x31 0x30 0x30 0x23300: 0x2d 0x32 0x30 0x30 0x39 0x2f 0x31 0x32 0x23308: 0x2f 0x31 0x36 0x2d 0x31 0x33 0x3a 0x35 0x23310: 0x30 0x3a 0x34 0x36 0x3a 0x35 0x33 0x37 0x23318: 0x37 0x36 0x38 0x2d 0x63 0x6c 0x69 0x65 0x23320: 0x6e 0x74 0x00 0x00 0x00 0x00 0x10 0x00 0x23328: 0x00 0x00 0x04 0x70 0x72 0x6f 0x74 0x6f 0x23330: 0x63 0x6f 0x6c 0x2d 0x76 0x65 0x72 0x73 0x23338: 0x69 0x6f 0x6e 0x00 0x33 0x2e 0x30 0x00 0x23340: 0x00 0x00 0x00 0x17 0x00 0x00 0x00 0x05 0x23348: 0x74 0x72 0x61 0x6e 0x73 0x70 0x6f 0x72 0x23350: 0x74 0x2e 0x73 0x6f 0x63 0x6b 0x65 0x74 0x23358: 0x2e 0x6c 0x6f 0x77 0x6c 0x61 0x74 0x00 0x23360: 0x74 0x72 0x75 0x65 0x00 0x00 0x00 0x00 0x23368: 0x18 0x00 0x00 0x00 0x05 0x74 0x72 0x61 0x23370: 0x6e 0x73 0x70 0x6f 0x72 0x74 0x2e 0x61 0x23378: 0x64 0x64 0x72 0x65 0x73 0x73 0x2d 0x66 0x23380: 0x61 0x6d 0x69 0x6c 0x79 0x00 0x69 0x6e 0x23388: 0x65 0x74 0x00 0x00 0x00 0x00 0x10 0x00 0x23390: 0x00 0x00 0x06 0x72 0x65 0x6d 0x6f 0x74 0x23398: 0x65 0x2d 0x73 0x75 0x62 0x76 0x6f 0x6c 0x233a0: 0x75 0x6d 0x65 0x00 0x62 0x72 0x69 0x63 0x233a8: 0x6b 0x00 0x00 0x00 0x00 0x0b 0x00 0x00 0x233b0: 0x00 0x05 0x72 0x65 0x6d 0x6f 0x74 0x65 0x233b8: 0x2d 0x70 0x6f 0x72 0x74 0x00 0x36 0x39 0x233c0: 0x39 0x35 0x00 0x00 0x00 0x00 0x0b 0x00 0x233c8: 0x00 0x00 0x0a 0x72 0x65 0x6d 0x6f 0x74 0x233d0: 0x65 0x2d 0x68 0x6f 0x73 0x74 0x00 0x31 0x233d8: 0x32 0x37 0x2e 0x30 0x2e 0x30 0x2e 0x31 0x233e0: 0x00 0x00 0x00 0x00 0x0e 0x00 0x00 0x00 0x233e8: 0x07 0x74 0x72 0x61 0x6e 0x73 0x70 0x6f 0x233f0: 0x72 0x74 0x2d 0x74 0x79 0x70 0x65 0x00 0x233f8: 0x73 0x6f 0x63 0x6b 0x65 0x74 0x00 The culprit was the pointer cast which, apparently, forced word-alignment. Memcpy does not suffer from this caveat. The dict_unserialize routine uses the same bad style pointer casts, causing similar issues on that side. I applied a similar fix there (using memcpy()), lo and behold: [2009-12-16 14:20:40] D [dict.c:2391:dict_unserialize] dict: Unserializing buffer 0x5a240--0x5a2af (size 111). [2009-12-16 14:20:40] N [client-protocol.c:6224:client_setvolume_cbk] client: Connected to 127.0.0.1:6995, attached to remote volume 'brick'. Patch attached. Cheers, Hraban Luyat
Hi Hruban, Thanks for the fix :). This patch also fixes another issue, where we observed crashes where a particular 2.0.x client connects with 3.0 server. Thanks once again :). regards, Raghavendra.
Hraban, Is it possible for you to submit a patch using git? If not let me know, I'll submit the patch on behalf of you. regards, Raghavendra.
Hello, Depending on what you want me to do, exactly, it sounds possible. It has been a while since I last used git, so this will be a good excuse to pick up some of those tricks again. Do you want me to register an account and commit to the repository directly, or do I have to host a branch with my fix myself, or do you want me to send you a git-formatted patch? Greetings
> Depending on what you want me to do, exactly, it sounds possible. do > you want me to send you a git-formatted patch? Yes, please send git-formatted patch to gluster-devel, or glusterfs.com You can trace it later here: http://patches.gluster.com -Amar
Hraban, This page has some info about submitting patches to glusterfs: http://www.gluster.com/community/documentation/index.php/Development_Work_Flow
Hello, Thanks for the link, Vikas. I sent a patch to gluster-devel as per the instructions, did it work? Also, perhaps a bit off-topic: are there instructions on how to compile from git, somewhere? I can not seem to find a configure script. Thanks, Hraban
> Thanks for the link, Vikas. I sent a patch to gluster-devel as per > the instructions, did it work? Also, perhaps a bit off-topic: are there > instructions on how to compile from git, somewhere? I can not seem to find a > configure script. > Hi Hraban, You can run './autogen.sh' first, which generates 'configure', and then you can run ./configure, and 'make'. Regards,
PATCH: http://patches.gluster.com/patch/2625 in master (More robust dictionary (un)serialization (fixes bug #762129).)