Bug 828058

Summary:

cli crash with mixture of 32, 64bit machines

Product:

[Community] GlusterFS

Reporter:

Pranith Kumar K <pkarampu>

Component:

cli

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Anush Shetty <ashetty>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

mainline

CC:

blacktear23, gluster-bugs, joe, mailbox, ndevos, nragusa, ujjwala

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-24 17:24:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

817967

Attachments:

Description	Flags
core of the crashed process	none

Description Pranith Kumar K 2012-06-04 07:17:02 UTC

Description of problem:
    #0 0x00007ffff66dd005 in __strftime_internal () from /lib64/libc.so.6
    #1 0x00007ffff66df026 in strftime_l () from /lib64/libc.so.6
    #2 0x00000000004169c1 in cmd_heal_volume_brick_out (dict=0x666d24, brick=0) at cli-rpc-ops.c:5836
    #3 0x0000000000416d49 in gf_cli3_1_heal_volume_cbk (req=<value optimized out>, iov=<value optimized out>,
    count=<value optimized out>, myframe=<value optimized out>) at cli-rpc-ops.c:5944
    #4 0x00007ffff6f90095 in rpc_clnt_handle_reply (clnt=0x676b50, pollin=0x6833a 0) at rpc-clnt.c:788
    #5 0x00007ffff6f90890 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x676b80,
    event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:907
    #6 0x00007ffff6f8c018 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>,
    data=<value optimized out>) at rpc-transport.c:489
    #7 0x00007ffff5074954 in socket_event_poll_in (this=0x680170) at socket.c:1677
    #8 0x00007ffff5074a37 in socket_event_handler (fd=<value optimized out>, idx=0, data=0x680170, poll_in=1,
    poll_out=0, poll_err=<value optimized out>) at socket.c:1792
    #9 0x00007ffff7850d44 in event_dispatch_epoll_handler (event_pool=0x658470) at event.c:785
    #10 event_dispatch_epoll (event_pool=0x658470) at event.c:847
    #11 0x0000000000409109 in main (argc=<value optimized out>, argv=<value optimized out>) at cli.c:696


Seen in Joe Juilan's setup with a mixture of 32 and 64 bit servers.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Niels de Vos 2012-06-11 14:17:06 UTC

review available at http://review.gluster.com/3534 (release-3.3) and http://review.gluster.com/3535 (master)

Comment 2 Pranith Kumar K 2012-06-12 08:47:34 UTC

Gowri Shankar provided the machine where the crash was seen. I have installed rpms just with my patch. For the same volume/data the crash does not appear anymore. :-)

Comment 3 Joe Julian 2012-06-12 22:03:25 UTC

Yes, this does solve the problem I was seeing.

Comment 4 Niels de Vos 2012-06-13 09:26:27 UTC

Created attachment 591413 [details]
core of the crashed process

Some details from the core:

Core was generated by `gluster'.
Program terminated with signal 11, Segmentation fault.
#0  __strftime_internal (s=0x7fffdcb56040 "\220`\265\334\377\177", 
    maxsize=256, format=0x431f1c "%Y-%m-%d %H:%M:%S", tp=0x0, 
    tzset_called=0x7fffdcb55fdf, loc=0x334f98a580) at strftime_l.c:508
508	  int hour12 = tp->tm_hour;
(gdb) bt
#0  __strftime_internal (s=0x7fffdcb56040 "\220`\265\334\377\177", 
    maxsize=256, format=0x431f1c "%Y-%m-%d %H:%M:%S", tp=0x0, 
    tzset_called=0x7fffdcb55fdf, loc=0x334f98a580) at strftime_l.c:508
#1  0x000000334f6a4026 in __strftime_l (s=<value optimized out>, 
    maxsize=<value optimized out>, format=<value optimized out>, 
    tp=<value optimized out>, loc=<value optimized out>) at strftime_l.c:486
#2  0x00000000004169c1 in cmd_heal_volume_brick_out (dict=0x22e25b4, brick=0)
    at cli-rpc-ops.c:5836
#3  0x0000000000416d49 in gf_cli3_1_heal_volume_cbk (
    req=<value optimized out>, iov=<value optimized out>, 
    count=<value optimized out>, myframe=<value optimized out>)
    at cli-rpc-ops.c:5944
#4  0x0000003350e0f095 in rpc_clnt_handle_reply (clnt=0x22f23e0, 
    pollin=0x22fddb0) at rpc-clnt.c:788
...

The cmd_heal_volume_brick_out() in frame 2 passes tz=NULL, the return value of localtime() to strftime() and crashes.

(gdb) f 2
#2  0x00000000004169c1 in cmd_heal_volume_brick_out (dict=0x22e25b4, brick=0)
    at cli-rpc-ops.c:5836
5836	                        strftime (timestr, sizeof (timestr),
(gdb) l
5831	                ret = dict_get_uint32 (dict, key, &time);
5832	                if (!time) {
5833	                        cli_out ("%s", path);
5834	                } else {
5835	                        tm = localtime ((time_t*)(&time));
5836	                        strftime (timestr, sizeof (timestr),
5837	                                  "%Y-%m-%d %H:%M:%S", tm);
5838	                        if (i ==0) {
5839	                                cli_out ("at                    path on brick");
5840	                                cli_out ("-----------------------------------");

Conversion of the time variable (seconds after epoch) should work, unless type-casting interferes. We can check the value of the time variable and see if it is sane:

(gdb) print time
$1 = 1339053312

$ date -d '1 jan 1970 1339053312 sec'
Thu Jun  7 07:15:12 UTC 2012

localtime() returned NULL, but the "time" variable contains a valid date. The patch corrects the type-cast, so it should not be possible to hit this problem anymore.

The patch also changes localtime() to localtime_r(), a race condition is possible if multiple threads call localtime(), localtime_r() prevents this. I think it is unlikely that the possible race condition in localtime() caused this problem, but using localtime_r() is surely a good thing.

Comment 5 Anush Shetty 2012-06-22 09:03:25 UTC

Verified with RC2. 

Created 1000 files when one of the bricks was down in a 2 replica volume. Then brought of the servers up and ran 'gluster volume heal volname full' and  'gluster volume heal volname info healed'. In 3.3.0qa45, glusterd crashed when 'gluster volume heal volname info healed' was run.

Comment 6 Kaushal 2012-07-05 04:15:08 UTC

*** Bug 836421 has been marked as a duplicate of this bug. ***