Bug 1763036
Summary: | glusterfsd crashed with "'MemoryError' Cannot access memory at address" | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Mohit Agrawal <moagrawa> | |
Component: | rpc | Assignee: | Mohit Agrawal <moagrawa> | |
Status: | CLOSED NEXTRELEASE | QA Contact: | ||
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | mainline | CC: | alchan, amukherj, bkunal, jpankaja, kramdoss, madam, moagrawa, nbalacha, nravinas, pasik, pdhange, pprakash, rgowdapp, rhs-bugs, rtalur, sheggodu, ykaul | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 1741193 | |||
: | 1778175 1778182 1804523 1806595 (view as bug list) | Environment: | ||
Last Closed: | 2019-10-22 13:47:20 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1741193, 1778175, 1778182, 1804523, 1806595 |
Description
Mohit Agrawal
2019-10-18 05:57:00 UTC
Hi, It seems the brick process is getting crashed because the function event_slot_alloc is not able to return a valid slot. bt #0 0x00007f9efaed5207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f9efaed68f8 in __GI_abort () at abort.c:90 #2 0x00007f9efaece026 in __assert_fail_base (fmt=0x7f9efb028ea0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f9efc94507a "slot->fd == fd", file=file@entry=0x7f9efc945054 "event-epoll.c", line=line@entry=417, function=function@entry=0x7f9efc945420 <__PRETTY_FUNCTION__.11118> "event_register_epoll") at assert.c:92 #3 0x00007f9efaece0d2 in __GI___assert_fail (assertion=assertion@entry=0x7f9efc94507a "slot->fd == fd", file=file@entry=0x7f9efc945054 "event-epoll.c", line=line@entry=417, function=function@entry=0x7f9efc945420 <__PRETTY_FUNCTION__.11118> "event_register_epoll") at assert.c:101 #4 0x00007f9efc8f7d04 in event_register_epoll (event_pool=0x563fac588150, fd=<optimized out>, handler=<optimized out>, data=<optimized out>, poll_in=<optimized out>, poll_out=<optimized out>, notify_poller_death=0 '\000') at event-epoll.c:417 #5 0x00007f9ef798ceb2 in socket_server_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7f9ee80403f0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2950 #6 0x00007f9efc8f8870 in event_dispatch_epoll_handler (event=0x7f98fffc9e70, event_pool=0x563fac588150) at event-epoll.c:643 #7 event_dispatch_epoll_worker (data=0x7f992c8ea110) at event-epoll.c:759 #8 0x00007f9efb6d5dd5 in start_thread (arg=0x7f98fffca700) at pthread_create.c:307 #9 0x00007f9efaf9cead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) f 4 #4 0x00007f9efc8f7d04 in event_register_epoll (event_pool=0x563fac588150, fd=<optimized out>, handler=<optimized out>, data=<optimized out>, poll_in=<optimized out>, poll_out=<optimized out>, notify_poller_death=0 '\000') at event-epoll.c:417 417 assert (slot->fd == fd); (gdb) p slot $3184 = (struct event_slot_epoll *) 0x7f9e247f81b0 (gdb) p *slot $3185 = {fd = -1, events = 1073741851, gen = 216, idx = 0, ref = 1, do_close = 1, in_handler = 0, handled_error = 0, data = 0x7f9e3c050be0, handler = 0x7f9ef7989980 <socket_event_handler>, lock = {spinlock = 0, mutex = {__data = { __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, poller_death = {next = 0x7f9e247f8208, prev = 0x7f9e247f8208}} p event_pool->slots_used $3188 = {1021, 10, 0 <repeats 1022 times>} f 4 After print the complete fd table i am not able to figure out any entry for socket 1340 set $tmp = 1 while ($tmp != 1023) print ((struct event_slot_epoll *)event_pool->ereg[0])[$tmp].fd set $tmp = $tmp + 1 f 5 p new_sock 1340 As per the current code of event_slot_alloc first, it checks the value of slot_used to validate the free entry in the table. Current bt is showing the value of slots used is 1021(less than 1024) it means still it has some free slot and it sets the table to this registry index(0) >>>>>>>>>>>>>>>>>>>>> for (i = 0; i < EVENT_EPOLL_TABLES; i++) { switch (event_pool->slots_used[i]) { case EVENT_EPOLL_SLOTS: continue; case 0: if (!event_pool->ereg[i]) { table = __event_newtable(event_pool, i); if (!table) return -1; } else { table = event_pool->ereg[i]; } break; default: table = event_pool->ereg[i]; break; } if (table) /* break out of the loop */ break; } if (!table) return -1; table_idx = i; >>>>>>>>>>>>>>>>>>>>>>>>> In below code it tries to check the free entry in the table.As per current slots_used value ideally 3 entry should be free in table but somehow here no entry is free. The code is not validating the fd assignment in the table. It is just returning idx. >>>>>>>>>>>>>>>>>>>>>>>>>> for (i = 0; i < EVENT_EPOLL_SLOTS; i++) { if (table[i].fd == -1) { /* wipe everything except bump the generation */ gen = table[i].gen; memset(&table[i], 0, sizeof(table[i])); table[i].gen = gen + 1; LOCK_INIT(&table[i].lock); INIT_LIST_HEAD(&table[i].poller_death); table[i].fd = fd; if (notify_poller_death) { table[i].idx = table_idx * EVENT_EPOLL_SLOTS + i; list_add_tail(&table[i].poller_death, &event_pool->poller_death); } event_pool->slots_used[table_idx]++; break; } } >>>>>>>>>>>>>>>>> return table_idx * EVENT_EPOLL_SLOTS + i; I think we need to update the code. I have checked the event code, I am not able to figure out why slots_used is not showing correct value. Ideally slots_used value should be 1024 because in registry table no index is free and before returning the index it should validate fd is successfully assigned or not. RCA: The slot->ref is not incremented atomically when slot is allocated. Instead it is done later as part of event_slot_get. If in this window if we happen to run into ref/unref cycles (as explained above) it would result in more calls to event_slot_deallocation than actually needed resulting in wrong accounting of slots_used in slot table. The fix would be: 1. increment slot->ref atomically in __event_slot_alloc 2. Add checks to __event_slot_alloc whether it actually returns a valid slot instead of assuming it does. Thanks, Mohit Agrawal Patch is posted to resolve the same https://review.gluster.org/#/c/glusterfs/+/23508/ REVIEW: https://review.gluster.org/23508 (rpc: Synchronize slot allocation code) merged (#7) on master by Raghavendra G |