Bug 1770256
Summary: | erlc regularly yet randomly hangs on s390x | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | John Eckersberg <jeckersb> |
Component: | kernel | Assignee: | John Eckersberg <jeckersb> |
Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | airlied, bskeggs, bugproxy, dan, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, lemenkov, linville, masami256, mchehab, michele, mjg59, rhbugs, rjones, s, steved, tstaudt |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | s390x | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 467765, 1772954, 1772962, 1772963, 1772967, 1772968, 1772969, 1772970, 1772971, 1772973, 1772975, 1773731, 1775737 |
Description
John Eckersberg
2019-11-08 14:54:45 UTC
I've tried to use mock on RHEL8 to build for rawhide, and I cannot reproduce the problem there. However I can reproduce the problem with mock on F29. Here's kernel stack traces for all threads on a stuck erlc: 44611 [<0>] do_select+0x4d0/0x710 [<0>] core_sys_select+0x202/0x518 [<0>] kern_select+0x11c/0x160 [<0>] __s390x_sys_select+0x26/0x38 [<0>] system_call+0x290/0x2b0 44612 [<0>] pipe_read+0x232/0x318 [<0>] new_sync_read+0x1ac/0x1e8 [<0>] vfs_read+0x84/0x140 [<0>] ksys_read+0x62/0xe8 [<0>] system_call+0x290/0x2b0 44613 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44615 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44616 [<0>] do_epoll_wait+0xb0/0xc8 [<0>] __s390x_sys_epoll_wait+0x2e/0x40 [<0>] system_call+0x290/0x2b0 44617 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44618 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44619 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44620 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44621 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44622 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44623 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44624 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44625 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44626 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44627 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44628 [<0>] futex_wait+0x158/0x250 [<0>] do_futex+0x76/0xcd0 [<0>] __s390x_sys_futex+0xe0/0x168 [<0>] system_call+0x290/0x2b0 44629 [<0>] do_sys_poll+0x4a0/0x560 [<0>] __s390x_sys_ppoll+0x8c/0xc8 [<0>] system_call+0x290/0x2b0 44630 [<0>] do_epoll_wait+0xb0/0xc8 [<0>] __s390x_sys_epoll_wait+0x2e/0x40 [<0>] system_call+0x290/0x2b0 If I use strace to attach to the main process with -f and thus attach to all children, the process immediately becomes unstuck. For the above example, I attached strace one at a time to each thread individually until I found which thread caused the entire process to unwedge itself, and it was this one: # strace -p 44616 strace: Process 44616 attached timerfd_settime(11, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0 munmap(0x3ff4f380000, 524288) = 0 futex(0x3ff513402d0, FUTEX_WAKE_PRIVATE, 1) = 1 fcntl(2, F_GETFL) = 0x401 (flags O_WRONLY|O_APPEND) fcntl(2, F_SETFL, O_WRONLY|O_APPEND) = 0 write(10, "!", 1) = 1 fcntl(1, F_GETFL) = 0x401 (flags O_WRONLY|O_APPEND) fcntl(1, F_SETFL, O_WRONLY|O_APPEND) = 0 fcntl(0, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) fcntl(0, F_SETFL, O_RDONLY|O_LARGEFILE) = 0 exit_group(0) = ? +++ exited with 0 +++ So it appears that interrupting the epoll_wait call is sufficient for the task to complete. I can reproduce without mock, just trying to build otp from git at OTP-22.1.5 on f29. Initial kernel testing: On RHEL8 host: RHEL8 kernel 4.18.0-147.el8.s390x does not reproduce the problem On F29 host: F29 kernel 5.3.6-100.fc29.s390x reproduces the problem Upstream 5.3.6 reproduces the problem Upstream 4.18 does not reproduce the problem I am going to check the latest 5.4-rc6 next and assuming it fails the same as 5.3.6, I'll start bisecting from 4.18 to 5.3.6. 5.4-rc6 reproduces the problem I've bisected this down to here: $ git log --summary --oneline HEAD~1..refs/bisect/bad abc610e01c66 (refs/bisect/bad) fs/epoll: avoid barrier after an epoll_wait(2) timeout c5a282e9635e (refs/bisect/skip-c5a282e9635e9c7382821565083db5d260085e3e) fs/epoll: reduce the scope of wq lock in epoll_wait() 21877e1a5b52 (HEAD, refs/bisect/good-21877e1a5b520132f54515f8835c963056418b4c) fs/epoll: robustify ep->mtx held checks c5a282e9635e does not boot. So either c5a282e9635e or abc610e01c66 seems to be a regression. John, first, great job, and second, what hw did you use for debugging / bisecting the kernel? (In reply to Dan Horák from comment #7) > John, first, great job, and second, what hw did you use for debugging / > bisecting the kernel? I've used several different lab machines, the one I am currently using is: IBM 2964 - N63 But I've been using whatever the lab scheduler gives me. At one point I was also using: IBM 2827 - H20 and it behaved the same. Another quick note, I tested 5.4-rc7 yesterday and it still reproduces the problem. OK, so it's both zEC12 and z13. I wasn't able to reproduce the crash earlier, but I'm going to let IBM know about this problem now. (In reply to Dan Horák from comment #10) > OK, so it's both zEC12 and z13. I wasn't able to reproduce the crash > earlier, but I'm going to let IBM know about this problem now. Let me double check that the 2827-H20 behaved the same. I'm not 100% positive and would hate to waste time relying on bad information. (In reply to John Eckersberg from comment #11) > (In reply to Dan Horák from comment #10) > > OK, so it's both zEC12 and z13. I wasn't able to reproduce the crash > > earlier, but I'm going to let IBM know about this problem now. > > Let me double check that the 2827-H20 behaved the same. I'm not 100% > positive and would hate to waste time relying on bad information. Confirmed, on a newly provisioned IBM 2827 - H20 system: 21877e1a5b52 works correctly abc610e01c66 hangs ------- Comment From h.carstens.com 2019-11-13 03:37 EDT------- Could you report this upstream to Davidlohr Bueso, please? This doesn't look like an architecture specific bug at all. Thank you for taking the time to bisect this! Sure, I'll write up some more details and send it to upstream. Agreed, it doesn't look like an architecture-specific bug, at least examining the patch(es) that it bisects down to. However I see this issue *only* on s390x, which leads me to think maybe the patches in question are interacting with some other arch-specific code somewhere that could have a bug. It's also possible the erlang VM is doing something silly and the fact that it ever worked before was dumb luck. We'll see, thanks for taking a look! Something I somehow forgot to test until now, and is very interesting... On a fresh F29 system, with: [root@ibm-z-37 ~]# rpm -q erlang-erts erlang-erts-20.3.8.22-1.fc29.s390x [root@ibm-z-37 ~]# uname -r 5.3.11-100.fc29.s390x I can't reproduce the problem anymore. So it seems that older erlang doesn't reproduce the problem even with a newer kernel. I'll see if I can bisect down changes in the erlang VM, hopefully narrow down what is going on here. Ideally I'll narrow it down to a very simple reproducer in userspace to trigger the behavior in the kernel. Bisected down to this rather large and invasive patch in otp: commit c6498571109b524fb319300e1b177b942e556f1b (HEAD, refs/bisect/bad) Author: Lukas Larsson <lukas> Date: Fri Oct 12 18:16:17 2018 +0200 erts: Move fds with active true behaviour to own pollset At start of the VM a poll-set that the schedulers will check is created where fds that have triggered many (at the moment, many means 10) times without being deselected inbetween. In this scheduler specific poll-set fds do not use ONESHOT, which means that the number of syscalls goes down dramatically for such fds. This pollset is introduced in order to handle fds that are used by the erlang distribution and that never change their state from {active, true}. This pollset only handles ready_input events, ready_output is still handled by the poll threads. During overload, polling the scheduler poll-set is done on a 10ms timer. erts/emulator/beam/erl_port.h | 2 + erts/emulator/beam/erl_port_task.c | 81 ++++++++++++++++++++++++++++++---- erts/emulator/beam/erl_port_task.h | 21 ++++++++- erts/emulator/beam/erl_process.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++---- erts/emulator/beam/erl_process.h | 4 +- erts/emulator/sys/common/erl_check_io.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------- erts/emulator/sys/common/erl_check_io.h | 16 +------ erts/emulator/sys/common/erl_poll.c | 504 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------- erts/emulator/sys/common/erl_poll.h | 14 +++++- erts/emulator/sys/common/erl_poll_api.h | 4 +- erts/emulator/sys/win32/erl_poll.c | 5 ++- erts/emulator/test/scheduler_SUITE.erl | 27 +++++++----- erts/emulator/test/signal_SUITE.erl | 2 +- 13 files changed, 843 insertions(+), 269 deletions(-) I've re-enabled s390x in rawhide and built erlang-22.1.8-1.fc32: https://koji.fedoraproject.org/koji/buildinfo?buildID=1415713 On s390x only, we invoke configure with --disable-kernel-poll, which works around the issue. See: https://src.fedoraproject.org/rpms/erlang/c/d286cacc6587d5d47eb7f6a7afe3178abba1d8e2?branch=master Thanks for your great work once more, John. What do you suggest as the next step? Keep the workaround in the specfile? Let someone look at the s390x kernel? Review the erlang change from comment 16? One thing that's specific to s390x is that everything is virtualized to some degree, so it wouldn't be the first case where an app couldn't cope with highly loaded VMs/LPARs due some bug. (In reply to Dan Horák from comment #18) > Thanks for your great work once more, John. What do you suggest as the next > step? Keep the workaround in the specfile? Let someone look at the s390x > kernel? Review the erlang change from comment 16? > > One thing that's specific to s390x is that everything is virtualized to some > degree, so it wouldn't be the first case where an app couldn't cope with > highly loaded VMs/LPARs due some bug. We can keep the workaround as long as we need to. In theory it's only a small performance loss. Until we can pin down the root cause I don't have any better ideas. I still owe an email to kernel upstream as I promised in comment #14. While I was trying to summarize my findings for that I started finding other threads to run down (like the erlang patch set). I suspect this is ultimately a kernel bug. The erlang change just happens to be where it began using the epoll interface. The kernel bisect came down to a change in the epoll interface as well. So the erlang bisect result is not really surprising in that context. It's got to be a race somewhere, because it stops reproducing if I run the reproducer under strace or if I add additional debug output to the erlang VM. I *can* however still reproduce it when tracing with perf. I'll detail this more in my summary to come. The biggest mystery to me is why this only happens on s390x, given the kernel epoll change is not arch-specific at all as far as I can tell. But I'll defer to upstream there because I am well out of my comfort zone at that point. Hi! I added some ExcludeArch's to some of my Erlang packages so I could build them on Rawhide and F31 (they had been FTBFS previously, for other reasons). Now that we have this workaround, do you recommend I remove those and rebuild? (In reply to Randy Barlow from comment #20) > Hi! > > I added some ExcludeArch's to some of my Erlang packages so I could build > them on Rawhide and F31 (they had been FTBFS previously, for other reasons). > Now that we have this workaround, do you recommend I remove those and > rebuild? Yep, they should work fine now. Let me know if you come across any issues building them. |