Bug 1770256

Summary:	erlc regularly yet randomly hangs on s390x
Product:	[Fedora] Fedora	Reporter:	John Eckersberg <jeckersb>
Component:	kernel	Assignee:	Peter Lemenkov <lemenkov>
Status:	NEW ---	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	airlied, bskeggs, bugproxy, dan, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, lemenkov, linville, masami256, mchehab, michele, mjg59, rhbugs, rjones, s, steved, tstaudt
Target Milestone:	---
Target Release:	---
Hardware:	s390x
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	467765, 1772954, 1772962, 1772963, 1772967, 1772968, 1772969, 1772970, 1772971, 1772973, 1772975, 1773731, 1775737

Description John Eckersberg 2019-11-08 14:54:45 UTC

Description of problem:
Trying to build erlang on rawhide will hang during erlc invocation, preventing the build from completing, but only on s390x.

For example:

https://koji.fedoraproject.org/koji/taskinfo?taskID=38804985

Ran for nearly a day before cancelled.

See also:

https://pagure.io/releng/issue/8711
and

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/KEVZTKSHMJJR6EVTRS4ZNC27SPX4Q4DU/

I am going to attempt to track this down here.

Comment 1 John Eckersberg 2019-11-08 14:57:55 UTC

I've tried to use mock on RHEL8 to build for rawhide, and I cannot reproduce the problem there.

However I can reproduce the problem with mock on F29.

Comment 2 John Eckersberg 2019-11-08 15:05:54 UTC

Here's kernel stack traces for all threads on a stuck erlc:

44611
[<0>] do_select+0x4d0/0x710
[<0>] core_sys_select+0x202/0x518
[<0>] kern_select+0x11c/0x160
[<0>] __s390x_sys_select+0x26/0x38
[<0>] system_call+0x290/0x2b0
44612
[<0>] pipe_read+0x232/0x318
[<0>] new_sync_read+0x1ac/0x1e8
[<0>] vfs_read+0x84/0x140
[<0>] ksys_read+0x62/0xe8
[<0>] system_call+0x290/0x2b0
44613
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44615
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44616
[<0>] do_epoll_wait+0xb0/0xc8
[<0>] __s390x_sys_epoll_wait+0x2e/0x40
[<0>] system_call+0x290/0x2b0
44617
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44618
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44619
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44620
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44621
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44622
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44623
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44624
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44625
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44626
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44627
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44628
[<0>] futex_wait+0x158/0x250
[<0>] do_futex+0x76/0xcd0
[<0>] __s390x_sys_futex+0xe0/0x168
[<0>] system_call+0x290/0x2b0
44629
[<0>] do_sys_poll+0x4a0/0x560
[<0>] __s390x_sys_ppoll+0x8c/0xc8
[<0>] system_call+0x290/0x2b0
44630
[<0>] do_epoll_wait+0xb0/0xc8
[<0>] __s390x_sys_epoll_wait+0x2e/0x40
[<0>] system_call+0x290/0x2b0

If I use strace to attach to the main process with -f and thus attach to all children, the process immediately becomes unstuck.

For the above example, I attached strace one at a time to each thread individually until I found which thread caused the entire process to unwedge itself, and it was this one:

# strace -p 44616
strace: Process 44616 attached 
timerfd_settime(11, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0
munmap(0x3ff4f380000, 524288)           = 0
futex(0x3ff513402d0, FUTEX_WAKE_PRIVATE, 1) = 1
fcntl(2, F_GETFL)                       = 0x401 (flags O_WRONLY|O_APPEND)
fcntl(2, F_SETFL, O_WRONLY|O_APPEND)    = 0
write(10, "!", 1)                       = 1
fcntl(1, F_GETFL)                       = 0x401 (flags O_WRONLY|O_APPEND)
fcntl(1, F_SETFL, O_WRONLY|O_APPEND)    = 0
fcntl(0, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(0, F_SETFL, O_RDONLY|O_LARGEFILE) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

So it appears that interrupting the epoll_wait call is sufficient for the task to complete.

Comment 3 John Eckersberg 2019-11-08 15:41:38 UTC

I can reproduce without mock, just trying to build otp from git at OTP-22.1.5 on f29.

Comment 4 John Eckersberg 2019-11-08 21:19:48 UTC

Initial kernel testing:

On RHEL8 host:
RHEL8 kernel 4.18.0-147.el8.s390x does not reproduce the problem

On F29 host:
F29 kernel 5.3.6-100.fc29.s390x reproduces the problem
Upstream 5.3.6 reproduces the problem
Upstream 4.18 does not reproduce the problem

I am going to check the latest 5.4-rc6 next and assuming it fails the same as 5.3.6, I'll start bisecting from 4.18 to 5.3.6.

Comment 5 John Eckersberg 2019-11-08 22:25:22 UTC

5.4-rc6 reproduces the problem

Comment 6 John Eckersberg 2019-11-10 20:36:42 UTC

I've bisected this down to here:

$ git log --summary --oneline HEAD~1..refs/bisect/bad
abc610e01c66 (refs/bisect/bad) fs/epoll: avoid barrier after an epoll_wait(2) timeout
c5a282e9635e (refs/bisect/skip-c5a282e9635e9c7382821565083db5d260085e3e) fs/epoll: reduce the scope of wq lock in epoll_wait()
21877e1a5b52 (HEAD, refs/bisect/good-21877e1a5b520132f54515f8835c963056418b4c) fs/epoll: robustify ep->mtx held checks

c5a282e9635e does not boot.  So either c5a282e9635e or abc610e01c66 seems to be a regression.

Comment 7 Dan Horák 2019-11-12 14:03:29 UTC

John, first, great job, and second, what hw did you use for debugging / bisecting the kernel?

Comment 8 John Eckersberg 2019-11-12 14:25:20 UTC

(In reply to Dan Horák from comment #7)
> John, first, great job, and second, what hw did you use for debugging /
> bisecting the kernel?

I've used several different lab machines, the one I am currently using is:

IBM 2964 - N63

But I've been using whatever the lab scheduler gives me.  At one point I was also using:

IBM 2827 - H20

and it behaved the same.

Comment 9 John Eckersberg 2019-11-12 14:25:57 UTC

Another quick note, I tested 5.4-rc7 yesterday and it still reproduces the problem.

Comment 10 Dan Horák 2019-11-12 14:37:39 UTC

OK, so it's both zEC12 and z13. I wasn't able to reproduce the crash earlier, but I'm going to let IBM know about this problem now.

Comment 11 John Eckersberg 2019-11-12 14:42:28 UTC

(In reply to Dan Horák from comment #10)
> OK, so it's both zEC12 and z13. I wasn't able to reproduce the crash
> earlier, but I'm going to let IBM know about this problem now.

Let me double check that the 2827-H20 behaved the same.  I'm not 100% positive and would hate to waste time relying on bad information.

Comment 12 John Eckersberg 2019-11-12 17:18:19 UTC

(In reply to John Eckersberg from comment #11)
> (In reply to Dan Horák from comment #10)
> > OK, so it's both zEC12 and z13. I wasn't able to reproduce the crash
> > earlier, but I'm going to let IBM know about this problem now.
> 
> Let me double check that the 2827-H20 behaved the same.  I'm not 100%
> positive and would hate to waste time relying on bad information.

Confirmed, on a newly provisioned IBM 2827 - H20 system:

21877e1a5b52 works correctly
abc610e01c66 hangs

Comment 13 IBM Bug Proxy 2019-11-13 08:40:28 UTC

------- Comment From h.carstens.com 2019-11-13 03:37 EDT-------
Could you report this upstream to Davidlohr Bueso, please? This doesn't look like an architecture specific bug at all.
Thank you for taking the time to bisect this!

Comment 14 John Eckersberg 2019-11-13 14:18:55 UTC

Sure, I'll write up some more details and send it to upstream.

Agreed, it doesn't look like an architecture-specific bug, at least examining the patch(es) that it bisects down to.  However I see this issue *only* on s390x, which leads me to think maybe the patches in question are interacting with some other arch-specific code somewhere that could have a bug.

It's also possible the erlang VM is doing something silly and the fact that it ever worked before was dumb luck.

We'll see, thanks for taking a look!

Comment 15 John Eckersberg 2019-11-15 21:17:05 UTC

Something I somehow forgot to test until now, and is very interesting...

On a fresh F29 system, with:

[root@ibm-z-37 ~]# rpm -q erlang-erts
erlang-erts-20.3.8.22-1.fc29.s390x
[root@ibm-z-37 ~]# uname -r
5.3.11-100.fc29.s390x

I can't reproduce the problem anymore.

So it seems that older erlang doesn't reproduce the problem even with a newer kernel.

I'll see if I can bisect down changes in the erlang VM, hopefully narrow down what is going on here.

Ideally I'll narrow it down to a very simple reproducer in userspace to trigger the behavior in the kernel.

Comment 16 John Eckersberg 2019-11-16 14:58:23 UTC

Bisected down to this rather large and invasive patch in otp:

commit c6498571109b524fb319300e1b177b942e556f1b (HEAD, refs/bisect/bad)
Author: Lukas Larsson <lukas>
Date:   Fri Oct 12 18:16:17 2018 +0200

    erts: Move fds with active true behaviour to own pollset
    
    At start of the VM a poll-set that the schedulers
    will check is created where fds that have triggered
    many (at the moment, many means 10) times without
    being deselected inbetween. In this scheduler specific
    poll-set fds do not use ONESHOT, which means that the
    number of syscalls goes down dramatically for such fds.
    
    This pollset is introduced in order to handle fds that
    are used by the erlang distribution and that never
    change their state from {active, true}.
    
    This pollset only handles ready_input events,
    ready_output is still handled by the poll threads.
    
    During overload, polling the scheduler poll-set is done
    on a 10ms timer.

 erts/emulator/beam/erl_port.h           |   2 +
 erts/emulator/beam/erl_port_task.c      |  81 ++++++++++++++++++++++++++++++----
 erts/emulator/beam/erl_port_task.h      |  21 ++++++++-
 erts/emulator/beam/erl_process.c        | 122 ++++++++++++++++++++++++++++++++++++++++++++++----
 erts/emulator/beam/erl_process.h        |   4 +-
 erts/emulator/sys/common/erl_check_io.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
 erts/emulator/sys/common/erl_check_io.h |  16 +------
 erts/emulator/sys/common/erl_poll.c     | 504 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
 erts/emulator/sys/common/erl_poll.h     |  14 +++++-
 erts/emulator/sys/common/erl_poll_api.h |   4 +-
 erts/emulator/sys/win32/erl_poll.c      |   5 ++-
 erts/emulator/test/scheduler_SUITE.erl  |  27 +++++++-----
 erts/emulator/test/signal_SUITE.erl     |   2 +-
 13 files changed, 843 insertions(+), 269 deletions(-)

Comment 17 John Eckersberg 2019-11-21 23:19:33 UTC

I've re-enabled s390x in rawhide and built erlang-22.1.8-1.fc32:

https://koji.fedoraproject.org/koji/buildinfo?buildID=1415713

On s390x only, we invoke configure with --disable-kernel-poll, which works around the issue.

See: https://src.fedoraproject.org/rpms/erlang/c/d286cacc6587d5d47eb7f6a7afe3178abba1d8e2?branch=master

Comment 18 Dan Horák 2019-11-22 12:21:24 UTC

Thanks for your great work once more, John. What do you suggest as the next step? Keep the workaround in the specfile? Let someone look at the s390x kernel? Review the erlang change from comment 16?

One thing that's specific to s390x is that everything is virtualized to some degree, so it wouldn't be the first case where an app couldn't cope with highly loaded VMs/LPARs due some bug.

Comment 19 John Eckersberg 2019-11-22 15:11:45 UTC

(In reply to Dan Horák from comment #18)
> Thanks for your great work once more, John. What do you suggest as the next
> step? Keep the workaround in the specfile? Let someone look at the s390x
> kernel? Review the erlang change from comment 16?
> 
> One thing that's specific to s390x is that everything is virtualized to some
> degree, so it wouldn't be the first case where an app couldn't cope with
> highly loaded VMs/LPARs due some bug.

We can keep the workaround as long as we need to.  In theory it's only a small performance loss.  Until we can pin down the root cause I don't have any better ideas.

I still owe an email to kernel upstream as I promised in comment #14.  While I was trying to summarize my findings for that I started finding other threads to run down (like the erlang patch set).

I suspect this is ultimately a kernel bug.  The erlang change just happens to be where it began using the epoll interface.  The kernel bisect came down to a change in the epoll interface as well.  So the erlang bisect result is not really surprising in that context.

It's got to be a race somewhere, because it stops reproducing if I run the reproducer under strace or if I add additional debug output to the erlang VM.  I *can* however still reproduce it when tracing with perf.  I'll detail this more in my summary to come.

The biggest mystery to me is why this only happens on s390x, given the kernel epoll change is not arch-specific at all as far as I can tell.  But I'll defer to upstream there because I am well out of my comfort zone at that point.

Comment 20 Randy Barlow 2019-11-22 16:01:46 UTC

Hi!

I added some ExcludeArch's to some of my Erlang packages so I could build them on Rawhide and F31 (they had been FTBFS previously, for other reasons). Now that we have this workaround, do you recommend I remove those and rebuild?

Comment 21 John Eckersberg 2019-11-22 16:15:33 UTC

(In reply to Randy Barlow from comment #20)
> Hi!
> 
> I added some ExcludeArch's to some of my Erlang packages so I could build
> them on Rawhide and F31 (they had been FTBFS previously, for other reasons).
> Now that we have this workaround, do you recommend I remove those and
> rebuild?

Yep, they should work fine now.  Let me know if you come across any issues building them.

Comment 22 Peter Lemenkov 2026-03-25 21:42:42 UTC

Reassigning to myself for review.