Bug 1779407 - [OSP13] Rebase erlang to 21.3.8.20
Summary: [OSP13] Rebase erlang to 21.3.8.20
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: erlang
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z16
: 13.0 (Queens)
Assignee: John Eckersberg
QA Contact: nlevinki
URL:
Whiteboard:
: 1859011 1880416 1936542 (view as bug list)
Depends On: 1835417
Blocks: 1862123 1862126 1866524 1926487
TreeView+ depends on / blocked
 
Reported: 2019-12-03 22:46 UTC by ldenny
Modified: 2024-12-20 18:56 UTC (History)
28 users (show)

Fixed In Version: erlang-21.3.8.20-1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1862123 1866524 1926487 (view as bug list)
Environment:
Last Closed: 2021-06-16 10:58:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 728779 0 None MERGED Tune haproxy for long running sessions to use leastconn 2021-02-17 16:44:22 UTC
Red Hat Issue Tracker OSP-1482 0 None None None 2022-08-30 14:38:40 UTC
Red Hat Product Errata RHBA-2021:2385 0 None None None 2021-06-16 10:59:47 UTC

Description ldenny 2019-12-03 22:46:12 UTC
Description of problem:
There has been ongoing RabbitMQ issues in this environment for some time. The issues impacts all users with sporadic timeouts. This impacts user ability to use Horizon and various other OpenStack services.

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.15-4.el7ost.noarch
openstack-rabbitmq 13.0-89

How reproducible:
Customer is having intermittent issues, pcs resource cleanup resolves the issue but it reoccurs a few weeks later

Steps to Reproduce:


Actual results:
erratic time-outs both for Horizon and for the CLI until `pcs resource cleanup` is ran:

~~~
[heat-admin@controller-1~]$ sudo docker logs rabbitmq-bundle-docker-1
 warning: qb_ipcs_event_sendv:  new_event_notification (13-826665-17): Broken pipe (32)
 warning: send_client_notify:   Could not notify client proxy-cib_rw-826665-bde2726d/-351f-41d1-a9a2-f0b0bb5d22e4: Broken pipe | rc=-32
 warning: qb_ipcs_event_sendv:  new_event_notification (13-125380-17): Broken pipe (32)
 warning: send_client_notify:   Could not notify client proxy-cib_rw-125380-beafb4a6/-84a0-4e93-ae9f-e17cb4460031: Broken pipe | rc=-32
 warning: qb_ipcs_event_sendv:  new_event_notification (13-556755-17): Broken pipe (32)
 warning: send_client_notify:   Could not notify client proxy-cib_rw-556755-4cfa5985/-99b4-4af7-af09-7eccf1c45f45: Broken pipe | rc=-32
    info: log_execute:  executing - rsc:rabbitmq action:notify call_id:1777061
    info: log_finished: finished - rsc:rabbitmq action:notify call_id:1777061 pid:1028799 exit-code:0 exec-time:876ms queue-time:0ms
    info: cancel_recurring_action:      Cancelling ocf operation rabbitmq_monitor_10000
    info: log_execute:  executing - rsc:rabbitmq action:stop call_id:1777063
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [  ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ DIAGNOSTICS ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ =========== ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [  ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ attempted to contact: ['rabbit@controller-1'] ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [  ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ rabbit@controller-1: ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [   * connected to epmd (port 4369) on controller-1 ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [   * epmd reports: node 'rabbit' not running at all ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [                   no other nodes on controller-1 ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [   * suggestion: start the node ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [  ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ current node details: ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ - node name: 'rabbitmq-cli-03@controller-1' ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ - home dir: /var/lib/rabbitmq ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [ - cookie hash: YpW5g/5GL2uHQF1pA== ]
  notice: operation_finished:   rabbitmq_stop_0:1028898:stderr [  ]
    info: log_finished: finished - rsc:rabbitmq action:stop call_id:1777063 pid:1028898 exit-code:0 exec-time:7150ms queue-time:0ms
    info: log_execute:  executing - rsc:rabbitmq action:start call_id:1777074
~~~
From pcs status:
~~~
* rabbitmq_monitor_10000 on rabbitmq-bundle-1 'unknown error' (1): call=32, status=complete, exitreason='',
    last-rc-change='Sun Dec  1 23:58:22 2019', queued=0ms, exec=0ms


Errors in the sasl log:
=SUPERVISOR REPORT==== 2-Dec-2019::14:21:06 ===
     Supervisor: {<0.2450.246>,rabbit_connection_helper_sup}
     Context:    shutdown_error
     Reason:     noproc
     Offender:   [{pid,<0.2374.246>},
                  {name,heartbeat_sender},
                  {mfargs,
                      {rabbit_heartbeat,start_heartbeat_sender,
                          [#Port<0.217030>,60,#Fun<rabbit_reader.4.130174145>,
                           {heartbeat_sender,
                               <<"192.168.x.x:36625 -> 192.168.x.x:5672">>}]}},
                  {restart_type,transient},
                  {shutdown,30000},
                  {child_type,worker}]
~~~

Expected results:
rabbitmq to run without failing

Additional info:

Comment 12 Daniel Alvarez Sanchez 2020-03-06 13:56:22 UTC
From the OVN side it looks like a dup of: https://bugzilla.redhat.com/show_bug.cgi?id=1795697

Comment 51 John Eckersberg 2020-05-13 21:04:14 UTC
There's a lot of private comments here due to potentially sensitive data contained within.  Let me generalize this and summarize the issue here so it's visible.

Problem Description
===================

RabbitMQ cluster suffers from a partial partition.

Pacemaker resource agent is able to detect the partial partition, and recover the cluster, although sometimes clients require a restart to restore functionality (this is a distinct issue that also needs investigated).


Root Cause Analysis
===================

I consistently see this sequence of events when the failure occurs:

- A quick burst of AMQP messages are being published by openstack clients, usually nova-conductor.  Somewhere on the order of ~10 MB of message data being sent over the wire in ~100 ms.  It stands out on data throughput graphs relative to the normal traffic, but I don't think it's unreasonable and the burst is very quick.

- A corresponding burst of traffic is seen on the RabbitMQ clustering connections between nodes, presumably as the above burst of messages is routed to the queue master processes and then replicated to all of the queue slaves.

- During this burst of traffic, the TCP receive window for one of the nodes will go to zero, and remain at zero forever.  This causes the connection to completely stall out, and is the reason the cluster partitions.  Understanding why this happens is key to resolving the issue.


Next steps
==========

I suggest two parallel paths forward from here:

Workaround/mitigation
---------------------

Best guess with the information we have so far, this may have something to do with queue mirroring.

I did notice this in the sosreports:

pcs_config:   Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all","ha-promote-on-shutdown":"always"}"

This means that in a three-node cluster, every message is mirrored to all nodes.  This is rather inefficient, it is fine to only maintain one backup for each message because if you lose two nodes you are going to lose quorum anyway.  We've changed the default in newer OSP to do less mirroring here:

https://review.opendev.org/#/c/585340/

So in the three node case, it will only maintain one slave instead of two.  This cuts the amount of load from replication in half.

I think regardless it's worthwhile to make this change and monitor how the performance is impacted.  There is a chance the problem may resolve itself.

If the problem still persists with one slave, we could also test disabling queue mirroring entirely and monitoring again.  There is a potential for message loss with queue mirroring disabled, but given the current state of things there is already a known and definitive amount of message loss happening.  I think turning it off entirely would be a net-improvement as far as operational risk goes.

Continued RCA
-------------

At this point what we really need is an automated way to detect that a cluster connection has stalled out and is stuck in zero window probe, and then we need to gather sufficient debugging data from the system before it partitions and pacemaker restarts the resource.  I am currently trying a few different things with scripts that monitor connection status (ss, /proc/net/tcp, etc) or possibly using perf to probe the relevant functions in the tcp stack and detect when the connection window goes to zero.  Then we trigger data collection, probably just a coredump of the erlang VM so we can look through and see where all the threads are at and why seemingly nothing is reading from the clustering socket.

Comment 52 Bogdan Dobrelya 2020-05-15 15:58:42 UTC
Perhaps it also makes sense to switch haproxy LB strategy to 'leastconn' instead of 'roundrobin'? That might alleviate network spikes indirectly by reducing "unrelated" network pressure for the node taking AMQP traffic.

Comment 53 John Eckersberg 2020-05-19 17:40:47 UTC
I'm suspicious this is related:

https://github.com/erlang/otp/commit/b0018e78e6b87b40e93760bda715ae45f0776da7#

commit b0018e78e6b87b40e93760bda715ae45f0776da7
Author: Raimo Niskanen <raimo>
Date:   Fri Apr 26 16:48:07 2019 +0200

    Copy linger zero flag from listen socket


From the top of the file...

#define TCP_ADDF_LINGER_ZERO 	  512 /* Discard driver queue on port close */

I don't think this is the root cause, but it seems that after a partition occurs, the old dead connections are hanging around and this might be why.

Comment 54 John Eckersberg 2020-05-19 17:54:01 UTC
More candidates:

https://github.com/erlang/otp/commit/9b24da12c766d15493d9c99be18f5a14498ec558

commit 9b24da12c766d15493d9c99be18f5a14498ec558
Author: John Högberg <john>
Date:   Mon Jul 10 08:26:55 2017 +0200

    Close TCP ports properly on send timeout

https://github.com/erlang/otp/commit/ae8d3bc9b32e95ce6f9a2da3b6ae961f04071c36

commit ae8d3bc9b32e95ce6f9a2da3b6ae961f04071c36
Author: John Högberg <john>
Date:   Thu Jun 1 10:13:51 2017 +0200

    Unconditionally clear IO buffers on send/shutdown errors
    
    This fixes a bug where a send/shutdown error on an active-mode socket results
    in the port never being properly closed.

Comment 63 John Eckersberg 2020-06-16 15:46:51 UTC
I have my new-and-greatly-improved debugging script working, available here:

http://git.engineering.redhat.com/git/users/jeckersb/misc/plain/rmq-epoll

You can run this (as root) on any host that has a clustered RabbitMQ running.  It does *not* need to run in a container context, and does *not* require SYS_PTRACE on any target container.

Before running, you will need to install the kernel-devel package for the currently running kernel, as well as python-bcc:

# yum install -y kernel-devel-$(uname -r) python-bcc

You should expect to see similar output to this when running:

[root@controller-0 ~]# ./rmq-epoll
192231 beam.smp
  192333 1_scheduler
  192334 2_scheduler
  192335 3_scheduler
  192336 4_scheduler
  192352 0_poller
281   172.17.1.86:25672 <-> 172.17.1.123:59194
19    172.17.1.86:25672 <-> 172.17.1.126:53021
2020-06-16 15:28:45.787116 436925.566798 kernel       tcp lport=25672 rport=53021 len=73 rcv_wnd=183296
2020-06-16 15:28:45.788021 436925.566829 kernel       sock readable lport=25672 rport=53021
2020-06-16 15:28:45.788152 436925.567349 kernel       tcp lport=25672 rport=59194 len=1779 rcv_wnd=183296
2020-06-16 15:28:45.788259 436925.567363 kernel       sock readable lport=25672 rport=59194
2020-06-16 15:28:45.788377 436925.567492 kernel       tcp lport=25672 rport=59194 len=73 rcv_wnd=182272
2020-06-16 15:28:45.788514 436925.567506 kernel       sock readable lport=25672 rport=59194
2020-06-16 15:28:45.788612 436925.569325 1_scheduler  epoll_wait ready fd=19
2020-06-16 15:28:45.789541 436925.569325 1_scheduler  epoll_wait ready fd=281
2020-06-16 15:28:45.790014 436925.571782 2_scheduler  recvfrom(19)=73
2020-06-16 15:28:45.790474 436925.571951 2_scheduler  recvfrom(281)=1460
2020-06-16 15:28:45.790945 436925.571998 2_scheduler  epoll_wait ready fd=281
2020-06-16 15:28:45.791408 436925.572066 2_scheduler  recvfrom(281)=319
2020-06-16 15:28:45.791952 436925.572355 2_scheduler  epoll_wait ready fd=281
2020-06-16 15:28:45.792408 436925.572392 2_scheduler  recvfrom(281)=73
2020-06-16 15:28:45.792891 436925.575734 kernel       tcp lport=25672 rport=59194 len=109 rcv_wnd=182272
2020-06-16 15:28:45.792996 436925.575773 kernel       sock readable lport=25672 rport=59194
2020-06-16 15:28:45.793090 436925.575805 2_scheduler  epoll_wait ready fd=281
2020-06-16 15:28:45.793617 436925.575843 2_scheduler  recvfrom(281)=109
2020-06-16 15:28:45.794086 436925.576423 kernel       tcp lport=25672 rport=59194 len=144 rcv_wnd=182272
2020-06-16 15:28:45.794189 436925.576456 kernel       sock readable lport=25672 rport=59194
2020-06-16 15:28:45.794284 436925.577137 2_scheduler  epoll_wait ready fd=281
2020-06-16 15:28:45.794738 436925.577182 2_scheduler  recvfrom(281)=144
2020-06-16 15:28:45.795236 436925.578355 kernel       tcp lport=25672 rport=59194 len=109 rcv_wnd=182272
2020-06-16 15:28:45.795340 436925.578394 kernel       sock readable lport=25672 rport=59194
2020-06-16 15:28:45.795455 436925.578489 2_scheduler  epoll_wait ready fd=281
2020-06-16 15:28:45.795927 436925.578536 2_scheduler  recvfrom(281)=109

Some notes on what we're seeing here...

First we print out the thread-IDs for the relevant erlang VM processes (main pid, plus all scheduler and poller threads)

Next we print out the file descriptor numbers and addr:port for all established erlang clustering connections.  We will filter on these to only capture events related to the clustering connections.

Then the actual data collection. This lets us observe the data flow across the major subsystems:

- First we get an inbound TCP packet

- Then the socket layer for the related TCP stream is flagged as readable

- Next the epoll_wait syscall returns to userspace to inform the erlang VM that the socket is ready to read

- Finally we see the erlang VM call recvfrom() on the socket to pull the data out of the kernel.

Somewhere in this sequence of events is where our problem is occuring.  Gathering this data should allow us to better isolate at which step the problem is occurring, and develop further probing as necessary until the root cause is identified.

One final note, this has the potential to generate a lot of output on a heavily-load production cluster, so it's probably best to capture it to disk compressed as we did on the previous strace-based approach.  Something similar to:

# ./rmq-epoll | tee /dev/tty | xz -c --flush-timeout 30000 > rmq-epoll.$(hostname -s).$(date -Iseconds).xz

Which will still show output on the console (if desired) while compressing the stream in-flight onto disk for later retrieval and upload to the case.

Comment 64 John Eckersberg 2020-06-17 15:00:30 UTC
Here's what I'm seeing on the first failure with the new script.  This is on controller 1.


This is final packet that gets read correctly and completely (1460+1460+1460+1411 = 5791):

2020-06-16 22:15:49.321391 519470.848982 kernel       tcp lport=52077 rport=25672 len=5791 rcv_wnd=2010624
2020-06-16 22:15:49.321407 519470.848986 kernel       sock readable lport=52077 rport=25672
2020-06-16 22:15:50.097131 519470.849020 0_poller     epoll_wait ready fd=74
2020-06-16 22:15:49.321419 519470.849031 3_scheduler  recvfrom(74)=1460
2020-06-16 22:15:50.097259 519470.849041 0_poller     epoll_wait ready fd=74
2020-06-16 22:15:49.321540 519470.849052 3_scheduler  recvfrom(74)=1460
2020-06-16 22:15:50.097389 519470.849054 0_poller     epoll_wait ready fd=74
2020-06-16 22:15:49.321661 519470.849062 3_scheduler  recvfrom(74)=1460
2020-06-16 22:15:50.097517 519470.849065 0_poller     epoll_wait ready fd=74
2020-06-16 22:15:49.321788 519470.849087 3_scheduler  epoll_wait ready fd=74
2020-06-16 22:15:49.065143 519470.849100 2_scheduler  recvfrom(74)=1411


First packet that doesn't get read normally:

2020-06-16 22:15:50.097784 519470.849405 kernel       tcp lport=52077 rport=25672 len=173 rcv_wnd=2010624
2020-06-16 22:15:50.097798 519470.849407 kernel       sock readable lport=52077 rport=25672
2020-06-16 22:15:50.097809 519470.849413 3_scheduler  epoll_wait ready fd=74


Which then begins a big section where schedulers 1-4 poll fd 74 successfully in a tight loop, but no recvfrom... a sampling below:

2020-06-16 22:15:50.101010 519470.849805 3_scheduler  epoll_wait ready fd=74
...
2020-06-16 22:15:49.079188 519470.886697 1_scheduler  epoll_wait ready fd=74
...
2020-06-16 22:15:48.780837 519470.904228 2_scheduler  epoll_wait ready fd=74
...
2020-06-16 22:15:49.041318 519471.118394 4_scheduler  epoll_wait ready fd=74


Last scheduler poll is here:

2020-06-16 22:15:52.633398 519473.951640 3_scheduler  epoll_wait ready fd=74


Eventually the poller thread gets in on the action, but only these two times:

2020-06-16 22:15:52.552202 519473.986179 0_poller     epoll_wait ready fd=74
...
2020-06-16 22:16:03.936846 519485.940761 0_poller     epoll_wait ready fd=74


Finally we eventually read, some time later (way too late).  This is 32.3055s elapsed since last read.

2020-06-16 22:16:21.023404 519503.154588 8_scheduler  recvfrom(74)=209


A few notes and initial thoughts:

- I've sorted this by the middle column, which is the timestamp generated in the kernel as events fire.  The timestamp on the left is generated by python when it prints, so there is a delay and it doesn't necessarily retrieve events in the order that they were produced.  So if you are looking at the leftmost time it will jump around a lot, this is expected.

- I'm guessing the final read happens *after* net_tick timeout, and the system has probably forced reconnect on the file descriptor already.

- We see some large TCP payloads, largest being 62636:

  2020-06-16 22:15:00.054606 519422.184989 kernel       tcp lport=52077 rport=25672 len=62636 rcv_wnd=1879040

  (This seems to be a function of the vlan MTU of 9000, which results in a MSS of 8948 once the headers are stripped away, so we get aggregated payloads that are a multiple of this.  I don't understand presently the details of how this works in the kernel.  But it's probably not super relevant.)

  But note that we limit recvfrom length to 1460 which is the tcp mss for a "normal" 1500 MTU packet.  This is related to tunable buffers in the VM, see http://erlang.org/doc/man/inet.html, values for "buffer" and "recbuf".

- It looks like the problem occurs right after both the poller and scheduler poll the fd back to back... these two lines:

  2020-06-16 22:15:50.097517 519470.849065 0_poller     epoll_wait ready fd=74
  2020-06-16 22:15:49.321788 519470.849087 3_scheduler  epoll_wait ready fd=74
 
  Which makes me suspicious that this is some sort of race in moving file descriptors between poll sets.  I'm pretty sure the schedulers share one pollset and the poller uses another one, but I need to re-read the code to verify how this all works.

Comment 66 John Eckersberg 2020-06-18 21:08:42 UTC
Quick update here at the end of my day:

I've updated http://git.engineering.redhat.com/git/users/jeckersb/misc/plain/rmq-epoll which introduces two changes from the previous version:

- Improve instrumentation for the recvfrom syscall.  I think we may have possibly missed some stuff previously depending on timings.

- Add tracing of epoll_ctl.  This will allow us to watch as the distribution file descriptors move in and out of different pollsets.

There's a small chance this new version may improve our understanding, but I am reasonably sure the problem is somewhere inside of the erlang VM at this point.  My next plan is to work on setting up uprobes at key places inside of the VM so we can start chasing the tracing into userspace.

Comment 67 John Eckersberg 2020-06-19 15:28:35 UTC
(In reply to John Eckersberg from comment #66)
> My next plan is to work on setting up uprobes at key places inside of the VM so we can start chasing the tracing into userspace.

Unfortunately this won't work presently because the RHEL7 kernel lacks https://github.com/torvalds/linux/commit/f0a2aa5a2a406d0a57aa9b320ffaa5538672b6c5 so we can't attach uprobes to binaries inside of docker containers.

[root@controller-0 tracing]# echo 'p:beam_entry /var/lib/docker/overlay2/a00f4ddbe02c82672d625736eb9eed8c4e5c7eb0a9e91b2ddfdc71701ac33e5d/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp:0x8d2e0' > uprobe_events
[root@controller-0 tracing]# echo 1 > events/uprobes/beam_entry/enable
-bash: echo: write error: Invalid argument
[root@controller-0 tracing]# 

And from dmesg:

[694776.777753] event trace: Could not enable event beam_entry


But if I try the same thing on OSP16+RHEL8:

[root@messaging-0 tracing]# echo 'p:beam_entry /var/lib/containers/storage/overlay/12de157c2be6f3165330c6221ed34546f2eddac50a60e8582ad01d174f2f5fb7/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp:0x8d2e0' > uprobe_events
[root@messaging-0 tracing]# echo 1 > events/uprobes/beam_entry/enable
[root@messaging-0 tracing]#

Comment 68 John Eckersberg 2020-06-25 16:42:15 UTC
I rebuilt the RHEL 7.8 kernel to include the above patch, and it gets the job done:

[root@controller-0 tracing]# uname -r
3.10.0-1127.15.1.el7.uprobe1.x86_64
[root@controller-0 tracing]# echo 'p:beam_entry /var/lib/docker/overlay2/0648b867267cc4654efec944c917b7a13e114c5264d6e496b6162d1f5d5c2c59/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp:0x8d2e0' > uprobe_events
[root@controller-0 tracing]# echo 1 > events/uprobes/beam_entry/enable
[root@controller-0 tracing]# 

So now I can successfully do:

[root@controller-0 ~]# perf probe -x /var/lib/docker/overlay2/0648b867267cc4654efec944c917b7a13e114c5264d6e496b6162d1f5d5c2c59/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp 'erts_poll_control pollset=%di fd=%si op=%dx events=%cx'              
Added new event:
  probe_beam:erts_poll_control (on erts_poll_control in /var/lib/docker/overlay2/0648b867267cc4654efec944c917b7a13e114c5264d6e496b6162d1f5d5c2c59/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp with pollset=%di fd=%si op=%dx events=%cx)

You can now use it in all perf tools, such as:

        perf record -e probe_beam:erts_poll_control -aR sleep 1

[root@controller-0 ~]# perf record -e probe_beam:erts_poll_control -p 29406 -- sleep 30
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.363 MB perf.data (72 samples) ]
[root@controller-0 ~]# perf script | head
     3_scheduler 29519 [002]  1976.175666: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x0 events=0x1
     3_scheduler 29519 [002]  1976.176138: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1
     3_scheduler 29519 [000]  1976.193570: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1
     3_scheduler 29519 [000]  1976.193793: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1
     3_scheduler 29519 [000]  1976.193880: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x2 events=0x0
     1_scheduler 29517 [002]  1985.416328: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x44 op=0x1 events=0x0
     1_scheduler 29517 [002]  1985.416889: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x44 op=0x1 events=0x1
     1_scheduler 29517 [002]  1985.418074: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x0 events=0x1
     1_scheduler 29517 [002]  1985.418772: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x0
     1_scheduler 29517 [002]  1985.419125: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1
[root@controller-0 ~]# 

I'll see if we can get this added for the RHEL7 kernel.

Comment 69 John Eckersberg 2020-06-25 17:08:36 UTC
(In reply to John Eckersberg from comment #68)
> I'll see if we can get this added for the RHEL7 kernel.

https://bugzilla.redhat.com/show_bug.cgi?id=1851177

Comment 71 John Eckersberg 2020-06-29 20:32:04 UTC
Something else to try, given this smells like a race condition...

Related to 1848705, which I spun off due to the previous changes we made for the +zdbbl option, we are also overwriting the default scheduler option of +stbt, which controls the scheduler cpu affinity.

In the current (inadvertently overwritten) state, the +stbt option is unset, which means there is no cpu affinity, and the erlang scheduler threads are allowed to migrate freely across cpus.

Let's try updating RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS in rabbitmq-env.conf to include:

+stbt db

Which is the normal default value in rabbitmq.  The 'db' setting means default_bind, which defaults to thread_no_node_processor_spread, and basically binds schedulers intelligently by considering NUMA nodes and hyperthreading.

This has potential to (a) eliminate a source of race conditions as schedulers will no longer migrate between CPUs, and (b) increase performance due to better cache usage and fewer context switches.

Comment 76 John Eckersberg 2020-07-14 21:12:15 UTC
Unfortunately the latest script output tipped me off to the fact that I have been using a slightly different container image to develop against.  The uprobe attached by symbol works fine, but those that are offset from the beginning of a function are not firing because of the version mismatch.

Fortunately, this is an easy fix.  I will provide an updated script with corrected offsets and we can recapture once more.

Comment 77 John Eckersberg 2020-07-14 22:32:52 UTC
https://bugs.erlang.org/browse/ERL-1213 is a possible root cause.  The erlang version in OSP 16.1 contains this fix but the version in 16.0 does not.  This sounds very similar to what we are seeing.

Comment 78 John Eckersberg 2020-07-15 15:34:58 UTC
(In reply to John Eckersberg from comment #77)
> https://bugs.erlang.org/browse/ERL-1213 is a possible root cause.  The
> erlang version in OSP 16.1 contains this fix but the version in 16.0 does
> not.  This sounds very similar to what we are seeing.

Let's try the latest image from the 16.1 Beta:

https://catalog.redhat.com/software/containers/rhosp-beta/openstack-rabbitmq/5cdcaa77dd19c778293bdd47

Which contains:

erlang-erts-21.3.8.16-1.el8ost.x86_64

and that package contains the fix for the above bug.

If it's still broken after that we can continue troubleshooting, but I think it's highly likely that fixes the problem completely.

Comment 79 John Eckersberg 2020-07-24 20:31:58 UTC
Erlang 21.3.8.17 was released on Wednesday and contains this fix:

https://github.com/erlang/otp/commit/2361484707d64c3cccb6e14df67d8b0ca0b9babf

commit 2361484707d64c3cccb6e14df67d8b0ca0b9babf
Author: Ingela Anderton Andin <ingela>
Date:   Tue Jun 23 17:53:35 2020 +0200

    ssl: Correct flow control handling of passive sockets when packet mode is used


Which I am almost positive is our problem.  I was already tracing the same code before pulling the latest release and seeing this commit.

Comment 81 John Eckersberg 2020-07-27 18:48:47 UTC
(In reply to John Eckersberg from comment #79)
> Erlang 21.3.8.17 was released on Wednesday and contains this fix:
> 
> https://github.com/erlang/otp/commit/2361484707d64c3cccb6e14df67d8b0ca0b9babf
> 
> commit 2361484707d64c3cccb6e14df67d8b0ca0b9babf
> Author: Ingela Anderton Andin <ingela>
> Date:   Tue Jun 23 17:53:35 2020 +0200
> 
>     ssl: Correct flow control handling of passive sockets when packet mode
> is used
> 
> 
> Which I am almost positive is our problem.  I was already tracing the same
> code before pulling the latest release and seeing this commit.

Unfortunately doing a backport to OSP13 is not straighforward for this change, as the SSL module was completely rewritten from gen_fsm to gen_statem for OTP19.  Thus the delta between erlang versions in OSP 13 and 16 is very large, and the code flow is completely different.

Comment 82 John Eckersberg 2020-08-05 21:54:03 UTC
Hm rabbitmq-server 3.6.X does *not* boot on OTP 21.  So this will require a bit more than just updated erlang.

Comment 90 John Eckersberg 2020-09-14 18:02:04 UTC
Putting this back to ASSIGNED

We're going to try to rebase to erlang 23 instead.  There was a pretty severe performance issue with 21 when I was benchmarking it and trying to reproduce the original issue.

Comment 92 John Eckersberg 2020-09-18 14:49:59 UTC
*** Bug 1880416 has been marked as a duplicate of this bug. ***

Comment 93 John Eckersberg 2020-09-18 19:55:05 UTC
Ok let's try this again...

Should be fixed by erlang-23.0.4-1.el7ost - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1326337

Will also require a rebased rabbitmq as the current version will not boot on this new erlang build.

Comment 110 John Eckersberg 2021-02-05 20:34:52 UTC
I am going to request this as an exception for OSP13z15, in order to rebase to the latest version from OSP16 (>=21.3.8.18).  That version fixes the TLS issue described here as well as a related issue in the VM implementation which breaks TCP connections.  It also requires updating a much smaller set of packages as compared to rebasing to 23.x.

Comment 111 John Eckersberg 2021-02-08 19:54:53 UTC
erlang-21.3.8.20-1.el7ost - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1496012

Comment 128 Freddy Wissing 2021-06-10 13:51:19 UTC
Hi guys, 

I am unclear as to how this shakes out for 16.1. I have a customer who has two trackers on this BZ, and another 16.1 environment with the same issue.

If this is already fixed in a certain z-stream, please let me know.

Thanks and Cheers! 

/Freddy

Comment 129 John Eckersberg 2021-06-10 14:02:12 UTC
(In reply to Freddy Wissing from comment #128)
> Hi guys, 
> 
> I am unclear as to how this shakes out for 16.1. I have a customer who has
> two trackers on this BZ, and another 16.1 environment with the same issue.
> 
> If this is already fixed in a certain z-stream, please let me know.
> 
> Thanks and Cheers! 
> 
> /Freddy

https://bugzilla.redhat.com/show_bug.cgi?id=1913784 is the same issue for 16.1.  Should be fixed in z6 with erlang-21.3.8.20-1.el8ost from https://access.redhat.com/errata/RHBA-2021:2097.

Comment 133 errata-xmlrpc 2021-06-16 10:58:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2385

Comment 134 Luca Miccini 2021-06-25 08:13:34 UTC
*** Bug 1936542 has been marked as a duplicate of this bug. ***

Comment 135 Luca Miccini 2021-06-25 08:15:56 UTC
*** Bug 1859011 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.