Description of problem: There has been ongoing RabbitMQ issues in this environment for some time. The issues impacts all users with sporadic timeouts. This impacts user ability to use Horizon and various other OpenStack services. Version-Release number of selected component (if applicable): rabbitmq-server-3.6.15-4.el7ost.noarch openstack-rabbitmq 13.0-89 How reproducible: Customer is having intermittent issues, pcs resource cleanup resolves the issue but it reoccurs a few weeks later Steps to Reproduce: Actual results: erratic time-outs both for Horizon and for the CLI until `pcs resource cleanup` is ran: ~~~ [heat-admin@controller-1~]$ sudo docker logs rabbitmq-bundle-docker-1 warning: qb_ipcs_event_sendv: new_event_notification (13-826665-17): Broken pipe (32) warning: send_client_notify: Could not notify client proxy-cib_rw-826665-bde2726d/-351f-41d1-a9a2-f0b0bb5d22e4: Broken pipe | rc=-32 warning: qb_ipcs_event_sendv: new_event_notification (13-125380-17): Broken pipe (32) warning: send_client_notify: Could not notify client proxy-cib_rw-125380-beafb4a6/-84a0-4e93-ae9f-e17cb4460031: Broken pipe | rc=-32 warning: qb_ipcs_event_sendv: new_event_notification (13-556755-17): Broken pipe (32) warning: send_client_notify: Could not notify client proxy-cib_rw-556755-4cfa5985/-99b4-4af7-af09-7eccf1c45f45: Broken pipe | rc=-32 info: log_execute: executing - rsc:rabbitmq action:notify call_id:1777061 info: log_finished: finished - rsc:rabbitmq action:notify call_id:1777061 pid:1028799 exit-code:0 exec-time:876ms queue-time:0ms info: cancel_recurring_action: Cancelling ocf operation rabbitmq_monitor_10000 info: log_execute: executing - rsc:rabbitmq action:stop call_id:1777063 notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ DIAGNOSTICS ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ =========== ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ attempted to contact: ['rabbit@controller-1'] ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ rabbit@controller-1: ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ * connected to epmd (port 4369) on controller-1 ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ * epmd reports: node 'rabbit' not running at all ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ no other nodes on controller-1 ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ * suggestion: start the node ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ current node details: ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ - node name: 'rabbitmq-cli-03@controller-1' ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ - home dir: /var/lib/rabbitmq ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ - cookie hash: YpW5g/5GL2uHQF1pA== ] notice: operation_finished: rabbitmq_stop_0:1028898:stderr [ ] info: log_finished: finished - rsc:rabbitmq action:stop call_id:1777063 pid:1028898 exit-code:0 exec-time:7150ms queue-time:0ms info: log_execute: executing - rsc:rabbitmq action:start call_id:1777074 ~~~ From pcs status: ~~~ * rabbitmq_monitor_10000 on rabbitmq-bundle-1 'unknown error' (1): call=32, status=complete, exitreason='', last-rc-change='Sun Dec 1 23:58:22 2019', queued=0ms, exec=0ms Errors in the sasl log: =SUPERVISOR REPORT==== 2-Dec-2019::14:21:06 === Supervisor: {<0.2450.246>,rabbit_connection_helper_sup} Context: shutdown_error Reason: noproc Offender: [{pid,<0.2374.246>}, {name,heartbeat_sender}, {mfargs, {rabbit_heartbeat,start_heartbeat_sender, [#Port<0.217030>,60,#Fun<rabbit_reader.4.130174145>, {heartbeat_sender, <<"192.168.x.x:36625 -> 192.168.x.x:5672">>}]}}, {restart_type,transient}, {shutdown,30000}, {child_type,worker}] ~~~ Expected results: rabbitmq to run without failing Additional info:
From the OVN side it looks like a dup of: https://bugzilla.redhat.com/show_bug.cgi?id=1795697
There's a lot of private comments here due to potentially sensitive data contained within. Let me generalize this and summarize the issue here so it's visible. Problem Description =================== RabbitMQ cluster suffers from a partial partition. Pacemaker resource agent is able to detect the partial partition, and recover the cluster, although sometimes clients require a restart to restore functionality (this is a distinct issue that also needs investigated). Root Cause Analysis =================== I consistently see this sequence of events when the failure occurs: - A quick burst of AMQP messages are being published by openstack clients, usually nova-conductor. Somewhere on the order of ~10 MB of message data being sent over the wire in ~100 ms. It stands out on data throughput graphs relative to the normal traffic, but I don't think it's unreasonable and the burst is very quick. - A corresponding burst of traffic is seen on the RabbitMQ clustering connections between nodes, presumably as the above burst of messages is routed to the queue master processes and then replicated to all of the queue slaves. - During this burst of traffic, the TCP receive window for one of the nodes will go to zero, and remain at zero forever. This causes the connection to completely stall out, and is the reason the cluster partitions. Understanding why this happens is key to resolving the issue. Next steps ========== I suggest two parallel paths forward from here: Workaround/mitigation --------------------- Best guess with the information we have so far, this may have something to do with queue mirroring. I did notice this in the sosreports: pcs_config: Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all","ha-promote-on-shutdown":"always"}" This means that in a three-node cluster, every message is mirrored to all nodes. This is rather inefficient, it is fine to only maintain one backup for each message because if you lose two nodes you are going to lose quorum anyway. We've changed the default in newer OSP to do less mirroring here: https://review.opendev.org/#/c/585340/ So in the three node case, it will only maintain one slave instead of two. This cuts the amount of load from replication in half. I think regardless it's worthwhile to make this change and monitor how the performance is impacted. There is a chance the problem may resolve itself. If the problem still persists with one slave, we could also test disabling queue mirroring entirely and monitoring again. There is a potential for message loss with queue mirroring disabled, but given the current state of things there is already a known and definitive amount of message loss happening. I think turning it off entirely would be a net-improvement as far as operational risk goes. Continued RCA ------------- At this point what we really need is an automated way to detect that a cluster connection has stalled out and is stuck in zero window probe, and then we need to gather sufficient debugging data from the system before it partitions and pacemaker restarts the resource. I am currently trying a few different things with scripts that monitor connection status (ss, /proc/net/tcp, etc) or possibly using perf to probe the relevant functions in the tcp stack and detect when the connection window goes to zero. Then we trigger data collection, probably just a coredump of the erlang VM so we can look through and see where all the threads are at and why seemingly nothing is reading from the clustering socket.
Perhaps it also makes sense to switch haproxy LB strategy to 'leastconn' instead of 'roundrobin'? That might alleviate network spikes indirectly by reducing "unrelated" network pressure for the node taking AMQP traffic.
I'm suspicious this is related: https://github.com/erlang/otp/commit/b0018e78e6b87b40e93760bda715ae45f0776da7# commit b0018e78e6b87b40e93760bda715ae45f0776da7 Author: Raimo Niskanen <raimo> Date: Fri Apr 26 16:48:07 2019 +0200 Copy linger zero flag from listen socket From the top of the file... #define TCP_ADDF_LINGER_ZERO 512 /* Discard driver queue on port close */ I don't think this is the root cause, but it seems that after a partition occurs, the old dead connections are hanging around and this might be why.
More candidates: https://github.com/erlang/otp/commit/9b24da12c766d15493d9c99be18f5a14498ec558 commit 9b24da12c766d15493d9c99be18f5a14498ec558 Author: John Högberg <john> Date: Mon Jul 10 08:26:55 2017 +0200 Close TCP ports properly on send timeout https://github.com/erlang/otp/commit/ae8d3bc9b32e95ce6f9a2da3b6ae961f04071c36 commit ae8d3bc9b32e95ce6f9a2da3b6ae961f04071c36 Author: John Högberg <john> Date: Thu Jun 1 10:13:51 2017 +0200 Unconditionally clear IO buffers on send/shutdown errors This fixes a bug where a send/shutdown error on an active-mode socket results in the port never being properly closed.
I have my new-and-greatly-improved debugging script working, available here: http://git.engineering.redhat.com/git/users/jeckersb/misc/plain/rmq-epoll You can run this (as root) on any host that has a clustered RabbitMQ running. It does *not* need to run in a container context, and does *not* require SYS_PTRACE on any target container. Before running, you will need to install the kernel-devel package for the currently running kernel, as well as python-bcc: # yum install -y kernel-devel-$(uname -r) python-bcc You should expect to see similar output to this when running: [root@controller-0 ~]# ./rmq-epoll 192231 beam.smp 192333 1_scheduler 192334 2_scheduler 192335 3_scheduler 192336 4_scheduler 192352 0_poller 281 172.17.1.86:25672 <-> 172.17.1.123:59194 19 172.17.1.86:25672 <-> 172.17.1.126:53021 2020-06-16 15:28:45.787116 436925.566798 kernel tcp lport=25672 rport=53021 len=73 rcv_wnd=183296 2020-06-16 15:28:45.788021 436925.566829 kernel sock readable lport=25672 rport=53021 2020-06-16 15:28:45.788152 436925.567349 kernel tcp lport=25672 rport=59194 len=1779 rcv_wnd=183296 2020-06-16 15:28:45.788259 436925.567363 kernel sock readable lport=25672 rport=59194 2020-06-16 15:28:45.788377 436925.567492 kernel tcp lport=25672 rport=59194 len=73 rcv_wnd=182272 2020-06-16 15:28:45.788514 436925.567506 kernel sock readable lport=25672 rport=59194 2020-06-16 15:28:45.788612 436925.569325 1_scheduler epoll_wait ready fd=19 2020-06-16 15:28:45.789541 436925.569325 1_scheduler epoll_wait ready fd=281 2020-06-16 15:28:45.790014 436925.571782 2_scheduler recvfrom(19)=73 2020-06-16 15:28:45.790474 436925.571951 2_scheduler recvfrom(281)=1460 2020-06-16 15:28:45.790945 436925.571998 2_scheduler epoll_wait ready fd=281 2020-06-16 15:28:45.791408 436925.572066 2_scheduler recvfrom(281)=319 2020-06-16 15:28:45.791952 436925.572355 2_scheduler epoll_wait ready fd=281 2020-06-16 15:28:45.792408 436925.572392 2_scheduler recvfrom(281)=73 2020-06-16 15:28:45.792891 436925.575734 kernel tcp lport=25672 rport=59194 len=109 rcv_wnd=182272 2020-06-16 15:28:45.792996 436925.575773 kernel sock readable lport=25672 rport=59194 2020-06-16 15:28:45.793090 436925.575805 2_scheduler epoll_wait ready fd=281 2020-06-16 15:28:45.793617 436925.575843 2_scheduler recvfrom(281)=109 2020-06-16 15:28:45.794086 436925.576423 kernel tcp lport=25672 rport=59194 len=144 rcv_wnd=182272 2020-06-16 15:28:45.794189 436925.576456 kernel sock readable lport=25672 rport=59194 2020-06-16 15:28:45.794284 436925.577137 2_scheduler epoll_wait ready fd=281 2020-06-16 15:28:45.794738 436925.577182 2_scheduler recvfrom(281)=144 2020-06-16 15:28:45.795236 436925.578355 kernel tcp lport=25672 rport=59194 len=109 rcv_wnd=182272 2020-06-16 15:28:45.795340 436925.578394 kernel sock readable lport=25672 rport=59194 2020-06-16 15:28:45.795455 436925.578489 2_scheduler epoll_wait ready fd=281 2020-06-16 15:28:45.795927 436925.578536 2_scheduler recvfrom(281)=109 Some notes on what we're seeing here... First we print out the thread-IDs for the relevant erlang VM processes (main pid, plus all scheduler and poller threads) Next we print out the file descriptor numbers and addr:port for all established erlang clustering connections. We will filter on these to only capture events related to the clustering connections. Then the actual data collection. This lets us observe the data flow across the major subsystems: - First we get an inbound TCP packet - Then the socket layer for the related TCP stream is flagged as readable - Next the epoll_wait syscall returns to userspace to inform the erlang VM that the socket is ready to read - Finally we see the erlang VM call recvfrom() on the socket to pull the data out of the kernel. Somewhere in this sequence of events is where our problem is occuring. Gathering this data should allow us to better isolate at which step the problem is occurring, and develop further probing as necessary until the root cause is identified. One final note, this has the potential to generate a lot of output on a heavily-load production cluster, so it's probably best to capture it to disk compressed as we did on the previous strace-based approach. Something similar to: # ./rmq-epoll | tee /dev/tty | xz -c --flush-timeout 30000 > rmq-epoll.$(hostname -s).$(date -Iseconds).xz Which will still show output on the console (if desired) while compressing the stream in-flight onto disk for later retrieval and upload to the case.
Here's what I'm seeing on the first failure with the new script. This is on controller 1. This is final packet that gets read correctly and completely (1460+1460+1460+1411 = 5791): 2020-06-16 22:15:49.321391 519470.848982 kernel tcp lport=52077 rport=25672 len=5791 rcv_wnd=2010624 2020-06-16 22:15:49.321407 519470.848986 kernel sock readable lport=52077 rport=25672 2020-06-16 22:15:50.097131 519470.849020 0_poller epoll_wait ready fd=74 2020-06-16 22:15:49.321419 519470.849031 3_scheduler recvfrom(74)=1460 2020-06-16 22:15:50.097259 519470.849041 0_poller epoll_wait ready fd=74 2020-06-16 22:15:49.321540 519470.849052 3_scheduler recvfrom(74)=1460 2020-06-16 22:15:50.097389 519470.849054 0_poller epoll_wait ready fd=74 2020-06-16 22:15:49.321661 519470.849062 3_scheduler recvfrom(74)=1460 2020-06-16 22:15:50.097517 519470.849065 0_poller epoll_wait ready fd=74 2020-06-16 22:15:49.321788 519470.849087 3_scheduler epoll_wait ready fd=74 2020-06-16 22:15:49.065143 519470.849100 2_scheduler recvfrom(74)=1411 First packet that doesn't get read normally: 2020-06-16 22:15:50.097784 519470.849405 kernel tcp lport=52077 rport=25672 len=173 rcv_wnd=2010624 2020-06-16 22:15:50.097798 519470.849407 kernel sock readable lport=52077 rport=25672 2020-06-16 22:15:50.097809 519470.849413 3_scheduler epoll_wait ready fd=74 Which then begins a big section where schedulers 1-4 poll fd 74 successfully in a tight loop, but no recvfrom... a sampling below: 2020-06-16 22:15:50.101010 519470.849805 3_scheduler epoll_wait ready fd=74 ... 2020-06-16 22:15:49.079188 519470.886697 1_scheduler epoll_wait ready fd=74 ... 2020-06-16 22:15:48.780837 519470.904228 2_scheduler epoll_wait ready fd=74 ... 2020-06-16 22:15:49.041318 519471.118394 4_scheduler epoll_wait ready fd=74 Last scheduler poll is here: 2020-06-16 22:15:52.633398 519473.951640 3_scheduler epoll_wait ready fd=74 Eventually the poller thread gets in on the action, but only these two times: 2020-06-16 22:15:52.552202 519473.986179 0_poller epoll_wait ready fd=74 ... 2020-06-16 22:16:03.936846 519485.940761 0_poller epoll_wait ready fd=74 Finally we eventually read, some time later (way too late). This is 32.3055s elapsed since last read. 2020-06-16 22:16:21.023404 519503.154588 8_scheduler recvfrom(74)=209 A few notes and initial thoughts: - I've sorted this by the middle column, which is the timestamp generated in the kernel as events fire. The timestamp on the left is generated by python when it prints, so there is a delay and it doesn't necessarily retrieve events in the order that they were produced. So if you are looking at the leftmost time it will jump around a lot, this is expected. - I'm guessing the final read happens *after* net_tick timeout, and the system has probably forced reconnect on the file descriptor already. - We see some large TCP payloads, largest being 62636: 2020-06-16 22:15:00.054606 519422.184989 kernel tcp lport=52077 rport=25672 len=62636 rcv_wnd=1879040 (This seems to be a function of the vlan MTU of 9000, which results in a MSS of 8948 once the headers are stripped away, so we get aggregated payloads that are a multiple of this. I don't understand presently the details of how this works in the kernel. But it's probably not super relevant.) But note that we limit recvfrom length to 1460 which is the tcp mss for a "normal" 1500 MTU packet. This is related to tunable buffers in the VM, see http://erlang.org/doc/man/inet.html, values for "buffer" and "recbuf". - It looks like the problem occurs right after both the poller and scheduler poll the fd back to back... these two lines: 2020-06-16 22:15:50.097517 519470.849065 0_poller epoll_wait ready fd=74 2020-06-16 22:15:49.321788 519470.849087 3_scheduler epoll_wait ready fd=74 Which makes me suspicious that this is some sort of race in moving file descriptors between poll sets. I'm pretty sure the schedulers share one pollset and the poller uses another one, but I need to re-read the code to verify how this all works.
Quick update here at the end of my day: I've updated http://git.engineering.redhat.com/git/users/jeckersb/misc/plain/rmq-epoll which introduces two changes from the previous version: - Improve instrumentation for the recvfrom syscall. I think we may have possibly missed some stuff previously depending on timings. - Add tracing of epoll_ctl. This will allow us to watch as the distribution file descriptors move in and out of different pollsets. There's a small chance this new version may improve our understanding, but I am reasonably sure the problem is somewhere inside of the erlang VM at this point. My next plan is to work on setting up uprobes at key places inside of the VM so we can start chasing the tracing into userspace.
(In reply to John Eckersberg from comment #66) > My next plan is to work on setting up uprobes at key places inside of the VM so we can start chasing the tracing into userspace. Unfortunately this won't work presently because the RHEL7 kernel lacks https://github.com/torvalds/linux/commit/f0a2aa5a2a406d0a57aa9b320ffaa5538672b6c5 so we can't attach uprobes to binaries inside of docker containers. [root@controller-0 tracing]# echo 'p:beam_entry /var/lib/docker/overlay2/a00f4ddbe02c82672d625736eb9eed8c4e5c7eb0a9e91b2ddfdc71701ac33e5d/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp:0x8d2e0' > uprobe_events [root@controller-0 tracing]# echo 1 > events/uprobes/beam_entry/enable -bash: echo: write error: Invalid argument [root@controller-0 tracing]# And from dmesg: [694776.777753] event trace: Could not enable event beam_entry But if I try the same thing on OSP16+RHEL8: [root@messaging-0 tracing]# echo 'p:beam_entry /var/lib/containers/storage/overlay/12de157c2be6f3165330c6221ed34546f2eddac50a60e8582ad01d174f2f5fb7/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp:0x8d2e0' > uprobe_events [root@messaging-0 tracing]# echo 1 > events/uprobes/beam_entry/enable [root@messaging-0 tracing]#
I rebuilt the RHEL 7.8 kernel to include the above patch, and it gets the job done: [root@controller-0 tracing]# uname -r 3.10.0-1127.15.1.el7.uprobe1.x86_64 [root@controller-0 tracing]# echo 'p:beam_entry /var/lib/docker/overlay2/0648b867267cc4654efec944c917b7a13e114c5264d6e496b6162d1f5d5c2c59/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp:0x8d2e0' > uprobe_events [root@controller-0 tracing]# echo 1 > events/uprobes/beam_entry/enable [root@controller-0 tracing]# So now I can successfully do: [root@controller-0 ~]# perf probe -x /var/lib/docker/overlay2/0648b867267cc4654efec944c917b7a13e114c5264d6e496b6162d1f5d5c2c59/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp 'erts_poll_control pollset=%di fd=%si op=%dx events=%cx' Added new event: probe_beam:erts_poll_control (on erts_poll_control in /var/lib/docker/overlay2/0648b867267cc4654efec944c917b7a13e114c5264d6e496b6162d1f5d5c2c59/merged/usr/lib64/erlang/erts-10.3.5.12/bin/beam.smp with pollset=%di fd=%si op=%dx events=%cx) You can now use it in all perf tools, such as: perf record -e probe_beam:erts_poll_control -aR sleep 1 [root@controller-0 ~]# perf record -e probe_beam:erts_poll_control -p 29406 -- sleep 30 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.363 MB perf.data (72 samples) ] [root@controller-0 ~]# perf script | head 3_scheduler 29519 [002] 1976.175666: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x0 events=0x1 3_scheduler 29519 [002] 1976.176138: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1 3_scheduler 29519 [000] 1976.193570: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1 3_scheduler 29519 [000] 1976.193793: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1 3_scheduler 29519 [000] 1976.193880: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x2 events=0x0 1_scheduler 29517 [002] 1985.416328: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x44 op=0x1 events=0x0 1_scheduler 29517 [002] 1985.416889: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x44 op=0x1 events=0x1 1_scheduler 29517 [002] 1985.418074: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x0 events=0x1 1_scheduler 29517 [002] 1985.418772: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x0 1_scheduler 29517 [002] 1985.419125: probe_beam:erts_poll_control: (55b0a009e680) pollset=0x7f384b940ac0 fd=0x13 op=0x1 events=0x1 [root@controller-0 ~]# I'll see if we can get this added for the RHEL7 kernel.
(In reply to John Eckersberg from comment #68) > I'll see if we can get this added for the RHEL7 kernel. https://bugzilla.redhat.com/show_bug.cgi?id=1851177
Something else to try, given this smells like a race condition... Related to 1848705, which I spun off due to the previous changes we made for the +zdbbl option, we are also overwriting the default scheduler option of +stbt, which controls the scheduler cpu affinity. In the current (inadvertently overwritten) state, the +stbt option is unset, which means there is no cpu affinity, and the erlang scheduler threads are allowed to migrate freely across cpus. Let's try updating RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS in rabbitmq-env.conf to include: +stbt db Which is the normal default value in rabbitmq. The 'db' setting means default_bind, which defaults to thread_no_node_processor_spread, and basically binds schedulers intelligently by considering NUMA nodes and hyperthreading. This has potential to (a) eliminate a source of race conditions as schedulers will no longer migrate between CPUs, and (b) increase performance due to better cache usage and fewer context switches.
Unfortunately the latest script output tipped me off to the fact that I have been using a slightly different container image to develop against. The uprobe attached by symbol works fine, but those that are offset from the beginning of a function are not firing because of the version mismatch. Fortunately, this is an easy fix. I will provide an updated script with corrected offsets and we can recapture once more.
https://bugs.erlang.org/browse/ERL-1213 is a possible root cause. The erlang version in OSP 16.1 contains this fix but the version in 16.0 does not. This sounds very similar to what we are seeing.
(In reply to John Eckersberg from comment #77) > https://bugs.erlang.org/browse/ERL-1213 is a possible root cause. The > erlang version in OSP 16.1 contains this fix but the version in 16.0 does > not. This sounds very similar to what we are seeing. Let's try the latest image from the 16.1 Beta: https://catalog.redhat.com/software/containers/rhosp-beta/openstack-rabbitmq/5cdcaa77dd19c778293bdd47 Which contains: erlang-erts-21.3.8.16-1.el8ost.x86_64 and that package contains the fix for the above bug. If it's still broken after that we can continue troubleshooting, but I think it's highly likely that fixes the problem completely.
Erlang 21.3.8.17 was released on Wednesday and contains this fix: https://github.com/erlang/otp/commit/2361484707d64c3cccb6e14df67d8b0ca0b9babf commit 2361484707d64c3cccb6e14df67d8b0ca0b9babf Author: Ingela Anderton Andin <ingela> Date: Tue Jun 23 17:53:35 2020 +0200 ssl: Correct flow control handling of passive sockets when packet mode is used Which I am almost positive is our problem. I was already tracing the same code before pulling the latest release and seeing this commit.
(In reply to John Eckersberg from comment #79) > Erlang 21.3.8.17 was released on Wednesday and contains this fix: > > https://github.com/erlang/otp/commit/2361484707d64c3cccb6e14df67d8b0ca0b9babf > > commit 2361484707d64c3cccb6e14df67d8b0ca0b9babf > Author: Ingela Anderton Andin <ingela> > Date: Tue Jun 23 17:53:35 2020 +0200 > > ssl: Correct flow control handling of passive sockets when packet mode > is used > > > Which I am almost positive is our problem. I was already tracing the same > code before pulling the latest release and seeing this commit. Unfortunately doing a backport to OSP13 is not straighforward for this change, as the SSL module was completely rewritten from gen_fsm to gen_statem for OTP19. Thus the delta between erlang versions in OSP 13 and 16 is very large, and the code flow is completely different.
Hm rabbitmq-server 3.6.X does *not* boot on OTP 21. So this will require a bit more than just updated erlang.
Putting this back to ASSIGNED We're going to try to rebase to erlang 23 instead. There was a pretty severe performance issue with 21 when I was benchmarking it and trying to reproduce the original issue.
*** Bug 1880416 has been marked as a duplicate of this bug. ***
Ok let's try this again... Should be fixed by erlang-23.0.4-1.el7ost - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1326337 Will also require a rebased rabbitmq as the current version will not boot on this new erlang build.
I am going to request this as an exception for OSP13z15, in order to rebase to the latest version from OSP16 (>=21.3.8.18). That version fixes the TLS issue described here as well as a related issue in the VM implementation which breaks TCP connections. It also requires updating a much smaller set of packages as compared to rebasing to 23.x.
erlang-21.3.8.20-1.el7ost - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1496012
Hi guys, I am unclear as to how this shakes out for 16.1. I have a customer who has two trackers on this BZ, and another 16.1 environment with the same issue. If this is already fixed in a certain z-stream, please let me know. Thanks and Cheers! /Freddy
(In reply to Freddy Wissing from comment #128) > Hi guys, > > I am unclear as to how this shakes out for 16.1. I have a customer who has > two trackers on this BZ, and another 16.1 environment with the same issue. > > If this is already fixed in a certain z-stream, please let me know. > > Thanks and Cheers! > > /Freddy https://bugzilla.redhat.com/show_bug.cgi?id=1913784 is the same issue for 16.1. Should be fixed in z6 with erlang-21.3.8.20-1.el8ost from https://access.redhat.com/errata/RHBA-2021:2097.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2385
*** Bug 1936542 has been marked as a duplicate of this bug. ***
*** Bug 1859011 has been marked as a duplicate of this bug. ***