Bug 1774378 - qdrouter might segfault when goferd drops connection just when it timeouts
Summary: qdrouter might segfault when goferd drops connection just when it timeouts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 6.9.0
Assignee: Ganesh Murthy
QA Contact: Jitendra Yejare
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-20 08:06 UTC by Hao Chang Yu
Modified: 2024-03-25 15:31 UTC (History)
25 users (show)

Fixed In Version: qpid-proton-0.28.0-4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1933149 (view as bug list)
Environment:
Last Closed: 2021-04-21 13:12:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
qpid-proton-c-0.28.0-4.el7.x86_64.rpm (186.10 KB, application/x-rpm)
2021-02-16 17:03 UTC, Mike McCune
no flags Details
python-qpid-proton-0.28.0-4.el7.x86_64.rpm (209.54 KB, application/x-rpm)
2021-02-16 17:03 UTC, Mike McCune
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4633101 0 None None None 2019-12-04 09:22:57 UTC

Description Hao Chang Yu 2019-11-20 08:06:53 UTC
Description of problem:
Stopping Qpid broker on the Satellite while doing a offline backup may cause segfault to the Qdrouter running on the Capsules.

In the Capsule qdrouters we saw hundreds (number of content hosts connected to the qdrouter) of the following error after stopping the Qpid broker on the Satellite. These logs are expected since Qpid broker is stopped.
---------------------------
qdrouterd: 2019-11-06 23:32:41.964684 +0100 SERVER (info) [11111]: Connection from xx.xx.xx.xx:xxxxx (to :5647) failed: proton:io Connection reset by peer - disconnected :5672 (SSL Failure: Unknown error)
qdrouterd: 2019-11-06 23:32:41.989879 +0100 SERVER (info) [22222]: Connection from yy.yy.yy.yy:yyyyy (to :5647) failed: proton:io Connection reset by peer - disconnected :5672 (SSL Failure: Unknown error)
<snip>
--------------------------


After that the Capsules' qdrouter crash with the following error:
---------------------------
qdrouterd: *** Error in `/usr/sbin/qdrouterd': double free or corruption (out): 0x00007f0b1fd73890 ***
---------------------------

Comment 8 momran 2019-12-02 12:40:17 UTC
Hi,

Is there a workaround for this bug?

Thanks and regards,

MOHAMED M. OMRAN
TECHNICAL SUPPORT ENGINEER

Comment 9 Pavel Moravec 2019-12-03 12:29:53 UTC
Sounds like https://issues.apache.org/jira/browse/PROTON-1727 / https://issues.apache.org/jira/browse/DISPATCH-902 but that is fixed in qpid-proton 0.28 :-/

Comment 10 Pavel Moravec 2019-12-03 13:25:59 UTC
(gdb) frame 3
#3  pconnection_process (pc=0x65bf170, events=events@entry=0, timeout=timeout@entry=true, topup=topup@entry=false, is_io_2=is_io_2@entry=false)
    at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:1164
1164	  bool closed = pconnection_rclosed(pc) && pconnection_wclosed(pc);
(gdb) p pc->driver
$8 = {connection = 0x0, transport = 0x0, collector = 0x0, batch = {next_event = 0x0}}
(gdb) p pc->psocket->sockfd 
$9 = 1109870520
(gdb) 


sounds like we are epoll-ing an event on a proton connection that is closed already..?


(gdb) frame 4
#4  0x00007fb442ea01ae in proactor_do_epoll (p=0x1dbf540, can_block=can_block@entry=true) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2141
2141	          batch = pconnection_process(pc, 0, true, false, false);
(gdb) p *ee
$11 = {psocket = 0x65bf170, fd = -1, type = PCONNECTION_TIMER, wanted = 1, polling = false, barrier_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, 
      __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}

So a timer-based event for just closed connection..?

Theoretical reproducer:
- goferd (or even qdrouterd on Sat) establishes a connection with AMQP heartbeats (default in goferd)
- no heartbeat is received in time
- about the time when heartbeat expires (or when it is expected to receive first?), client closes the connection
- meantime, qdrouterd wants to process the timer event and fails?

Comment 11 Pavel Moravec 2019-12-03 20:39:26 UTC
Reproducer outside Satellite:

0) have these packages:
qpid-proton-c-0.28.0-1.el7.x86_64
qpid-dispatch-router-1.5.0-4.el7.x86_64
qpid-cpp-server-1.36.0-24.el7amq.x86_64

1) Have qpidd listening on 5672 with 10 auxiliary queues:

for i in $(seq 0 9); do qpid-config add queue pulp.agent.TEST.$i --durable; done


2) Have qdrouterd link routing pulp.* to qpidd, with config (just relevant sections shown):

connector {
    name: broker
    host: localhost
    port: 5672
    sasl-mechanisms: ANONYMOUS
    role: route-container
    idle-timeout-seconds: 0
}

linkRoute {
    prefix: pulp.
    dir: out
    connection: broker
}

listener {
    port: 5648
    sasl-mechanisms: ANONYMOUS
    idle-timeout-seconds: 1
}

listener {
    port: 5646
    sasl-mechanisms: ANONYMOUS
    role: inter-router
}


3) Have another qdrouterd (so far reproduced just with this one only) that link-routes everything to the first one:

connector {
    name: hub
    host: hostname-of-previous-qdrouterd
    port: 5646
    sasl-mechanisms: ANONYMOUS
    role: inter-router
    idle-timeout-seconds: 0
}

linkRoute {
    prefix: pulp.
    dir: in
}
linkRoute {
    prefix: pulp.
    dir: out
}

listener {
    port: 5649
    sasl-mechanisms: ANONYMOUS
    idle-timeout-seconds: 1
}


4) Use below script to simulate a client connecting to the 2nd qdrouterd, that creates a receiver to a pulp.agent.* address, ignores heartbeats and drops the connection 2 seconds later on (just when qdrouterd is about to drop the timeouted connection) - just update the hostname in "connect" method:

----------------8<-------------[ cut here ]------------------

import socket
import sys
import time
import random

RECV_SLEEP = 0.005

RECEIVER_MIN = 1
RECEIVER_MAX = 1000
SLEEP_AFTER = 2
SLEEP_AFTER_VAR = 0.2
CYCLES = 10
PORT = 5647
REC_ID = 1

if len(sys.argv) > 1:
    RECEIVER_MIN = int(sys.argv[1])
if len(sys.argv) > 2:
    RECEIVER_MAX = int(sys.argv[2])
if len(sys.argv) > 3:
    SLEEP_AFTER = float(sys.argv[3])
if len(sys.argv) > 4:
    CYCLES = int(sys.argv[4])
if len(sys.argv) > 5:
    PORT = int(sys.argv[5])
if len(sys.argv) > 6:
    REC_ID = int(sys.argv[6])%10
RECEIVERS = RECEIVER_MAX-RECEIVER_MIN+1

def int_to_hexstream(i, digits=4):
    out = "3" + "3".join(list(str(i)))
    spc = "30" * ((2*digits-len(out))/2)
    out = spc + out
    return out

class Send:
    def __init__(self):
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.connect(("hostname-of-second-qdrouterd", PORT))
        self.pending_ack = 0

    def send_recv(self,hexstream, _len, recv_now=True):
        self.sock.send(bytearray.fromhex(hexstream))
        if recv_now:
            self.sock.recv(self.pending_ack)
            self.pending_ack = 0
	    time.sleep(RECV_SLEEP)
        else:
            self.pending_ack += _len
        #print "%s %s" % (_len, self.pending_ack)

    def close(self):
        self.sock.close()

for cycle in range(0,CYCLES):
    #print("starting cycle #%s" % (cycle+1))
    snd = Send()
    # Protocol Header
    snd.send_recv('414d515003010000',51)
    # sasl.init
    snd.send_recv('0000005102010000005341d00000004100000002a309414e4f4e594d4f5553a030616e6f6e796d6f757340706d6f72617665632d7268656c37362e6773736c61622e627271322e7265646861742e636f6d', 22)
    # Protocol Header + open
#    snd.send_recv('414d5150000100000000007502000000005310d0000000650000000aa12433663438653730392d633836642d346232342d396435352d616230346365636332353666a12b706d6f72617665632d6361707336352d726865762e6773736c61622e627271322e7265646861742e636f6d40607fff700000c3504040404040', 227)
    snd.send_recv('414d5150000100000000007002000000005310d00000006000000005a12436666135306464642d386166362d343033642d623135342d613033636135303736613733a12b706d6f72617665632d6361707336352d726865762e6773736c61622e627271322e7265646861742e636f6d40607fff7000001388', 227)
    # begin
    snd.send_recv('0000002102000000005311d00000001100000004405200707fffffff707fffffff', 35)

    for i in range(cycle*RECEIVERS + RECEIVER_MIN, cycle*RECEIVERS + RECEIVER_MAX+1):
        # attach and flow
	snd.send_recv('0000007a02000000005312d00000006a0000000ba11470756c702e6167656e742e'+int_to_hexstream(REC_ID)+'2e'+int_to_hexstream(i)+'70'+'{:08x}'.format(i-RECEIVER_MIN)+'4150025000005328d0000000230000000ba11170756c702e6167656e742e544553542e'+int_to_hexstream(REC_ID, 1)+'520040520042404040404040005329d00000000d00000007405200405200424040404052005300'+'0000002d02000000005313d00000001d000000095200707fffffff5200707fffffff700000'+'{:04x}'.format(i-RECEIVER_MIN)+'520052014042', 142 if (i-RECEIVER_MIN)>255 else 122, False)

    print "after %s cycle, sent %s attaches" % (cycle+1, (cycle+1)*RECEIVERS)
    time.sleep(random.uniform(SLEEP_AFTER-SLEEP_AFTER_VAR, SLEEP_AFTER+SLEEP_AFTER_VAR))
    snd.close()

print "sent all %s attaches" % CYCLES*RECEIVERS

----------------8<-------------[ cut here ]------------------


5) run this client concurrently:
while true; do
  for i in $(seq 1 10); do
    python bytearray_create_receiver_kill.py $i $i 1 100 5649 $i &
    sleep 0.1
  done
  time wait
done


6) Wait few minutes for the segfault of 2nd qdrouterd.

Comment 12 Pavel Moravec 2019-12-03 20:55:10 UTC
OK, even the first qdrouterd (on Satellite) can segfault the same way, just (probably) with lower probability.

Comment 13 Pavel Moravec 2019-12-03 21:20:10 UTC
Few possible backtraces can happen following this reproducer:

(gdb) bt
#0  0x00007fdddd0a42c7 in raise () from /lib64/libc.so.6
#1  0x00007fdddd0a59b8 in abort () from /lib64/libc.so.6
#2  0x00007fddde05d2a2 in stop_polling (epollfd=4, ee=0x218a8c0) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:312
#3  0x00007fddde05dc96 in stop_polling (ee=0x218a8c0, epollfd=4) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:315
#4  0x00007fddde05e636 in pconnection_cleanup (pc=0x218a8b0) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:896
#5  0x00007fddde05f749 in pconnection_process (pc=0x218a8b0, events=events@entry=0, timeout=timeout@entry=true, topup=topup@entry=false, is_io_2=is_io_2@entry=false)
    at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:1272
#6  0x00007fddde0601ae in proactor_do_epoll (p=0x1f34400, can_block=can_block@entry=true) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2141
#7  0x00007fddde0611ea in pn_proactor_wait (p=<optimized out>) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2156
#8  0x00007fddde5071f5 in thread_run (arg=0x1f38d10) at /usr/src/debug/qpid-dispatch-1.5.0/src/server.c:995
#9  0x00007fdddde45dd5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fdddd16c02d in clone () from /lib64/libc.so.6
(gdb) 

or:

(gdb) bt
#0  0x00007fe85cd332c7 in raise () from /lib64/libc.so.6
#1  0x00007fe85cd349b8 in abort () from /lib64/libc.so.6
#2  0x00007fe85cd75e17 in __libc_message () from /lib64/libc.so.6
#3  0x00007fe85cd7e609 in _int_free () from /lib64/libc.so.6
#4  0x00007fe85dcee749 in pconnection_process (pc=0x7fe834012700, events=events@entry=0, timeout=timeout@entry=true, topup=topup@entry=false, is_io_2=is_io_2@entry=false)
    at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:1272
#5  0x00007fe85dcef1ae in proactor_do_epoll (p=0x144f400, can_block=can_block@entry=true) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2141
#6  0x00007fe85dcf01ea in pn_proactor_wait (p=<optimized out>) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2156
#7  0x00007fe85e1961f5 in thread_run (arg=0x1453d10) at /usr/src/debug/qpid-dispatch-1.5.0/src/server.c:995
#8  0x00007fe85dad4dd5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fe85cdfb02d in clone () from /lib64/libc.so.6
(gdb) 

or:

(gdb) bt
#0  pn_transport_tail_closed (transport=0x0) at /usr/src/debug/qpid-proton-0.28.0/c/src/core/transport.c:3102
#1  0x00007fefd1de9e79 in pn_connection_driver_read_closed (d=d@entry=0x7fefa4070e68) at /usr/src/debug/qpid-proton-0.28.0/c/src/core/connection_driver.c:109
#2  0x00007fefd1bcb641 in pconnection_rclosed (pc=0x7fefa40708c0) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:962
#3  pconnection_process (pc=0x7fefa40708c0, events=events@entry=0, timeout=timeout@entry=true, topup=topup@entry=false, is_io_2=is_io_2@entry=false)
    at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:1164
#4  0x00007fefd1bcc1ae in proactor_do_epoll (p=0x1e66400, can_block=can_block@entry=true) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2141
#5  0x00007fefd1bcd1ea in pn_proactor_wait (p=<optimized out>) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2156
#6  0x00007fefd20731f5 in thread_run (arg=0x1e6ad10) at /usr/src/debug/qpid-dispatch-1.5.0/src/server.c:995
#7  0x00007fefd19b1dd5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fefd0cd802d in clone () from /lib64/libc.so.6
(gdb) 

or:

(gdb) bt
#0  0x00007f1fe54a32c7 in raise () from /lib64/libc.so.6
#1  0x00007f1fe54a49b8 in abort () from /lib64/libc.so.6
#2  0x00007f1fe645c239 in read_uint64 (fd=<optimized out>) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:221
#3  0x00007f1fe645c965 in read_uint64 (fd=<optimized out>) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:224
#4  0x00007f1fe645c9ad in ptimer_callback (pt=pt@entry=0x7f1fbc06a6b8) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:234
#5  0x00007f1fe645ead3 in pconnection_process (pc=0x7f1fbc06a1a0, events=events@entry=0, timeout=timeout@entry=true, topup=topup@entry=false, is_io_2=is_io_2@entry=false)
    at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:1109
#6  0x00007f1fe645f1ae in proactor_do_epoll (p=0x2296400, can_block=can_block@entry=true) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2141
#7  0x00007f1fe64601ea in pn_proactor_wait (p=<optimized out>) at /usr/src/debug/qpid-proton-0.28.0/c/src/proactor/epoll.c:2156
#8  0x00007f1fe69061f5 in thread_run (arg=0x229ad10) at /usr/src/debug/qpid-dispatch-1.5.0/src/server.c:995
#9  0x00007f1fe6244dd5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f1fe556b02d in clone () from /lib64/libc.so.6
(gdb)

Comment 14 Pavel Moravec 2019-12-04 08:13:12 UTC
Changing subject that was misleading. qpidd activity shall not have big impact to the segfault (and definitely it is not the main trigger).

Comment 21 Mike McCune 2021-02-16 17:02:26 UTC
*** Satellite 6.8 Hotfix Available ***

1) Download:

python-qpid-proton-0.28.0-4.el7.x86_64.rpm
qpid-proton-c-0.28.0-4.el7.x86_64.rpm

from this Bugzilla to your Satellite or Capsule server.
2) stop services:

satellite-maintain service stop

3) Install:

rpm -Uvh python-qpid-proton-0.28.0-4.el7.x86_64.rpm qpid-proton-c-0.28.0-4.el7.x86_64.rpm

4) restart:

satellite-maintain service start

5) resume operations

Comment 22 Mike McCune 2021-02-16 17:03:13 UTC
Created attachment 1757304 [details]
qpid-proton-c-0.28.0-4.el7.x86_64.rpm

Comment 23 Mike McCune 2021-02-16 17:03:41 UTC
Created attachment 1757305 [details]
python-qpid-proton-0.28.0-4.el7.x86_64.rpm

Comment 24 Jitendra Yejare 2021-03-10 07:28:36 UTC
What could be the exact steps to test this issue and the expected behavior ?

Or the steps mentioned in Comment 11 are enough to test it ?

Comment 25 Pavel Moravec 2021-03-10 08:16:21 UTC
(In reply to Jitendra Yejare from comment #24)
> What could be the exact steps to test this issue and the expected behavior ?
> 
> Or the steps mentioned in Comment 11 are enough to test it ?

The bug is based on hitting a race condition. The reproducer outside Satellite in #c11 does hit the problematic scenario with much higher probability, than one would achieve via playing with goferd. So #c11 is imho the best reproducer test for the bugfix, which is sufficient for goferd usage of the library.

Optional reproducer inside Satellite:

1) restart goferd, and since logging:

Mar 10 09:01:42 pmoravec-rhel78 goferd: [INFO][pulp.agent.c23d0b57-c3d1-476e-949b-13d3389e95bf] gofer.messaging.adapter.connect:30 - connected: proton+amqps://pmoravec-caps68-rhev.gsslab.brq2.redhat.com:5647

start to drop outgoing packages to port 5647 for 16(*) seconds (iptables -A OUTPUT -p tcp --dport 5647 -j DROP). Optionally, take tcpdump and notice when goferd sends some PSH packet to qdrouterd (that will be a heartbeat), and start the blocking outgoing traffic from this time. "Simply", you have to block the traffic exactly for 16 seconds since latest activity seen on the wire (either the connection established or heartbeat sent).

2) after the 16 seconds, stop goferd and remove the iptables rule concurrently - ideally such that the TCP closure from goferd hits qdrouterd at the time it wants to drop the connection as well.

(*) I *think* it should be 15.98s instead of 16, to really hit the race condition on the qdrouterd - one needs to tune this..

3) repeat (or scale) the test to hit the probability

Comment 26 Pavel Moravec 2021-03-10 08:29:29 UTC
(In reply to Pavel Moravec from comment #25)
> (In reply to Jitendra Yejare from comment #24)
> > What could be the exact steps to test this issue and the expected behavior ?
> > 
> > Or the steps mentioned in Comment 11 are enough to test it ?
> 
> The bug is based on hitting a race condition. The reproducer outside
> Satellite in #c11 does hit the problematic scenario with much higher
> probability, than one would achieve via playing with goferd. So #c11 is imho
> the best reproducer test for the bugfix, which is sufficient for goferd
> usage of the library.
> 
> Optional reproducer inside Satellite:
> ..

Here, to diagnose what&when happened on wire, it might be useful to follow https://access.redhat.com/solutions/1594323 .

Comment 27 Jitendra Yejare 2021-03-18 10:40:36 UTC
Verified !

@ Satellite 6.9.0 snap 9.0


Steps:
------------

Referred Comment 11:- 

0. Disable all SSLs in both qdrouterd and qpidd confs.
1. Create 10 custom queues.
2. Updated satellite qdrouterd(as qdrouterd 1) to point to 6571 port and other things as per comment 11.
3. Create another qdrouterd in /etc/qpid-dispatch as qdrouterd.caps.conf, asked pmorovac for same. Or just create same as mentioned in comment 11 point 3.
4. Run the new qdrouterd.
# /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.caps.conf
5. Create the custom script from comment 11 point 4.
6. Run the script in loop.
7. Check the qpid queue for custom queues are populated with at least 1 connection. 
# qpid-stat -q
8 .Wait for 10 minutes to if any errors/segfault in any qdrouter process.


Observation:
--------------

1. No segfault is observed post waiting for 10 minutes.

Comment 30 errata-xmlrpc 2021-04-21 13:12:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313


Note You need to log in before you can comment on or make changes to this bug.