1491160 – qdrouterd segfault when processing bursts of goferd requests

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1491160 - qdrouterd segfault when processing bursts of goferd requests

Summary: qdrouterd segfault when processing bursts of goferd requests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Qpid
Sub Component:
Version:	6.2.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	Unspecified
Assignee:	Mike Cressman
QA Contact:	Roman Plevka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-13 08:24 UTC by Pavel Moravec
Modified:	2024-10-01 16:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:	qpid-dispatch-0.4-27
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1530692 (view as bug list)
Environment:
Last Closed:	2018-02-05 13:54:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1484028	high	CLOSED	qdrouterd segfaults on frequent goferd reconnect	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1492355	high	CLOSED	sporadic deadlock of qdrouterd on bursts of goferd (dis)connection requests	2022-07-09 09:22:09 UTC
Red Hat Knowledge Base (Solution)	3251561	None	None	None	2017-11-26 11:58:34 UTC
Red Hat Product Errata	RHSA-2018:0273	normal	SHIPPED_LIVE	Important: Red Hat Satellite 6 security, bug fix, and enhancement update	2018-02-08 00:35:29 UTC

Internal Links: 1484028 1492355 1561876

Description Pavel Moravec 2017-09-13 08:24:18 UTC

Description of problem:
When too many client connections (from goferd) are attempting to be link-routed via qdrouterd on Satellite towards qpidd, qdrouterd on Sat can segfault.

Restarting the process can lead to another segfault soon, since the pending goferd clients send another burst of link attach attempts to the just-started qdrouterd, once possible.

Anyway I was able to reproduce these sefaults sporadically only.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-16.el7.x86_64
qpid-dispatch-router-0.4-22.el7sat.x86_64
libqpid-dispatch-0.4-22.el7sat.x86_64


How reproducible:
nondeterministic, tome to time


Steps to Reproduce:
(Satellite-free reproducer)

1. Have qpidd, one qdrouterd (named A) link-routing everything to it, and another qdrouterd (named B) link-routing everything via A

2. Have a python client I will post in next update (just align CONNECTION_URL there). The script basically creates (via QMF) a queue in qpidd and tries to fetch messages from it via a consumer - and it does so concurrently in 30 threads.

3. Run:

while true; do
	if [ $((RANDOM%2)) -eq 0 ]; then
		service qdrouterd restart  # to restart the qdrouterd B
	fi
	sleep 1
	kill $(ps aux | grep simul | grep python | awk '{ print $2 }')
        for i in $(seq 1 10); do
                python simulate_first_goferd_run-threaded.py &
                sleep 0.1
        done
	date
	sleep 45
done


Actual results:
qdrouterd A segfaults time to time, with logs like:

Wed Sep 13 09:54:36 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Wed Sep 13 09:55:00 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Wed Sep 13 09:55:06 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Wed Sep 13 09:55:08 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Wed Sep 13 09:55:34 2017 ROUTER_LS (info) Router Link Lost - link_id=0
2017-09-13T09:57:45.632091+02:00 pmoravec-sat62-rhel7 kernel: traps: qdrouterd[14234] general protection ip:7f1265e9a585 sp:7f1259878318 error:0 in libqpid-proton.so.2.0.0[7f1265e76000+4d000]
Segmentation fault (core dumped)

(i.e. some random time after the Link lost event - this event is supposed to be caused by too many link attach requests exchanged between A and B, causing HELLO/RA messages timeout)


Expected results:
No segfaults.


Additional info:
Backtrace:

(gdb) bt
#0  0x00007fcc3d7fa585 in pn_connection_get_context (conn=0x6f6d70275b3d6e65) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184
#1  0x00007fcc3da3ee61 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:937
#2  0x00007fcc3da4cc95 in router_link_attach_handler (context=0xedba20, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686
#3  0x00007fcc3da3e11c in handle_link_open (container=<optimized out>, pn_link=0x7fcc282e4610) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217
#4  process_handler (unused=<optimized out>, qd_conn=0x7fcc2800cb30, container=0xe80840) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:481
#5  handler (handler_context=0xe80840, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7fcc2800cb30) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:643
#6  0x00007fcc3da5095c in process_connector (cxtr=0x7fcc28010290, qd_server=0xede3d0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#8  0x00007fcc3d5c1e25 in start_thread (arg=0x7fcc311d9700) at pthread_create.c:308
#9  0x00007fcc3cb1734d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb)

(very few variants, once frame #1 was already qd_link_connection method)

Always the connection object there is gone while qdrouterd tries to access it:

(gdb) frame 2
#2  0x00007fcc3da4cc95 in router_link_attach_handler (context=0xedba20, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686
1686	        qd_connection_t *out_conn = qd_link_connection(peer_rlink->link);
(gdb) p peer_rlink->link->pn_link->session->connection
$1 = (pn_connection_t *) 0x6f6d70275b3d6e65
(gdb) p *(peer_rlink->link->pn_link->session->connection)
Cannot access memory at address 0x6f6d70275b3d6e65
(gdb) 

The thread processes some event related to connection to/from qpidd, since:

(gdb) frame 6
#6  0x00007fcc3da5095c in process_connector (cxtr=0x7fcc28010290, qd_server=0xede3d0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
398	                events = qd_server->conn_handler(qd_server->conn_handler_context, ctx->context,
(gdb) p cxtr->name
$5 = "pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:5671", '\000' <repeats 136 times>, "\300\241\335<\314\177\000\000\241\f\001", '\000' <repeats 60 times>
(gdb) 

The "$(hostname):5671" name is connector to qpidd here (qpidd listens on 5671).

(note that these backtraces (with valid connection object) appears regularly whenever there is an established link routing.


Idea behind reproducer and backtrace:
- too many link attach requests cause inter-router link to be congested, causing HELLO/RA messages timeout and link drop
- meantime, processing some request like above, it still tries to use just-dropped link/connection to routerd B

Comment 2 Pavel Moravec 2017-09-13 08:27:23 UTC

script simulate_first_goferd_run-threaded.py :

(routers must link-route pulp.* to qpidd)


#!/usr/bin/python
from time import sleep
from uuid import uuid4
import threading

from proton import ConnectionException, Message, Timeout
from proton.utils import LinkDetached, ConnectionClosed, BlockingConnection
from proton.reactor import DynamicNodeProperties

SLEEP = 0.2
THREADS = 30
CONNECTION_URL = "proton+amqp://pmoravec-caps62-rhel7.gsslab.brq2.redhat.com:5647"

class CreateQueueAndSubscribe(threading.Thread):
  def __init__(self):
    super(CreateQueueAndSubscribe, self).__init__()
    self.qName = 'pulp.agent.FAKE.%s' % uuid4()

  def run(self):
    qmf_conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=None)
    qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")}))
    qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4()))
    print self.qName

    content = {
                "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"},
                "_method_name": "create",
                "_arguments": {"type":"queue", "name":self.qName,
                               "properties":{"exclusive":False, "auto-delete":False, "durable":True}}
              }
    properties = {
                   'qmf.opcode': '_method_request',
                   'x-amqp-0-10.app-id': 'qmf2',
                   'method': 'request'
                 }
    request = Message(
                    body=content,
                    reply_to=qmf_rec.remote_source.address,
                    properties=properties,
                    correlation_id=unicode(uuid4()).encode('utf-8'),
                    subject='broker')
    qmf_snd.send(request)
    qmf_rec.receive()  # ignore what we have received as QMF response
    qmf_snd.close()
    qmf_rec.close()
    qmf_conn.close()

    while True:
      connected=False
      while not(connected):
        try:
          conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=10)
          rec = conn.create_receiver(self.qName, name=str(uuid4()))
          connected=True
        except:
          pass

      while connected:
        try:
          impl = rec.receive(3)
        except Timeout:
          pass
        except LinkDetached, ConnectionClosed:
          connected=False
          pass


threads = []
for i in range(THREADS):
  threads.append(CreateQueueAndSubscribe())
  sleep(SLEEP)
  threads[i].start()

while True:
  sleep(10)

Comment 3 Pavel Moravec 2017-09-16 19:20:28 UTC

Much much more reliable reproducer:

1) In 1st terminal, kill and respawn python scripts mimicking goferd :

ssh root.pek2.redhat.com (beaker password)

while true; do
    kill $(ps aux | grep simulate | grep python | awk '{ print $2 }')
    for i in $(seq 1 30); do
        python simulate_first_goferd_run-threaded.py &
    done
    sleep $((RANDOM%10+25))
done


2) In 2nd terminal, occasionally restart qdrouterd B (the one on Capsule / between clients and "hub router"):

ssh root.pek2.redhat.com (beaker password)

while true; do
    kill $(ps aux | grep qdr | grep -v grep | awk '{ print $2 }')
    sleep 1
    /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf &
    sleep $((30+RANDOM%30))
done


3) In 3rd terminal, run the "hub router", the segfaulting:

ssh root.pek2.redhat.com (beaker password)

ulimit -c unlimited

/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf

4) In 4th terminal, clean orphaned qpidd queues (as each goferd mimicked python client creates its own queues and doesnt clean them):

ssh root.pek2.redhat.com (beaker password)

while true; do
    date
    ./delete_orphaned_queues.sh | wc    # to know number of deleted queues
    qpid-stat -q | wc                   # to know number of remaining queues
    sleep 90
done


5) Wait up to 10 minutes (usually approx. 2-3 minutes) for qdrouterd in 3rd terminal to segfault.


Some observation: I *think* the cleanup is somehow necessary - or *some* activity in qpidd that makes the broker busy. I.e. the trigger for the segfault is busy qpidd broker that responds with some delays (?).

Just a speculation: despite connector to qpidd broker has idle-timeout-seconds: 0, maybe the connection is still somehow timeouted in either way? That would explain why busy qpidd triggers it and also the missing connection object (timeouted)?


Side observation: time to time, qdrouterd on Capsule (2nd terminal) didnt finish / survived "kill <PID>". That *can* prevent segfaults of "hub router", be aware.

This "not kill-able router" is in fact in a deadlock - will describe in separate BZ.

Comment 4 Pavel Moravec 2017-09-17 06:29:47 UTC

Reproducer script from #c3 (since the machine will be returned once a day):

#!/usr/bin/python
from time import sleep
from uuid import uuid4
import threading

from proton import ConnectionException, Message, Timeout
from proton.utils import LinkDetached, ConnectionClosed, BlockingConnection
from proton.reactor import DynamicNodeProperties

SLEEP = 0.1
THREADS = 20 # was 30
CONNECTION_URL = "proton+amqp://localhost:5647"

class CreateQueueAndSubscribe(threading.Thread):
  def __init__(self):
    super(CreateQueueAndSubscribe, self).__init__()
    self.qName = 'pulp.agent.FAKE.%s' % uuid4()

  def run(self):
    created=False
    while not created:
      try:
        qmf_conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=None)
        qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")}))
        qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4()))
        print self.qName

        content = {
                    "_object_id": {"_object_name": "org.apache.qpid.broker:broker:amqp-broker"},
                    "_method_name": "create",
                    "_arguments": {"type":"queue", "name":self.qName,
                                   "properties":{"exclusive":False, "auto-delete":False, "durable":True}}
                  }
        properties = {
                       'qmf.opcode': '_method_request',
                       'x-amqp-0-10.app-id': 'qmf2',
                       'method': 'request'
                     }
        request = Message(
                        body=content,
                        reply_to=qmf_rec.remote_source.address,
                        properties=properties,
                        correlation_id=unicode(uuid4()).encode('utf-8'),
                        subject='broker')
        qmf_snd.send(request)
        qmf_rec.receive()  # ignore what we have received as QMF response
        qmf_snd.close()
        qmf_rec.close()
        qmf_conn.close()
        created = True
      except:
        created=False
    # end of while not created

    while True:
      connected=False
      while not(connected):
        try:
          conn = BlockingConnection(CONNECTION_URL, ssl_domain=None, heartbeat=10)
          rec = conn.create_receiver(self.qName, name=str(uuid4()))
          connected=True
        except:
          try:
            sleep(0.3)
            con.close()
          except:
            pass
          pass

      while connected:
        try:
          impl = rec.receive(3)
        except LinkDetached as e:
          connected=False
          print type(e)
	  try:
            con.close()
          except:
            pass
        except Timeout as e:
          pass
        except (ConnectionClosed, ConnectionException) as e:
          connected=False
	  print type(e)
          pass


threads = []
for i in range(THREADS):
  threads.append(CreateQueueAndSubscribe())
  sleep(SLEEP)
  threads[i].start()

while True:
  sleep(10)




(just for completeness, delete_orphaned_queues.sh is:

for i in $(qpid-stat -q | grep FAKE | grep "0     1$" | awk '{ print $1 }'); do
	echo $i
	qpid-receive -a "$i; {delete:always}"
done

)

Comment 5 Pavel Moravec 2017-09-19 06:56:47 UTC

Improved & simplified reproducer from #3 (mainly to ensure deadlocked router on -14 server does not block the reproducer, and that ulimits are properly set):

1) 1st terminal on dell-per430-32.gsslab.pek2.redhat.com :

./satellite-delete-old-queues-regularly.sh


2) 2nd terminal on dell-per430-32.gsslab.pek2.redhat.com (here watch for segfaults):

./satellite-run-qdrouterd-to-segfault.sh


3) 1st terminal on dell-per430-14.gsslab.pek2.redhat.com :

./capsule-reproducer-qdrouter.sh


4) 2nd terminal on dell-per430-14.gsslab.pek2.redhat.com :

./capsule-reproducer-clients.sh

(here watch that it prints lines like:

pulp.agent.FAKE.5d257d6f-fab5-442d-ab96-065b72dc5d1c
<class 'proton.utils.LinkDetached'>

or few other exceptions, time to time - if not, something is broken like unresponding qdrouterd on "capsule" / -14 server)

Comment 12 Mike Cressman 2017-10-05 14:29:54 UTC

Brad,
If you are including Bug 1492355 in 6.2.14, you should target this bug as well, since they are both addressed in the same qpid-dispatch-0.4-27 build.

Comment 13 Brad Buckingham 2017-10-05 14:41:53 UTC

Mike, thanks for the feedback!  I will update this one as well.  Be aware, that I am setting the TM to 6.2.14; however, prior to each z-stream there is a triage meeting where the content for the z gets selected.

Comment 18 Roman Plevka 2018-01-15 13:11:59 UTC

VERIFIED
on
satellite-6.2.14-1.0.el7sat.noarch

I also tried the following:

- create a docker image with rhel with installed katello-consumer-ca and installed katello-agent (yet not registered to satellite)
- as a startup script, run subscription-manager registration and append a conditional loop that will start up the gofer daemon on some sort of trigger (i mounted an external dir and made a conditional to check for a presence of some file).

- start up many containers (tried with 10,30,50).
- after all the containers are up and their registration is finished (verify by listing the content hosts in satellite and that there are no more requests arriving to /rhsm endpoint), pull the trigger (in my case, create the file) to break the waiting loop, that would run goferd on all containers simultaneously.
- observe the number pulp.agent* queues bumps by the number of the running containers in a moment
- watch the logs for any errors


- no erorrs detected

Comment 21 errata-xmlrpc 2018-02-05 13:54:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0273

Note You need to log in before you can comment on or make changes to this bug.