Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 873756

Summary: python/c++ high level API client receiver.fetch(timeout=N) blocks forever in select() if broker is stopped (kill -STOP).
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: Messaging_Programming_ReferenceAssignee: Nobody <nobody>
Status: CLOSED UPSTREAM QA Contact: Messaging QE <messaging-qe-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: DevelopmentCC: jross, pmoravec, zkraus
Target Milestone: 3.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-02-10 03:21:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Automated tests failures after applying the patch pmoravec: review? (dpierce)

Description Frantisek Reznicek 2012-11-06 16:16:21 UTC
Description of problem:

python high level API client receiver.fetch(timeout=N) blocks forever in select() if broker is stopped.

During verification of bug 856299 I found that qpid.messaging.Receiver.fetch(timeout=N) blocks program execution forever if qpidd is stopped by kill -STOP `pidof qpidd` just before receiver call.

Seen consistently across all supported RHEL variants.

Version-Release number of selected component (if applicable):
python-qpid-0.14-11.el5
python-qpid-0.18-4.el5

How reproducible:
100%

Steps to Reproduce:
1. service qpidd restart
2. python # interacive shell, terminal A
  import qpid.messaging
  c=qpid.messaging.Connection('localhost')
  c.open()
  s=c.session();
  S=s.sender("ADDR; {create: sender}")
  S.send("sdfdsfdsfsd", timeout=5);
  R=s.receiver("ADDR; {create: sender}")
3. kill -STOP `pidof qpidd` #3 in terminal B
4. back in python terminal A
  R.fetch(timeout=5)


Actual results:
Call at step 4 blocking forever.

Expected results:
Call at step 4 blocking 5 secs.

Additional info:

Comment 1 Frantisek Reznicek 2012-11-06 16:19:01 UTC
Backtrace from the step 4. [rhel5.9] (interrupted ^C)
>>> R.fetch(timeout=5)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<string>", line 6, in fetch
  File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 1021, in fetch
    self._ecwait(lambda: not self.draining)
  File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
    result = self._ewait(lambda: self.closed or predicate(), timeout)
  File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 973, in _ewait
    result = self.session._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 208, in _ewait
    result = self._wait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 193, in _wait
    return self._waiter.wait(predicate, timeout=timeout)
  File "/usr/lib/python2.4/site-packages/qpid/concurrency.py", line 57, in wait
    self.condition.wait(3)
  File "/usr/lib/python2.4/site-packages/qpid/concurrency.py", line 96, in wait
    sw.wait(timeout)
  File "/usr/lib/python2.4/site-packages/qpid/compat.py", line 53, in wait
    ready, _, _ = select([self], [], [], timeout)
KeyboardInterrupt
>>>

Comment 2 Justin Ross 2012-11-06 18:45:41 UTC
Darryl, please do a brief assessment.

Comment 3 Darryl L. Pierce 2012-11-06 19:58:35 UTC
(In reply to comment #2)
> Darryl, please do a brief assessment.

Confirmed the problem exists. qpidd isn't being killed, just stopped, so it's still accepting connections. I did a variation on this and sending from a session is also blocking forever even with a timeout.

killall -9 qpidd causes both the send and fetch to return immediately.

Comment 4 Justin Ross 2012-11-06 20:31:37 UTC
No doubt about this being a defect, but I do not think it's a good candidate for 2.3 at this point.  Nacking for 2.3, will approve for 2.4.

Comment 5 Pavel Moravec 2014-05-26 17:18:58 UTC
Patch:

--- /usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py	2014-04-25 21:02:27.000000000 +0200
+++ /usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py.bz873756	2014-05-26 18:14:37.491000871 +0200
@@ -1038,7 +1038,7 @@ class Receiver(Endpoint, object):
       self.check_closed()
       self.draining = True
       self._wakeup()
-      self._ecwait(lambda: not self.draining)
+      self._ecwait(lambda: not self.draining, timeout=0)
       msg = self.session._get(self, timeout=0)
       self._grant()
       self._wakeup()

Comment 6 Pavel Moravec 2014-05-27 07:44:47 UTC
Created attachment 899398 [details]
Automated tests failures after applying the patch

Applying patch https://bugzilla.redhat.com/show_bug.cgi?id=873756#c5 causes several automated tests failure. While the patch is more than reasonable (timeout needs to be set to zero, not to default None).

Daryl, could you please review what is wrong?

Comment 7 Pavel Moravec 2014-06-12 16:28:30 UTC
Upstream commit r1602221.

Comment 8 Pavel Moravec 2014-06-12 16:32:26 UTC
(In reply to Pavel Moravec from comment #7)
> Upstream commit r1602221.

Forgot to add: the automated tests are false alarms, confirmed by Daryl.

Comment 9 Pavel Moravec 2014-06-15 09:10:31 UTC
Upstream fix reverted as the automated tests failures were real (my fault).

Understanding more what happens, both the client and broker behave properly, strictly following AMQP 0-10 specification.

What happens by executing the code:
- consumer is created; per "Rule: initial-credit" in AMQP spec, the initial byte and message credit for the consumer is zero. So the broker does _not_ send the message from the queue now
- broker is frozen - this can be mimicked also by blocking TCP/IP traffic between the client and the broker
- receiver.fetch is called:
  - receiver.get is internally called but no message is in local buffer
  - credit is issued to the broker, asking it to provide messages from the queue. Technically the library sends message.flow (once for credit-unit bytes and once for messages) to (re-)issue its capacity/credit/prefetch, and then it sends execution.sync and message.flush.
    - here, the intervention in freezing broker forces the broker to "break" AMQP spec where the broker MUST handle two latest commands (meaning: respond promptly)

So putting client or broker implementation aside, the behaviour is expected. But it can be "workarounded" (meaning: fixed such that the scenario works fine) by two options:

1) (preferred) set receiver's capacity just after creating it. That is even a common practise. I.e. call "R.capacity=100" just after "R=s.receiver(..)". Then the client gets the message internally just after the setting capacity, and R.fetch method just offers it to the Python console.
(technically, there is still a little space for the problem to re-appear: if the broker freeze occurs after subscribing the receiver but before issueing its credit; but how real and probable such scenario is?)

2) The frozen broker is something similar to breaking connectivity (dropping TCP/IP packets). So another workaround would be using heartbeats like:
opts = {'heartbeat':10}
c=qpid.messaging.Connection('localhost', **opts)

Then R.fetch might not finish within 5 seconds but after up to 20 seconds, but that is acceptable imho. The timeout=5 is not strictly redeemed everytime, assume e.g. TCP redeliveries from client to broker when issuing credit.


With the above written, I would close the BZ as NOTABUG.

Frantisek, do you agree?

Comment 10 Frantisek Reznicek 2014-06-22 10:35:32 UTC
(In reply to Pavel Moravec from comment #9)
> ...
> With the above written, I would close the BZ as NOTABUG.
> 
> Frantisek, do you agree?

I pretty much agree with the explanation what is happening underneath.
I'm also fine with both drawn workarounds (capacity & heartbeat).

On the other side I reject / not agree on closing as NOTABUG due to following reasons:
 * Receiver.fetch(self, timeout=None) documentation says:
   Fetch and return a single message. A timeout of None will block forever waiting for a message to arrive, a timeout of zero will return immediately if no messages are available.
   * this does not explicitly define the timeout > 0 case but a customer still may expect that receiver blocks max. duration defined by timeout
 * The above well detailed flow how the test scenario consist of internal actions should not be taken as frozen due to AMQP spec.
   * AMQP protocol affects but not restricts client API to sit in R.fetch(., N) forever in above highlighted situation. In other words client API still has multiple choices to behave according above doc AND be compatible with AMQP spec.

So my summary options are (sorted by QA priority):
 * make necessary python API changes to achieve Receiver.fetch(...) doc being correct in above highlighted test scenario
 * (assuming the engineering effort and/or risks of such change breaking the implementation are high) Receiver.fetch(...) doc should be adopted the way customers know timeout parameter may be ineffective/useless (in some scenarios).

[1] http://qpid.apache.org/releases/qpid-0.28/messaging-api/python/api/index.html

Comment 11 Frantisek Reznicek 2014-06-22 11:12:37 UTC
I revisited the behavior and retested the c++ client.

Surprisingly I'm seeing the same issue as reported here on python API.

If broker is just up:
[root@dhcp-x clients]# ./a
1.40344e+09 session created
1403435046.419 msg sent
1403435046.421 receiver created
1403435051.421 fetch started (5s)
1403435051.422 fetch ended
Hey!

If broker gets paused at * and unpaused at **
[root@dhcp-27-117 clients]# ./a
1.40344e+09 session created
1403435085.937 msg sent
1403435085.939 receiver created
* kill -STOP `pidof qpidd`
1403435090.939 fetch started (5s)
** kill -CONT `pidof qpidd` (70seconds or so)
1403435162.375 fetch ended
Hey!

This proves that the issue is visible among different APIs (c++ and swigged ones) and python.

Comment 13 Frantisek Reznicek 2014-06-22 11:25:18 UTC
(In reply to Frantisek Reznicek from comment #11)
> ...
> This proves that the issue is visible among different APIs (c++ and swigged
> ones) and python.

qpid-cpp-client-0.22-42.el6.i686 was the build I used for c++ client retest.

Comment 14 Frantisek Reznicek 2014-06-22 11:29:14 UTC
I believe current c++/python implementation of Receiver.fetch(timeout=N) performs multiple actions, and some of them has to be both initiated & finished. This leads to situation that if mandatory actions (like message.flow and possibly others) are not finished then timeout is not functional.

Comment 15 Pavel Moravec 2014-06-23 10:07:53 UTC
I am in favour of rather mentioning this in API doc than changing the code that would require nontrivial change to the timeout handling (per my understanding of the source code) - both in C++ and Python.

Where the doc needs to be changed? Just upstream API reference? (I cant find a note on that in any RedHat official materials)

Comment 16 Justin Ross 2014-10-24 17:07:45 UTC
(In reply to Pavel Moravec from comment #15)
> I am in favour of rather mentioning this in API doc than changing the code
> that would require nontrivial change to the timeout handling (per my
> understanding of the source code) - both in C++ and Python.
> 
> Where the doc needs to be changed? Just upstream API reference? (I cant find
> a note on that in any RedHat official materials)

Just the API reference is fine.  Pavel, let me know if you'd like me to reassign this.

Comment 17 Pavel Moravec 2014-10-27 11:28:14 UTC
(In reply to Justin Ross from comment #16)
> (In reply to Pavel Moravec from comment #15)
> > I am in favour of rather mentioning this in API doc than changing the code
> > that would require nontrivial change to the timeout handling (per my
> > understanding of the source code) - both in C++ and Python.
> > 
> > Where the doc needs to be changed? Just upstream API reference? (I cant find
> > a note on that in any RedHat official materials)
> 
> Just the API reference is fine.  Pavel, let me know if you'd like me to
> reassign this.

That would be great; not sure how (or if) I would do it. I would suggest text either to receiver.fetch or even to receiver class like:

"it is recommended to set receiver's capacity just after creating a receiver. Otherwise, receiver.fetch can be blocked forever in case of networking issues." For more detailed text, see comment #9.

Comment 25 Red Hat Bugzilla 2025-02-10 03:21:05 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.