Bug 873756
| Summary: | python/c++ high level API client receiver.fetch(timeout=N) blocks forever in select() if broker is stopped (kill -STOP). | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> | ||||
| Component: | Messaging_Programming_Reference | Assignee: | Nobody <nobody> | ||||
| Status: | CLOSED UPSTREAM | QA Contact: | Messaging QE <messaging-qe-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | Development | CC: | jross, pmoravec, zkraus | ||||
| Target Milestone: | 3.2 | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2025-02-10 03:21:05 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Frantisek Reznicek
2012-11-06 16:16:21 UTC
Backtrace from the step 4. [rhel5.9] (interrupted ^C) >>> R.fetch(timeout=5) Traceback (most recent call last): File "<stdin>", line 1, in ? File "<string>", line 6, in fetch File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 1021, in fetch self._ecwait(lambda: not self.draining) File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait result = self._ewait(lambda: self.closed or predicate(), timeout) File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 973, in _ewait result = self.session._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 208, in _ewait result = self._wait(lambda: self.error or predicate(), timeout) File "/usr/lib/python2.4/site-packages/qpid/messaging/endpoints.py", line 193, in _wait return self._waiter.wait(predicate, timeout=timeout) File "/usr/lib/python2.4/site-packages/qpid/concurrency.py", line 57, in wait self.condition.wait(3) File "/usr/lib/python2.4/site-packages/qpid/concurrency.py", line 96, in wait sw.wait(timeout) File "/usr/lib/python2.4/site-packages/qpid/compat.py", line 53, in wait ready, _, _ = select([self], [], [], timeout) KeyboardInterrupt >>> Darryl, please do a brief assessment. (In reply to comment #2) > Darryl, please do a brief assessment. Confirmed the problem exists. qpidd isn't being killed, just stopped, so it's still accepting connections. I did a variation on this and sending from a session is also blocking forever even with a timeout. killall -9 qpidd causes both the send and fetch to return immediately. No doubt about this being a defect, but I do not think it's a good candidate for 2.3 at this point. Nacking for 2.3, will approve for 2.4. Patch: --- /usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py 2014-04-25 21:02:27.000000000 +0200 +++ /usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py.bz873756 2014-05-26 18:14:37.491000871 +0200 @@ -1038,7 +1038,7 @@ class Receiver(Endpoint, object): self.check_closed() self.draining = True self._wakeup() - self._ecwait(lambda: not self.draining) + self._ecwait(lambda: not self.draining, timeout=0) msg = self.session._get(self, timeout=0) self._grant() self._wakeup() Created attachment 899398 [details] Automated tests failures after applying the patch Applying patch https://bugzilla.redhat.com/show_bug.cgi?id=873756#c5 causes several automated tests failure. While the patch is more than reasonable (timeout needs to be set to zero, not to default None). Daryl, could you please review what is wrong? Upstream commit r1602221. (In reply to Pavel Moravec from comment #7) > Upstream commit r1602221. Forgot to add: the automated tests are false alarms, confirmed by Daryl. Upstream fix reverted as the automated tests failures were real (my fault).
Understanding more what happens, both the client and broker behave properly, strictly following AMQP 0-10 specification.
What happens by executing the code:
- consumer is created; per "Rule: initial-credit" in AMQP spec, the initial byte and message credit for the consumer is zero. So the broker does _not_ send the message from the queue now
- broker is frozen - this can be mimicked also by blocking TCP/IP traffic between the client and the broker
- receiver.fetch is called:
- receiver.get is internally called but no message is in local buffer
- credit is issued to the broker, asking it to provide messages from the queue. Technically the library sends message.flow (once for credit-unit bytes and once for messages) to (re-)issue its capacity/credit/prefetch, and then it sends execution.sync and message.flush.
- here, the intervention in freezing broker forces the broker to "break" AMQP spec where the broker MUST handle two latest commands (meaning: respond promptly)
So putting client or broker implementation aside, the behaviour is expected. But it can be "workarounded" (meaning: fixed such that the scenario works fine) by two options:
1) (preferred) set receiver's capacity just after creating it. That is even a common practise. I.e. call "R.capacity=100" just after "R=s.receiver(..)". Then the client gets the message internally just after the setting capacity, and R.fetch method just offers it to the Python console.
(technically, there is still a little space for the problem to re-appear: if the broker freeze occurs after subscribing the receiver but before issueing its credit; but how real and probable such scenario is?)
2) The frozen broker is something similar to breaking connectivity (dropping TCP/IP packets). So another workaround would be using heartbeats like:
opts = {'heartbeat':10}
c=qpid.messaging.Connection('localhost', **opts)
Then R.fetch might not finish within 5 seconds but after up to 20 seconds, but that is acceptable imho. The timeout=5 is not strictly redeemed everytime, assume e.g. TCP redeliveries from client to broker when issuing credit.
With the above written, I would close the BZ as NOTABUG.
Frantisek, do you agree?
(In reply to Pavel Moravec from comment #9) > ... > With the above written, I would close the BZ as NOTABUG. > > Frantisek, do you agree? I pretty much agree with the explanation what is happening underneath. I'm also fine with both drawn workarounds (capacity & heartbeat). On the other side I reject / not agree on closing as NOTABUG due to following reasons: * Receiver.fetch(self, timeout=None) documentation says: Fetch and return a single message. A timeout of None will block forever waiting for a message to arrive, a timeout of zero will return immediately if no messages are available. * this does not explicitly define the timeout > 0 case but a customer still may expect that receiver blocks max. duration defined by timeout * The above well detailed flow how the test scenario consist of internal actions should not be taken as frozen due to AMQP spec. * AMQP protocol affects but not restricts client API to sit in R.fetch(., N) forever in above highlighted situation. In other words client API still has multiple choices to behave according above doc AND be compatible with AMQP spec. So my summary options are (sorted by QA priority): * make necessary python API changes to achieve Receiver.fetch(...) doc being correct in above highlighted test scenario * (assuming the engineering effort and/or risks of such change breaking the implementation are high) Receiver.fetch(...) doc should be adopted the way customers know timeout parameter may be ineffective/useless (in some scenarios). [1] http://qpid.apache.org/releases/qpid-0.28/messaging-api/python/api/index.html I revisited the behavior and retested the c++ client. Surprisingly I'm seeing the same issue as reported here on python API. If broker is just up: [root@dhcp-x clients]# ./a 1.40344e+09 session created 1403435046.419 msg sent 1403435046.421 receiver created 1403435051.421 fetch started (5s) 1403435051.422 fetch ended Hey! If broker gets paused at * and unpaused at ** [root@dhcp-27-117 clients]# ./a 1.40344e+09 session created 1403435085.937 msg sent 1403435085.939 receiver created * kill -STOP `pidof qpidd` 1403435090.939 fetch started (5s) ** kill -CONT `pidof qpidd` (70seconds or so) 1403435162.375 fetch ended Hey! This proves that the issue is visible among different APIs (c++ and swigged ones) and python. (In reply to Frantisek Reznicek from comment #11) > ... > This proves that the issue is visible among different APIs (c++ and swigged > ones) and python. qpid-cpp-client-0.22-42.el6.i686 was the build I used for c++ client retest. I believe current c++/python implementation of Receiver.fetch(timeout=N) performs multiple actions, and some of them has to be both initiated & finished. This leads to situation that if mandatory actions (like message.flow and possibly others) are not finished then timeout is not functional. I am in favour of rather mentioning this in API doc than changing the code that would require nontrivial change to the timeout handling (per my understanding of the source code) - both in C++ and Python. Where the doc needs to be changed? Just upstream API reference? (I cant find a note on that in any RedHat official materials) (In reply to Pavel Moravec from comment #15) > I am in favour of rather mentioning this in API doc than changing the code > that would require nontrivial change to the timeout handling (per my > understanding of the source code) - both in C++ and Python. > > Where the doc needs to be changed? Just upstream API reference? (I cant find > a note on that in any RedHat official materials) Just the API reference is fine. Pavel, let me know if you'd like me to reassign this. (In reply to Justin Ross from comment #16) > (In reply to Pavel Moravec from comment #15) > > I am in favour of rather mentioning this in API doc than changing the code > > that would require nontrivial change to the timeout handling (per my > > understanding of the source code) - both in C++ and Python. > > > > Where the doc needs to be changed? Just upstream API reference? (I cant find > > a note on that in any RedHat official materials) > > Just the API reference is fine. Pavel, let me know if you'd like me to > reassign this. That would be great; not sure how (or if) I would do it. I would suggest text either to receiver.fetch or even to receiver class like: "it is recommended to set receiver's capacity just after creating a receiver. Otherwise, receiver.fetch can be blocked forever in case of networking issues." For more detailed text, see comment #9. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |