Bug 906277 - qpidd --port 0 should always start up with listening to randomly generated ipv4/6 port - no 'Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)' should be seen
Summary: qpidd --port 0 should always start up with listening to randomly generated ip...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-31 10:48 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-08-29 19:55:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducer (1.12 KB, text/x-python)
2013-04-30 20:40 UTC, Alan Conway
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-4745 0 None None None Never

Description Frantisek Reznicek 2013-01-31 10:48:45 UTC
Description of problem:

qpidd --port 0 should always start up with listening to randomly generated ipv4/6 port - no 'debug Exception constructed: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)' should be seen

qpidd --port 0 should automatically guess and attach qpidd to random free ipv[46] port.

Currently on RHEL5.x (more rapidly on RHEL5 x86_64 than on i386) we can see qpidd leaving with following exception:

  fork0: 2013-01-30 12:16:05 [Network] info Listening to: [::]:0
  fork0: 2013-01-30 12:16:05 [Network] debug Listened to: 49337
  fork0: 2013-01-30 12:16:05 [Network] info Listening to: 0.0.0.0:49337
  fork0: 2013-01-30 12:16:05 [System] debug Exception constructed: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)
  fork0: 2013-01-30 12:16:05 [Unspecified] notice Journal "TplStore": Destroyed
  fork0: 2013-01-30 12:16:05 [Broker] critical Unexpected error: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)

This I believe comes from RHEL 5 ::bind() behavior.

I think qpidd should try to bind socket multiple times to guarantee that we do not see above exception.

I propose to add env. variable defining maximum number of ::bind() trials with default of 3 on RHEL5.x and 1 on RHEL6.x or so.


Version-Release number of selected component (if applicable):
qpid-0.14, qpid-0.18

How reproducible:
70%

Steps to Reproduce:
0. Install and setup openais
1. compile qpid from source tree
   ./configure && make && make check
2. run the c++ cluster_test in the loop
source ./test_env.sh ; rm -f cluster_test.log; for i in `seq 100`; do ipcs -m | grep ^0x | awk '{print "ipcrm -m " $2}' | sh;ipcs -s | grep ^0x | awk '{print "ipcrm -s " $2}' | sh;service openais restart; LD_LIBRARY_PATH=$(dirname ${CLUSTER_LIB}) ./cluster_test >> cluster_test.log 2>&1; ecode=$?; echo -n $ecod
e; done
3. grep 'Address already in use' cluster_test.log
  
Actual results:
qpidd -p 0 eventually fails to start with exception

Expected results:
qpidd -p 0 should always start and bind to random port.


Additional info (one run of cluster_test):
  *Entering test case "testTxTransaction"
  2013-01-30 12:16:04 [Unspecified] debug ForkedBroker exec /usr/sbin/qpidd: qpidd  --cluster-name 71faa87d-38dc-4fc3-bafb-01532c556829 --log-prefix fork0 --auth no
  --no-module-dir --load-module /usr/lib64/qpid/daemon/cluster.so --load-module /usr/lib64/qpid/daemon/msgstore.so --data-dir /tmp/ForkedBroker.uh0IiS --port=0
  fork0: 2013-01-30 12:16:05 [Management] debug Management object added: amqp-broker
  fork0: 2013-01-30 12:16:05 [Broker] info Management enabled
  fork0: 2013-01-30 12:16:05 [Management] info ManagementAgent generated broker ID: 9edfc3f1-f456-4a40-aa52-02fb0bb6a5b0
  fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent boot sequence: 1
  fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent added package org.apache.qpid.broker
  fork0: 2013-01-30 12:16:05 [Management] debug SEND PackageInd package=org.apache.qpid.broker to=schema.package
  fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent added class org.apache.qpid.broker:system
  ...
  fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent added class org.apache.qpid.broker:queueThresholdExceeded
  fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:system:de5aaffa-f92b-4cb9-a0ae-02098c7a4055
  fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:broker:amqp-broker
  fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:vhost:org.apache.qpid.broker:broker:amqp-broker,/
  fork0: 2013-01-30 12:16:05 [Unspecified] notice Journal "TplStore": Created
  fork0: 2013-01-30 12:16:05 [Unspecified] debug Journal "TplStore": Journal directory = "/tmp/ForkedBroker.uh0IiS/rhm/tpl/"; Base file name = "tpl"
  fork0: 2013-01-30 12:16:05 [Unspecified] notice Store module initialized; store-dir=/tmp/ForkedBroker.uh0IiS
  fork0: 2013-01-30 12:16:05 [Unspecified] info > Default files per journal: 8
  fork0: 2013-01-30 12:16:05 [Unspecified] info > Default journal file size: 24 (wpgs)
  fork0: 2013-01-30 12:16:05 [Unspecified] info > Default write cache page size: 32 (KiB)
  fork0: 2013-01-30 12:16:05 [Unspecified] info > Default number of write cache pages: 32
  fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL files per journal: 8
  fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL journal file size: 24 (wpgs)
  fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL write cache page size: 4 (KiB)
  fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL number of write cache pages: 64
  fork0: 2013-01-30 12:16:05 [HA] notice Initializing CPG
  fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:exchange:amq.failover
  fork0: 2013-01-30 12:16:05 [HA] notice Cluster store state: empty
  fork0: 2013-01-30 12:16:05 [HA] notice cluster(10.16.64.162:18839 PRE_INIT) configuration change: 10.16.64.162:18839
  fork0: 2013-01-30 12:16:05 [HA] notice cluster(10.16.64.162:18839 PRE_INIT) Members joined: 10.16.64.162:18839
  fork0: 2013-01-30 12:16:05 [HA] debug cluster(10.16.64.162:18839 PRE_INIT) received initial status from 10.16.64.162:18839
  fork0: 2013-01-30 12:16:05 [HA] debug cluster(10.16.64.162:18839 PRE_INIT) initial status map complete.
  fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:exchange:
  fork0: 2013-01-30 12:16:05 [Management] debug SEND raiseEvent (v1) class=org.apache.qpid.broker.exchangeDeclare
  ...
  fork0: 2013-01-30 12:16:05 [Management] debug SEND raiseEvent (v2) class=org.apache.qpid.broker.exchangeDeclare
  fork0: 2013-01-30 12:16:05 [Broker] notice SASL disabled: No Authentication Performed
  fork0: 2013-01-30 12:16:05 [Network] info Listening to: [::]:0
  fork0: 2013-01-30 12:16:05 [Network] debug Listened to: 49337
  fork0: 2013-01-30 12:16:05 [Network] info Listening to: 0.0.0.0:49337
  fork0: 2013-01-30 12:16:05 [System] debug Exception constructed: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)
  fork0: 2013-01-30 12:16:05 [Unspecified] notice Journal "TplStore": Destroyed
  fork0: 2013-01-30 12:16:05 [Broker] critical Unexpected error: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)
  2013-01-30 12:16:05 [System] debug Exception constructed: EOF reading port number from child.
  unknown location(0): fatal error in "testTxTransaction": std::exception: EOF reading port number from child.
  /builddir/build/BUILD/boost_1_33_1/boost/test/impl/results_collector.ipp(197): info: check 'Test case has less failures then expected' passed
  Leaving test case "testTxTransaction"

Comment 1 Frantisek Reznicek 2013-02-01 08:34:14 UTC
Very rarely seen also when running python cluster_tests on rhel6.4 i686:

cluster_tests.LongTests.test_msg_group_failover ......................... fail
Error during test:  Traceback (most recent call last):
    File "/usr/bin/qpid-python-test", line 340, in run
      phase()
    File "/root/rpmbuild/BUILD/qpid-0.18/cpp/src/tests/cluster_tests.py", line 1701, in test_msg_group_failover
      b = cluster.start(expect=EXPECT_EXIT_FAIL)
    File "/root/rpmbuild/BUILD/qpid-0.18/cpp/src/tests/brokertest.py", line 427, in start
      self._brokers.append(self.test.broker(self.args+args, name, expect, wait, port=port, show_cmd=show_cmd))
    File "/root/rpmbuild/BUILD/qpid-0.18/cpp/src/tests/brokertest.py", line 527, in broker
      raise RethrownException("Failed to start broker %s(%s): %s" % (b.name, b.log, e))
  RethrownException: Failed to start broker cluster8-4(027:cluster8-4.log): Timed out waiting for broker cluster8-4:
      2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:amq.match Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:0, byteReceives
:0, byteRoutes:0, msgDrops:0, msgReceives:0, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0}
      2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:qpid.management Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:302, byte
Receives:302, byteRoutes:0, msgDrops:2, msgReceives:2, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0}
      2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:qmf.default.topic Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:0, byte
Receives:0, byteRoutes:0, msgDrops:0, msgReceives:0, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0}
      2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:qmf.default.direct Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:0, byt
eReceives:0, byteRoutes:0, msgDrops:0, msgReceives:0, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0}
      2013-01-31 14:18:51 [Broker] critical Unexpected error: Can't bind to port [::]:41633: Address already in use (qpid/sys/posix/Socket.cpp:206)



 > I propose to add env. variable defining maximum number of ::bind() trials with default of 3 on RHEL5.x and 1 on RHEL6.x or so.

Turning to be general problem. If proposal agreed default value of env. variable should be 3 for both RHEL5/6.

Comment 2 Andrew Stitcher 2013-04-16 18:26:34 UTC
Why do you think that repeating the bind would make the operation succeed?

Comment 3 Andrew Stitcher 2013-04-16 18:29:57 UTC
Assuming that this is truly non deterministic. I think it's a bug with IPv6/4 support in RHEL 5.

It seems that sometimes binding to the IPv6 port also binds to the IPv4 port, when the code sets options to stop that happening.

However it's possible that there is some uninitialised variable involved here so that something is being set to a non deterministic state.

Comment 4 Frantisek Reznicek 2013-04-19 09:38:23 UTC
(In reply to comment #3)
> Assuming that this is truly non deterministic. I think it's a bug with
> IPv6/4 support in RHEL 5.

Correct, non-deterministic.
Lately comment 1 I saw same/similar case on RHEL6 with drasticly low reproducibility.

> 
> It seems that sometimes binding to the IPv6 port also binds to the IPv4
> port, when the code sets options to stop that happening.
> 
> However it's possible that there is some uninitialised variable involved
> here so that something is being set to a non deterministic state.

> Why do you think that repeating the bind would make the operation succeed?
I don't have any proof of that. Just saw on other case that repeating helped to workaround.


I agree that if there is problem in RHEL, then it should be fixed. In case we are not able to prove that, then I propose to implement that env. variable.

Isolated / simplified testing scenario involving just qpid::sys::posix::Socket  in short c++ code (creating socket with port guess, sleep N, closing socket) ran in parallel loop should prove whether it is qpid part (which is unlikely) or on OS socket library. Implementing simple re-try mechanism should answer whether re-try actually helps.

Comment 5 Alan Conway 2013-04-30 20:40:39 UTC
Created attachment 741993 [details]
Reproducer

This script reproduces the problem for me usually within 10 minutes.

I think this may be a bug in qpidd: qpidd tries to bind its port on all interfaces. But for port=0 it does bind(0) on the first interface - which returns port P, and then tries to do bind(P) on the remaining interfaces. There's no guarantee that the port is free on all the other interfaces.

Either way, I am working on a solution that gets rid of --port=0. A small python script does the bind(0) and passes the socket to qpidd using the --socket-fd argument. Running the reproducer for 24 hours I saw no problems with this

Comment 6 Alan Conway 2013-05-15 15:12:42 UTC
Comitted to trunk: solution for the HA tests. Still need fix for other tests.

------------------------------------------------------------------------   
r1482882 | aconway | 2013-05-15 11:05:05 -0400 (Wed, 15 May 2013) | 2 lines

QPID-4745: Clean up test_backup_acquired to use HaCluster.

------------------------------------------------------------------------
r1482881 | aconway | 2013-05-15 11:03:44 -0400 (Wed, 15 May 2013) | 16 lines

QPID-4745: HA safe port allocation for brokers in HA tests.

Many HA tests use --port=0 to start a broker on an available port, but then need
to shutdown and restart the broker on the same port. This is not safe, on a busy
system it is possible for another process to take the port between the time the
broker is shut down and the time it is restarted.

The solution is to do bind(0) and listen in the python test framework (class
HaPort) and let the broker use the socket using qpidd --socket-fd. When the
broker is shut down the port remains bound by the python process. When the  
broker is re-started it again is given access to the socket via --socket-fd.

Other changes
- move ha_store_tests into ha_tests.
- add heartbeats to avoid stalling.

------------------------------------------------------------------------

Comment 7 Alan Conway 2013-06-18 21:16:16 UTC
This commit introduces a tool to do bind 0 outside of qpidd. 

------------------------------------------------------------------------
r1492778 | gsim | 2013-06-13 13:45:56 -0400 (Thu, 13 Jun 2013) | 2 lines

NO-JIRA: restrict broker to listen only on loopback interface (which is all that gets used by clients) in order to avoid port collisions

------------------------------------------------------------------------

Tests have not yet been updated to use this script. This is less urgent since gsim committed an alternative fix to the port clash problem:

------------------------------------------------------------------------
r1492778 | gsim | 2013-06-13 13:45:56 -0400 (Thu, 13 Jun 2013) | 2 lines

NO-JIRA: restrict broker to listen only on loopback interface (which is all that gets used by clients) in order to avoid port collisions

------------------------------------------------------------------------

Comment 8 Alan Conway 2014-08-29 19:55:01 UTC
This has been addressed by the --interface flag to qpidd. Using --interface 127.0.0.1 prevents the problem, which does seem to be related to qpidds default "bind to everything" behavior.


Note You need to log in before you can comment on or make changes to this bug.