Description of problem: qpidd --port 0 should always start up with listening to randomly generated ipv4/6 port - no 'debug Exception constructed: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206)' should be seen qpidd --port 0 should automatically guess and attach qpidd to random free ipv[46] port. Currently on RHEL5.x (more rapidly on RHEL5 x86_64 than on i386) we can see qpidd leaving with following exception: fork0: 2013-01-30 12:16:05 [Network] info Listening to: [::]:0 fork0: 2013-01-30 12:16:05 [Network] debug Listened to: 49337 fork0: 2013-01-30 12:16:05 [Network] info Listening to: 0.0.0.0:49337 fork0: 2013-01-30 12:16:05 [System] debug Exception constructed: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206) fork0: 2013-01-30 12:16:05 [Unspecified] notice Journal "TplStore": Destroyed fork0: 2013-01-30 12:16:05 [Broker] critical Unexpected error: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206) This I believe comes from RHEL 5 ::bind() behavior. I think qpidd should try to bind socket multiple times to guarantee that we do not see above exception. I propose to add env. variable defining maximum number of ::bind() trials with default of 3 on RHEL5.x and 1 on RHEL6.x or so. Version-Release number of selected component (if applicable): qpid-0.14, qpid-0.18 How reproducible: 70% Steps to Reproduce: 0. Install and setup openais 1. compile qpid from source tree ./configure && make && make check 2. run the c++ cluster_test in the loop source ./test_env.sh ; rm -f cluster_test.log; for i in `seq 100`; do ipcs -m | grep ^0x | awk '{print "ipcrm -m " $2}' | sh;ipcs -s | grep ^0x | awk '{print "ipcrm -s " $2}' | sh;service openais restart; LD_LIBRARY_PATH=$(dirname ${CLUSTER_LIB}) ./cluster_test >> cluster_test.log 2>&1; ecode=$?; echo -n $ecod e; done 3. grep 'Address already in use' cluster_test.log Actual results: qpidd -p 0 eventually fails to start with exception Expected results: qpidd -p 0 should always start and bind to random port. Additional info (one run of cluster_test): *Entering test case "testTxTransaction" 2013-01-30 12:16:04 [Unspecified] debug ForkedBroker exec /usr/sbin/qpidd: qpidd --cluster-name 71faa87d-38dc-4fc3-bafb-01532c556829 --log-prefix fork0 --auth no --no-module-dir --load-module /usr/lib64/qpid/daemon/cluster.so --load-module /usr/lib64/qpid/daemon/msgstore.so --data-dir /tmp/ForkedBroker.uh0IiS --port=0 fork0: 2013-01-30 12:16:05 [Management] debug Management object added: amqp-broker fork0: 2013-01-30 12:16:05 [Broker] info Management enabled fork0: 2013-01-30 12:16:05 [Management] info ManagementAgent generated broker ID: 9edfc3f1-f456-4a40-aa52-02fb0bb6a5b0 fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent boot sequence: 1 fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent added package org.apache.qpid.broker fork0: 2013-01-30 12:16:05 [Management] debug SEND PackageInd package=org.apache.qpid.broker to=schema.package fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent added class org.apache.qpid.broker:system ... fork0: 2013-01-30 12:16:05 [Management] debug ManagementAgent added class org.apache.qpid.broker:queueThresholdExceeded fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:system:de5aaffa-f92b-4cb9-a0ae-02098c7a4055 fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:broker:amqp-broker fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:vhost:org.apache.qpid.broker:broker:amqp-broker,/ fork0: 2013-01-30 12:16:05 [Unspecified] notice Journal "TplStore": Created fork0: 2013-01-30 12:16:05 [Unspecified] debug Journal "TplStore": Journal directory = "/tmp/ForkedBroker.uh0IiS/rhm/tpl/"; Base file name = "tpl" fork0: 2013-01-30 12:16:05 [Unspecified] notice Store module initialized; store-dir=/tmp/ForkedBroker.uh0IiS fork0: 2013-01-30 12:16:05 [Unspecified] info > Default files per journal: 8 fork0: 2013-01-30 12:16:05 [Unspecified] info > Default journal file size: 24 (wpgs) fork0: 2013-01-30 12:16:05 [Unspecified] info > Default write cache page size: 32 (KiB) fork0: 2013-01-30 12:16:05 [Unspecified] info > Default number of write cache pages: 32 fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL files per journal: 8 fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL journal file size: 24 (wpgs) fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL write cache page size: 4 (KiB) fork0: 2013-01-30 12:16:05 [Unspecified] info > TPL number of write cache pages: 64 fork0: 2013-01-30 12:16:05 [HA] notice Initializing CPG fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:exchange:amq.failover fork0: 2013-01-30 12:16:05 [HA] notice Cluster store state: empty fork0: 2013-01-30 12:16:05 [HA] notice cluster(10.16.64.162:18839 PRE_INIT) configuration change: 10.16.64.162:18839 fork0: 2013-01-30 12:16:05 [HA] notice cluster(10.16.64.162:18839 PRE_INIT) Members joined: 10.16.64.162:18839 fork0: 2013-01-30 12:16:05 [HA] debug cluster(10.16.64.162:18839 PRE_INIT) received initial status from 10.16.64.162:18839 fork0: 2013-01-30 12:16:05 [HA] debug cluster(10.16.64.162:18839 PRE_INIT) initial status map complete. fork0: 2013-01-30 12:16:05 [Management] debug Management object (V1) added: org.apache.qpid.broker:exchange: fork0: 2013-01-30 12:16:05 [Management] debug SEND raiseEvent (v1) class=org.apache.qpid.broker.exchangeDeclare ... fork0: 2013-01-30 12:16:05 [Management] debug SEND raiseEvent (v2) class=org.apache.qpid.broker.exchangeDeclare fork0: 2013-01-30 12:16:05 [Broker] notice SASL disabled: No Authentication Performed fork0: 2013-01-30 12:16:05 [Network] info Listening to: [::]:0 fork0: 2013-01-30 12:16:05 [Network] debug Listened to: 49337 fork0: 2013-01-30 12:16:05 [Network] info Listening to: 0.0.0.0:49337 fork0: 2013-01-30 12:16:05 [System] debug Exception constructed: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206) fork0: 2013-01-30 12:16:05 [Unspecified] notice Journal "TplStore": Destroyed fork0: 2013-01-30 12:16:05 [Broker] critical Unexpected error: Can't bind to port 0.0.0.0:49337: Address already in use (qpid/sys/posix/Socket.cpp:206) 2013-01-30 12:16:05 [System] debug Exception constructed: EOF reading port number from child. unknown location(0): fatal error in "testTxTransaction": std::exception: EOF reading port number from child. /builddir/build/BUILD/boost_1_33_1/boost/test/impl/results_collector.ipp(197): info: check 'Test case has less failures then expected' passed Leaving test case "testTxTransaction"
Very rarely seen also when running python cluster_tests on rhel6.4 i686: cluster_tests.LongTests.test_msg_group_failover ......................... fail Error during test: Traceback (most recent call last): File "/usr/bin/qpid-python-test", line 340, in run phase() File "/root/rpmbuild/BUILD/qpid-0.18/cpp/src/tests/cluster_tests.py", line 1701, in test_msg_group_failover b = cluster.start(expect=EXPECT_EXIT_FAIL) File "/root/rpmbuild/BUILD/qpid-0.18/cpp/src/tests/brokertest.py", line 427, in start self._brokers.append(self.test.broker(self.args+args, name, expect, wait, port=port, show_cmd=show_cmd)) File "/root/rpmbuild/BUILD/qpid-0.18/cpp/src/tests/brokertest.py", line 527, in broker raise RethrownException("Failed to start broker %s(%s): %s" % (b.name, b.log, e)) RethrownException: Failed to start broker cluster8-4(027:cluster8-4.log): Timed out waiting for broker cluster8-4: 2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:amq.match Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:0, byteReceives :0, byteRoutes:0, msgDrops:0, msgReceives:0, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0} 2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:qpid.management Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:302, byte Receives:302, byteRoutes:0, msgDrops:2, msgReceives:2, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0} 2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:qmf.default.topic Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:0, byte Receives:0, byteRoutes:0, msgDrops:0, msgReceives:0, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0} 2013-01-31 14:18:51 [Model] trace Mgmt delete exchange. id:qmf.default.direct Statistics: {bindingCount:0, bindingCountHigh:0, bindingCountLow:0, byteDrops:0, byt eReceives:0, byteRoutes:0, msgDrops:0, msgReceives:0, msgRoutes:0, producerCount:0, producerCountHigh:0, producerCountLow:0} 2013-01-31 14:18:51 [Broker] critical Unexpected error: Can't bind to port [::]:41633: Address already in use (qpid/sys/posix/Socket.cpp:206) > I propose to add env. variable defining maximum number of ::bind() trials with default of 3 on RHEL5.x and 1 on RHEL6.x or so. Turning to be general problem. If proposal agreed default value of env. variable should be 3 for both RHEL5/6.
Why do you think that repeating the bind would make the operation succeed?
Assuming that this is truly non deterministic. I think it's a bug with IPv6/4 support in RHEL 5. It seems that sometimes binding to the IPv6 port also binds to the IPv4 port, when the code sets options to stop that happening. However it's possible that there is some uninitialised variable involved here so that something is being set to a non deterministic state.
(In reply to comment #3) > Assuming that this is truly non deterministic. I think it's a bug with > IPv6/4 support in RHEL 5. Correct, non-deterministic. Lately comment 1 I saw same/similar case on RHEL6 with drasticly low reproducibility. > > It seems that sometimes binding to the IPv6 port also binds to the IPv4 > port, when the code sets options to stop that happening. > > However it's possible that there is some uninitialised variable involved > here so that something is being set to a non deterministic state. > Why do you think that repeating the bind would make the operation succeed? I don't have any proof of that. Just saw on other case that repeating helped to workaround. I agree that if there is problem in RHEL, then it should be fixed. In case we are not able to prove that, then I propose to implement that env. variable. Isolated / simplified testing scenario involving just qpid::sys::posix::Socket in short c++ code (creating socket with port guess, sleep N, closing socket) ran in parallel loop should prove whether it is qpid part (which is unlikely) or on OS socket library. Implementing simple re-try mechanism should answer whether re-try actually helps.
Created attachment 741993 [details] Reproducer This script reproduces the problem for me usually within 10 minutes. I think this may be a bug in qpidd: qpidd tries to bind its port on all interfaces. But for port=0 it does bind(0) on the first interface - which returns port P, and then tries to do bind(P) on the remaining interfaces. There's no guarantee that the port is free on all the other interfaces. Either way, I am working on a solution that gets rid of --port=0. A small python script does the bind(0) and passes the socket to qpidd using the --socket-fd argument. Running the reproducer for 24 hours I saw no problems with this
Comitted to trunk: solution for the HA tests. Still need fix for other tests. ------------------------------------------------------------------------ r1482882 | aconway | 2013-05-15 11:05:05 -0400 (Wed, 15 May 2013) | 2 lines QPID-4745: Clean up test_backup_acquired to use HaCluster. ------------------------------------------------------------------------ r1482881 | aconway | 2013-05-15 11:03:44 -0400 (Wed, 15 May 2013) | 16 lines QPID-4745: HA safe port allocation for brokers in HA tests. Many HA tests use --port=0 to start a broker on an available port, but then need to shutdown and restart the broker on the same port. This is not safe, on a busy system it is possible for another process to take the port between the time the broker is shut down and the time it is restarted. The solution is to do bind(0) and listen in the python test framework (class HaPort) and let the broker use the socket using qpidd --socket-fd. When the broker is shut down the port remains bound by the python process. When the broker is re-started it again is given access to the socket via --socket-fd. Other changes - move ha_store_tests into ha_tests. - add heartbeats to avoid stalling. ------------------------------------------------------------------------
This commit introduces a tool to do bind 0 outside of qpidd. ------------------------------------------------------------------------ r1492778 | gsim | 2013-06-13 13:45:56 -0400 (Thu, 13 Jun 2013) | 2 lines NO-JIRA: restrict broker to listen only on loopback interface (which is all that gets used by clients) in order to avoid port collisions ------------------------------------------------------------------------ Tests have not yet been updated to use this script. This is less urgent since gsim committed an alternative fix to the port clash problem: ------------------------------------------------------------------------ r1492778 | gsim | 2013-06-13 13:45:56 -0400 (Thu, 13 Jun 2013) | 2 lines NO-JIRA: restrict broker to listen only on loopback interface (which is all that gets used by clients) in order to avoid port collisions ------------------------------------------------------------------------
This has been addressed by the --interface flag to qpidd. Using --interface 127.0.0.1 prevents the problem, which does seem to be related to qpidds default "bind to everything" behavior.