Description of problem: goferd is behaving oddly on the clients after the qdrouterd service is restarted on the capsule they connected to. The katello services (mongo, pulp, qdrouterd, httpd etc) restarted daily during backup on capsules. It seems client's goferd is unable to connect back to capsule's qdrouterd after the the service is restarted. (22:10 is when the services restarted on capsule) Somehow the client (nil06) thinks it connected back to sat6cap01 successfully, after qrouterd restarted. - -- snip -- Oct 25 22:10:05 nil06.devlab.redhat.com goferd[11928]: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:525 - Disconnected Oct 25 22:10:15 nil06.devlab.redhat.com goferd[11928]: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] gofer.messaging.adapter.proton.connection:100 - conn...ion: None Oct 25 22:10:15 nil06.devlab.redhat.com goferd[11928]: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:485 - connecting to sat6cap01.util.phx1.redhat.com:5647... Oct 25 22:10:16 nil06.devlab.redhat.com goferd[11928]: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] gofer.messaging.adapter.proton.connection:103 - conn....com:5647 Oct 25 22:10:16 nil06.devlab.redhat.com goferd[11928]: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:505 - connected to sat6cap01.util.phx1.redhat.com:5647 - -- snip -- But no established goferd -> qdrouterd connection is found. [root@nil06 ~]# lsof -i :5647 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 11928 root 8u IPv4 80106078 0t0 TCP nil06.devlab.redhat.com:38036->sat6cap01.util.phx1.redhat.com:5647 (CLOSE_WAIT) python 11928 root 14u IPv4 80106146 0t0 TCP nil06.devlab.redhat.com:38038->sat6cap01.util.phx1.redhat.com:5647 (CLOSE_WAIT) No gofer log is seen after "Oct 25 22:10:16" in /var/log/messages [root@nil06 ~]# grep gofer /var/log/messages | tail -n 3 Oct 25 22:10:15 nil06 goferd: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:485 - connecting to sat6cap01.util.phx1.redhat.com:5647... Oct 25 22:10:16 nil06 goferd: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] gofer.messaging.adapter.proton.connection:103 - connected: amqps://sat6cap01.util.phx1.redhat.com:5647 Oct 25 22:10:16 nil06 goferd: [INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:505 - connected to sat6cap01.util.phx1.redhat.com:5647 [nil06.devlab.redhat.com] [07:27:15 PM] [root@nil06 ~]# date Mon Oct 26 19:27:18 EDT 2015 The connection has been broken since. Restarts the goferd on the client fix the issue. Problem is: This behavior only happens overnight, while the existing goferd <--> qdrouterd connection has been established for a period of time (a few hours). It happened consistently on 2 satellite systems (both prod and dev) by: 1. restarts goferd on the client the day before 2. verify the connection is established and functioning 3. check again the second day, the connection is broken, and only the CLOSE_WAIT connections are visible from 'lsof -i :5647' on the client. I cannot reproduce it if I manually restart capsule services immediately after the goferd <-> qdrouterd connection is established. The client reconnects back to the capsules just fine. Version-Release number of selected component (if applicable): capsule: python-qpid-0.30-6.el7.noarch qpid-proton-c-0.9-7.el7.x86_64 qpid-dispatch-router-0.4-10.el7.x86_64 qpid-qmf-0.30-5.el7.x86_64 qpid-cpp-server-linearstore-0.30-9.el7.x86_64 libqpid-dispatch-0.4-10.el7.x86_64 python-gofer-qpid-2.6.6-2.el7sat.noarch python-qpid-proton-0.9-7.el7.x86_64 qpid-cpp-server-0.30-9.el7.x86_64 qpid-tools-0.30-4.el7.noarch qpid-cpp-client-0.30-9.el7.x86_64 python-qpid-qmf-0.30-5.el7.x86_64 client: python-gofer-2.6.6-2.el7sat.noarch gofer-2.6.6-2.el7sat.noarch python-gofer-qpid-1.3.0-1.el7sat.noarch python-gofer-proton-2.6.6-2.el7sat.noarch How reproducible: only happens overnight, a bit hard to reproduce. Steps to Reproduce: 1. restarts goferd on the client the day before 2. verify the connection is established and functioning 3. check again the second day, the connection is broken, and only the CLOSE_WAIT connections are visible from 'lsof -i :5647' on the client. Actual results: connection should remain established Expected results: connection is broken, and only the CLOSE_WAIT connections are visible from 'lsof -:5647' on the client. Additional info:
Apart from the broken goferd -> qrouterd connection. it seems has another side effect: The goferd daemon is using nearly 100% CPU *all the time* after the connection is broken. [root@nil06 ~]# ps aux | grep 25035 root 25035 95.8 4.0 1089836 40748 ? Ssl Nov02 3227:26 python /usr/bin/goferd --foreground [root@nil06 ~]# lsof -i :5647 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 25035 root 17u IPv4 96490511 0t0 TCP nil06.devlab.redhat.com:52851->sat6cap01.util.phx1.redhat.com:5647 (CLOSE_WAIT) python 25035 root 22u IPv4 96572826 0t0 TCP nil06.devlab.redhat.com:52936->sat6cap01.util.phx1.redhat.com:5647 (CLOSE_WAIT) The python process is sitting there doing nothing. [root@nil06 ~]# strace -p 25035 Process 25035 attached futex(0x227bab0, FUTEX_WAIT_PRIVATE, 0, NULL^CProcess 25035 detached <detached ...>
it happens to me as well. any ideas
exactly the same here. How reproducible: only happens overnight, a bit hard to reproduce. => run katello-backup i guess. all our clients goes to this state around the same time. Highly critical, forced to restart all client agents after backup completed but hypervisors are impacted as all clients are running into a VM.
I suspect the root cause is https://issues.apache.org/jira/browse/PROTON-1090.
Can't it be a combination of https://bugzilla.redhat.com/show_bug.cgi?id=1272596 and https://bugzilla.redhat.com/show_bug.cgi?id=1292026 / https://bugzilla.redhat.com/show_bug.cgi?id=1281947 ? Several such issues have been recently fixed, I recommend upgrading to latest qpid-python qnd gofer versions that should help.
In my environment i can confirm that this isssue is fixed with Sat6.1.6 and updating all clients to Sat6-Tools6.1.6
IMHO dup of https://bugzilla.redhat.com/show_bug.cgi?id=1272596, closing this. *** This bug has been marked as a duplicate of bug 1272596 ***
I confirmed that the issue is fixed with Sat6.1.6+