Bug 1276176

Summary: goferd on client does not reconnect back to capsule properly when the qdrouterd on the capsule side restarted
Product: Red Hat Satellite Reporter: Neil Miao <nmiao>
Component: Foreman ProxyAssignee: Jeff Ortel <jortel>
Status: CLOSED DUPLICATE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1.3CC: bbuckingham, chrobert, davy.stoffel, ealcaniz, jcallaha, nmiao, peter.vreman, pmoravec, rmahique
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-08 08:52:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1122832    

Description Neil Miao 2015-10-29 01:06:06 UTC
Description of problem:

goferd is behaving oddly on the clients after the qdrouterd service is
restarted on the capsule they connected to.

The katello services (mongo, pulp, qdrouterd, httpd etc) restarted
daily during backup on capsules. It seems client's goferd is unable to
connect back to capsule's qdrouterd after the the service is restarted.

(22:10 is when the services restarted on capsule)

Somehow the client (nil06) thinks it connected back to sat6cap01 successfully, after qrouterd restarted.

- -- snip --
Oct 25 22:10:05 nil06.devlab.redhat.com goferd[11928]:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:525 -
Disconnected
Oct 25 22:10:15 nil06.devlab.redhat.com goferd[11928]:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f]
gofer.messaging.adapter.proton.connection:100 - conn...ion: None
Oct 25 22:10:15 nil06.devlab.redhat.com goferd[11928]:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:485 -
connecting to sat6cap01.util.phx1.redhat.com:5647...
Oct 25 22:10:16 nil06.devlab.redhat.com goferd[11928]:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f]
gofer.messaging.adapter.proton.connection:103 - conn....com:5647
Oct 25 22:10:16 nil06.devlab.redhat.com goferd[11928]:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:505 -
connected to sat6cap01.util.phx1.redhat.com:5647
- -- snip --

But no established goferd -> qdrouterd connection is found.

[root@nil06 ~]# lsof -i :5647
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 11928 root 8u IPv4 80106078 0t0 TCP
nil06.devlab.redhat.com:38036->sat6cap01.util.phx1.redhat.com:5647
(CLOSE_WAIT)
python 11928 root 14u IPv4 80106146 0t0 TCP
nil06.devlab.redhat.com:38038->sat6cap01.util.phx1.redhat.com:5647
(CLOSE_WAIT)

No gofer log is seen after "Oct 25 22:10:16" in /var/log/messages

[root@nil06 ~]# grep gofer /var/log/messages | tail -n 3
Oct 25 22:10:15 nil06 goferd:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:485 -
connecting to sat6cap01.util.phx1.redhat.com:5647...
Oct 25 22:10:16 nil06 goferd:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f]
gofer.messaging.adapter.proton.connection:103 - connected:
amqps://sat6cap01.util.phx1.redhat.com:5647
Oct 25 22:10:16 nil06 goferd:
[INFO][pulp.agent.f685caec-fbca-40c2-8461-f0af7a31078f] root:505 -
connected to sat6cap01.util.phx1.redhat.com:5647
[nil06.devlab.redhat.com] [07:27:15 PM]
[root@nil06 ~]# date
Mon Oct 26 19:27:18 EDT 2015

The connection has been broken since. Restarts the goferd on the client fix the issue.

Problem is:

This behavior only happens overnight, while the existing goferd <--> qdrouterd connection has been established for a period of time (a few hours). 

It happened consistently on 2 satellite systems (both prod and dev) by:

1. restarts goferd on the client the day before
2. verify the connection is established and functioning
3. check again the second day, the connection is broken, and only the CLOSE_WAIT connections are visible from 'lsof -i :5647' on the client. 

I cannot reproduce it if I manually restart capsule services immediately after the goferd <-> qdrouterd connection is established. The client reconnects back to the capsules just fine.

Version-Release number of selected component (if applicable):

capsule:
python-qpid-0.30-6.el7.noarch
qpid-proton-c-0.9-7.el7.x86_64
qpid-dispatch-router-0.4-10.el7.x86_64
qpid-qmf-0.30-5.el7.x86_64
qpid-cpp-server-linearstore-0.30-9.el7.x86_64
libqpid-dispatch-0.4-10.el7.x86_64
python-gofer-qpid-2.6.6-2.el7sat.noarch
python-qpid-proton-0.9-7.el7.x86_64
qpid-cpp-server-0.30-9.el7.x86_64
qpid-tools-0.30-4.el7.noarch
qpid-cpp-client-0.30-9.el7.x86_64
python-qpid-qmf-0.30-5.el7.x86_64

client:
python-gofer-2.6.6-2.el7sat.noarch
gofer-2.6.6-2.el7sat.noarch
python-gofer-qpid-1.3.0-1.el7sat.noarch
python-gofer-proton-2.6.6-2.el7sat.noarch

How reproducible:
only happens overnight, a bit hard to reproduce.

Steps to Reproduce:
1. restarts goferd on the client the day before
2. verify the connection is established and functioning
3. check again the second day, the connection is broken, and only the CLOSE_WAIT connections are visible from 'lsof -i :5647' on the client. 

Actual results:
connection should remain established

Expected results:
connection is broken, and only the CLOSE_WAIT connections are visible from 'lsof -:5647' on the client. 

Additional info:

Comment 1 Neil Miao 2015-11-05 09:17:11 UTC
Apart from the broken goferd -> qrouterd connection.  it seems has another side effect:

The goferd daemon is using nearly 100% CPU *all the time* after the connection is broken. 

[root@nil06 ~]# ps aux | grep 25035
root     25035 95.8  4.0 1089836 40748 ?       Ssl  Nov02 3227:26 python /usr/bin/goferd --foreground

[root@nil06 ~]# lsof -i :5647
COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
python  25035 root   17u  IPv4 96490511      0t0  TCP nil06.devlab.redhat.com:52851->sat6cap01.util.phx1.redhat.com:5647 (CLOSE_WAIT)
python  25035 root   22u  IPv4 96572826      0t0  TCP nil06.devlab.redhat.com:52936->sat6cap01.util.phx1.redhat.com:5647 (CLOSE_WAIT)

The python process is sitting there doing nothing.

[root@nil06 ~]# strace -p 25035
Process 25035 attached
futex(0x227bab0, FUTEX_WAIT_PRIVATE, 0, NULL^CProcess 25035 detached
 <detached ...>

Comment 2 Edu Alcaniz 2015-12-14 13:27:47 UTC
it happens to me as well. any ideas

Comment 5 Davy 2016-01-26 09:56:29 UTC
exactly the same here.
How reproducible:
only happens overnight, a bit hard to reproduce.

=> run katello-backup i guess. all our clients goes to this state around the same time.
Highly critical, forced to restart all client agents after backup completed but hypervisors are impacted as all clients are running into a VM.

Comment 6 Jeff Ortel 2016-01-26 15:29:53 UTC
I suspect the root cause is https://issues.apache.org/jira/browse/PROTON-1090.

Comment 7 Pavel Moravec 2016-02-08 07:45:35 UTC
Can't it be a combination of https://bugzilla.redhat.com/show_bug.cgi?id=1272596 and https://bugzilla.redhat.com/show_bug.cgi?id=1292026 / https://bugzilla.redhat.com/show_bug.cgi?id=1281947 ?

Several such issues have been recently fixed, I recommend upgrading to latest qpid-python qnd gofer versions that should help.

Comment 8 Peter Vreman 2016-03-07 09:38:15 UTC
In my environment i can confirm that this isssue is fixed with Sat6.1.6 and updating all clients to Sat6-Tools6.1.6

Comment 9 Pavel Moravec 2016-03-08 08:52:17 UTC
IMHO dup of https://bugzilla.redhat.com/show_bug.cgi?id=1272596, closing this.

*** This bug has been marked as a duplicate of bug 1272596 ***

Comment 10 Neil Miao 2016-03-15 00:15:53 UTC
I confirmed that the issue is fixed with Sat6.1.6+