Bug 1334845

Summary: A backend service [ Candlepin ] is unreachable
Product: Red Hat Satellite Reporter: Roman Plevka <rplevka>
Component: Subscription ManagementAssignee: Tom McKay <tomckay>
Status: CLOSED ERRATA QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2.0CC: bbuckingham, bcourt, bkearney, rplevka, tomckay
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-27 11:24:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Plevka 2016-05-10 15:56:22 UTC
Description of problem:
A candlepin service seems to die on its own after some time. This is a second occurrence for sat 6.2.0 Beta (GA10.1).

I don't have a clear replication steps, so I'll attach the foreman-debug tarball.
This happend on CI machine long time (days) after the automation finished


Version-Release number of selected component (if applicable):
6.2.0 Beta (GA10.1)


Actual results:
"Oops, we're sorry but something went wrong A backend service [ Candlepin ] is unreachable"

# hammer -u admin -p changeme ping
candlepin:      
    Status:          FAIL
    Server Response:
candlepin_auth: 
    Status:          FAIL
    Server Response:
pulp:           
    Status:          ok
    Server Response: Duration: 380ms
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 12ms

production.log:
2016-05-10 11:53:08 [app] [W] Connection refused - connect(2) for "localhost" port 8443

Comment 2 Brad Buckingham 2016-05-11 15:16:21 UTC
Hi Roman, Can you attach the debug?  Thanks!

Comment 5 Roman Plevka 2016-05-11 15:23:05 UTC
accidentally cleared the needinfo flag, setting it back

Comment 6 Barnaby Court 2016-05-11 15:30:01 UTC
I don't see any specific errors in Candlepin however it looks like the katello_event_queue is not being drained of the messages that candlepin is sending. 

From the qpid_stat_queues file in the foreman debug info. 2.4k messages in the katello_event_queue and no connections draining them. 

Brad, how should that be routed?

Comment 7 Tom McKay 2016-05-25 17:36:48 UTC
Perhaps related (fixed?) by https://bugzilla.redhat.com/show_bug.cgi?id=1283582

Can the upstream patch be applied and tested?
https://github.com/Katello/katello/pull/6065

Comment 9 Tom McKay 2016-06-06 14:38:19 UTC
Moving to ON_QA to allow retest with the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1283582 .

Comment 10 Roman Plevka 2016-06-21 08:37:19 UTC
VERIFIED

on sat6.2.0 beta (GA16.0).

I no longer the issue.
after collecting the dynflow_executor PID data for 6.5 hours, It seems the memory leak is gone.

VmPeak:	 2977916 kB
VmSize:	 2912376 kB
VmLck:	       0 kB
VmHWM:	 1312892 kB
VmRSS:	 1310272 kB
VmData:	 2540460 kB
VmStk:	   10244 kB
VmExe:	       4 kB
VmLib:	   30008 kB
VmPTE:	    3316 kB
VmSwap:	       8 kB

# uptime
 04:36:45 up 3 days, 18:17,  1 user,  load average: 1.26, 1.04, 0.90
# hammer ping
candlepin:      
    Status:          ok
    Server Response: Duration: 30ms
candlepin_auth: 
    Status:          ok
    Server Response: Duration: 36ms
pulp:           
    Status:          ok
    Server Response: Duration: 46ms
foreman_tasks:  
    Status:          ok
    Server Response: Duration: 14ms

Comment 11 Bryan Kearney 2016-07-27 11:24:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501