Bug 1890487

Summary: Candlepin services gets down after upgrade
Product: Red Hat Satellite Reporter: Devendra Singh <desingh>
Component: InfrastructureAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED NOTABUG QA Contact: Lukas Pramuk <lpramuk>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.8.0CC: bcourt, ehelms, inecas, smallamp, zhunting
Target Milestone: 6.9.0Keywords: Regression, Triaged, Upgrades
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-18 15:43:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Devendra Singh 2020-10-22 11:39:11 UTC
Description of problem: Candlepin services get down after upgrade


Version-Release number of selected component (if applicable):
6.8 Snap20

How reproducible:
1/1

Steps to Reproduce:
1. Upgrade Satellite and Capsule from 6.7.4 to 6.8.0 Snap20
2. Execute Test Suit to validate the components. 
3. During the execution of the test cases Candlepin services get downstate.

candlepin:        
    Status:          FAIL
    Server Response: Message: Failed to open TCP connection to localhost:8443 (Connection refused - connect(2) for "localhost" port 8443)
candlepin_events:
    Status:          FAIL
    message:         Not running
    Server Response: Duration: 6ms
candlepin_auth:  
    Status:          FAIL
    Server Response: Message: A backend service [ Candlepin ] is unreachable


Actual results:
Candlepin services get down after upgrade.

Expected results:
Candlepin services should not go down after the upgrade. 

Additional info:

Comment 2 Sudhir Mallamprabhakara 2020-10-22 12:22:42 UTC
adding regression keyword. snap 19 looked good.

Comment 4 Zach Huntington-Meath 2020-11-02 15:44:47 UTC
Was this fixed in any candlepin version?

Comment 7 Eric Helms 2020-11-18 02:59:11 UTC
The most common cause of this is the system hitting an 'Out of memory' exception, tomcat is the first service to be killed when that happens on a Satellite. Looking through the sosreport I see that condition:

var/log/messages-20201115:Nov 14 05:01:27 qe-sat6-upgrade-rhel7 kernel: Out of memory: Kill process 30358 (java) score 84 or sacrifice child


Devendra, can you re-test this and if you run into the same Candlepin connection failure check to see if an OOM condition was encountered? If so, then I would opt to close this not a bug.

Comment 8 Devendra Singh 2020-11-23 05:25:26 UTC
(In reply to Eric Helms from comment #7)
> The most common cause of this is the system hitting an 'Out of memory'
> exception, tomcat is the first service to be killed when that happens on a
> Satellite. Looking through the sosreport I see that condition:
> 
> var/log/messages-20201115:Nov 14 05:01:27 qe-sat6-upgrade-rhel7 kernel: Out
> of memory: Kill process 30358 (java) score 84 or sacrifice child
> 
> 
> Devendra, can you re-test this and if you run into the same Candlepin
> connection failure check to see if an OOM condition was encountered? If so,
> then I would opt to close this not a bug.

I re-tested but didn't see the OOM problem, last time I saw it in 6.8.0 Snap20.

Comment 9 Zach Huntington-Meath 2020-11-23 15:13:34 UTC
If this issue is not seen since then is this still an issue or can it be reproduced still?

Comment 10 Devendra Singh 2020-11-23 15:36:21 UTC
(In reply to Zach Huntington-Meath from comment #9)
> If this issue is not seen since then is this still an issue or can it be
> reproduced still?

No, The issue is intermittent, I saw it in 6.8 Snap20 and 6.8.1 Snap3(not seen in 6.8.1 snap1, snap2, and snap 4)
I tested the upgrade with a recent 6.8.1 snap(6.7.4 -->6.8.1 Snap4 and 6.8.0 -->6.8.1 Snap4) but didn't see this issue.