Bug 2053928

Summary: Satellite UI suddenly shows "Connection refused - connect(2) for 10.74.xxx.yyy:443 (Errno::ECONNREFUSED) Plus 6 more errors" for a capsule even if there are no connectivity issue present in Satellite\Capsule 7.0
Product: Red Hat Satellite Reporter: Sayan Das <saydas>
Component: Capsule - ContentAssignee: Justin Sherrill <jsherril>
Status: CLOSED ERRATA QA Contact: Vladimír Sedmík <vsedmik>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.11.0CC: aruzicka, ehelms, inecas, jsherril, pcreech, sajha, vsedmik
Target Milestone: 6.11.0Keywords: Triaged
Target Release: Unused   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: tfm-rubygem-katello-4.3.0.15-1,tfm-rubygem-katello-4.3.0.38-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-05 14:33:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sayan Das 2022-02-13 09:49:56 UTC
Description of problem:

Overall everything was going fine but suddenly after a failed capsule sync, I found that even if the communication status of the capsule is green and active,  a weird Error message pops up in the Infrastructure --> Capsules page for the capsule where I am trying to refresh the features.

Error
Connection refused - connect(2) for 10.74.xxx.yyy:443 (Errno::ECONNREFUSED) Plus 6 more errors

10.74.xxx.yyy is capsule IP.


Version-Release number of selected component (if applicable):

Satellite\Capsule 7.0 (Snap 9.0)



How reproducible:

Always (after the failed capsule sync)


Steps to Reproduce: [ Not the exact steps but I am documenting the sequence of events that led to this discovery ]

* Install Satellite + Capsule 7.0 

* Associate Library environment to Capsule server, for content syncing.

* Enable and sync one or two repos in satellite and let the auto-sync of capsule server completed as well.

* Go to Satellite UI --> Infrastructure --> Capsules --> <Click open Capsule> , Manually iniiate a complete\optimized sync for capsule and then again hit the action "Cancel Sync".

* Once the sync task is canceled, click on the "Refresh Features" button of the capsule.




Actual results:

Communication status: Active (Green)

Clicked on "Refresh Features" and then some messages appears on the Top Right corner


In green:

Success
No changes were found when refreshing features from capsule.example.com.

In Red :

Error
Connection refused - connect(2) for 10.74.xxx.yyy:443 (Errno::ECONNREFUSED) Plus 6 more errors


But the Communication status is still Active (Green) and from satellite, I surely can connect to port 443 of capsule via openssl or curl or nc command. 

The hammer command shows, the following tasks.

# hammer task list --search "label ~ Capsule and result != success"
-------------------------------------|-----------------------------------------------------------|---------|---------|---------------------|---------------------|--------------|-------|---------------------------------------------------------------------------------
ID                                   | ACTION                                                    | STATE   | RESULT  | STARTED AT          | ENDED AT            | DURATION     | OWNER | TASK ERRORS                                                                     
-------------------------------------|-----------------------------------------------------------|---------|---------|---------------------|---------------------|--------------|-------|---------------------------------------------------------------------------------
d9dea867-e945-4dde-9e92-822bb8809c12 | Synchronize capsule 'capsule.example.com'                 | stopped | warning | 2022/02/13 07:39:01 | 2022/02/13 08:22:06 | 00:43:04.813 | admin | Pulp task error, Pulp task error, Could not lookup a publication_href for rep...
708f46c6-bf6d-4e5e-9306-d83dfb777616 | Synchronize capsule 'capsule.example.com'                 | stopped | warning | 2022/02/13 07:07:00 | 2022/02/13 07:29:16 | 00:22:16.181 | admin | Connection refused - connect(2) for 10.74.195.193:443 (Errno::ECONNREFUSED), ...
388f349e-1a75-47cd-9800-318462f5ebb4 | Sync Content View on Capsule(s)                           | stopped | warning | 2022/02/13 07:07:00 | 2022/02/13 07:29:16 | 00:22:16.599 | admin | A sub task failed                                                               
8e2e1498-dddd-4b5d-9154-95a9225d581e | Synchronize capsule 'capsule.example.com'                 | stopped | warning | 2022/02/13 07:07:00 | 2022/02/13 07:29:16 | 00:22:16.721 | admin | Connection refused - connect(2) for 10.74.195.193:443 (Errno::ECONNREFUSED), ...
de63799f-fa46-4f77-8588-ae85b80f04a2 | Sync Content View on Capsule(s)                           | stopped | warning | 2022/02/13 07:06:59 | 2022/02/13 07:29:16 | 00:22:17.064 | admin | A sub task failed                                                               
-------------------------------------|-----------------------------------------------------------|---------|---------|---------------------|---------------------|--------------|-------|---------------------------------------------------------------------------------


At this stage even if i visit this page Satellite UI --> Infrastructure --> Capsules --> <Click open Capsule> , the error pops up.


And now I clear up those tasks:

# foreman-rake foreman_tasks:cleanup TASK_SEARCH='label ~ Capsule and result != success' STATES='stopped' VERBOSE=true 
About to remove 5 tasks matching filter
0/5
5/5
Deleted 5 tasks matching filter
No orphaned task locks found, skipping.
About to remove 3 orphaned task links
0/3
3/3
Deleted 3 orphaned task links
No orphaned execution plans found, skipping.
No orphaned job invocations found, skipping.


And the issue is no longer present


Expected results:

Show such errors only when actually a communication issue is present but not by referring to the status of the last failed capsule synced task (which may have experienced some connection issue at that point).


Additional info:

NA

Comment 3 Sayan Das 2022-02-28 12:40:30 UTC
With Snap 10, It seems the scenario has improved (but I could be wrong).

I had tried reproducing the same issue yesterday and was able to reproduce it but now, This is what I see i.e.

A) The last capsule sync attempt had stopped with a warning.

B) The Sync status of Capsule refers to that same task and shows the sync progress bar for Capsule in Red , meaning there is some issues present with sync. 

C) It pops up a message like "Some Pulp errors occurred" or something similar instead of showing the exact error message of failure which is less confusing and does not relay an incorrect situation i.e. "Connection issue".

D) When I initiate a manual capsule sync and get it completed, then the same message goes away since the last capsule sync task is now successfully completed.


So I believe we should be good here with Snap 10 (but I will try to do one more testing later today. )

Comment 4 Justin Sherrill 2022-03-09 02:14:30 UTC
I can't seem to reproduce even your issue Sayan on tfm-rubygem-katello-4.3.0.9-1.el7sat.noarch  

I did get a popup, but it just said 'Task Cancelled'', shich seems fine.  If you can reproduce, would you be able to attach a screenshot and maybe share the reproducer?

Thanks

Comment 5 Sayan Das 2022-03-09 06:31:19 UTC
(In reply to Justin Sherrill from comment #4)
> I can't seem to reproduce even your issue Sayan on
> tfm-rubygem-katello-4.3.0.9-1.el7sat.noarch  
> 
> I did get a popup, but it just said 'Task Cancelled'', shich seems fine.  If
> you can reproduce, would you be able to attach a screenshot and maybe share
> the reproducer?
> 
> Thanks

I actually later had re-tested this scenario and It seems this had occurred initially twice due to a very certain situation

A) The capsule was configured with only 8 GB ram.

B) For the first time a "Complete Sync" was being executed and that put the capsule on an insane amount of load. 

C) The Capsule became unresponsive and got an OOM which killed httpd and pulpcore services 

D) When I had restarted them back they came up fine, but during Point C and D, Satellite already had the task running and due to the kill on httpd and pulp of capsule, It had started getting "Connection refused - connect(2) for 10.74.xxx.yyy:443 (Errno::ECONNREFUSED) Plus 6 more errors" which at the end anyway resulted in the Capsule sync task to stop with a warning.


The easiest way to reproduce this would be :

* Start a Complete Sync of Capsule

* Wait for the task to go till 25% or a bit more.

* Restart all capsule services [ while monitoring the production.log of satellite ]

* Now once the Capsule services are back online, Check back in UI --> Infra --> Capsules --> Click open the Capsule and we would see:

The CApsule sync progress bar is in Red and showing 100% but that means that the last task was not successful. 

And the top right corner will pop-up the message "Connection refused - connect(2) for 10.74.195.240:443 (Errno::ECONNREFUSED) Plus X more errors"



While the entire scenario is not an issue to be concerned about, But that pop-up happens as I guess, we have designed the UI to reflect the Reason\Status of the last failed task as well. 

In my opinion, if we want to show an error message, Then regardless of what the message is there in the failed task, We should say "Some error happened during the last sync of the Capsule XXXX" or something generic like that.


If there is still a need of a reproducer present, Please drop me a ping in IRC\g-chat and I will share the details.

Comment 6 Justin Sherrill 2022-03-09 20:47:47 UTC
Ahh, that makes a lot of sense.  We could mention something like  "Last Sync failure: Connection refused - connect(2) for 10.74.195.240:443 (Errno::ECONNREFUSED)"

Comment 7 Sayan Das 2022-03-09 20:56:19 UTC
Yup, that works as well as long as we are specifically pointing out, That the message is not related to any actual connectivity issue but about the failed task of last sync.

Comment 9 Justin Sherrill 2022-03-23 20:04:36 UTC
Created redmine issue https://projects.theforeman.org/issues/34671 from this bug

Comment 10 Bryan Kearney 2022-03-29 16:05:03 UTC
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34671 has been resolved.

Comment 19 errata-xmlrpc 2022-07-05 14:33:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498