Bug 1295725

Summary: [RFE] foreman-task to have warning goferd failed due to disk full
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: katello-agentAssignee: Katello Bug Bin <katello-bugs>
Status: CLOSED WONTFIX QA Contact: Katello QA List <katello-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1.5CC: bbuckingham, bkearney, chrobert, daniele, pmoravec, snemeth, sthirugn, yjog
Target Milestone: UnspecifiedKeywords: FutureFeature, Reopened
Target Release: Unused   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-30 14:51:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1353215    

Description Pavel Moravec 2016-01-05 10:58:43 UTC
Description of problem:
Assume a content host (running goferd without a problem, connected to Sat/Caps port 5647) runs out of free disk space - a user error but one that can easily happen.

In this situation, any new foreman-task propagated to goferd on this machine gets stalled forever (or until the timeout for package install / capsule sync, so matter of hours? havent waited so long). Usually with a symptom of goferd having no established TCP connection for a long time - despite heartbeats are enabled.

It would be nice to catch this situation (sooner) and provide meaningful error message into the foreman-task.


Version-Release number of selected component (if applicable):
(content host)
python-gofer-2.6.8-1.el7sat.noarch
python-gofer-proton-2.6.8-1.el7sat.noarch
gofer-2.6.8-1.el7sat.noarch

(Satellite)
ruby193-rubygem-foreman-tasks-0.6.15.7-1.el7sat.noarch


How reproducible:
100%

Steps to Reproduce:
1. Have a content host registered to Satellite6
2. Fill its disk (at least /var partition)
3. hammer -u admin -p password content-host package remove --content-host-id UUID --organization-id 1 --packages sos
4. monitor TCP connections established from goferd to Satellite/Capsule port 5647
5. (after few minutes, free the disk on content host)



Actual results:
3. timeouts / never finishes
4. no connection for a longer time


Expected results:
3. to finish (sooner) with self-explanatory error
4. almost everytime there needs to be an established TCP connection


Additional info:
I *think* the problem is in goferd that fails to write a json file with pending work to /var/lib/gofer/messaging/pending/katelloplugin. So it has nothing to pick up later on. This failure to write should be reported as task failed.

In parallel, goferd looses TCP connection to qdrouterd (for some time, in my reproducer it got established after some (tens of?) minutes - no idea what triggers this.

Comment 2 Pavel Moravec 2016-01-05 11:13:14 UTC
Removing "Improvement" due to another symptom detected:

goferd process consumes 100% CPU after a while. That sounds rather a bug than improvement.

Comment 5 Pavel Moravec 2016-02-20 08:56:11 UTC
(In reply to Bryan Kearney from comment #4)
> so this is not fixed by 1295957?

Yes, thanks for spotting it. goferd high CPU usage not reproducible further since qpid-proton 0.9-12 used.

Changing back to "[Improvement] foreman-task to have warning goferd failed due to disk full" since goferd should be robust enough to report back to katello disk full / failure in creating json file in katelloagent dir.

Comment 6 Bryan Kearney 2016-02-20 21:24:06 UTC
Du to change... moving this out.

Comment 8 Bryan Kearney 2018-09-04 18:57:50 UTC
Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in the product in the foreseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you.

Comment 9 Bryan Kearney 2018-09-04 19:09:02 UTC
Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in the product in the foreseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you.

Comment 11 Bryan Kearney 2018-11-01 14:45:06 UTC
The Satellite Team is attempting to provide an accurate backlog of bugzilla requests which we feel will be resolved in the next few releases. We do not believe this bugzilla will meet that criteria, and have plans to close it out in 1 month. This is  not a reflection on the validity of the request, but a reflection of the many priorities for the product. If you have any concerns about this, feel free to contact Rich Jerrido or Bryan Kearney or your account team. If we do not hear from you, we will close this bug out. Thank you.

Comment 12 Bryan Kearney 2018-11-30 14:51:14 UTC
Thank you for your interest in Satellite 6. We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in the product in the foreseeable future. This is due to other priorities for the product, and not a reflection on the request itself. We are therefore closing this out as WONTFIX. If you have any concerns about this, please do not reopen. Instead, feel free to contact Rich Jerrido or Bryan Kearney. Thank you.