Bug 1567239
| Summary: | xinetd infinite busy loop on bad file descriptor uses 100% cpu | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Stepan Broz <sbroz> | ||||||
| Component: | xinetd | Assignee: | Jan Synacek <jsynacek> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 28 | CC: | jsynacek, qe-baseos-daemons | ||||||
| Target Milestone: | --- | Keywords: | EasyFix, Patch, Reproducer | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | xinetd-2.3.15-26.fc28 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | 1567227 | Environment: | |||||||
| Last Closed: | 2018-06-14 19:13:47 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1567227 | ||||||||
| Attachments: |
|
||||||||
|
Description
Stepan Broz
2018-04-13 15:54:23 UTC
Created attachment 1421434 [details]
Re-attaching the patch from the original bug
Patch that prevents entering the busy loop
I am suggesting a patch to main.c that prevents entering the busy loop by disabling the bad file descriptor in the pollfd array in find_bad_fd().
The problem occurs when the poll() in main_loop() returns "POLLERR | POLLHUP | POLLNVAL" for a file descriptor for which there is no service found (The fd is closed by the service process? Race condition?) in find_bad_fd().
The current code clears the events flags for next poll():
304 #ifdef HAVE_POLL
305 ps.rws.pfd_array[fd].events = 0;
However, the file descriptor is not removed from the pollfd array and on next iteration, the above process repeats (forever).
Therefore, I am suggesting disabling the fd in the array by adding:
ps.rws.pfd_array[fd].fd *= -1;
Other option might be e.g. to remove the fd from the array completely.
The behaviour of xinetd is only improved by the patch (tested only in this scenario!) as it never enters the busy loop. The failed service will, however, still remain dysfunctional (it never recovers) and would still require restart (and reconfiguration) of xinetd.
I am also suggesting another modification:
292 if ( SVC_FD( sp ) == fd )
293 {
294 msg( LOG_ERR, func,
295 "file descriptor of service %s has been closed",
296 SVC_ID( sp ) ) ;
297 svc_deactivate( sp ) ;
298 found = TRUE ;
299 bad_fd_count++ ;
300 break ;
Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed (and detected by the poll()), which always resulted in the following misleading log message being displayed:
316 if ( bad_fd_count == 0 )
317 msg( LOG_NOTICE, func,
318 "select reported EBADF but no bad file descriptors were found" ) ;
(In reply to Stepan Broz from comment #1) > Created attachment 1421434 [details] > Re-attaching the patch from the original bug > > Patch that prevents entering the busy loop > > I am suggesting a patch to main.c that prevents entering the busy loop by > disabling the bad file descriptor in the pollfd array in find_bad_fd(). > > The problem occurs when the poll() in main_loop() returns "POLLERR | POLLHUP > | POLLNVAL" for a file descriptor for which there is no service found (The > fd is closed by the service process? Race condition?) in find_bad_fd(). Well, the actual fix should be to find out that there is a misconfigured service and, possibly, refuse to load it. > The current code clears the events flags for next poll(): > > 304 #ifdef HAVE_POLL > 305 ps.rws.pfd_array[fd].events = 0; > > However, the file descriptor is not removed from the pollfd array and on > next iteration, the above process repeats (forever). > > Therefore, I am suggesting disabling the fd in the array by adding: > ps.rws.pfd_array[fd].fd *= -1; I don't understand this. Why "*= -1" if a simple "= -1" will do? > Other option might be e.g. to remove the fd from the array completely. Well if you can remove it and make sure that you didn't change the behaviour of the program, then it is the perfect solution. > The behaviour of xinetd is only improved by the patch (tested only in this > scenario!) as it never enters the busy loop. The failed service will, > however, still remain dysfunctional (it never recovers) and would still > require restart (and reconfiguration) of xinetd. Two problems. If you submit a patch and say that you actually only tested it in this very scenario, don't expect it to be accepted. The second problem is that the service is still there, totally useless. > I am also suggesting another modification: > > 292 if ( SVC_FD( sp ) == fd ) > 293 { > 294 msg( LOG_ERR, func, > 295 "file descriptor of service %s has been closed", > 296 SVC_ID( sp ) ) ; > 297 svc_deactivate( sp ) ; > 298 found = TRUE ; > 299 bad_fd_count++ ; > 300 break ; > > Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed > (and detected by the poll()), which always resulted in the following > misleading log message being displayed: > > 316 if ( bad_fd_count == 0 ) > 317 msg( LOG_NOTICE, func, > 318 "select reported EBADF but no bad file descriptors were found" > ) ; So the real problem here is misconfiguration. Yes, it should be handled gracefully, but the simple "Just don't do it!", which is already kind of mentioned in the xinetd.conf, is a good enough fix in this case. Besides, people are encouraged to use systemd units. If you can come up with a patch that detects "udp" + "wait=no" and handles that with a big fat warning, possibly an error, and make sure that the rest works fine, I will merge such patch. (In reply to Jan Synacek from comment #2) Thank you for your feedback! > Well, the actual fix should be to find out that there is a misconfigured > service and, possibly, refuse to load it. In addition to this fix, yes, that would be ideal. > > Therefore, I am suggesting disabling the fd in the array by adding: > > ps.rws.pfd_array[fd].fd *= -1; > > I don't understand this. Why "*= -1" if a simple "= -1" will do? You do not lose the original fd number, it's only negated (i.e. for debugging). Both will do, yes. > Two problems. If you submit a patch and say that you actually only tested it > in this very scenario, don't expect it to be accepted. The second problem is > that the service is still there, totally useless. So it is currently, and busy looping and flooding logs as a bonus. > > I am also suggesting another modification: > > > > 292 if ( SVC_FD( sp ) == fd ) > > 293 { > > 294 msg( LOG_ERR, func, > > 295 "file descriptor of service %s has been closed", > > 296 SVC_ID( sp ) ) ; > > 297 svc_deactivate( sp ) ; > > 298 found = TRUE ; > > 299 bad_fd_count++ ; > > 300 break ; > > > > Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed > > (and detected by the poll()), which always resulted in the following > > misleading log message being displayed: > > > > 316 if ( bad_fd_count == 0 ) > > 317 msg( LOG_NOTICE, func, > > 318 "select reported EBADF but no bad file descriptors were found" > > ) ; > > Can you comment on the above part, please? > So the real problem here is misconfiguration. Yes, it should be handled > gracefully, but the simple "Just don't do it!", which is already kind of > mentioned in the xinetd.conf, is a good enough fix in this case. Besides, > people are encouraged to use systemd units. > > If you can come up with a patch that detects "udp" + "wait=no" and handles > that with a big fat warning, possibly an error, and make sure that the rest > works fine, I will merge such patch. Yes, the source of the problem is in the configuration. The xinetd.conf suggests using "wait=yes" for UDP services, but does not provide any warning of what could happen. I believe it is because the upstream version does not have the poll() patch and therefore the busy loop will not happen there. I may have a look for a way of preventing such service to be loaded if time allows, but still, I consider the busy loop and flooded logs an issue worth fixing. Created attachment 1431238 [details]
Reworked patch preventing loading a udp service with wait=no
I have reworked the patch, based on your comments. I have added the big fat error. Now the service with socket_type = dgram (or protocol = udp) will not load with wait=no.
(In reply to Stepan Broz from comment #4) > Created attachment 1431238 [details] > Reworked patch preventing loading a udp service with wait=no > > I have reworked the patch, based on your comments. I have added the big fat > error. Now the service with socket_type = dgram (or protocol = udp) will not > load with wait=no. Thanks, I'll do some additional testing and (possibly) push it. xinetd-2.3.15-26.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-d54d442814 xinetd-2.3.15-26.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-d54d442814 xinetd-2.3.15-26.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report. |