Bug 1567239

Summary:

xinetd infinite busy loop on bad file descriptor uses 100% cpu

Product:

[Fedora] Fedora

Reporter:

Stepan Broz <sbroz>

Component:

xinetd

Assignee:

Jan Synacek <jsynacek>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

high

Version:

CC:

jsynacek, qe-baseos-daemons

Target Milestone:

---

Keywords:

EasyFix, Patch, Reproducer

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

xinetd-2.3.15-26.fc28

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1567227

Environment:

Last Closed:

2018-06-14 19:13:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1567227

Attachments:

Description	Flags
Re-attaching the patch from the original bug	jsynacek: review-
Reworked patch preventing loading a udp service with wait=no	sbroz: review? (jsynacek)

Description Stepan Broz 2018-04-13 15:54:23 UTC

+++ This bug was initially created as a clone of Bug #1567227 +++

Description of problem:
When you configure a xinetd-based service that listens on a UDP socket, e.g. in.tftpd, and mistakenly use "wait = no" in xinetd service configuration, xinetd enters an infinite busy loop upon receiving traffic for such service.


Version-Release number of selected component (if applicable):
xinetd-2.3.15-13.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install xinetd, tftp-server and tftp
2. Configure the tftp service in xinetd. Create "/etc/xinetd.d/tftp" as follows:

service tftp
{
        socket_type             = dgram
        protocol                = udp
        wait                    = no 
        user                    = root
        server                  = /usr/sbin/in.tftpd
        server_args             = -s /var/lib/tftpboot
        disable                 = no 
        per_source              = 11
        cps                     = 100 2
        flags                   = IPv4
}

3. Create /var/lib/tftpboot if not already present.

# mkdir -p /var/lib/tftpboot

4. Start xinetd service

# systemctl start xinetd

5. Start a tftp file transfer attempt. You can request a non-existent file:

# tftp -4 -v localhost -c get /var/lib/tftpboot/foobar.txt

6. xinetd enters the busy loop with similar logs in /var/log/messages or system journal:

Apr 14 01:23:32 localhost xinetd[1359]: xinetd Version 2.3.15 started with libwrap loadavg labeled-networking options compiled in.
Apr 14 01:23:32 localhost xinetd[1359]: Started working: 1 available service
Apr 14 01:23:32 localhost systemd: Started Xinetd A Powerful Replacement For Inetd.
Apr 14 01:23:37 localhost xinetd[1359]: START: tftp pid=1361 from=127.0.0.1
Apr 14 01:23:37 localhost xinetd[1359]: No active service for file descriptor 0
Apr 14 01:23:37 localhost xinetd[1359]: 1 descriptors still set
Apr 14 01:23:37 localhost xinetd[1359]: No active service for file descriptor 0
Apr 14 01:23:37 localhost xinetd[1359]: 1 descriptors still set
Apr 14 01:23:37 localhost xinetd[1359]: No active service for file descriptor 0
Apr 14 01:23:37 localhost xinetd[1359]: 1 descriptors still set
Apr 14 01:23:37 localhost xinetd[1359]: No active service for file descriptor 0
Apr 14 01:23:37 localhost xinetd[1359]: 1 descriptors still set
Apr 14 01:23:37 localhost xinetd[1359]: No active service for file descriptor 0
Apr 14 01:23:37 localhost xinetd[1359]: 1 descriptors still set
... output never stops ... truncated ...

[NOTE: The file descriptor number is an index of the pollfd array, not actual file descriptor number]

Actual results:
- xinetd enters infinite tight loop, which does not stop after the child (service) process terminates
- floods system logging service (rsyslogd, journald), which also impacts system performance

Expected results:
- xinetd detects the situation and does not enter the infinite loop

Additional info:
- Upstream does not have this issue, issue introduced with a patch "xinetd-2.3.14-poll.patch" that appears to be Red Hat, and Fedora specific.
- Fedora 28 still affected

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-04-13 11:30:29 EDT ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Stepan Broz on 2018-04-13 11:49 EDT ---

I am suggesting a patch to main.c that prevents entering the busy loop by disabling the bad file descriptor in the pollfd array in find_bad_fd().

The problem occurs when the poll() in main_loop() returns "POLLERR | POLLHUP | POLLNVAL" for a file descriptor for which there is no service found (The fd is closed by the service process? Race condition?) in find_bad_fd().

The current code clears the events flags for next poll():

304 #ifdef HAVE_POLL
305             ps.rws.pfd_array[fd].events = 0;

However, the file descriptor is not removed from the pollfd array and on next iteration, the above process repeats (forever).

Therefore, I am suggesting disabling the fd in the array by adding:
                ps.rws.pfd_array[fd].fd *= -1;

Other option might be e.g. to remove the fd from the array completely.

The behaviour of xinetd is only improved by the patch (tested only in this scenario!) as it never enters the busy loop. The failed service will, however, still remain dysfunctional (it never recovers) and would still require restart (and reconfiguration) of xinetd.

I am also suggesting another modification:

292             if ( SVC_FD( sp ) == fd )
293             {
294                msg( LOG_ERR, func,
295                   "file descriptor of service %s has been closed",
296                               SVC_ID( sp ) ) ;
297                svc_deactivate( sp ) ;
298                found = TRUE ;
299                bad_fd_count++ ;
300                break ;

Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed (and detected by the poll()), which always resulted in the following misleading log message being displayed:

316    if ( bad_fd_count == 0 )
317       msg( LOG_NOTICE, func,
318          "select reported EBADF but no bad file descriptors were found" ) ;

Comment 1 Stepan Broz 2018-04-13 15:56:27 UTC

Created attachment 1421434 [details]
Re-attaching the patch from the original bug

Patch that prevents entering the busy loop

I am suggesting a patch to main.c that prevents entering the busy loop by disabling the bad file descriptor in the pollfd array in find_bad_fd().

The problem occurs when the poll() in main_loop() returns "POLLERR | POLLHUP | POLLNVAL" for a file descriptor for which there is no service found (The fd is closed by the service process? Race condition?) in find_bad_fd().

The current code clears the events flags for next poll():

304 #ifdef HAVE_POLL
305             ps.rws.pfd_array[fd].events = 0;

However, the file descriptor is not removed from the pollfd array and on next iteration, the above process repeats (forever).

Therefore, I am suggesting disabling the fd in the array by adding:
                ps.rws.pfd_array[fd].fd *= -1;

Other option might be e.g. to remove the fd from the array completely.

The behaviour of xinetd is only improved by the patch (tested only in this scenario!) as it never enters the busy loop. The failed service will, however, still remain dysfunctional (it never recovers) and would still require restart (and reconfiguration) of xinetd.

I am also suggesting another modification:

292             if ( SVC_FD( sp ) == fd )
293             {
294                msg( LOG_ERR, func,
295                   "file descriptor of service %s has been closed",
296                               SVC_ID( sp ) ) ;
297                svc_deactivate( sp ) ;
298                found = TRUE ;
299                bad_fd_count++ ;
300                break ;

Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed (and detected by the poll()), which always resulted in the following misleading log message being displayed:

316    if ( bad_fd_count == 0 )
317       msg( LOG_NOTICE, func,
318          "select reported EBADF but no bad file descriptors were found" ) ;

Comment 2 Jan Synacek 2018-04-17 09:36:49 UTC

(In reply to Stepan Broz from comment #1)
> Created attachment 1421434 [details]
> Re-attaching the patch from the original bug
> 
> Patch that prevents entering the busy loop
> 
> I am suggesting a patch to main.c that prevents entering the busy loop by
> disabling the bad file descriptor in the pollfd array in find_bad_fd().
> 
> The problem occurs when the poll() in main_loop() returns "POLLERR | POLLHUP
> | POLLNVAL" for a file descriptor for which there is no service found (The
> fd is closed by the service process? Race condition?) in find_bad_fd().

Well, the actual fix should be to find out that there is a misconfigured service and, possibly, refuse to load it.

> The current code clears the events flags for next poll():
> 
> 304 #ifdef HAVE_POLL
> 305             ps.rws.pfd_array[fd].events = 0;
> 
> However, the file descriptor is not removed from the pollfd array and on
> next iteration, the above process repeats (forever).
> 
> Therefore, I am suggesting disabling the fd in the array by adding:
>                 ps.rws.pfd_array[fd].fd *= -1;

I don't understand this. Why "*= -1" if a simple "= -1" will do?

> Other option might be e.g. to remove the fd from the array completely.

Well if you can remove it and make sure that you didn't change the behaviour of the program, then it is the perfect solution.

> The behaviour of xinetd is only improved by the patch (tested only in this
> scenario!) as it never enters the busy loop. The failed service will,
> however, still remain dysfunctional (it never recovers) and would still
> require restart (and reconfiguration) of xinetd.

Two problems. If you submit a patch and say that you actually only tested it in this very scenario, don't expect it to be accepted. The second problem is that the service is still there, totally useless.

> I am also suggesting another modification:
> 
> 292             if ( SVC_FD( sp ) == fd )
> 293             {
> 294                msg( LOG_ERR, func,
> 295                   "file descriptor of service %s has been closed",
> 296                               SVC_ID( sp ) ) ;
> 297                svc_deactivate( sp ) ;
> 298                found = TRUE ;
> 299                bad_fd_count++ ;
> 300                break ;
> 
> Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed
> (and detected by the poll()), which always resulted in the following
> misleading log message being displayed:
> 
> 316    if ( bad_fd_count == 0 )
> 317       msg( LOG_NOTICE, func,
> 318          "select reported EBADF but no bad file descriptors were found"
> ) ;


So the real problem here is misconfiguration. Yes, it should be handled gracefully, but the simple "Just don't do it!", which is already kind of mentioned in the xinetd.conf, is a good enough fix in this case. Besides, people are encouraged to use systemd units.

If you can come up with a patch that detects "udp" + "wait=no" and handles that with a big fat warning, possibly an error, and make sure that the rest works fine, I will merge such patch.

Comment 3 Stepan Broz 2018-04-17 11:40:55 UTC

(In reply to Jan Synacek from comment #2)

Thank you for your feedback!

> Well, the actual fix should be to find out that there is a misconfigured
> service and, possibly, refuse to load it.

In addition to this fix, yes, that would be ideal.

> > Therefore, I am suggesting disabling the fd in the array by adding:
> >                 ps.rws.pfd_array[fd].fd *= -1;
> 
> I don't understand this. Why "*= -1" if a simple "= -1" will do?

You do not lose the original fd number, it's only negated (i.e. for debugging). Both will do, yes.

> Two problems. If you submit a patch and say that you actually only tested it
> in this very scenario, don't expect it to be accepted. The second problem is
> that the service is still there, totally useless.

So it is currently, and busy looping and flooding logs as a bonus.

> > I am also suggesting another modification:
> > 
> > 292             if ( SVC_FD( sp ) == fd )
> > 293             {
> > 294                msg( LOG_ERR, func,
> > 295                   "file descriptor of service %s has been closed",
> > 296                               SVC_ID( sp ) ) ;
> > 297                svc_deactivate( sp ) ;
> > 298                found = TRUE ;
> > 299                bad_fd_count++ ;
> > 300                break ;
> > 
> > Adding "bad_fd_count++" on line 299 fixes a case where the FD was closed
> > (and detected by the poll()), which always resulted in the following
> > misleading log message being displayed:
> > 
> > 316    if ( bad_fd_count == 0 )
> > 317       msg( LOG_NOTICE, func,
> > 318          "select reported EBADF but no bad file descriptors were found"
> > ) ;
> 
> 

Can you comment on the above part, please?


> So the real problem here is misconfiguration. Yes, it should be handled
> gracefully, but the simple "Just don't do it!", which is already kind of
> mentioned in the xinetd.conf, is a good enough fix in this case. Besides,
> people are encouraged to use systemd units.
> 
> If you can come up with a patch that detects "udp" + "wait=no" and handles
> that with a big fat warning, possibly an error, and make sure that the rest
> works fine, I will merge such patch.

Yes, the source of the problem is in the configuration. The xinetd.conf suggests using "wait=yes" for UDP services, but does not provide any warning of what could happen. I believe it is because the upstream version does not have the poll() patch and therefore the busy loop will not happen there.

I may have a look for a way of preventing such service to be loaded if time allows, but still, I consider the busy loop and flooded logs an issue worth fixing.

Comment 4 Stepan Broz 2018-05-04 10:22:30 UTC

Created attachment 1431238 [details]
Reworked patch preventing loading a udp service with wait=no

I have reworked the patch, based on your comments. I have added the big fat error. Now the service with socket_type = dgram (or protocol = udp) will not load with wait=no.

Comment 5 Jan Synacek 2018-05-10 11:02:45 UTC

(In reply to Stepan Broz from comment #4)
> Created attachment 1431238 [details]
> Reworked patch preventing loading a udp service with wait=no
> 
> I have reworked the patch, based on your comments. I have added the big fat
> error. Now the service with socket_type = dgram (or protocol = udp) will not
> load with wait=no.

Thanks, I'll do some additional testing and (possibly) push it.

Comment 6 Fedora Update System 2018-06-05 09:13:13 UTC

xinetd-2.3.15-26.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-d54d442814

Comment 7 Fedora Update System 2018-06-05 14:51:12 UTC

xinetd-2.3.15-26.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-d54d442814

Comment 8 Fedora Update System 2018-06-14 19:13:47 UTC

xinetd-2.3.15-26.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.