Bug 1231543 - nginx.{service,logrotate} improvements
Summary: nginx.{service,logrotate} improvements
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: nginx
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jamie Nguyen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-14 15:00 UTC by Marcos Mello
Modified: 2015-07-30 04:58 UTC (History)
8 users (show)

Fixed In Version: nginx-1.6.3-6.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-03 13:43:19 UTC


Attachments (Terms of Use)

Description Marcos Mello 2015-06-14 15:00:45 UTC
nginx.service:

- Drop syslog.target from After=, not needed.
- Drop TimeoutStopSec=5? Default systemd global configuration (/etc/systemd/system.conf) sets it to 90s.
- Replace ExecStop with "KillSignal=SIGQUIT". See man systemd.kill.


nginx.logrotate:

Not sure why 'systemctl kill' was replaced. Was not just a matter of using '--kill-who=main'? 'main' process for Type=forking daemons is the pid file one for systemd.

/bin/systemctl kill --signal=SIGUSR1 --kill-who=main nginx.service 2>/dev/null || true

?


And please backport these changes to F21 ("KillMode=mixed" addition especially).

Comment 1 Jamie Nguyen 2015-06-14 17:25:23 UTC
Hi. Thanks for the bug report.

(In reply to Marcos Mello from comment #0)
> - Drop syslog.target from After=, not needed.

Done.


> - Drop TimeoutStopSec=5? Default systemd global configuration
> (/etc/systemd/system.conf) sets it to 90s.

The default is too long. If you send a SIGQUIT but for some reason nginx gets stuck, you probably don't want to wait around 90 seconds for a SIGTERM.


> - Replace ExecStop with "KillSignal=SIGQUIT". See man systemd.kill.

Done, though this doesn't really solve any real problems.


> Not sure why 'systemctl kill' was replaced.

I changed from `/usr/bin/systemctl` to `/bin/kill` as a quickfix for an SELinux issue a long time ago. The issue was fixed, so I'll change it back but again I don't think this solves any real problems.

 
> And please backport these changes to F21 ("KillMode=mixed" addition
> especially).

Done.


Builds and updates to follow.

Comment 2 Fedora Update System 2015-06-14 17:46:18 UTC
nginx-1.8.0-8.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/nginx-1.8.0-8.fc22

Comment 3 Fedora Update System 2015-06-14 17:48:33 UTC
nginx-1.6.3-2.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/nginx-1.6.3-2.fc21

Comment 4 Fedora Update System 2015-06-14 17:49:28 UTC
nginx-1.6.3-3.el7 has been submitted as an update for Fedora EPEL 7.
https://admin.fedoraproject.org/updates/nginx-1.6.3-3.el7

Comment 5 Marcos Mello 2015-06-14 17:50:27 UTC
(In reply to Jamie Nguyen from comment #1)
> Hi. Thanks for the bug report.
> 
> (In reply to Marcos Mello from comment #0)
> > - Drop syslog.target from After=, not needed.
> 
> Done.
> 
> 
> > - Drop TimeoutStopSec=5? Default systemd global configuration
> > (/etc/systemd/system.conf) sets it to 90s.
> 
> The default is too long. If you send a SIGQUIT but for some reason nginx
> gets stuck, you probably don't want to wait around 90 seconds for a SIGTERM.
> 

Now with KillSignal=SIGQUIT and KillMode=mixed, there is no SIGTERM anymore. systemd will send SIGQUIT to main pid and after 5 seconds all processes will be SIGKILLed if still there in the cgroup. This does not affect my light load server, but on busy ones perhaps it can?

> 
> > - Replace ExecStop with "KillSignal=SIGQUIT". See man systemd.kill.
> 
> Done, though this doesn't really solve any real problems.
> 
> 
> > Not sure why 'systemctl kill' was replaced.
> 
> I changed from `/usr/bin/systemctl` to `/bin/kill` as a quickfix for an
> SELinux issue a long time ago. The issue was fixed, so I'll change it back
> but again I don't think this solves any real problems.
> 
>  
> > And please backport these changes to F21 ("KillMode=mixed" addition
> > especially).
> 
> Done.
> 
> 
> Builds and updates to follow.

Thank you for your prompt reply!

Comment 6 Jamie Nguyen 2015-06-14 17:57:45 UTC
(In reply to Marcos Mello from comment #5)
> Now with KillSignal=SIGQUIT and KillMode=mixed, there is no SIGTERM anymore.
> systemd will send SIGQUIT to main pid and after 5 seconds all processes will
> be SIGKILLed if still there in the cgroup. This does not affect my light
> load server, but on busy ones perhaps it can?

Yes I meant SIGKILL. If the server is very busy and the admin needs a longer timeout, they can override this default with a systemd snippet. (NB: Debian also sets TimeoutStopSec=5 in their service file.)


> Thank you for your prompt reply!

No problem!

Comment 7 Marcos Mello 2015-06-14 18:13:16 UTC
Yeah, makes sense.

Comment 8 Marcos Mello 2015-06-15 13:06:35 UTC
On CentOS 7, with default KillMode=control-group, I see sometimes this on shutdown (worker pids):

[info] 1516#0: epoll_wait() failed (4: Interrupted system call)

Probably some race because all processes are receiving SIGQUIT.

KillMode=process solves it signaling only the main pid. "mixed" is more robust, but we have to wait for RHEL 8...

Comment 9 Fedora Update System 2015-06-15 13:29:56 UTC
Package nginx-1.8.0-8.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing nginx-1.8.0-8.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-10025/nginx-1.8.0-8.fc22
then log in and leave karma (feedback).

Comment 10 Jamie Nguyen 2015-06-15 21:25:12 UTC
(In reply to Marcos Mello from comment #8)
> On CentOS 7, with default KillMode=control-group, I see sometimes this on
> shutdown (worker pids):
> 
> [info] 1516#0: epoll_wait() failed (4: Interrupted system call)

In the situation you described, are any processes (either master or workers) left behind after `systemctl stop`?

Comment 11 Marcos Mello 2015-06-15 23:44:59 UTC
Nope. They are all finished.

I do not have access to the CentOS machine now. You can reproduce it in Fedora though. Comment KillMode, raise log level to info, and configure nginx to use multiple workers (4 in this example).

 PPID   PID COMMAND
    1  2795 nginx: master process /usr/sbin/nginx
 2795  2796  \_ nginx: worker process
 2795  2797  \_ nginx: worker process
 2795  2798  \_ nginx: worker process
 2795  2799  \_ nginx: worker process

Every pid received SIGQUIT:

2015/06/15 19:49:08 [notice] 2792#0: using the "epoll" event method
2015/06/15 19:49:08 [notice] 2792#0: nginx/1.6.3
2015/06/15 19:49:08 [notice] 2792#0: built by gcc 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)
2015/06/15 19:49:08 [notice] 2792#0: OS: Linux 4.0.4-202.fc21.x86_64
2015/06/15 19:49:08 [notice] 2792#0: getrlimit(RLIMIT_NOFILE): 1024:4096
2015/06/15 19:49:08 [notice] 2795#0: start worker processes
2015/06/15 19:49:08 [notice] 2795#0: start worker process 2796
2015/06/15 19:49:08 [notice] 2795#0: start worker process 2797
2015/06/15 19:49:08 [notice] 2795#0: start worker process 2798
2015/06/15 19:49:08 [notice] 2795#0: start worker process 2799
2015/06/15 19:49:14 [notice] 2797#0: signal 3 (SIGQUIT) received, shutting down
2015/06/15 19:49:14 [info] 2797#0: epoll_wait() failed (4: Interrupted system call)
2015/06/15 19:49:14 [notice] 2797#0: gracefully shutting down
2015/06/15 19:49:14 [notice] 2797#0: exiting
2015/06/15 19:49:14 [notice] 2795#0: signal 3 (SIGQUIT) received, shutting down
2015/06/15 19:49:14 [notice] 2796#0: signal 3 (SIGQUIT) received, shutting down
2015/06/15 19:49:14 [notice] 2796#0: gracefully shutting down
2015/06/15 19:49:14 [notice] 2796#0: exiting
2015/06/15 19:49:14 [notice] 2797#0: exit
2015/06/15 19:49:14 [notice] 2796#0: exit
2015/06/15 19:49:14 [notice] 2798#0: signal 3 (SIGQUIT) received, shutting down
2015/06/15 19:49:14 [notice] 2798#0: gracefully shutting down
2015/06/15 19:49:14 [notice] 2798#0: exiting
2015/06/15 19:49:14 [notice] 2798#0: exit
2015/06/15 19:49:14 [notice] 2799#0: signal 3 (SIGQUIT) received, shutting down
2015/06/15 19:49:14 [notice] 2799#0: gracefully shutting down
2015/06/15 19:49:14 [notice] 2799#0: exiting
2015/06/15 19:49:14 [notice] 2799#0: exit
2015/06/15 19:49:14 [notice] 2795#0: signal 17 (SIGCHLD) received
2015/06/15 19:49:14 [notice] 2795#0: worker process 2796 exited with code 0
2015/06/15 19:49:14 [notice] 2795#0: signal 29 (SIGIO) received
2015/06/15 19:49:14 [notice] 2795#0: signal 17 (SIGCHLD) received
2015/06/15 19:49:14 [notice] 2795#0: worker process 2797 exited with code 0
2015/06/15 19:49:14 [notice] 2795#0: signal 29 (SIGIO) received
2015/06/15 19:49:14 [notice] 2795#0: signal 17 (SIGCHLD) received
2015/06/15 19:49:14 [notice] 2795#0: worker process 2799 exited with code 0
2015/06/15 19:49:14 [notice] 2795#0: signal 29 (SIGIO) received
2015/06/15 19:49:14 [notice] 2795#0: signal 17 (SIGCHLD) received
2015/06/15 19:49:14 [notice] 2795#0: worker process 2798 exited with code 0
2015/06/15 19:49:14 [notice] 2795#0: exit

Worker 2797 is complaining. Master (2795) was already cleaning things up I think. It seems quite racy and is easily reproducible.


A "clean" log (KillMode=process):

 PPID   PID COMMAND
    1  2855 nginx: master process /usr/sbin/nginx
 2855  2856  \_ nginx: worker process
 2855  2857  \_ nginx: worker process
 2855  2858  \_ nginx: worker process
 2855  2859  \_ nginx: worker process

2015/06/15 19:56:55 [notice] 2852#0: using the "epoll" event method
2015/06/15 19:56:55 [notice] 2852#0: nginx/1.6.3
2015/06/15 19:56:55 [notice] 2852#0: built by gcc 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)
2015/06/15 19:56:55 [notice] 2852#0: OS: Linux 4.0.4-202.fc21.x86_64
2015/06/15 19:56:55 [notice] 2852#0: getrlimit(RLIMIT_NOFILE): 1024:4096
2015/06/15 19:56:55 [notice] 2855#0: start worker processes
2015/06/15 19:56:55 [notice] 2855#0: start worker process 2856
2015/06/15 19:56:55 [notice] 2855#0: start worker process 2857
2015/06/15 19:56:55 [notice] 2855#0: start worker process 2858
2015/06/15 19:56:55 [notice] 2855#0: start worker process 2859
2015/06/15 19:58:01 [notice] 2855#0: signal 3 (SIGQUIT) received, shutting down
2015/06/15 19:58:01 [notice] 2856#0: gracefully shutting down
2015/06/15 19:58:01 [notice] 2856#0: exiting
2015/06/15 19:58:01 [notice] 2857#0: gracefully shutting down
2015/06/15 19:58:01 [notice] 2857#0: exiting
2015/06/15 19:58:01 [notice] 2856#0: exit
2015/06/15 19:58:01 [notice] 2857#0: exit
2015/06/15 19:58:01 [notice] 2859#0: gracefully shutting down
2015/06/15 19:58:01 [notice] 2858#0: gracefully shutting down
2015/06/15 19:58:01 [notice] 2859#0: exiting
2015/06/15 19:58:01 [notice] 2858#0: exiting
2015/06/15 19:58:01 [notice] 2859#0: exit
2015/06/15 19:58:01 [notice] 2858#0: exit
2015/06/15 19:58:01 [notice] 2855#0: signal 17 (SIGCHLD) received
2015/06/15 19:58:01 [notice] 2855#0: worker process 2857 exited with code 0
2015/06/15 19:58:01 [notice] 2855#0: signal 29 (SIGIO) received
2015/06/15 19:58:01 [notice] 2855#0: signal 17 (SIGCHLD) received
2015/06/15 19:58:01 [notice] 2855#0: worker process 2859 exited with code 0
2015/06/15 19:58:01 [notice] 2855#0: signal 29 (SIGIO) received
2015/06/15 19:58:01 [notice] 2855#0: signal 17 (SIGCHLD) received
2015/06/15 19:58:01 [notice] 2855#0: worker process 2856 exited with code 0
2015/06/15 19:58:01 [notice] 2855#0: worker process 2858 exited with code 0
2015/06/15 19:58:01 [notice] 2855#0: exit

nginx man page says "While there is no need to explicitly control worker processes normally, they support some signals too...". Master process is responsible for workers shutdown.

KillMode=mixed is better than KillMode=process, since it will force-kill everything after the timeout. Too bad we cannot use it in CentOS 7. :-(

Comment 12 Jamie Nguyen 2015-06-16 21:29:58 UTC
I agree, KillMode=process is probably better than KillMode=control-group. Your issue looks easily reproducible. Also, I've never seen rogue workers left behind after master has been killed so it's probably OK just to do KillMode=process.

Comment 13 Fedora Update System 2015-06-16 21:34:23 UTC
nginx-1.8.0-9.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/nginx-1.8.0-9.fc22

Comment 14 Fedora Update System 2015-06-16 21:35:22 UTC
nginx-1.6.3-3.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/nginx-1.6.3-3.fc21

Comment 15 Fedora Update System 2015-06-16 21:36:09 UTC
nginx-1.6.3-5.el7 has been submitted as an update for Fedora EPEL 7.
https://admin.fedoraproject.org/updates/nginx-1.6.3-5.el7

Comment 16 Marcos Mello 2015-06-16 22:58:59 UTC
Thanks again!

BTW, you released a F20 update. Does it worth backport same EPEL 7 nginx.{service,logrotate} changes to it (F20 uses systemd 208 too)? If you have spare time only... F20 is almost EOLed.

Comment 17 Fedora Update System 2015-06-30 00:13:13 UTC
nginx-1.8.0-9.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 18 Fedora Update System 2015-06-30 00:14:29 UTC
nginx-1.6.3-3.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 19 Marcos Mello 2015-07-03 10:27:32 UTC
I am seeing this on CentOS 7:

type=USER_AVC msg=audit(1435918621.481:550): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  denied  { stop } for auid=0 uid=0 gid=0 path="/usr/lib/systemd/system/nginx.service" cmdline="/usr/bin/systemctl kill --signal=SIGUSR1 --kill-who=main nginx.service" scontext=system_u:system_r:logrotate_t:s0-s0:c0.c1023 tcontext=system_u:object_r:httpd_unit_file_t:s0 tclass=service  exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'

But despite it nginx logs *are* being rotated.

Where should I report CentOS selinux-policy bugs?

Comment 20 Jamie Nguyen 2015-07-03 10:46:59 UTC
(In reply to Marcos Mello from comment #19)
> Where should I report CentOS selinux-policy bugs?

https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%207&component=selinux-policy

Comment 21 Marcos Mello 2015-07-03 11:04:21 UTC
Done:
https://bugzilla.redhat.com/show_bug.cgi?id=1239045

Comment 22 Jamie Nguyen 2015-07-03 12:41:56 UTC
(In reply to Jamie Nguyen from comment #1)
> > Not sure why 'systemctl kill' was replaced.
> 
> I changed from `/usr/bin/systemctl` to `/bin/kill` as a quickfix for an
> SELinux issue a long time ago. The issue was fixed, so I'll change it back
> but again I don't think this solves any real problems.

The new SELinux denials do in fact prevent NGINX from re-opening the log files. This is true for both Fedora and CentOS. Thus I'm reverting back to /bin/kill on all branches. I am very unlikely to switch back to /usr/bin/systemctl in the foreseeable future, unless a compelling reason presents itself.

Comment 23 Fedora Update System 2015-07-03 12:53:24 UTC
nginx-1.6.3-6.el7 has been submitted as an update for Fedora EPEL 7.
https://admin.fedoraproject.org/updates/nginx-1.6.3-6.el7

Comment 24 Marcos Mello 2015-07-03 13:08:51 UTC
Well, systemctl kill is future proof, if one day nginx gains Type=notify support and a blessed* foreground mode, etc, pid file is not needed (and can be ommited or even disabled). When selinux-policy gets fixed, perhaps in the future you can put it back.

Sorry for not testing this right.

* upstream puts some restrictions on 'daemon off'.

Comment 25 Jamie Nguyen 2015-07-03 13:43:19 UTC
If one day nginx doesn't need a pid file anymore, at that point I'll consider switching back to systemctl.

Comment 26 Fedora Update System 2015-07-30 04:58:20 UTC
nginx-1.6.3-6.el7 has been pushed to the Fedora EPEL 7 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.