Description of problem: Lot of times, with --debug-rgw and --debug-ms enabled, ceph-rgw-$host.log file is empty and all the logs are collected in log files that have a timestamp. Version-Release number of selected component (if applicable): ceph-radosgw-10.2.2-27.el7cp.x86_64 How reproducible: Sometimes Steps to Reproduce: 1. With debug enabled, run I/O on rgw node. 2. The logs start collecting in ceph-rgw-host.log. After a while, with log rotation, a new file with the datestamp is created and the logs are written to that file. ceph-rgw-host.log is empty mostly. 3. After a radosgw process restart, it starts writing to the ceph-rgw-host.log file. Actual results: # ls -lh /var/log/ceph/ | grep rgw -rw-r--r--. 1 ceph ceph 0 Jul 10 03:28 ceph-client.rgw.magna115.log -rw-r--r--. 1 ceph ceph 0 Jul 26 03:38 ceph-rgw-magna115.log -rw-r--r--. 1 ceph ceph 5.3G Jul 22 06:15 ceph-rgw-magna115.log-20160722 -rw-r--r--. 1 ceph ceph 2.2G Jul 24 06:31 ceph-rgw-magna115.log-20160723 -rw-r--r--. 1 ceph ceph 308M Jul 25 07:28 ceph-rgw-magna115.log-20160725 -rw-r--r--. 1 ceph ceph 8.5G Jul 26 09:33 ceph-rgw-magna115.log-20160726 # systemctl restart ceph-radosgw.service # ls -lh /var/log/ceph/ | grep rgw -rw-r--r--. 1 ceph ceph 0 Jul 10 03:28 ceph-client.rgw.magna115.log -rw-r--r--. 1 ceph ceph 5.0M Jul 26 09:41 ceph-rgw-magna115.log -rw-r--r--. 1 ceph ceph 5.3G Jul 22 06:15 ceph-rgw-magna115.log-20160722 -rw-r--r--. 1 ceph ceph 2.2G Jul 24 06:31 ceph-rgw-magna115.log-20160723 -rw-r--r--. 1 ceph ceph 308M Jul 25 07:28 ceph-rgw-magna115.log-20160725 -rw-r--r--. 1 ceph ceph 8.5G Jul 26 09:41 ceph-rgw-magna115.log-20160726 Not sure if I understand how this should work.
I cannot reproduce this behavior locally. I configured debugging ("debug rgw = 20" and "debug ms = 1" in [global] in ceph.conf), and when I SIGHUP the daemon, the radosgw process immediately writes to the new log file as expected. Here are the two scenarios I tried: Scenario 1 (testing SIGHUP outside of logrotate): 1. systemctl start ceph-radosgw@rgw.$(hostname -s) 2. cd /var/ceph/log 3. mv ceph-rgw-$(hostname -s).log ceph-rgw-$(hostname -s).log.moved 4. killall -q -1 radosgw 5. Confirm that radosgw is now writing to the new log file (watch "ls -l" for the log directory). Scenario 2 (testing logrotate): 1. systemctl start ceph-radosgw@rgw.$(hostname -s) 2. Force a log rotation with the following command: logrotate -f /etc/logrotate.d/ceph 3. Confirm that radosgw is now writing to the new log file (watch "ls -l" for the log directory). I looked at magna115 and the problem is probably gone now that you've restarted the daemon. (I'm curious why the log files are not gzipped, since Ceph's default logrotate configuration compresses the log files? Could that have something to do with the issue you were seeing?) If you can reproduce this (particularly with "logrotate -f"), it would be helpful to see the output of: ls -l /proc/$(pgrep radosgw)/fd | grep log while you're seeing the writes to the old log file so that we can see what filehandle(s) the process has open.
We're still getting logs. I recommend that we push this to 2.1
(In reply to Ken Dreyer (Red Hat) from comment #3) > I cannot reproduce this behavior locally. I configured debugging ("debug rgw > = 20" and "debug ms = 1" in [global] in ceph.conf), and when I SIGHUP the > daemon, the radosgw process immediately writes to the new log file as > expected. > > Here are the two scenarios I tried: > > Scenario 1 (testing SIGHUP outside of logrotate): > 1. systemctl start ceph-radosgw@rgw.$(hostname -s) > 2. cd /var/ceph/log > 3. mv ceph-rgw-$(hostname -s).log ceph-rgw-$(hostname -s).log.moved > 4. killall -q -1 radosgw > 5. Confirm that radosgw is now writing to the new log file (watch "ls -l" > for the log directory). > > Scenario 2 (testing logrotate): > 1. systemctl start ceph-radosgw@rgw.$(hostname -s) > 2. Force a log rotation with the following command: > logrotate -f /etc/logrotate.d/ceph > 3. Confirm that radosgw is now writing to the new log file (watch "ls -l" > for the log directory). > > > I looked at magna115 and the problem is probably gone now that you've > restarted the daemon. (I'm curious why the log files are not gzipped, since > Ceph's default logrotate configuration compresses the log files? Could that > have something to do with the issue you were seeing?) > > If you can reproduce this (particularly with "logrotate -f"), it would be > helpful to see the output of: > > ls -l /proc/$(pgrep radosgw)/fd | grep log > > while you're seeing the writes to the old log file so that we can see what > filehandle(s) the process has open. While the old file is being written: # ls -l /proc/$(pgrep radosgw)/fd | grep log l-wx------. 1 root root 64 Jul 26 12:50 3 -> /var/log/ceph/ceph-rgw-magna115.log-20160727
I checked my test cluster today and I see now that the issue is present there now. $ ls -l /proc/$(pgrep radosgw)/fd | grep log /var/log/ceph/ceph-rgw-kdreyer-clot-2.log-20160727 I think this means that "logrotate -f" will not reproduce it? Still investigating.
Checked Shilpa's system, found the following: 1. /var/log/messages:1:Jul 24 03:48:01 magna115 logrotate: ALERT exited abnormally with [1] /var/log/messages:945:Jul 25 03:49:01 magna115 logrotate: ALERT exited abnormally with [1] which means that logrotate is exiting with errors, most likely the post script in /etc/logrotate.d/ceph is not getting executed, which is why SIGHUP signal is not getting sent to the processes (which causes the log files to be reopened and hence point to the latest file) and the step to zip the files is being skipped as well. 2. Next I added verbose debugging to logrotate command and observed the following: /usr/sbin/logrotate -vvvdf /etc/logrotate.d/ceph 2> /tmp/logrotate.debug In logrotate.debug file: switching euid to 167 and egid to 167 error: error opening /var/log/ceph/qemu-guest-10049.log: Permission denied which is because the permissions in the /etc/logrotate.d/ceph file are for ceph uid and gid and some files need root permission as can be seen below: -rw-r--r--. 1 root ceph 100 Jul 22 06:09 qemu-guest-10049.log -rw-r--r--. 1 root ceph 100 Jul 20 11:50 qemu-guest-1036.log -rw-r--r--. 1 root ceph 0 Jul 18 07:19 qemu-guest-10599.log 3. I changed the permission to root root in /etc/logrotate.d/ceph and again ran the above command and can see that now logrotate ran the postscript and is attempting to zip the files (though there is an error about a missing file) /usr/sbin/logrotate -vvvdf /etc/logrotate.d/ceph 2> /tmp/logrotate.debug running postrotate script running script with arg /var/log/ceph/*.log : " killall -q -1 ceph-mon ceph-mds ceph-osd ceph-fuse radosgw || true " compressing log with: /bin/gzip removing old log /var/log/ceph/qemu-guest-10049.log.8.gz error: error opening /var/log/ceph/qemu-guest-10049.log.8.gz: No such file or directory 4. I suspect that the permissions mismatch is the cause of the error 5. Shilpa will execute the logrotate -f command without the '-d' option and see what the result is.
(In reply to Pritha Srivastava from comment #8) > Checked Shilpa's system, found the following: > > 1. /var/log/messages:1:Jul 24 03:48:01 magna115 logrotate: ALERT exited > abnormally with [1] > /var/log/messages:945:Jul 25 03:49:01 magna115 logrotate: ALERT exited > abnormally with [1] > > which means that logrotate is exiting with errors, most likely the post > script in /etc/logrotate.d/ceph is not getting executed, which is why SIGHUP > signal is not getting sent to the processes (which causes the log files to > be reopened and hence point to the latest file) and the step to zip the > files is being skipped as well. > > 2. Next I added verbose debugging to logrotate command and observed the > following: > /usr/sbin/logrotate -vvvdf /etc/logrotate.d/ceph 2> /tmp/logrotate.debug > In logrotate.debug file: > switching euid to 167 and egid to 167 > error: error opening /var/log/ceph/qemu-guest-10049.log: Permission denied > which is because the permissions in the /etc/logrotate.d/ceph file are for > ceph uid and gid and some files need root permission as can be seen below: > > -rw-r--r--. 1 root ceph 100 Jul 22 06:09 qemu-guest-10049.log > -rw-r--r--. 1 root ceph 100 Jul 20 11:50 qemu-guest-1036.log > -rw-r--r--. 1 root ceph 0 Jul 18 07:19 qemu-guest-10599.log > > 3. I changed the permission to root root in /etc/logrotate.d/ceph and again > ran the above command and can see that now logrotate ran the postscript and > is attempting to zip the files (though there is an error about a missing > file) > > /usr/sbin/logrotate -vvvdf /etc/logrotate.d/ceph 2> /tmp/logrotate.debug > > running postrotate script > running script with arg /var/log/ceph/*.log : " > killall -q -1 ceph-mon ceph-mds ceph-osd ceph-fuse radosgw || true > " > compressing log with: /bin/gzip > removing old log /var/log/ceph/qemu-guest-10049.log.8.gz > error: error opening /var/log/ceph/qemu-guest-10049.log.8.gz: No such file > or directory > > 4. I suspect that the permissions mismatch is the cause of the error > > 5. Shilpa will execute the logrotate -f command without the '-d' option and > see what the result is. Thanks for looking into it Pritha. I was able to run logrotate -f successfully but I had to delete the older files and clean up /var/lib/logrotate.status file first as you suggested. Will check what is the status of logrotate tomorrow and update.
Thanks Pritha! As a review of why these qemu log files are written at all: ceph-ansible always configures client logging on every cluster node, like so: [client] log file = /var/log/ceph/qemu-guest-$pid.log This was added for qemu RBD logging, but apparently it also captures RGW logs, I guess? In RBD cases, the qemu-guest logs might be root-owned (?), which will trigger this bug. In this case, qemu is not running, so I don't know what else could have caused this file to be owned by root intead of "ceph". A quick workaround would be to update logrotate.conf to rotate the "ceph" logs as "ceph" and the qemu-guest logs as "root". A longer-term fix would involve: 1) Why is this file owned by root when qemu is not involved at all on this node? What could be writing a root-owned file here when all the daemons (that I can see) should be running as "ceph" ? 2) Can we make ceph-ansible smarter about when it configures a "qemu" log file here, because apparently this configuration will log a lot more than just qemu's activity?
Started thread "handling qemu-guest logs in ceph-ansible", http://marc.info/?l=ceph-devel&m=147007990404484&w=2
Are you by any chance just running out of free space? These logs seem to take 10+ GB of data which could easily fill up the remaining space on root partition. Also, you can't compress a log file if you do not have enough free space left to store both regular and the compressed file. I suppose logrotate might get killed because it runs out of space when it is trying to compress the data. Anyway, gathering debug logs all the time does not seem like a good idea to me.
(In reply to Boris Ranto from comment #12) > Are you by any chance just running out of free space? These logs seem to > take 10+ GB of data which could easily fill up the remaining space on root > partition. Also, you can't compress a log file if you do not have enough > free space left to store both regular and the compressed file. I suppose > logrotate might get killed because it runs out of space when it is trying to > compress the data. > > Anyway, gathering debug logs all the time does not seem like a good idea to > me. I had checked the space usage at the time of the error and it was normal - around 16% used.
Hmm, some more thoughts/further debugging: If we remove the su option then logrotate should run as root and the log file should inherit mode, user and group which could be something what we might actually want, here. However, looking at the git history, the su option was introduced in upstream commit 73d7bed9 with the following description (this applies both to src/logrotate.conf and src/rgw/logrotate.conf): logrotate: logs are now owned by ceph:ceph Add the su directive to avoid this error during logrotate: error: skipping "/var/log/ceph/ceph-osd.0.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation. Putting it all together, I think the best solution would be to use 'su root ceph'. We should hopefully be fine just using the 'su root ceph' line since the error is complaining about the group, not user and it should not complain about the user at all. @Ken|Pritha: Can you try changing the line to read 'su root ceph' locally to see if it helps and also check whether logrotate is not complaining about anything?
(In reply to Boris Ranto from comment #14) > Hmm, some more thoughts/further debugging: > > If we remove the su option then logrotate should run as root and the log > file should inherit mode, user and group which could be something what we > might actually want, here. > > However, looking at the git history, the su option was introduced in > upstream commit 73d7bed9 with the following description (this applies both > to src/logrotate.conf and src/rgw/logrotate.conf): > > logrotate: logs are now owned by ceph:ceph > > Add the su directive to avoid this error during logrotate: > > error: skipping "/var/log/ceph/ceph-osd.0.log" because parent > directory has insecure permissions (It's world writable or writable > by group which is not "root") Set "su" directive in config file to > tell logrotate which user/group should be used for rotation. > > Putting it all together, I think the best solution would be to use 'su root > ceph'. > > We should hopefully be fine just using the 'su root ceph' line since the > error is complaining about the group, not user and it should not complain > about the user at all. > > > @Ken|Pritha: Can you try changing the line to read 'su root ceph' locally to > see if it helps and also check whether logrotate is not complaining about > anything? I took Shilpa's system and have made the changes suggested above. Will check again after a day, to see if logrotate works fine with qemu logs enabled.
Upstream PR: https://github.com/ceph/ceph/pull/10587
Checked Shilpa's system today. Looks like the fix is working fine. The latest rgw log file being written to, and all the log files are zipped fine. The content of /var/log/ceph is below: total 5935172 -rw-r--r--. 1 ceph ceph 0 Jul 10 03:28 ceph-client.rgw.magna115.log -rw-r--r--. 1 ceph ceph 553938638 Aug 8 05:21 ceph-rgw-magna115.log -rw-r--r--. 1 ceph ceph 25550184 Jul 29 07:34 ceph-rgw-magna115.log.1.gz -rw-r--r--. 1 ceph ceph 720665573 Aug 2 03:13 ceph-rgw-magna115.log-20160802.gz -rw-r--r--. 1 ceph ceph 825355039 Aug 3 03:38 ceph-rgw-magna115.log-20160803.gz -rw-r--r--. 1 ceph ceph 821914934 Aug 4 03:26 ceph-rgw-magna115.log-20160804.gz -rw-r--r--. 1 ceph ceph 787272749 Aug 5 03:50 ceph-rgw-magna115.log-20160805.gz -rw-r--r--. 1 ceph ceph 733801007 Aug 6 03:37 ceph-rgw-magna115.log-20160806.gz -rw-r--r--. 1 ceph ceph 806388008 Aug 7 03:47 ceph-rgw-magna115.log-20160807.gz -rw-r--r--. 1 ceph ceph 802657306 Aug 8 03:43 ceph-rgw-magna115.log-20160808.gz -rw-r--r--. 1 root ceph 0 Aug 7 03:47 qemu-guest-1014.log -rw-r--r--. 1 root ceph 704 Aug 5 10:42 qemu-guest-1014.log-20160807.gz -rw-r--r--. 1 root ceph 0 Aug 5 09:52 qemu-guest-300.log -rw-r--r--. 1 root ceph 0 Aug 7 03:47 qemu-guest-31425.log -rw-r--r--. 1 root ceph 109 Aug 5 08:57 qemu-guest-31425.log-20160807.gz -rw-r--r--. 1 root ceph 0 Aug 7 03:47 qemu-guest-32401.log -rw-r--r--. 1 root ceph 718 Aug 5 09:49 qemu-guest-32401.log-20160807.gz -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32432.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32456.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32480.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32504.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32528.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32552.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32576.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32600.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32624.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32648.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32672.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32696.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32720.log -rw-r--r--. 1 root ceph 0 Aug 5 09:51 qemu-guest-32744.log -rw-r--r--. 1 root ceph 0 Aug 5 09:52 qemu-guest-338.log -rw-r--r--. 1 root ceph 0 Aug 5 09:52 qemu-guest-366.log -rw-r--r--. 1 root ceph 0 Aug 5 09:53 qemu-guest-394.log -rw-r--r--. 1 root ceph 0 Aug 5 09:53 qemu-guest-419.log -rw-r--r--. 1 root ceph 0 Aug 8 03:43 qemu-guest-4238.log -rw-r--r--. 1 root ceph 109 Aug 6 07:32 qemu-guest-4238.log-20160808.gz -rw-r--r--. 1 root ceph 0 Aug 8 03:43 qemu-guest-4307.log -rw-r--r--. 1 root ceph 112 Aug 6 07:33 qemu-guest-4307.log-20160808.gz -rw-r--r--. 1 root ceph 0 Aug 8 03:43 qemu-guest-4386.log -rw-r--r--. 1 root ceph 559 Aug 6 07:33 qemu-guest-4386.log-20160808.gz -rw-r--r--. 1 root ceph 0 Aug 5 09:54 qemu-guest-445.log -rw-r--r--. 1 root ceph 0 Aug 5 09:57 qemu-guest-494.log -rw-r--r--. 1 root ceph 0 Aug 5 10:00 qemu-guest-526.log -rw-r--r--. 1 root ceph 0 Aug 5 10:42 qemu-guest-611.log
Did you check the journal log (journalctl) to see if logrotate complains about anything?
(In reply to Boris Ranto from comment #18) > Did you check the journal log (journalctl) to see if logrotate complains > about anything? I did the following: journalctl -r | grep logrotate, and I saw: Aug 08 03:45:42 magna115 run-parts(/etc/cron.daily)[10936]: finished logrotate Aug 08 03:43:01 magna115 run-parts(/etc/cron.daily)[10922]: starting logrotate Aug 07 03:49:43 magna115 run-parts(/etc/cron.daily)[7372]: finished logrotate Aug 07 03:47:01 magna115 run-parts(/etc/cron.daily)[7357]: starting logrotate Aug 06 03:39:31 magna115 run-parts(/etc/cron.daily)[3463]: finished logrotate Aug 06 03:37:01 magna115 run-parts(/etc/cron.daily)[3455]: starting logrotate Aug 05 03:52:41 magna115 run-parts(/etc/cron.daily)[30360]: finished logrotate Aug 05 03:50:01 magna115 run-parts(/etc/cron.daily)[30352]: starting logrotate Aug 04 03:28:45 magna115 run-parts(/etc/cron.daily)[23755]: finished logrotate Aug 04 03:26:01 magna115 run-parts(/etc/cron.daily)[23747]: starting logrotate Aug 03 03:40:47 magna115 run-parts(/etc/cron.daily)[18868]: finished logrotate Aug 03 03:38:01 magna115 run-parts(/etc/cron.daily)[18854]: starting logrotate Aug 02 03:16:28 magna115 run-parts(/etc/cron.daily)[28047]: finished logrotate Aug 02 03:14:01 magna115 run-parts(/etc/cron.daily)[28037]: starting logrotate Aug 01 03:19:31 magna115 run-parts(/etc/cron.daily)[16172]: finished logrotate Aug 01 03:17:01 magna115 run-parts(/etc/cron.daily)[16139]: starting logrotate Jul 31 03:27:30 magna115 run-parts(/etc/cron.daily)[11789]: finished logrotate Jul 31 03:25:01 magna115 run-parts(/etc/cron.daily)[11777]: starting logrotate Jul 30 03:44:05 magna115 run-parts(/etc/cron.daily)[7178]: finished logrotate Jul 30 03:42:01 magna115 run-parts(/etc/cron.daily)[7166]: starting logrotate Also, I grepped in /var/log to check if I see "ALERT exited abnormally with [1]" error which I had seen when the error had happened, and I don't see that error as well.
Verified on ceph-radosgw-10.2.2-41.el7cp.x86_64.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1898.html