Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1696158

Summary:	mysqld.log is purged by logrotate, bit still opened by the mysqld process
Product:	Red Hat OpenStack	Reporter:	Cédric Jeanneret <cjeanner>
Component:	openstack-tripleo-heat-templates	Assignee:	Cédric Jeanneret <cjeanner>
Status:	CLOSED ERRATA	QA Contact:	Marius Cornea <mcornea>
Severity:	high	Docs Contact:
Priority:	high
Version:	14.0 (Rocky)	CC:	apevec, jschluet, lhh, mburns, mcornea, slinaber, tkajinam
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	14.0 (Rocky)	Flags:	cjeanner: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-9.3.1-0.20190314162753.d0a6cb1.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-30 17:51:32 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1695138, 1696161
Bug Blocks:

Description Cédric Jeanneret 2019-04-04 09:05:00 UTC

This bug was initially created as a copy of Bug #1695138

I am copying this bug because: 

Need a dedicated one for openstack-tripleo-heat-templates package

+++ This bug was initially created as a clone of Bug #1679878 +++

Description of problem:

In recent RHOSP13, we execute rotation for the logs from applications running inside container in default.
~~~
/var/log/containers/*/*log /var/log/containers/*/*/*log /var/log/containers/*/*err {
  daily
  rotate 14
  maxage 14
  minsize 1
  maxsize 10M
  missingok
  notifempty
  copytrunate
  delaycompress
  compress
  sharedscripts
  postrotate
    /usr/bin/find /var/log/containers -type f \
    \( -mtime +14 -or \
    -atime +14 -or \
    -ctime +14 \) -exec rm -f {} \;;
  endscript
}
~~~

According to the configuration, all logs which does not have any records
for 14 days are purged.

Mysql running inside container rarely generates logs,
and this causes the problem that its log get purged but mysqld process keep opening the deleted file
as it never gets SIGHUP from logrotate.

How reproducible:

- Deploy overcloud with default logrotate configuration

Steps to Reproduce:
1. Deploy overcloud with small LogrotateRotatte and LogrotatePurgeAfterDays
2. Wait until the specified day pass
3. See the opened log file by the following command
 - $ sudo lsof -p `sudo pidof mysqld` 2>/dev/null | grep log

Actual results:

The process keep opening the deleted log file

Expected results:

The process should open the exsiting log file


Additional info:

--- Additional comment from Cédric Jeanneret on 2019-02-22 09:41:39 UTC ---

Hello,

This is a "known" issue, due to the implementation of an European law (GDRP). A possible work-around is proposed in this change: https://review.openstack.org/#/c/616945/ - but, in the end, it just provides a longer delay before the removal :(.

Some history about this issue:
Currently, with an external log management (application writing their own load on the host, an independent logrotate in its own container), there isn't any convenient nor easy way to avoid this situation. There are parallel discussions about better logging management, involving container stdout and proper (r)syslog rules, but it isn't that simple either (number of issues with throttle and possible locks that might impact the application perfs).

If you think the proposed change might do for your case, let me know, I can revive it - while it will not solve the issue once for all...

Cheers,

C.

--- Additional comment from Takashi Kajinami on 2019-02-25 05:41:16 UTC ---

Hi Cédric,

Thank you for updating me.

> A possible work-around is proposed in this change: https://review.openstack.org/#/c/616945/ - but, in the end, it just provides a longer delay before the removal :(.

I'm afraid this could not be a solution here, as the said mysqld is so silent that rarely generates logs.

> This is a "known" issue, due to the implementation of an European law (GDRP).
Can I ask a question about it, "Do we really need to "purge" the logs?"

As far as I know, there are no logs related to OpenStack, which include information related to its user in their names,
so at this point it is harmless even if we can make it sure to remove all the content inside the logs.

If we do not need to purge the log files, IMO the following configuration could be a solution here,
as we cam surely remove old log contents while we can keep the log file itself after its first creation at the same time.

~~~
/var/log/containers/*/*log /var/log/containers/*/*/*log /var/log/containers/*/*err {
  daily
  rotate 14
  maxage 14
  minsize 0
  maxsize 10M
  missingok
  ifempty
  copytrunate
  delaycompress
  compress
  sharedscripts
  postrotate
    /usr/bin/find /var/log/containers -type f \
    -mtime +14 -exec rm -f {} \;;
  endscript
}
~~~

It would be nice if I can ask you thought about this.

Thank you,
Takashi

--- Additional comment from Cédric Jeanneret on 2019-02-25 13:06:08 UTC ---

Hello Takashi,

The mysqld case is probably the trickiest one we can face - indeed, it's a really silent service, while all the others, even rabbitmq, do some output on a daily base. The fact we apply the GDRP rules for all logs is mainly because of the possibility of getting personal data in almost any logs: even mysql can get them, if the operator decides to enable some debug (iirc we can get the query output in the logs). According to the UE, the IP address, user name and some other data are considered as personal data, and thus should be properly managed. Hence the huge clean occurring on the logs on a daily basis.

In order to free the file descriptor, an attempt was made to restart the services, but doing so in a container environment isn't that easy, and could lead to a sever downtime in case a service doesn't restart as expected. Conservation is key in this specific case: having the infra crumbling due to logrotate wasn't an option.

I understand the proposed patch isn't the best one - and indeed, it won't sort the mysql since it might never output anything for a year (especially if the operator doesn't do any update). Your proposal will probably not work either, especially since the "mtime" has little chance to change for mysql (no writing, hence no modification), meaning it will be removed anyway. Maybe "atime", but there again, not even sure it will do - and it might lead to some other issues with the other logs.

Fact is, I'm wondering if we'd really need the "rm" part - in my patch, we already empty the logfile older than 14 days. The "rotate 14" should take care of the removal itself. Have to think a bit about that, and do some more testing.
This subject is interesting, and the specificities of mysql make it really complicated.

I'll keep you updated with my findings.

Cheers,

C.

--- Additional comment from Takashi Kajinami on 2019-02-25 14:01:20 UTC ---

Hi Cédric,

Thanks for your great help.

> Your proposal will probably not work either, especially since the "mtime" has little chance to change for mysql 
Nice catch, thanks!
I thought mtime should be updated as we always rotate logs by using notifempty, but as you say I should check the behavior of copytruncate when the file is empty.

> Fact is, I'm wondering if we'd really need the "rm" part
I still need investigation about some deployment patterns, but the possible effect of removing rm part could be
that the file under /var/log/containers whose name does not match to any specified pattern could be kept without getting rotated,
(ex. ramdisk logs under /var/log/containers/ironic-inspector), but we can do some improvoments here,
like using more specific name pattern so that we do not remove the log files already treated by logrotate, IMO.

Please note that this could be a big problem from operaitonal point of view, because
 - operators could not detect the problem easily, as the log is deleted without any warning or error
 - mysqld.logs are one of the essential logs when we deal with db issue, which sometimes causes critial effect on services

Thank you,
Takashi

--- Additional comment from Cédric Jeanneret on 2019-02-25 14:10:13 UTC ---

Hello Takashi,

I've pushed a new update on the said patch - my intend is to avoid removing file that *might* be in use. No sure if the way I'm doing it is OK, but it might already help a bit. I'm deploying an env in order to be able to test it a bit more than "in theory it should".

I of course understand the issue not having the mysqld.log files - I've done operational as a former job ;). We'll find a solution.

Cheers,

C.

--- Additional comment from Cédric Jeanneret on 2019-02-25 15:12:11 UTC ---

Hello Takashi,

So apparently my last iteration is working, at least with the following file:

~~~~
/var/log/containers/mysql/*log {
  daily
  rotate 14
  maxage 14
  # minsize 1 is required for GDPR compliance, all files in
  # /var/log/containers not managed with logrotate will be purged!
  minsize 1
  # Do not use size as it's not compatible with time-based rotation rules
  # required for GDPR compliance.
  missingok
  notifempty
  copytruncate
  delaycompress
  postrotate
    directory=$(dirname $1);
    /usr/bin/find ${directory} -maxdepth 1 -type f \
    \( -mtime +14 -or \
    -atime +14 -or \
    -ctime +14 \) -exec truncate -s0 {} \;;
    test $(ls ${directory} | wc -l) -ge 14 && \
    /usr/bin/find ${directory} -maxdepth 1 -type f \
    \( -mtime +28 -or \
    -atime +28 -or \
    -ctime +28 \) -exec rm -f  {} \;;
  endscript
}
~~~

I have the following content/behavior, on the host:
standard logrotate up to 14 files:
[root@undercloud mysql]# ls -ltr
total 56
-rw-rw----. 1 42434 42434    8 Feb 25 14:44 mariadb.log.14
-rw-rw----. 1 42434 42434    4 Feb 25 14:44 mariadb.log.13
-rw-rw----. 1 42434 42434    4 Feb 25 14:45 mariadb.log.12
-rw-rw----. 1 42434 42434    4 Feb 25 14:45 mariadb.log.11
-rw-rw----. 1 42434 42434    4 Feb 25 14:45 mariadb.log.10
-rw-rw----. 1 42434 42434    4 Feb 25 14:46 mariadb.log.9
-rw-rw----. 1 42434 42434    4 Feb 25 14:46 mariadb.log.8
-rw-rw----. 1 42434 42434    4 Feb 25 14:47 mariadb.log.7
-rw-rw----. 1 42434 42434    4 Feb 25 14:47 mariadb.log.6
-rw-rw----. 1 42434 42434    4 Feb 25 14:47 mariadb.log.5
-rw-rw----. 1 42434 42434    4 Feb 25 14:48 mariadb.log.4
-rw-rw----. 1 42434 42434    4 Feb 25 14:48 mariadb.log.3
-rw-rw----. 1 42434 42434    4 Feb 25 14:49 mariadb.log.2
-rw-rw----. 1 42434 42434 3157 Feb 25 15:01 mariadb.log.1
-rw-rw----. 1 42434 42434    0 Feb 25 15:01 mariadb.log

(in order to do so:
echo foo > /var/log/containers/mysql/mariadb.log
logrotate -f -v -s /var/lib/logrotate/logrotate-crond.status /root/logrotate.conf
)

Now the interesting part:
echo foo > /var/log/containers/mysql/mariadb.log
touch -d 2018-02-02 /var/log/containers/mysql/mariadb.log

Now we have a file with "something" in it (matching the minsize), and really old.
logrotate -f -v -s /var/lib/logrotate/logrotate-crond.status /root/logrotate.conf

The file is empty (due to the first "find"), but NOT removed (because we don't have more than the amount of files we need/want).

This means:
- mysql doesn't lose its log file
- but we lose the content if the file is older than 14 days.

It's still not ideal - but at least it's better than the current situation. Would that be sufficient for you?

Cheers,

C.

--- Additional comment from Cédric Jeanneret on 2019-03-07 10:16:12 UTC ---

Dear Takashi,

Small summary of the current status:

## Integration of tmpwatch tool:
after some thoughts, tests and reflections, it appears the best thing to do is to use a proper tool for the cleanup. In our case, tmpwatch is probably the best one, since it allows to:
- cleanup files based on time
- exclude files based on patterns
- avoid the removal of directories
The following patch allows to get tmpwatch installed on the host directly¹: https://review.openstack.org/641608

As explained in the commit message, we ensure this tool will not conflict with logrotate managed files, since we explicitly exclude its pattern, and allow a small delay before the removal.

## Simplification of the logrotate configuration:
since we have a proper way to manage unwanted files, we can remove the postrotate thing from the logrotate configuration file.
This is done in this change: https://review.openstack.org/616945

Hopefully this will work as expected, we still need to get the CI result in order to ensure all is OK.

Cheers,

C.


¹ since the logrotate_crond container has no network access, we can't install things in it. And as there are discussions about the container removal, I'd like to avoid adding stuff in it anyway: https://bugs.launchpad.net/tripleo/+bug/1818674

--- Additional comment from Takashi Kajinami on 2019-03-07 23:57:48 UTC ---

Hi Cédric,

Great thanks for your support about this topic.

I did check the proposed patches, and it looks good at code level.
I'll try the configuration parameters in my env and let you know if I find any problems.

By the way, do you know the way to include tmpwatch in overcloud-images-full?
As far as I know, basically director deployments does not require yum repository
working at the installation time, so it could be better to include tmpwatch
at least in the image, so that we do not need to add an additional requirements here.

> And as there are discussions about the container removal
I'd like to vote my +1000 for the removal, as this could make logrotate more simple and easy,
without any problems in SIGHUP...

Thank you,
Takashi

--- Additional comment from Cédric Jeanneret on 2019-03-08 06:30:30 UTC ---

Hi Takashi,

hmm I remember seeing "yum update -y" or the like as one of the very first actions of the overcloud deploy - but that was on CentOS, and it was "long" ago. Anyway. If you want to populate the tmpwatch in the overcloud image, I would say:
- download the required RPM
- inject them in the image using `virt-sysprep' command
- install the RPM using that very same command.
- upload the modified overcloud image

So, at least on Fedora 28:
yum deplist tmpwatch
package: tmpwatch-2.11-12.fc28.x86_64
  dependency: libc.so.6(GLIBC_2.17)(64bit)
   provider: glibc-2.27-37.fc28.x86_64
  dependency: psmisc
   provider: psmisc-23.1-3.fc28.x86_64
  dependency: rtld(GNU_HASH)
   provider: glibc-2.27-37.fc28.i686
   provider: glibc-2.27-37.fc28.x86_64

So basically you need psmisc rpm in addition to the tmpwatch - you can fetch them via yum:
yum install --downloadonly psmisc tmpwatch
You'll find the RPM in the yum cache repository ( /var/cache/yum/....)
More info about the downloadonly plugin: https://access.redhat.com/solutions/10154

Once you have the rpm, you can do the 2 next steps in one line. You might want to copy the original overcloud image first, in stack home.
Based on the fact you already have a copy:
virt-sysprep -a /home/stack/<OVERCLOUD_IMG> --copy-in /var/cache/yum/.../TMPWATCH_RPM:/tmp/ --copy-in /var/cache/yum/.../PSMISC_RPM:/tmp/ --install /tmp/PSMISC_RPM,/tmp/TMPWATCH_RPM --selinux-relabel

And voilà, you should have the tmpwatch installed in the overcloud image. Just upload it in-place, and you're done.

Lemme know if this answers your question, or if you have some issues (I don't have an env up'n'running, so I can't test the commands right now).

Cheers,

C.

--- Additional comment from Takashi Kajinami on 2019-03-08 07:35:26 UTC ---

Hi Cédric,

I had a discussion with support guys, and got suggestion to create a another bug about RFE to add the tmpwatch package to overcloud-images-full.
I created the one here [1]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1686730

So please ignore my previous comment about adding package to the image.

Thank you,
Takashi

--- Additional comment from Takashi Kajinami on 2019-03-08 07:45:14 UTC ---

Hi Cédric,

Just fyi,

I did try the tmpwatch command in the patch in my local, and it works well.

Thank you,
Takashi

--- Additional comment from Cédric Jeanneret on 2019-03-08 07:50:45 UTC ---

Hello Takashi,

Good news then :). I'm happy being able to help you on that case. We just need to ensure the patches are merged, so that I can backport them to OSP 13 (up and downstream).

Cheers,

C.

--- Additional comment from Takashi Kajinami on 2019-03-20 01:08:48 UTC ---

Hi Cédric,

Thank you for your support on this issue.

I see that the patches are already merged to master.
Could you backport these patches to Queens/Rocky branch?

Thank you,
Takashi

--- Additional comment from Cédric Jeanneret on 2019-03-20 06:40:40 UTC ---

Hello Takashi,

I'll try, but the cherry-picks won't be that easy. Have to check with my colleague how to manage this kind of things (for instance, https://review.openstack.org/#/c/641608/ can't be cherry-picked "as is").

Cheers,

C.

--- Additional comment from Cédric Jeanneret on 2019-03-26 06:31:46 UTC ---

Hello Takashi,

I'm working my way through the backports, and should be able to land the three needed patches in Queens by the end of the week. I'll also work on the downstream import of those same patches in both OSP14 and OSP13 (the final target).

I'm currently lacking one patch in Rocky[1] - but it should land hopefully either today or tomorrow. As this patch is the one that really solve the issue, we can't discard it ;).

Cheers,

C.

[1] https://review.openstack.org/644781

--- Additional comment from Cédric Jeanneret on 2019-03-28 07:17:45 UTC ---

Hello Takashi,

The latest patch should hit upstream Queens shortly. After that, cherry-picks downstream will be required. I'll already start, directly in queens, with the two patches that are already merged upstream.

Cheers,

C.

Comment 2 Marius Cornea 2019-04-15 02:04:43 UTC

tmpwatch is not installed on the overcloud nodes while /etc/cron.daily/containers-tmpwatch gets created. Do we have another bug for including it in the overcloud-full image?


[root@controller-r00-00 ~]# cat /etc/cron.daily/containers-tmpwatch
#!/bin/sh
tmpwatch --nodirs \
  -X "/var/log/containers/*/*log" \
  -X "/var/log/containers/*/*/*log" \
  -X "/var/log/containers/*/*err" \
  15 \
  /var/log/containers/ 2>&1 | logger -t container-tmpwatch

[root@controller-r00-00 ~]# rpm -q tmpwatch
package tmpwatch is not installed

(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep rhosp-director
rhosp-director-images-14.0-20190402.1.el7ost.noarch
rhosp-director-images-ipa-x86_64-14.0-20190402.1.el7ost.noarch
rhosp-director-images-x86_64-14.0-20190402.1.el7ost.noarch

Comment 3 Cédric Jeanneret 2019-04-15 13:59:40 UTC

Hey,

So yeah, there are some other related bugs, especially https://bugzilla.redhat.com/show_bug.cgi?id=1696161 which includes the package in the overcloud base images.

This issue affects 3 issues per release:
OSP-14:
- tripleo-heat-templates: (current BZ)
- puppet-tripleo: https://bugzilla.redhat.com/show_bug.cgi?id=1695138
- tripleo-puppet-elements: https://bugzilla.redhat.com/show_bug.cgi?id=1696161

OSP-13:
- tripleo-heat-templates: https://bugzilla.redhat.com/show_bug.cgi?id=1696166
- puppet-tripleo: https://bugzilla.redhat.com/show_bug.cgi?id=1679878
- tripleo-puppet-elements: https://bugzilla.redhat.com/show_bug.cgi?id=1696170

You need all three packages together in order to properly test the patch..

Comment 8 errata-xmlrpc 2019-04-30 17:51:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0878