Bug 1591278
Summary: | memcache container not restarting consistently after update and reboot. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> |
Component: | openstack-tripleo-heat-templates | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED ERRATA | QA Contact: | Raviv Bar-Tal <rbartal> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 13.0 (Queens) | CC: | ccamacho, jcoufal, jschluet, mburns, morazi, ojanas, rbartal, sathlang, sgolovat |
Target Milestone: | ga | Keywords: | Triaged |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-8.0.2-38.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-27 13:58:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sofer Athlan-Guyot
2018-06-14 13:05:47 UTC
Hum... Steps to reproduce ar taking the previous comment where I left it: Steps to reproduce: 1. install beta 2. update to latest puddle 3. reboot the 3ctl This is where we could reproduce it. 2 memcached on the first run, only one on the second one. Doing install and reboot was successful so we cannot yet conclude that it happen there (but we can't exclude it neither). Some more information from https://bugs.launchpad.net/tripleo/+bug/1776889 Hi, originally written there https://bugzilla.redhat.com/show_bug.cgi?id=1590297. This is a more complete analysis on what is happening. On a 3 nodes setup, after a update, restart, scale out and re-deploy , some[1] of the memcached container fail to restart with that error: [root@controller-1 ~]# docker logs memcached /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied /bin/bash: /var/log/memcached.log: Permission denied The container is in that status: fa8fc2bf967d 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1 "/bin/bash -c 'sou..." 14 hours ago Restarting (1) 4 hours ago memcached each line correspond to a restart. On the other two controllers the restart was successful. So hold your belt tight as it's going to be hairy.... The log directory /var/log/containers/memcached is mounted directly in /var/log where we expect memcached to write its log by adding ">>/var/log/memcached.log 2>&1". In the memcached_init_logs step of the docker_config step1 config we init the log like this[2]: command: ['/bin/bash', '-c', 'source /etc/sysconfig/memcached; touch /var/log/memcached.log && chown ${USER} /var/log/memcached.log'] so far so good ... But if you look at any memcached directory where the container were successfully started, then you have[3]: - /var/log/memcached.log.1 - /var/log/memcached.log.2 with log.1 being opened by the memcached process: ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.1 lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted) Output information may be incomplete. COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME memcached 7 memcached 1w REG 252,2 41994775 97206843 /var/log/memcached.log.1 memcached 7 memcached 2w REG 252,2 41994775 97206843 /var/log/memcached.log.1 There is this error about /etc/sysconfig/memcached being deleted, I'll come back to it later. For now I just don't understand how those files are created. "log.1" and "log.2" ... it's even more puzzling as I can't create any file in the /var/log/ directory with permission denied on a working memcached container. [root@controller-0 ~]# docker exec -ti memcached bash ()[memcached@controller-0 /]$ ls -lrth /var/log/* -rw-r--r--. 1 memcached root 15M Jun 13 19:31 /var/log/memcached.log.2 -rw-r--r--. 1 memcached root 42M Jun 14 10:05 /var/log/memcached.log.1 ()[memcached@controller-0 /]$ ls -lrthd /var/log/ drwxr-xr-x. 2 root root 52 Jun 13 22:01 /var/log/ ()[memcached@controller-0 /]$ touch /var/log/memcached.log touch: cannot touch '/var/log/memcached.log': Permission denied ()[memcached@controller-0 /]$ id uid=42457(memcached) gid=42457(memcached) groups=42457(memcached),42400(kolla) and the "log.1" does belong to memcached process: ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.2 lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted) Output information may be incomplete. ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.1 lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted) Output information may be incomplete. COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME memcached 7 memcached 1w REG 252,2 43203747 97206843 /var/log/memcached.log.1 memcached 7 memcached 2w REG 252,2 43203747 97206843 /var/log/memcached.log.1 On the host: [root@controller-0 ~]# ls -lrth /var/log/containers/memcached/ total 68M -rw-r--r--. 1 42457 root 15M Jun 13 19:31 memcached.log.2 -rw-r--r--. 1 42457 root 42M Jun 14 10:19 memcached.log.1 [root@controller-0 ~]# ls -lrthd /var/log/containers/memcached/ drwxr-xr-x. 2 root root 52 Jun 13 22:01 /var/log/containers/memcached/ Which makes no sense. So there are two unknown here: 1. who creates log.1 and rotate log.2 ? 2. how come the process can write in a directory that it doesn't have write access to. I'm at loss for question 1. For question 2 I guess that the mounting sequence is flawed somehow, as the /etc/sysconfig/memcached error in the lsof output seems to indicate. Here more about it, the file system /etc/sysconfig/memcached is "deleted" but I can cat in it: # docker exec -ti memcached bash ()[memcached@controller-0 /]$ lsof /etc/sysconfig/memcached lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted) Output information may be incomplete. ()[memcached@controller-0 /]$ cat /etc/sysconfig/memcached PORT="11211" USER="memcached" MAXCONN="8192" CACHESIZE="16005" OPTIONS="-vv -l 172.17.1.15 -U 0 -t 8 >> /var/log/memcached.log 2>&1" ()[memcached@controller-0 /]$ This tends to confirm that something is wrong in the mounting sequence. All in all I think we should change the way the container are created to make them more robust after any reboot operation. If somebody has hint at what is happening there that would be awesome ... and thanks to read till the end. [1] first iteration of the test found 2 memcached failing to restart, on another run only 1 fail to restart and on another (whithout the scale out and the reconfig) all three restarted successfully. I can find the "log.1" in a deployment as well. [2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/memcached.yaml#L72..L84 [3] This seems to happen only on rhel platform as in the CI jobs memcached.log seems to be created fine. So there is log rotation done by the crond container and that is what must bite us: /var/lib/config-data/crond/etc/logrotate-crond.conf: /var/log/containers/*/*log /var/log/containers/*/*/*log { rotate 14 size 10M missingok notifempty delaycompress postrotate /sbin/lsof -nPs +L1 +D /var/log 2>&1|\ awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print $2}' |\ /bin/xargs -n1 -r -t kill -HUP endscript } hence the log.1 file. This other review will prevent massing memcached logging to the journal when debug is activated. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |