Bug 1591278

Summary: memcache container not restarting consistently after update and reboot.
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Raviv Bar-Tal <rbartal>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: ccamacho, jcoufal, jschluet, mburns, morazi, ojanas, rbartal, rhos-flags, sathlang, sgolovat
Target Milestone: gaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-38.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:58:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sofer Athlan-Guyot 2018-06-14 13:05:47 UTC
Description of problem:

Originally discovered there https://bugzilla.redhat.com/show_bug.cgi?id=1590297

Basically install osp13 with 3ctl and reboot them.  You then have a good chance to have one or more memcached container failing to restart with:

    [root@controller-1 ~]# docker logs memcached
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied

fa8fc2bf967d        192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1                   "/bin/bash -c 'sou..."   17 hours ago        Restarting (1) 21 minutes ago                       memcached


Version-Release number of selected component (if applicable):
 - 2018-06-12.1 of the container.

[root@controller-1 ~]# rhos-release -L
Installed repositories (rhel-7.5):
  13
  ceph-3
  ceph-osd-3
  rhel-7.5

http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2018-06-12.1/RH7-RHOS-13.0/$basearch/os

How reproducible:

twice but got a variable amount of memcached container failing (two in the first run, one in the first run)

Steps to Reproduce:
1. install
2. reboot the 3 ctl

Additional info:

In a fresh deployment the memcached container open its log in /var/log/memcached.log.1 .  The ".1" doesn't make any sense and we currently don't know how and who create it.

Comment 1 Sofer Athlan-Guyot 2018-06-14 13:07:33 UTC
Hum...

Steps to reproduce ar

Comment 2 Sofer Athlan-Guyot 2018-06-14 13:10:48 UTC
taking the previous comment where I left it:

Steps to reproduce:
 1. install beta
 2. update to latest puddle
 3. reboot the 3ctl

This is where we could reproduce it.  2 memcached on the first run, only one on the second one.

Doing install and reboot was successful so we cannot yet conclude that it happen there (but we can't exclude it neither).

Comment 3 Sofer Athlan-Guyot 2018-06-14 13:14:39 UTC
Some more information from https://bugs.launchpad.net/tripleo/+bug/1776889

Hi,

originally written there https://bugzilla.redhat.com/show_bug.cgi?id=1590297.

This is a more complete analysis on what is happening.

On a 3 nodes setup, after a update, restart, scale out and re-deploy ,
some[1] of the memcached container fail to restart with that error:

    [root@controller-1 ~]# docker logs memcached
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied

The container is in that status:

    fa8fc2bf967d 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1 "/bin/bash -c 'sou..." 14 hours ago Restarting (1) 4 hours ago memcached

each line correspond to a restart.

On the other two controllers the restart was successful.

So hold your belt tight as it's going to be hairy....

The log directory /var/log/containers/memcached is mounted directly in
/var/log where we expect memcached to write its log by adding
">>/var/log/memcached.log 2>&1". In the memcached_init_logs step of
the docker_config step1 config we init the log like this[2]:

   command: ['/bin/bash', '-c', 'source /etc/sysconfig/memcached; touch /var/log/memcached.log && chown ${USER} /var/log/memcached.log']

so far so good ...

But if you look at any memcached directory where the container were
successfully started, then you have[3]:

  - /var/log/memcached.log.1
  - /var/log/memcached.log.2

with log.1 being opened by the memcached process:

    ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.1
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    memcached 7 memcached 1w REG 252,2 41994775 97206843 /var/log/memcached.log.1
    memcached 7 memcached 2w REG 252,2 41994775 97206843 /var/log/memcached.log.1

There is this error about /etc/sysconfig/memcached being deleted, I'll
come back to it later. For now I just don't understand how those
files are created. "log.1" and "log.2" ... it's even more puzzling
as I can't create any file in the /var/log/ directory with permission
denied on a working memcached container.

    [root@controller-0 ~]# docker exec -ti memcached bash
    ()[memcached@controller-0 /]$ ls -lrth /var/log/*
    -rw-r--r--. 1 memcached root 15M Jun 13 19:31 /var/log/memcached.log.2
    -rw-r--r--. 1 memcached root 42M Jun 14 10:05 /var/log/memcached.log.1
    ()[memcached@controller-0 /]$ ls -lrthd /var/log/
    drwxr-xr-x. 2 root root 52 Jun 13 22:01 /var/log/
    ()[memcached@controller-0 /]$ touch /var/log/memcached.log
    touch: cannot touch '/var/log/memcached.log': Permission denied
    ()[memcached@controller-0 /]$ id
    uid=42457(memcached) gid=42457(memcached) groups=42457(memcached),42400(kolla)

and the "log.1" does belong to memcached process:

    ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.2
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.1
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    memcached 7 memcached 1w REG 252,2 43203747 97206843 /var/log/memcached.log.1
    memcached 7 memcached 2w REG 252,2 43203747 97206843 /var/log/memcached.log.1

On the host:

    [root@controller-0 ~]# ls -lrth /var/log/containers/memcached/
    total 68M
    -rw-r--r--. 1 42457 root 15M Jun 13 19:31 memcached.log.2
    -rw-r--r--. 1 42457 root 42M Jun 14 10:19 memcached.log.1
    [root@controller-0 ~]# ls -lrthd /var/log/containers/memcached/
    drwxr-xr-x. 2 root root 52 Jun 13 22:01 /var/log/containers/memcached/

Which makes no sense.

So there are two unknown here:
 1. who creates log.1 and rotate log.2 ?
 2. how come the process can write in a directory that it doesn't have
    write access to.

I'm at loss for question 1. For question 2 I guess that the mounting
sequence is flawed somehow, as the /etc/sysconfig/memcached error in
the lsof output seems to indicate. Here more about it, the file
system /etc/sysconfig/memcached is "deleted" but I can cat in it:

    # docker exec -ti memcached bash
    ()[memcached@controller-0 /]$ lsof /etc/sysconfig/memcached
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    ()[memcached@controller-0 /]$ cat /etc/sysconfig/memcached
    PORT="11211"
    USER="memcached"
    MAXCONN="8192"
    CACHESIZE="16005"
    OPTIONS="-vv -l 172.17.1.15 -U 0 -t 8 >> /var/log/memcached.log 2>&1"
    ()[memcached@controller-0 /]$

This tends to confirm that something is wrong in the mounting
sequence.

All in all I think we should change the way the container are created to make them
more robust after any reboot operation.

If somebody has hint at what is happening there that would be awesome ... and thanks
to read till the end.

[1] first iteration of the test found 2 memcached failing to restart,
on another run only 1 fail to restart and on another (whithout the
scale out and the reconfig) all three restarted successfully. I can
find the "log.1" in a deployment as well.

[2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/memcached.yaml#L72..L84

[3] This seems to happen only on rhel platform as in the CI jobs memcached.log seems to be created fine.

Comment 4 Sofer Athlan-Guyot 2018-06-14 16:48:23 UTC
So there is log rotation done by the crond container and that is what must bite us:

/var/lib/config-data/crond/etc/logrotate-crond.conf:

/var/log/containers/*/*log /var/log/containers/*/*/*log {
  rotate 14
  size 10M
  missingok
  notifempty
 
  delaycompress
 
  postrotate
    /sbin/lsof -nPs +L1 +D /var/log 2>&1|\
    awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print $2}' |\
    /bin/xargs -n1 -r -t kill -HUP
  endscript
}

hence the log.1 file.

Comment 5 Sofer Athlan-Guyot 2018-06-15 09:06:46 UTC
This other review will prevent massing memcached logging to the journal when debug is activated.

Comment 11 errata-xmlrpc 2018-06-27 13:58:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086