Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1591278 - memcache container not restarting consistently after update and reboot.
memcache container not restarting consistently after update and reboot.
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
urgent Severity urgent
: ga
: 13.0 (Queens)
Assigned To: Sofer Athlan-Guyot
Raviv Bar-Tal
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-06-14 09:05 EDT by Sofer Athlan-Guyot
Modified: 2018-06-27 09:59 EDT (History)
8 users (show)

See Also:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-38.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-06-27 09:58:15 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1776889 None None None 2018-06-14 09:10 EDT
OpenStack gerrit 575676 None master: MERGED tripleo-heat-templates: Log memcached to stdout/err only (I3b5aec6b7f7c5fa4f84e3c3eb243c155a6189e2c) 2018-06-20 12:20 EDT
OpenStack gerrit 575693 None master: MERGED tripleo-heat-templates: Activate memcached debug only when using MemcachedDebug param. (I85b6ca421d326c6e764fade66726407... 2018-06-20 12:20 EDT
OpenStack gerrit 576116 None stable/queens: MERGED tripleo-heat-templates: Log memcached to stdout/err only (I3b5aec6b7f7c5fa4f84e3c3eb243c155a6189e2c) 2018-06-20 12:20 EDT
OpenStack gerrit 576823 None stable/queens: NEW tripleo-heat-templates: Activate memcached debug only when using MemcachedDebug param. (I85b6ca421d326c6e764fade66726407... 2018-06-20 12:20 EDT
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 09:59 EDT

  None (edit)
Description Sofer Athlan-Guyot 2018-06-14 09:05:47 EDT
Description of problem:

Originally discovered there https://bugzilla.redhat.com/show_bug.cgi?id=1590297

Basically install osp13 with 3ctl and reboot them.  You then have a good chance to have one or more memcached container failing to restart with:

    [root@controller-1 ~]# docker logs memcached
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied

fa8fc2bf967d        192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1                   "/bin/bash -c 'sou..."   17 hours ago        Restarting (1) 21 minutes ago                       memcached


Version-Release number of selected component (if applicable):
 - 2018-06-12.1 of the container.

[root@controller-1 ~]# rhos-release -L
Installed repositories (rhel-7.5):
  13
  ceph-3
  ceph-osd-3
  rhel-7.5

http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2018-06-12.1/RH7-RHOS-13.0/$basearch/os

How reproducible:

twice but got a variable amount of memcached container failing (two in the first run, one in the first run)

Steps to Reproduce:
1. install
2. reboot the 3 ctl

Additional info:

In a fresh deployment the memcached container open its log in /var/log/memcached.log.1 .  The ".1" doesn't make any sense and we currently don't know how and who create it.
Comment 1 Sofer Athlan-Guyot 2018-06-14 09:07:33 EDT
Hum...

Steps to reproduce ar
Comment 2 Sofer Athlan-Guyot 2018-06-14 09:10:48 EDT
taking the previous comment where I left it:

Steps to reproduce:
 1. install beta
 2. update to latest puddle
 3. reboot the 3ctl

This is where we could reproduce it.  2 memcached on the first run, only one on the second one.

Doing install and reboot was successful so we cannot yet conclude that it happen there (but we can't exclude it neither).
Comment 3 Sofer Athlan-Guyot 2018-06-14 09:14:39 EDT
Some more information from https://bugs.launchpad.net/tripleo/+bug/1776889

Hi,

originally written there https://bugzilla.redhat.com/show_bug.cgi?id=1590297.

This is a more complete analysis on what is happening.

On a 3 nodes setup, after a update, restart, scale out and re-deploy ,
some[1] of the memcached container fail to restart with that error:

    [root@controller-1 ~]# docker logs memcached
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied
    /bin/bash: /var/log/memcached.log: Permission denied

The container is in that status:

    fa8fc2bf967d 192.168.24.1:8787/rhosp13/openstack-memcached:2018-06-12.1 "/bin/bash -c 'sou..." 14 hours ago Restarting (1) 4 hours ago memcached

each line correspond to a restart.

On the other two controllers the restart was successful.

So hold your belt tight as it's going to be hairy....

The log directory /var/log/containers/memcached is mounted directly in
/var/log where we expect memcached to write its log by adding
">>/var/log/memcached.log 2>&1". In the memcached_init_logs step of
the docker_config step1 config we init the log like this[2]:

   command: ['/bin/bash', '-c', 'source /etc/sysconfig/memcached; touch /var/log/memcached.log && chown ${USER} /var/log/memcached.log']

so far so good ...

But if you look at any memcached directory where the container were
successfully started, then you have[3]:

  - /var/log/memcached.log.1
  - /var/log/memcached.log.2

with log.1 being opened by the memcached process:

    ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.1
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    memcached 7 memcached 1w REG 252,2 41994775 97206843 /var/log/memcached.log.1
    memcached 7 memcached 2w REG 252,2 41994775 97206843 /var/log/memcached.log.1

There is this error about /etc/sysconfig/memcached being deleted, I'll
come back to it later. For now I just don't understand how those
files are created. "log.1" and "log.2" ... it's even more puzzling
as I can't create any file in the /var/log/ directory with permission
denied on a working memcached container.

    [root@controller-0 ~]# docker exec -ti memcached bash
    ()[memcached@controller-0 /]$ ls -lrth /var/log/*
    -rw-r--r--. 1 memcached root 15M Jun 13 19:31 /var/log/memcached.log.2
    -rw-r--r--. 1 memcached root 42M Jun 14 10:05 /var/log/memcached.log.1
    ()[memcached@controller-0 /]$ ls -lrthd /var/log/
    drwxr-xr-x. 2 root root 52 Jun 13 22:01 /var/log/
    ()[memcached@controller-0 /]$ touch /var/log/memcached.log
    touch: cannot touch '/var/log/memcached.log': Permission denied
    ()[memcached@controller-0 /]$ id
    uid=42457(memcached) gid=42457(memcached) groups=42457(memcached),42400(kolla)

and the "log.1" does belong to memcached process:

    ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.2
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    ()[memcached@controller-0 /]$ lsof /var/log/memcached.log.1
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    memcached 7 memcached 1w REG 252,2 43203747 97206843 /var/log/memcached.log.1
    memcached 7 memcached 2w REG 252,2 43203747 97206843 /var/log/memcached.log.1

On the host:

    [root@controller-0 ~]# ls -lrth /var/log/containers/memcached/
    total 68M
    -rw-r--r--. 1 42457 root 15M Jun 13 19:31 memcached.log.2
    -rw-r--r--. 1 42457 root 42M Jun 14 10:19 memcached.log.1
    [root@controller-0 ~]# ls -lrthd /var/log/containers/memcached/
    drwxr-xr-x. 2 root root 52 Jun 13 22:01 /var/log/containers/memcached/

Which makes no sense.

So there are two unknown here:
 1. who creates log.1 and rotate log.2 ?
 2. how come the process can write in a directory that it doesn't have
    write access to.

I'm at loss for question 1. For question 2 I guess that the mounting
sequence is flawed somehow, as the /etc/sysconfig/memcached error in
the lsof output seems to indicate. Here more about it, the file
system /etc/sysconfig/memcached is "deleted" but I can cat in it:

    # docker exec -ti memcached bash
    ()[memcached@controller-0 /]$ lsof /etc/sysconfig/memcached
    lsof: WARNING: can't stat() xfs file system /etc/sysconfig/memcached (deleted)
          Output information may be incomplete.
    ()[memcached@controller-0 /]$ cat /etc/sysconfig/memcached
    PORT="11211"
    USER="memcached"
    MAXCONN="8192"
    CACHESIZE="16005"
    OPTIONS="-vv -l 172.17.1.15 -U 0 -t 8 >> /var/log/memcached.log 2>&1"
    ()[memcached@controller-0 /]$

This tends to confirm that something is wrong in the mounting
sequence.

All in all I think we should change the way the container are created to make them
more robust after any reboot operation.

If somebody has hint at what is happening there that would be awesome ... and thanks
to read till the end.

[1] first iteration of the test found 2 memcached failing to restart,
on another run only 1 fail to restart and on another (whithout the
scale out and the reconfig) all three restarted successfully. I can
find the "log.1" in a deployment as well.

[2] https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/memcached.yaml#L72..L84

[3] This seems to happen only on rhel platform as in the CI jobs memcached.log seems to be created fine.
Comment 4 Sofer Athlan-Guyot 2018-06-14 12:48:23 EDT
So there is log rotation done by the crond container and that is what must bite us:

/var/lib/config-data/crond/etc/logrotate-crond.conf:

/var/log/containers/*/*log /var/log/containers/*/*/*log {
  rotate 14
  size 10M
  missingok
  notifempty
 
  delaycompress
 
  postrotate
    /sbin/lsof -nPs +L1 +D /var/log 2>&1|\
    awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print $2}' |\
    /bin/xargs -n1 -r -t kill -HUP
  endscript
}

hence the log.1 file.
Comment 5 Sofer Athlan-Guyot 2018-06-15 05:06:46 EDT
This other review will prevent massing memcached logging to the journal when debug is activated.
Comment 11 errata-xmlrpc 2018-06-27 09:58:15 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.