Bug 1690784 - [OSP15] Controller-replacement fails (controller-removal) because : /var/log/containers/nova/nova-manage.log is owned by root:root
Summary: [OSP15] Controller-replacement fails (controller-removal) because : /var/log/...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 15.0 (Stein)
Assignee: Martin Schuppert
QA Contact: Archit Modi
URL:
Whiteboard:
Depends On: 1685506 1690787 1707816 1707817 1742169 1743402
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-20 09:00 UTC by Martin Schuppert
Modified: 2019-09-26 10:48 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-10.5.1-0.20190429000408.3415df5.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of: 1685506
Environment:
Last Closed: 2019-09-21 11:20:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 41562 0 None None None 2019-08-18 08:31:40 UTC
Launchpad 1820590 0 None None None 2019-03-20 09:00:56 UTC
OpenStack gerrit 643936 0 None None None 2019-03-20 09:00:56 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:21:01 UTC

Description Martin Schuppert 2019-03-20 09:00:56 UTC
+++ This bug was initially created as a clone of Bug #1685506 +++

Description of problem:
Controller-replacement fails (controller-removal) because : /var/log/containers/nova/nova-manage.log is owned by root:root

Version-Release number of selected component (if applicable):
OSP14 2019-02-27.1

How reproducible:
always

Steps to Reproduce:

Via automaion: 
run : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/octavia/job/DFG-network-octavia-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-controller_replacement-normal/

Manually:
1.deploy an HA OSP14
2.try to remove one controller : 
rerun the overcloud_deploy.sh with an added : 
-e /home/stack/remove-controller.yaml \

cat /home/stack/remove-controller.yaml
parameters:
  ControllerRemovalPolicies:
    [{'resource_list': ['0']}]


Actual results:
Overcloud controller removal fails with : 
http://pastebin.test.redhat.com/731244
Controller-1 nova_api container fails to start because :

  "IOError: [Errno 13] Permission denied: '/var/log/nova/nova-manage.log'", 
(in deployment log file)

Expected results:
Controller removal succeeds, finishes without errors, and all overcloud
agents are up and operational.

--- Additional comment from  on 2019-03-05 11:05:18 UTC ---

sos reports and stack home are at : 
http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1685506/

--- Additional comment from  on 2019-03-05 11:07:24 UTC ---

As can be seen below the rest of nova's containers logs are owned by Kolla : userid=>42436 (as it should)
but nova-manage.log is owned by root:

[root@controller-1 ~]# ls -l /var/log/containers/nova
total 33072
-rw-r--r--. 1 42436 42436  6120842 Mar  5 10:41 nova-api.log
-rw-r--r--. 1 42436 42436 10828856 Mar  5 09:00 nova-api.log.1
[...]
-rw-r--r--. 1 root  root         0 Mar  4 17:23 nova-manage.log
-rw-r--r--. 1 42436 42436        0 Mar  5 00:01 nova-metadata-api.log
-rw-r--r--. 1 42436 42436   761548 Mar  5 00:01 nova-metadata-api.log.1


[stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'ls -l /var/log/containers/nova|grep manage'

controller-1 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root  root         0 Mar  4 17:23 nova-manage.log

[stack@undercloud-0 ~]$ ansible controller-2 -mshell -b -a'ls -l /var/log/containers/nova|grep manage'

controller-2 | SUCCESS | rc=0 >>
-rw-r--r--. 1 42436 42436        0 Mar  5 00:01 nova-manage.log

--- Additional comment from Artem Hrechanychenko on 2019-03-06 15:00:15 UTC ---

Hello Pini,
reproduced on my env too - https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-df-controller_replacement-14-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-vxlan-replace_controller-RHELOSP-31864/

OSP14 puddle - 2019-02-27.1

--- Additional comment from Martin Schuppert on 2019-03-15 11:01:31 UTC ---

As the sosreports miss system logs, I tried to reproduce the issue with 2019-02-27.1 , but don't see the wrong permission on the nova-manage log

After deploy:

The only nova-manage log on controller-0:

[root@controller-0 ~]#  ls -la /var/log/containers/nova/ |grep manage
-rw-r--r--.  1 42436 42436        0 Mar 15 00:00 nova-manage.log
-rw-r--r--.  1 42436 42436   274848 Mar 15 00:00 nova-manage.log.1

After replacement:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 0ab774a5-233f-46e9-a429-0949b969d6db | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 51344592-2b97-455f-a936-b7826eae7b30 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 0c5fe3a4-8a66-4f45-b85b-0855b02f277f | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.21 |
| 4a046882-093e-4e00-8bed-46fcf6e72603 | controller-3 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+

[root@controller-0 ~]# ls -la /var/log/containers/nova/ |grep manage
-rw-r--r--.  1 42436 42436        0 Mar 15 00:00 nova-manage.log
-rw-r--r--.  1 42436 42436   274848 Mar 15 00:00 nova-manage.log.1

Does that job run any nova-manage commands as root outside the tripleo workflow? If initially one got triggered on Controller-1 as root the nova-manage log gets created as from the description and then the reported issue can happen.

In any case we'll submit a patch to chown the logs in case something get triggered as root manually.

Comment 17 pkomarov 2019-09-11 06:53:15 UTC
Verified , 

(undercloud) [stack@undercloud-0 ~]$ rpm -qa|grep openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-10.6.1-0.20190905170437.b33b839.el8ost.noarch




(undercloud) [stack@undercloud-0 ~]$ ansible overcloud_nodes -mshell -b -a'ls -l /var/log/containers/nova/nova-manage.log' 
 [WARNING]: Found both group and host with same name: undercloud

controller-0 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname controller-0: Name or service not known",
    "unreachable": true
}
compute-1 | FAILED | rc=2 >>
ls: cannot access '/var/log/containers/nova/nova-manage.log': No such file or directorynon-zero return code

compute-0 | CHANGED | rc=0 >>
-rw-r--r--. 1 42436 42436 0 Sep 10 21:22 /var/log/containers/nova/nova-manage.log

controller-3 | CHANGED | rc=0 >>
-rw-r--r--. 1 42436 42436 0 Sep 11 00:39 /var/log/containers/nova/nova-manage.log

controller-1 | CHANGED | rc=0 >>
-rw-------. 1 42436 42436 0 Sep 10 22:19 /var/log/containers/nova/nova-manage.log

controller-2 | CHANGED | rc=0 >>
-rw-r--r--. 1 42436 42436 0 Sep 11 00:29 /var/log/containers/nova/nova-manage.log


and controller replacement procedure is successfull : 
http://pastebin.test.redhat.com/796167

Comment 21 errata-xmlrpc 2019-09-21 11:20:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811


Note You need to log in before you can comment on or make changes to this bug.