Bug 1690784

Summary: [OSP15] Controller-replacement fails (controller-removal) because : /var/log/containers/nova/nova-manage.log is owned by root:root
Product: Red Hat OpenStack Reporter: Martin Schuppert <mschuppe>
Component: openstack-tripleo-heat-templatesAssignee: Martin Schuppert <mschuppe>
Status: CLOSED ERRATA QA Contact: Archit Modi <amodi>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: agurenko, ahrechan, amodi, jjoyce, jschluet, lyarwood, mbooth, mburns, mschuppe, nlevinki, pkomarov, sclewis, slinaber, tvignaud
Target Milestone: gaKeywords: Triaged, ZStream
Target Release: 15.0 (Stein)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-10.5.1-0.20190429000408.3415df5.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1685506 Environment:
Last Closed: 2019-09-21 11:20:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1685506, 1690787, 1707816, 1707817, 1742169, 1743402    
Bug Blocks:    

Description Martin Schuppert 2019-03-20 09:00:56 UTC
+++ This bug was initially created as a clone of Bug #1685506 +++

Description of problem:
Controller-replacement fails (controller-removal) because : /var/log/containers/nova/nova-manage.log is owned by root:root

Version-Release number of selected component (if applicable):
OSP14 2019-02-27.1

How reproducible:
always

Steps to Reproduce:

Via automaion: 
run : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/octavia/job/DFG-network-octavia-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-controller_replacement-normal/

Manually:
1.deploy an HA OSP14
2.try to remove one controller : 
rerun the overcloud_deploy.sh with an added : 
-e /home/stack/remove-controller.yaml \

cat /home/stack/remove-controller.yaml
parameters:
  ControllerRemovalPolicies:
    [{'resource_list': ['0']}]


Actual results:
Overcloud controller removal fails with : 
http://pastebin.test.redhat.com/731244
Controller-1 nova_api container fails to start because :

  "IOError: [Errno 13] Permission denied: '/var/log/nova/nova-manage.log'", 
(in deployment log file)

Expected results:
Controller removal succeeds, finishes without errors, and all overcloud
agents are up and operational.

--- Additional comment from  on 2019-03-05 11:05:18 UTC ---

sos reports and stack home are at : 
http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1685506/

--- Additional comment from  on 2019-03-05 11:07:24 UTC ---

As can be seen below the rest of nova's containers logs are owned by Kolla : userid=>42436 (as it should)
but nova-manage.log is owned by root:

[root@controller-1 ~]# ls -l /var/log/containers/nova
total 33072
-rw-r--r--. 1 42436 42436  6120842 Mar  5 10:41 nova-api.log
-rw-r--r--. 1 42436 42436 10828856 Mar  5 09:00 nova-api.log.1
[...]
-rw-r--r--. 1 root  root         0 Mar  4 17:23 nova-manage.log
-rw-r--r--. 1 42436 42436        0 Mar  5 00:01 nova-metadata-api.log
-rw-r--r--. 1 42436 42436   761548 Mar  5 00:01 nova-metadata-api.log.1


[stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'ls -l /var/log/containers/nova|grep manage'

controller-1 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root  root         0 Mar  4 17:23 nova-manage.log

[stack@undercloud-0 ~]$ ansible controller-2 -mshell -b -a'ls -l /var/log/containers/nova|grep manage'

controller-2 | SUCCESS | rc=0 >>
-rw-r--r--. 1 42436 42436        0 Mar  5 00:01 nova-manage.log

--- Additional comment from Artem Hrechanychenko on 2019-03-06 15:00:15 UTC ---

Hello Pini,
reproduced on my env too - https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-df-controller_replacement-14-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-vxlan-replace_controller-RHELOSP-31864/

OSP14 puddle - 2019-02-27.1

--- Additional comment from Martin Schuppert on 2019-03-15 11:01:31 UTC ---

As the sosreports miss system logs, I tried to reproduce the issue with 2019-02-27.1 , but don't see the wrong permission on the nova-manage log

After deploy:

The only nova-manage log on controller-0:

[root@controller-0 ~]#  ls -la /var/log/containers/nova/ |grep manage
-rw-r--r--.  1 42436 42436        0 Mar 15 00:00 nova-manage.log
-rw-r--r--.  1 42436 42436   274848 Mar 15 00:00 nova-manage.log.1

After replacement:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 0ab774a5-233f-46e9-a429-0949b969d6db | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 51344592-2b97-455f-a936-b7826eae7b30 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 0c5fe3a4-8a66-4f45-b85b-0855b02f277f | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.21 |
| 4a046882-093e-4e00-8bed-46fcf6e72603 | controller-3 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+

[root@controller-0 ~]# ls -la /var/log/containers/nova/ |grep manage
-rw-r--r--.  1 42436 42436        0 Mar 15 00:00 nova-manage.log
-rw-r--r--.  1 42436 42436   274848 Mar 15 00:00 nova-manage.log.1

Does that job run any nova-manage commands as root outside the tripleo workflow? If initially one got triggered on Controller-1 as root the nova-manage log gets created as from the description and then the reported issue can happen.

In any case we'll submit a patch to chown the logs in case something get triggered as root manually.

Comment 17 pkomarov 2019-09-11 06:53:15 UTC
Verified , 

(undercloud) [stack@undercloud-0 ~]$ rpm -qa|grep openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-10.6.1-0.20190905170437.b33b839.el8ost.noarch




(undercloud) [stack@undercloud-0 ~]$ ansible overcloud_nodes -mshell -b -a'ls -l /var/log/containers/nova/nova-manage.log' 
 [WARNING]: Found both group and host with same name: undercloud

controller-0 | UNREACHABLE! => {
    "changed": false,
    "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname controller-0: Name or service not known",
    "unreachable": true
}
compute-1 | FAILED | rc=2 >>
ls: cannot access '/var/log/containers/nova/nova-manage.log': No such file or directorynon-zero return code

compute-0 | CHANGED | rc=0 >>
-rw-r--r--. 1 42436 42436 0 Sep 10 21:22 /var/log/containers/nova/nova-manage.log

controller-3 | CHANGED | rc=0 >>
-rw-r--r--. 1 42436 42436 0 Sep 11 00:39 /var/log/containers/nova/nova-manage.log

controller-1 | CHANGED | rc=0 >>
-rw-------. 1 42436 42436 0 Sep 10 22:19 /var/log/containers/nova/nova-manage.log

controller-2 | CHANGED | rc=0 >>
-rw-r--r--. 1 42436 42436 0 Sep 11 00:29 /var/log/containers/nova/nova-manage.log


and controller replacement procedure is successfull : 
http://pastebin.test.redhat.com/796167

Comment 21 errata-xmlrpc 2019-09-21 11:20:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811