Bug 1685506

Summary: [OSP14] Controller-replacement fails (controller-removal) because : /var/log/containers/nova/nova-manage.log is owned by root:root
Product: Red Hat OpenStack Reporter: pkomarov
Component: openstack-tripleo-heat-templatesAssignee: Martin Schuppert <mschuppe>
Status: CLOSED ERRATA QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: ahrechan, jjoyce, jschluet, mbooth, mburns, mschuppe, nlevinki, pkopec, rheslop, slinaber, tvignaud
Target Milestone: z3Keywords: Reopened, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-9.3.1-0.20190314162764.d0a6cb1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1690784 1690787 1707816 (view as bug list) Environment:
Last Closed: 2019-05-08 13:07:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1690784, 1690787, 1707816, 1707817    

Description pkomarov 2019-03-05 11:00:50 UTC
Description of problem:
Controller-replacement fails (controller-removal) because : /var/log/containers/nova/nova-manage.log is owned by root:root

Version-Release number of selected component (if applicable):
OSP14 2019-02-27.1

How reproducible:
always

Steps to Reproduce:

Via automaion: 
run : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/octavia/job/DFG-network-octavia-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-controller_replacement-normal/

Manually:
1.deploy an HA OSP14
2.try to remove one controller : 
rerun the overcloud_deploy.sh with an added : 
-e /home/stack/remove-controller.yaml \

cat /home/stack/remove-controller.yaml
parameters:
  ControllerRemovalPolicies:
    [{'resource_list': ['0']}]


Actual results:
Overcloud controller removal fails with : 
http://pastebin.test.redhat.com/731244
Controller-1 nova_api container fails to start because :

  "IOError: [Errno 13] Permission denied: '/var/log/nova/nova-manage.log'", 
(in deployment log file)

Expected results:
Controller removal succeeds, finishes without errors, and all overcloud
agents are up and operational.

Comment 1 pkomarov 2019-03-05 11:05:18 UTC
sos reports and stack home are at : 
http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1685506/

Comment 2 pkomarov 2019-03-05 11:07:24 UTC
As can be seen below the rest of nova's containers logs are owned by Kolla : userid=>42436 (as it should)
but nova-manage.log is owned by root:

[root@controller-1 ~]# ls -l /var/log/containers/nova
total 33072
-rw-r--r--. 1 42436 42436  6120842 Mar  5 10:41 nova-api.log
-rw-r--r--. 1 42436 42436 10828856 Mar  5 09:00 nova-api.log.1
[...]
-rw-r--r--. 1 root  root         0 Mar  4 17:23 nova-manage.log
-rw-r--r--. 1 42436 42436        0 Mar  5 00:01 nova-metadata-api.log
-rw-r--r--. 1 42436 42436   761548 Mar  5 00:01 nova-metadata-api.log.1


[stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'ls -l /var/log/containers/nova|grep manage'

controller-1 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root  root         0 Mar  4 17:23 nova-manage.log

[stack@undercloud-0 ~]$ ansible controller-2 -mshell -b -a'ls -l /var/log/containers/nova|grep manage'

controller-2 | SUCCESS | rc=0 >>
-rw-r--r--. 1 42436 42436        0 Mar  5 00:01 nova-manage.log

Comment 4 Martin Schuppert 2019-03-15 11:01:31 UTC
As the sosreports miss system logs, I tried to reproduce the issue with 2019-02-27.1 , but don't see the wrong permission on the nova-manage log

After deploy:

The only nova-manage log on controller-0:

[root@controller-0 ~]#  ls -la /var/log/containers/nova/ |grep manage
-rw-r--r--.  1 42436 42436        0 Mar 15 00:00 nova-manage.log
-rw-r--r--.  1 42436 42436   274848 Mar 15 00:00 nova-manage.log.1

After replacement:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 0ab774a5-233f-46e9-a429-0949b969d6db | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 51344592-2b97-455f-a936-b7826eae7b30 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 0c5fe3a4-8a66-4f45-b85b-0855b02f277f | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.21 |
| 4a046882-093e-4e00-8bed-46fcf6e72603 | controller-3 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+

[root@controller-0 ~]# ls -la /var/log/containers/nova/ |grep manage
-rw-r--r--.  1 42436 42436        0 Mar 15 00:00 nova-manage.log
-rw-r--r--.  1 42436 42436   274848 Mar 15 00:00 nova-manage.log.1

Does that job run any nova-manage commands as root outside the tripleo workflow? If initially one got triggered on Controller-1 as root the nova-manage log gets created as from the description and then the reported issue can happen.

In any case we'll submit a patch to chown the logs in case something get triggered as root manually.

Comment 21 errata-xmlrpc 2019-04-30 17:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0878

Comment 22 Martin Schuppert 2019-05-02 22:05:08 UTC
Reopen as fix openstack-tripleo-heat-templates-9.3.1-0.20190314162764.d0a6cb1.el7ost is not included in the errata

Comment 23 Martin Schuppert 2019-05-08 13:07:16 UTC
we can note reopen BZ which are mapped to an errata, created 1707816 for track the release