Bug 1561255

Summary: FFU: /etc/os-net-config/config.json is empty after updating the stack outputs
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: python-tripleoclientAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: bfournie, ccamacho, dbecker, hbrock, jschluet, jslagle, lbezdick, mandreou, mbultel, mburns, morazi, rhel-osp-director-maint, sathlang, sclewis
Target Milestone: rcKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.6.1-14.el7ost python-tripleoclient-9.2.1-10.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:49:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1561169    
Attachments:
Description Flags
mistral logs
none
some excerpts from the logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=1561255#c17 none

Description Marius Cornea 2018-03-28 00:32:02 UTC
Description of problem:
FFU: /etc/os-net-config/config.json is empty after updating the stack outputs.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10
2. Upgrade undercloud to OSP11/12/13
3. Run the overcloud deploy command to update the stack outputs:
#!/bin/bash
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/docker-images.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/fast-forward-upgrade.yaml \
-e /home/stack/ffu_repos.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/config-download-environment.yaml \
-e /home/stack/ceph-ansible-env.yaml \

4. SSH to any of the overcloud nodes and check /etc/os-net-config/config.json 

Actual results:
empty - 

[root@ceph-0 ~]# wc /etc/os-net-config/config.json
0 0 0 /etc/os-net-config/config.json


Expected results:
/etc/os-net-config/config.json gets preserved with the content it was populated before running the overcloud deploy command

Additional info:

Before running the overcloud deploy command for updating the stack outputs /etc/os-net-config/config.json was correctly populated.

Comment 3 Marios Andreou 2018-04-04 11:54:39 UTC
o/ took this for triage this week. Is this duplicate of/same root cause as BZ 1559151 ? It might be... this is happening after a stack update and coming from an OSP10 env (so the 'old' element based os-net-config is being on the original deployment, but OSP13 templates are using the new script based one). 

I suspect if we "rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true " like [1] at the start of the FFU it should solve the issue, assuming it is the same.

It would be great if you could test that, I mean, manually remove that from overcloud nodes before the FFU stack update for the ansible generation. We might want to carry that in the FFU env [2] or some other suitable place.

I think it makes sense to keep this bz anyway even if it is the same since BZ 1559151 is for upgrades and the solution will be slightly different here/land in different place

I am going to mark triaged for now, remove if you disagree.

[1] https://review.openstack.org/#/c/556533/2/environments/major-upgrade-composable-steps-docker.yaml@15
[2]https://github.com/openstack/tripleo-heat-templates/blob/master/environments/fast-forward-upgrade.yaml#L17

Comment 4 Marius Cornea 2018-04-04 13:55:38 UTC
(In reply to Marios Andreou from comment #3)
> o/ took this for triage this week. Is this duplicate of/same root cause as
> BZ 1559151 ? It might be... this is happening after a stack update and
> coming from an OSP10 env (so the 'old' element based os-net-config is being
> on the original deployment, but OSP13 templates are using the new script
> based one). 
> 
> I suspect if we "rm
> /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true
> " like [1] at the start of the FFU it should solve the issue, assuming it is
> the same.
> 
> It would be great if you could test that, I mean, manually remove that from
> overcloud nodes before the FFU stack update for the ansible generation. We
> might want to carry that in the FFU env [2] or some other suitable place.

That's right, by manually removing /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json before running the deploy command for updating the stack output /etc/os-net-config/config.json keeps its content.

> I think it makes sense to keep this bz anyway even if it is the same since
> BZ 1559151 is for upgrades and the solution will be slightly different
> here/land in different place
> 

I'd keep this BZ to track of this issue for the FFU workflow. BZ#1559151 is related to upgrades and also the consequences are different than the ones observed during FFU.

Comment 5 Marios Andreou 2018-04-25 14:28:05 UTC
Spent some time looking into this today. Working with "lets remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json before running the deploy command for updating the stack output" I initially thought we might be able to use the UpgradeInit and set it in the ffwd-upgrade-prepare.yaml and unset it on converge as we do for the major upgrade. However UpgradeInit is a SoftwareConfig @ [1] but the ffwd-upgrade-prepare.yaml is setting that to config download @ [2] so we can't expect that to be applied during the ffwd-upgrade prepare heat stack update. It *would* be applied with the ansible playbooks but I believe it needs to happen before the heat stack update. 

A really 'easy' way is if we consider something like "openstack overcloud execute" for this [3][4][5] but that would require the operator to run something like: 

cat <<EOF > remove_os_net_config.sh
#!/bin/bash

rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true
EOF

Then they run it with

     "openstack overcloud execute remove_os_net_config.sh --server_name "overcloud"
     ## --server_name will do a partial match so overcloud-controller-x, overcloud-compute-x

Otherwise we will have to work out another way in the client before we call the prepare stack update.

[1]  https://github.com/openstack/tripleo-heat-templates/blob/1bec57e9770f27d44c0768410fc7d4b5926858da/puppet/role.role.j2.yaml#L468-L469
[2] https://github.com/openstack/tripleo-heat-templates/blob/1bec57e9770f27d44c0768410fc7d4b5926858da/environments/lifecycle/ffwd-upgrade-prepare.yaml#L10-L11 
[3] https://github.com/openstack/python-tripleoclient/blob/5c7c923a01d4f8b460fc9481b7c38454cde10f5f/tripleoclient/v1/overcloud_execute.py#L63
[4] https://github.com/openstack/tripleo-common/blob/eb43cf0c2993cb20342ba04563866371ddec3773/workbooks/deployment.yaml#L24
[5] https://github.com/openstack/tripleo-common/blob/eb43cf0c2993cb20342ba04563866371ddec3773/tripleo_common/actions/deployment.py#L97

Comment 6 Marios Andreou 2018-05-04 15:00:18 UTC
o/ so digged into this a little more today. The way I see it we have 3 options: 

  1. semi "manual"/docs way with suggestion in comment #5 , or
  2. add invocation to the client, before the stack update, possibly using the tripleo.deployment.v1.deploy_on_servers , or 
  3. fix it so that upgradeinit does run during the heat stack update (see comment #5 on why it isn't). Will need tweaks in tripleo-heat-templates (redirect the upgrade init to another resource from softwareconfig).

I played with 2. today and posted https://review.openstack.org/#/c/566336/ as a WIP for discussion to continue next week.

Comment 7 Lukas Bezdicka 2018-05-04 15:50:25 UTC
I threw in idea but not sure if correct one https://review.openstack.org/566348

Comment 8 Marius Cornea 2018-05-04 15:56:26 UTC
*** Bug 1574258 has been marked as a duplicate of this bug. ***

Comment 11 Bob Fournier 2018-05-09 19:06:48 UTC
Marios - note that this fix https://review.openstack.org/#/c/560022/ to run-os-net-config.sh for https://bugzilla.redhat.com/show_bug.cgi?id=1514949 explicitly removes  /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json to prevent an overwrite and an empty /etc/os-net-config/config.json.  I wonder why that's not removing /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json?

Comment 12 Marios Andreou 2018-05-10 10:02:59 UTC
o/ Bob , in BZ 1514949 we also landed the removal of the file right at the start of an upgrade (i.e. before those deployment steps run which will run that os-net-config script and remove the file as you point to) with https://review.openstack.org/#/c/557739/

SO here we need something equivalent but for ffu, ie remove the /usr/libexec... as the very first step in the overcloud ffu

Comment 13 Marios Andreou 2018-05-11 11:17:39 UTC
some discussion about this on https://review.openstack.org/#/c/567613/1/common/deploy-steps.j2@657 the alternative proposal from lbezdick at https://review.openstack.org/#/c/567613 to clean the element files

Comment 15 Marios Andreou 2018-05-11 15:26:20 UTC
some update on the discussion today. it seems we agree on to go with the patches currently in trackers i.e. https://review.openstack.org/566576 & https://review.openstack.org/566336

so we need those merged into queens for starters

Comment 16 Marius Cornea 2018-05-14 19:27:28 UTC
(In reply to Marios Andreou from comment #15)
> some update on the discussion today. it seems we agree on to go with the
> patches currently in trackers i.e. https://review.openstack.org/566576 &
> https://review.openstack.org/566336
> 
> so we need those merged into queens for starters

I tried applying the ^ patches by running:

curl -s -4 https://review.openstack.org/changes/566336/revisions/current/patch?download | base64 -d | sudo patch -d /usr/lib/python2.7/site-packages/ -p1; curl -s -4 https://review.openstack.org/changes/566576/revisions/current/patch?download | base64 -d | sudo patch -d /usr/share/openstack-tripleo-common/ -p1; source /home/stack/stackrc; mistral workbook-update /usr/share/openstack-tripleo-common/workbooks/deployment.yaml

but unfortunately when I ran openstack overcloud ffwd-upgrade prepare command it got stuck. Attaching the mistral logs.

Comment 17 Marius Cornea 2018-05-14 19:27:51 UTC
Created attachment 1436516 [details]
mistral logs

Comment 18 Marios Andreou 2018-05-15 10:37:12 UTC
Created attachment 1436757 [details]
some excerpts from the logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=1561255#c17

o/ mcornea... I checked through the attached logs and see some db connection errors (in api.log and engine.log) pasting the relevant bits as attachment here for ease of reference. I wonder if those are the source of the hanging. It should be other services are also reporting that possibly if it is a problem on your undercloud. Also as a sanity check can you please include db populate and all mistral services restart before you try it again please

    sudo mistral-db-manage  populate
    sudo systemctl restart openstack-mistral-api.service
    sudo systemctl restart openstack-mistral-engine.service
    sudo systemctl restart openstack-mistral-executor.service

i'll sanity check it again on my pike environment too but from first pass those db connection errors really stuck out for me wdyt

Comment 19 Marios Andreou 2018-05-15 12:14:08 UTC
update #2 - besides the issues that mcornea env might have as per comment #18, there is also a nit in the client review @ https://review.openstack.org/#/c/566336/4/tripleoclient/workflows/package_update.py@178 I just commented there and will post an update there momentarily. Can you please try again with the latest today - fwiw it seems to work OK for me

Comment 20 Marius Cornea 2018-05-16 02:37:23 UTC
(In reply to Marios Andreou from comment #19)
> update #2 - besides the issues that mcornea env might have as per comment
> #18, there is also a nit in the client review @
> https://review.openstack.org/#/c/566336/4/tripleoclient/workflows/
> package_update.py@178 I just commented there and will post an update there
> momentarily. Can you please try again with the latest today - fwiw it seems
> to work OK for me

Yep, worked fine this time, probably it was environmental issue with my env.

Comment 31 errata-xmlrpc 2018-06-27 13:49:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086