Description of problem: FFU: /etc/os-net-config/config.json is empty after updating the stack outputs. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 2. Upgrade undercloud to OSP11/12/13 3. Run the overcloud deploy command to update the stack outputs: #!/bin/bash openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --control-scale 3 \ --control-flavor controller \ --compute-scale 2 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/enable-tls.yaml \ -e /home/stack/virt/inject-trust-anchor.yaml \ -e /home/stack/virt/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/debug.yaml \ -e /home/stack/virt/docker-images.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/fast-forward-upgrade.yaml \ -e /home/stack/ffu_repos.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/config-download-environment.yaml \ -e /home/stack/ceph-ansible-env.yaml \ 4. SSH to any of the overcloud nodes and check /etc/os-net-config/config.json Actual results: empty - [root@ceph-0 ~]# wc /etc/os-net-config/config.json 0 0 0 /etc/os-net-config/config.json Expected results: /etc/os-net-config/config.json gets preserved with the content it was populated before running the overcloud deploy command Additional info: Before running the overcloud deploy command for updating the stack outputs /etc/os-net-config/config.json was correctly populated.
o/ took this for triage this week. Is this duplicate of/same root cause as BZ 1559151 ? It might be... this is happening after a stack update and coming from an OSP10 env (so the 'old' element based os-net-config is being on the original deployment, but OSP13 templates are using the new script based one). I suspect if we "rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true " like [1] at the start of the FFU it should solve the issue, assuming it is the same. It would be great if you could test that, I mean, manually remove that from overcloud nodes before the FFU stack update for the ansible generation. We might want to carry that in the FFU env [2] or some other suitable place. I think it makes sense to keep this bz anyway even if it is the same since BZ 1559151 is for upgrades and the solution will be slightly different here/land in different place I am going to mark triaged for now, remove if you disagree. [1] https://review.openstack.org/#/c/556533/2/environments/major-upgrade-composable-steps-docker.yaml@15 [2]https://github.com/openstack/tripleo-heat-templates/blob/master/environments/fast-forward-upgrade.yaml#L17
(In reply to Marios Andreou from comment #3) > o/ took this for triage this week. Is this duplicate of/same root cause as > BZ 1559151 ? It might be... this is happening after a stack update and > coming from an OSP10 env (so the 'old' element based os-net-config is being > on the original deployment, but OSP13 templates are using the new script > based one). > > I suspect if we "rm > /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true > " like [1] at the start of the FFU it should solve the issue, assuming it is > the same. > > It would be great if you could test that, I mean, manually remove that from > overcloud nodes before the FFU stack update for the ansible generation. We > might want to carry that in the FFU env [2] or some other suitable place. That's right, by manually removing /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json before running the deploy command for updating the stack output /etc/os-net-config/config.json keeps its content. > I think it makes sense to keep this bz anyway even if it is the same since > BZ 1559151 is for upgrades and the solution will be slightly different > here/land in different place > I'd keep this BZ to track of this issue for the FFU workflow. BZ#1559151 is related to upgrades and also the consequences are different than the ones observed during FFU.
Spent some time looking into this today. Working with "lets remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json before running the deploy command for updating the stack output" I initially thought we might be able to use the UpgradeInit and set it in the ffwd-upgrade-prepare.yaml and unset it on converge as we do for the major upgrade. However UpgradeInit is a SoftwareConfig @ [1] but the ffwd-upgrade-prepare.yaml is setting that to config download @ [2] so we can't expect that to be applied during the ffwd-upgrade prepare heat stack update. It *would* be applied with the ansible playbooks but I believe it needs to happen before the heat stack update. A really 'easy' way is if we consider something like "openstack overcloud execute" for this [3][4][5] but that would require the operator to run something like: cat <<EOF > remove_os_net_config.sh #!/bin/bash rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true EOF Then they run it with "openstack overcloud execute remove_os_net_config.sh --server_name "overcloud" ## --server_name will do a partial match so overcloud-controller-x, overcloud-compute-x Otherwise we will have to work out another way in the client before we call the prepare stack update. [1] https://github.com/openstack/tripleo-heat-templates/blob/1bec57e9770f27d44c0768410fc7d4b5926858da/puppet/role.role.j2.yaml#L468-L469 [2] https://github.com/openstack/tripleo-heat-templates/blob/1bec57e9770f27d44c0768410fc7d4b5926858da/environments/lifecycle/ffwd-upgrade-prepare.yaml#L10-L11 [3] https://github.com/openstack/python-tripleoclient/blob/5c7c923a01d4f8b460fc9481b7c38454cde10f5f/tripleoclient/v1/overcloud_execute.py#L63 [4] https://github.com/openstack/tripleo-common/blob/eb43cf0c2993cb20342ba04563866371ddec3773/workbooks/deployment.yaml#L24 [5] https://github.com/openstack/tripleo-common/blob/eb43cf0c2993cb20342ba04563866371ddec3773/tripleo_common/actions/deployment.py#L97
o/ so digged into this a little more today. The way I see it we have 3 options: 1. semi "manual"/docs way with suggestion in comment #5 , or 2. add invocation to the client, before the stack update, possibly using the tripleo.deployment.v1.deploy_on_servers , or 3. fix it so that upgradeinit does run during the heat stack update (see comment #5 on why it isn't). Will need tweaks in tripleo-heat-templates (redirect the upgrade init to another resource from softwareconfig). I played with 2. today and posted https://review.openstack.org/#/c/566336/ as a WIP for discussion to continue next week.
I threw in idea but not sure if correct one https://review.openstack.org/566348
*** Bug 1574258 has been marked as a duplicate of this bug. ***
Marios - note that this fix https://review.openstack.org/#/c/560022/ to run-os-net-config.sh for https://bugzilla.redhat.com/show_bug.cgi?id=1514949 explicitly removes /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json to prevent an overwrite and an empty /etc/os-net-config/config.json. I wonder why that's not removing /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json?
o/ Bob , in BZ 1514949 we also landed the removal of the file right at the start of an upgrade (i.e. before those deployment steps run which will run that os-net-config script and remove the file as you point to) with https://review.openstack.org/#/c/557739/ SO here we need something equivalent but for ffu, ie remove the /usr/libexec... as the very first step in the overcloud ffu
some discussion about this on https://review.openstack.org/#/c/567613/1/common/deploy-steps.j2@657 the alternative proposal from lbezdick at https://review.openstack.org/#/c/567613 to clean the element files
some update on the discussion today. it seems we agree on to go with the patches currently in trackers i.e. https://review.openstack.org/566576 & https://review.openstack.org/566336 so we need those merged into queens for starters
(In reply to Marios Andreou from comment #15) > some update on the discussion today. it seems we agree on to go with the > patches currently in trackers i.e. https://review.openstack.org/566576 & > https://review.openstack.org/566336 > > so we need those merged into queens for starters I tried applying the ^ patches by running: curl -s -4 https://review.openstack.org/changes/566336/revisions/current/patch?download | base64 -d | sudo patch -d /usr/lib/python2.7/site-packages/ -p1; curl -s -4 https://review.openstack.org/changes/566576/revisions/current/patch?download | base64 -d | sudo patch -d /usr/share/openstack-tripleo-common/ -p1; source /home/stack/stackrc; mistral workbook-update /usr/share/openstack-tripleo-common/workbooks/deployment.yaml but unfortunately when I ran openstack overcloud ffwd-upgrade prepare command it got stuck. Attaching the mistral logs.
Created attachment 1436516 [details] mistral logs
Created attachment 1436757 [details] some excerpts from the logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=1561255#c17 o/ mcornea... I checked through the attached logs and see some db connection errors (in api.log and engine.log) pasting the relevant bits as attachment here for ease of reference. I wonder if those are the source of the hanging. It should be other services are also reporting that possibly if it is a problem on your undercloud. Also as a sanity check can you please include db populate and all mistral services restart before you try it again please sudo mistral-db-manage populate sudo systemctl restart openstack-mistral-api.service sudo systemctl restart openstack-mistral-engine.service sudo systemctl restart openstack-mistral-executor.service i'll sanity check it again on my pike environment too but from first pass those db connection errors really stuck out for me wdyt
update #2 - besides the issues that mcornea env might have as per comment #18, there is also a nit in the client review @ https://review.openstack.org/#/c/566336/4/tripleoclient/workflows/package_update.py@178 I just commented there and will post an update there momentarily. Can you please try again with the latest today - fwiw it seems to work OK for me
(In reply to Marios Andreou from comment #19) > update #2 - besides the issues that mcornea env might have as per comment > #18, there is also a nit in the client review @ > https://review.openstack.org/#/c/566336/4/tripleoclient/workflows/ > package_update.py@178 I just commented there and will post an update there > momentarily. Can you please try again with the latest today - fwiw it seems > to work OK for me Yep, worked fine this time, probably it was environmental issue with my env.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086