Description of problem:
FFU: /etc/os-net-config/config.json is empty after updating the stack outputs.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Deploy OSP10
2. Upgrade undercloud to OSP11/12/13
3. Run the overcloud deploy command to update the stack outputs:
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/docker-images.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/fast-forward-upgrade.yaml \
-e /home/stack/ffu_repos.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/config-download-environment.yaml \
-e /home/stack/ceph-ansible-env.yaml \
4. SSH to any of the overcloud nodes and check /etc/os-net-config/config.json
[root@ceph-0 ~]# wc /etc/os-net-config/config.json
0 0 0 /etc/os-net-config/config.json
/etc/os-net-config/config.json gets preserved with the content it was populated before running the overcloud deploy command
Before running the overcloud deploy command for updating the stack outputs /etc/os-net-config/config.json was correctly populated.
o/ took this for triage this week. Is this duplicate of/same root cause as BZ 1559151 ? It might be... this is happening after a stack update and coming from an OSP10 env (so the 'old' element based os-net-config is being on the original deployment, but OSP13 templates are using the new script based one).
I suspect if we "rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true " like  at the start of the FFU it should solve the issue, assuming it is the same.
It would be great if you could test that, I mean, manually remove that from overcloud nodes before the FFU stack update for the ansible generation. We might want to carry that in the FFU env  or some other suitable place.
I think it makes sense to keep this bz anyway even if it is the same since BZ 1559151 is for upgrades and the solution will be slightly different here/land in different place
I am going to mark triaged for now, remove if you disagree.
(In reply to Marios Andreou from comment #3)
> o/ took this for triage this week. Is this duplicate of/same root cause as
> BZ 1559151 ? It might be... this is happening after a stack update and
> coming from an OSP10 env (so the 'old' element based os-net-config is being
> on the original deployment, but OSP13 templates are using the new script
> based one).
> I suspect if we "rm
> /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true
> " like  at the start of the FFU it should solve the issue, assuming it is
> the same.
> It would be great if you could test that, I mean, manually remove that from
> overcloud nodes before the FFU stack update for the ansible generation. We
> might want to carry that in the FFU env  or some other suitable place.
That's right, by manually removing /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json before running the deploy command for updating the stack output /etc/os-net-config/config.json keeps its content.
> I think it makes sense to keep this bz anyway even if it is the same since
> BZ 1559151 is for upgrades and the solution will be slightly different
> here/land in different place
I'd keep this BZ to track of this issue for the FFU workflow. BZ#1559151 is related to upgrades and also the consequences are different than the ones observed during FFU.
Spent some time looking into this today. Working with "lets remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json before running the deploy command for updating the stack output" I initially thought we might be able to use the UpgradeInit and set it in the ffwd-upgrade-prepare.yaml and unset it on converge as we do for the major upgrade. However UpgradeInit is a SoftwareConfig @  but the ffwd-upgrade-prepare.yaml is setting that to config download @  so we can't expect that to be applied during the ffwd-upgrade prepare heat stack update. It *would* be applied with the ansible playbooks but I believe it needs to happen before the heat stack update.
A really 'easy' way is if we consider something like "openstack overcloud execute" for this  but that would require the operator to run something like:
cat <<EOF > remove_os_net_config.sh
rm /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json || true
Then they run it with
"openstack overcloud execute remove_os_net_config.sh --server_name "overcloud"
## --server_name will do a partial match so overcloud-controller-x, overcloud-compute-x
Otherwise we will have to work out another way in the client before we call the prepare stack update.
o/ so digged into this a little more today. The way I see it we have 3 options:
1. semi "manual"/docs way with suggestion in comment #5 , or
2. add invocation to the client, before the stack update, possibly using the tripleo.deployment.v1.deploy_on_servers , or
3. fix it so that upgradeinit does run during the heat stack update (see comment #5 on why it isn't). Will need tweaks in tripleo-heat-templates (redirect the upgrade init to another resource from softwareconfig).
I played with 2. today and posted https://review.openstack.org/#/c/566336/ as a WIP for discussion to continue next week.
I threw in idea but not sure if correct one https://review.openstack.org/566348
*** Bug 1574258 has been marked as a duplicate of this bug. ***
Marios - note that this fix https://review.openstack.org/#/c/560022/ to run-os-net-config.sh for https://bugzilla.redhat.com/show_bug.cgi?id=1514949 explicitly removes /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json to prevent an overwrite and an empty /etc/os-net-config/config.json. I wonder why that's not removing /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json?
o/ Bob , in BZ 1514949 we also landed the removal of the file right at the start of an upgrade (i.e. before those deployment steps run which will run that os-net-config script and remove the file as you point to) with https://review.openstack.org/#/c/557739/
SO here we need something equivalent but for ffu, ie remove the /usr/libexec... as the very first step in the overcloud ffu
some discussion about this on https://review.openstack.org/#/c/567613/1/common/deploy-steps.j2@657 the alternative proposal from lbezdick at https://review.openstack.org/#/c/567613 to clean the element files
some update on the discussion today. it seems we agree on to go with the patches currently in trackers i.e. https://review.openstack.org/566576 & https://review.openstack.org/566336
so we need those merged into queens for starters
(In reply to Marios Andreou from comment #15)
> some update on the discussion today. it seems we agree on to go with the
> patches currently in trackers i.e. https://review.openstack.org/566576 &
> so we need those merged into queens for starters
I tried applying the ^ patches by running:
curl -s -4 https://review.openstack.org/changes/566336/revisions/current/patch?download | base64 -d | sudo patch -d /usr/lib/python2.7/site-packages/ -p1; curl -s -4 https://review.openstack.org/changes/566576/revisions/current/patch?download | base64 -d | sudo patch -d /usr/share/openstack-tripleo-common/ -p1; source /home/stack/stackrc; mistral workbook-update /usr/share/openstack-tripleo-common/workbooks/deployment.yaml
but unfortunately when I ran openstack overcloud ffwd-upgrade prepare command it got stuck. Attaching the mistral logs.
Created attachment 1436516 [details]
Created attachment 1436757 [details]
some excerpts from the logs attached in https://bugzilla.redhat.com/show_bug.cgi?id=1561255#c17
o/ mcornea... I checked through the attached logs and see some db connection errors (in api.log and engine.log) pasting the relevant bits as attachment here for ease of reference. I wonder if those are the source of the hanging. It should be other services are also reporting that possibly if it is a problem on your undercloud. Also as a sanity check can you please include db populate and all mistral services restart before you try it again please
sudo mistral-db-manage populate
sudo systemctl restart openstack-mistral-api.service
sudo systemctl restart openstack-mistral-engine.service
sudo systemctl restart openstack-mistral-executor.service
i'll sanity check it again on my pike environment too but from first pass those db connection errors really stuck out for me wdyt
update #2 - besides the issues that mcornea env might have as per comment #18, there is also a nit in the client review @ https://review.openstack.org/#/c/566336/4/tripleoclient/workflows/package_update.py@178 I just commented there and will post an update there momentarily. Can you please try again with the latest today - fwiw it seems to work OK for me
(In reply to Marios Andreou from comment #19)
> update #2 - besides the issues that mcornea env might have as per comment
> #18, there is also a nit in the client review @
> package_update.py@178 I just commented there and will post an update there
> momentarily. Can you please try again with the latest today - fwiw it seems
> to work OK for me
Yep, worked fine this time, probably it was environmental issue with my env.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.