Bug 1494107

Summary: OSP11 -> OSP12 upgrade: libvirtd service on compute nodes gets stopped during major-upgrade-composable-steps-docker.yaml
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: dbecker, jschluet, mandreou, mbracho, mbultel, mburns, morazi, rhel-osp-director-maint, shardy, tvignaud
Target Milestone: betaKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171023134947.8da5e1f.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 22:11:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1399762    
Attachments:
Description Flags
ansible-playbook invocations from journal on compute 0 and compute 1 none

Description Marius Cornea 2017-09-21 13:05:13 UTC
Description of problem:
OSP11 -> OSP12 upgrade: libvirtd service on compute nodes gets stopped during major-upgrade-composable-steps-docker.yaml

major-upgrade-composable-steps-docker.yaml should not touch the services running on compute nodes as this role has the disable_upgrade_deployment: True set.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11 with 3 controller + 2 compute + 3 ceph nodes

2. Run first step of the overcloud upgrade to OSP12 - major-upgrade-composable-steps-docker.yaml

#!/bin/bash

timeout 100m openstack overcloud deploy \
--templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/services-docker/sahara.yaml \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/ceph-ansible-env.yaml \
-e /home/stack/docker-osp12.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \
 
3. Check status of libvirtd service on compute nodes

Actual results:
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.16 'sudo systemctl status libvirtd'
Warning: Permanently added '192.168.24.16' (ECDSA) to the list of known hosts.
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; disabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2017-09-21 12:42:29 UTC; 21min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 19535 (code=exited, status=0/SUCCESS)

Sep 21 11:02:10 compute-1 systemd[1]: Starting Virtualization daemon...
Sep 21 11:02:10 compute-1 systemd[1]: Started Virtualization daemon.
Sep 21 12:42:29 compute-1 systemd[1]: Stopping Virtualization daemon...
Sep 21 12:42:29 compute-1 systemd[1]: Stopped Virtualization daemon.
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.6 'sudo systemctl status libvirtd'
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; disabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2017-09-21 12:42:29 UTC; 21min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 19537 (code=exited, status=0/SUCCESS)

Sep 21 11:02:10 compute-0 systemd[1]: Starting Virtualization daemon...
Sep 21 11:02:10 compute-0 systemd[1]: Started Virtualization daemon.
Sep 21 12:42:29 compute-0 systemd[1]: Stopping Virtualization daemon...
Sep 21 12:42:29 compute-0 systemd[1]: Stopped Virtualization daemon.


Expected results:
The libvirtd service should be running as it was before running major-upgrade-composable-steps-docker.yaml:

-(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.6 'sudo systemctl status libvirtd'
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-21 11:02:10 UTC; 1h 3min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 19537 (libvirtd)
   CGroup: /system.slice/libvirtd.service
           └─19537 /usr/sbin/libvirtd

Sep 21 11:02:10 compute-0 systemd[1]: Starting Virtualization daemon...
Sep 21 11:02:10 compute-0 systemd[1]: Started Virtualization daemon.
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.16 'sudo systemctl status libvirtd'
Warning: Permanently added '192.168.24.16' (ECDSA) to the list of known hosts.
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-21 11:02:10 UTC; 1h 3min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 19535 (libvirtd)
   CGroup: /system.slice/libvirtd.service
           └─19535 /usr/sbin/libvirtd

Sep 21 11:02:10 compute-1 systemd[1]: Starting Virtualization daemon...
Sep 21 11:02:10 compute-1 systemd[1]: Started Virtualization daemon.


Additional info:

Comment 1 Marios Andreou 2017-09-22 14:55:03 UTC
o/ Marius spent some time looking at this one. Going to mark as triaged and adding some thoughts so I can point others to it. To confirm, this should be happening on all upgrades right now and it shouldn't be confined to any one environment right?

--> It is the deployment_steps (host_prep_tasks specifically afaics) that are being executed on the computes, not the upgrade_tasks. There is indeed a task that stops libvirtd here https://github.com/openstack/tripleo-heat-templates/blob/420126fd98193f755562887603f604ca5fd53175/docker/services/nova-libvirt.yaml#L288-L295

--> I think the roles_data disable_upgrade_deployment flag is being set correctly in the environment because both computes (and no other nodes) got the /root/tripleo_upgrade_node.sh delivered. https://github.com/openstack/tripleo-heat-templates/blob/420126fd98193f755562887603f604ca5fd53175/common/major_upgrade_steps.j2.yaml#L41-L57

--> Suspect the problem is here https://github.com/openstack/tripleo-heat-templates/blob/fb54bc7901885ffb8c93c648643cab7ab70b41df/common/deploy-steps.j2#L6 but not sure why since enabled_roles should be set https://github.com/openstack/tripleo-heat-templates/blob/fb54bc7901885ffb8c93c648643cab7ab70b41df/common/post-upgrade.j2.yaml#L3 which just then includes the deploy-steps.j2 ...

Comment 2 Marios Andreou 2017-09-22 14:57:16 UTC
Created attachment 1329633 [details]
ansible-playbook invocations from journal on compute 0 and compute 1

Comment 3 Steven Hardy 2017-09-22 15:55:50 UTC
I think this is caused by https://review.openstack.org/#/c/502470/4/common/deploy-steps.j2

We made that change so the json files would be written to the nodes, and the RoleConfig output would be generated for all roles, even when upgrade is disabled.

But I missed that we'll then run host_prep_tasks even on nodes where upgrade is disabled, so we need to decouple that from the other tasks (which just write data that is later consumed by the ansible driven upgrade).

Comment 4 Steven Hardy 2017-09-22 16:03:01 UTC
To clarify, I think to fix this we need to decouple host_prep_tasks here:

https://github.com/openstack/tripleo-heat-templates/blob/fb54bc7901885ffb8c93c648643cab7ab70b41df/common/deploy-steps.j2#L192

So we can make them not run on nodes where upgrade is disabled but we need to decide if that means they never get run on upgrade (in which case there may sometimes be tasks that exist in both host_prep_tasks and upgrade_tasks) or if we make them run via the operator driven upgrade script.

Comment 5 Marios Andreou 2017-09-26 13:22:13 UTC
(In reply to Steven Hardy from comment #4)
> To clarify, I think to fix this we need to decouple host_prep_tasks here:
> 
> https://github.com/openstack/tripleo-heat-templates/blob/
> fb54bc7901885ffb8c93c648643cab7ab70b41df/common/deploy-steps.j2#L192
> 
> So we can make them not run on nodes where upgrade is disabled but we need
> to decide if that means they never get run on upgrade (in which case there
> may sometimes be tasks that exist in both host_prep_tasks and upgrade_tasks)
> or if we make them run via the operator driven upgrade script.

o/ I just posted this wdyt? https://review.openstack.org/507524

Comment 6 Marios Andreou 2017-09-26 13:24:55 UTC
(In reply to marios from comment #5)
> (In reply to Steven Hardy from comment #4)
> > To clarify, I think to fix this we need to decouple host_prep_tasks here:
> > 
> > https://github.com/openstack/tripleo-heat-templates/blob/
> > fb54bc7901885ffb8c93c648643cab7ab70b41df/common/deploy-steps.j2#L192
> > 
> > So we can make them not run on nodes where upgrade is disabled but we need
> > to decide if that means they never get run on upgrade (in which case there
> > may sometimes be tasks that exist in both host_prep_tasks and upgrade_tasks)
> > or if we make them run via the operator driven upgrade script.
> 
> o/ I just posted this wdyt? https://review.openstack.org/507524

I don't think that will work as is, thinking about it just now. We *do* want those to be included normally, just not on upgrade. SO the disable_upgrade_deployment is not the right check to make there. We need to know if it is upgrade.

WIll update the review I think you are out today anyway thanks shardy

Comment 7 Marios Andreou 2017-10-04 08:56:47 UTC
not yet merged on Pike so moving back ASSIGNED and updating trackers

Comment 13 errata-xmlrpc 2017-12-13 22:11:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462