Bug 1533275 - access workbook does not honor DeploymentServerBlacklist parameter during update
Summary: access workbook does not honor DeploymentServerBlacklist parameter during update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 12.0 (Pike)
Hardware: All
OS: All
medium
high
Target Milestone: z2
: 12.0 (Pike)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
: 1595674 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-10 23:24 UTC by Chris Janiszewski
Modified: 2022-08-09 11:39 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-common-7.6.9-1.el7ost openstack-tripleo-heat-templates-7.0.9-1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-28 17:27:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
custom templates (7.02 KB, application/x-gzip)
2018-01-11 15:22 UTC, Chris Janiszewski
no flags Details
stack resource list output (613.31 KB, text/plain)
2018-01-11 17:17 UTC, Chris Janiszewski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1743046 0 None None None 2018-01-15 10:39:20 UTC
OpenStack gerrit 535255 0 None MERGED Workflow execution blacklist support 2020-12-02 15:40:45 UTC
OpenStack gerrit 535294 0 None MERGED Consume blacklisted_ip_addresses in workflows 2020-12-02 15:40:46 UTC
OpenStack gerrit 537966 0 None MERGED Pass blacklisted_{ip_addresses,hostnames} to major_upgrade_steps 2020-12-02 15:40:47 UTC
Red Hat Issue Tracker OSP-8734 0 None None None 2022-08-09 11:39:36 UTC
Red Hat Product Errata RHBA-2018:0607 0 None None None 2018-03-28 17:28:17 UTC

Description Chris Janiszewski 2018-01-10 23:24:34 UTC
Description of problem:
DeploymentServerBlacklist parameter doesn't exclude servers in the list from running update.
Deployed OSP12 with 3 controllers, 1 compute, 3 ceph nodes.
After it completed successfully, added following extension:
(undercloud) [stack@chrisjupgrade-undercloud ~]$ cat templates/server-blacklist.yaml 
parameter_defaults:
  DeploymentServerBlacklist:
    - overcloud-compute-0
    - overcloud-controller-0
    - overcloud-controller-1
    - overcloud-controller-2
    - overcloud-cephstorage-0
    - overcloud-cephstorage-1
    - overcloud-cephstorage-2

Adjusted ComputeCount: from 1 -> 2 and started another deployment/update.
Monitored os-collect-config on one of the controllers and seen a lot of traffic during the update, even though it should be excluded.
Docs - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#Scaling-Blacklisting_Nodes
also mentioning:
"You can also power off or stop the os-collect-config agents during the operation."


Version-Release number of selected component (if applicable):
osp12


How reproducible:
Every time


Steps to Reproduce:
1. Deploy overcloud with 1 compute node
2. Scale-out overcloud with +1 compute while running update with DeploymentServerBlacklist parameter

Actual results:
All steps run on all the hosts even the blacklisted

Expected results:
Blacklisted nodes should not run any updates.


Additional info:
http://chrisj.cloud/sosreport-controller0-DeploymentServerBlacklist-issue-20180110223206.tar.xz
http://chrisj.cloud/sosreport-undercloud-DeploymentServerBlacklist-issue-20180110173246.tar.xz

Comment 1 James Slagle 2018-01-11 12:19:52 UTC
What is your deployment command?
Please provide all custom templates as well if there are any in use.

Comment 2 Chris Janiszewski 2018-01-11 15:22:13 UTC
Created attachment 1380066 [details]
custom templates

Hey James, Thanks for looking at it.

Here is the deploy command that has been used:
source ~/stackrc
cd ~/
time openstack overcloud deploy --templates \
     --ntp-server 10.9.71.7 \
     -e templates/server-blacklist.yaml \
     -e templates/network-environment.yaml \
     -e templates/storage-environment.yaml \
     -e templates/docker-registry.yaml \
     -e templates/node-info.yaml \
     -e templates/inject-trust-anchor-hiera.yaml \
     -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml 


I am attaching custom templates directory for your reference.

Comment 3 James Slagle 2018-01-11 16:34:13 UTC
Just because there is output from os-collect-config output in /var/log/messages on controller-0, does not necessarily mean that any changes or updates were applied. You will in fact see some output as the Heat metadata changes due to the deployments being blacklisted. That should not cause any issues though (the service could even be stopped).

Did you actually see any updates applied to the controller?

I looked at /var/log/messages myself from the controller, and I actually do see one update being applied:

Jan 10 17:19:39 localhost os-collect-config: [2018-01-10 22:19:39,563] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/ansible < /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.json
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [INFO] {"deploy_stdout": "\nPLAY [localhost] ***************************************************************\n\nTASK [Gathering Facts] *********************************************************\nok: [localhost]\n\nTASK [create user tripleo-admin] ***********************************************\nok: [localhost]\n\nTASK [grant admin rights to user tripleo-admin] ********************************\nok: [localhost]\n\nTASK [ensure .ssh dir exists for user tripleo-admin] ***************************\nok: [localhost]\n\nTASK [ensure authorized_keys file exists for user tripleo-admin] ***************\nchanged: [localhost]\n\nTASK [authorize TripleO Mistral key for user tripleo-admin] ********************\nok: [localhost]\n\nPLAY RECAP *********************************************************************\nlocalhost                  : ok=6    changed=1    unreachable=0    failed=0   \n\n", "deploy_stderr": "", "deploy_status_code": 0}
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [DEBUG] [2018-01-10 22:19:39,593] (heat-config) [DEBUG] Running ansible-playbook -i localhost, /var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_playbook.yaml --extra-vars @/var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_variables.json
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Return code 0
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO]
Jan 10 17:19:45 localhost os-collect-config: PLAY [localhost] ***************************************************************
Jan 10 17:19:45 localhost os-collect-config: TASK [Gathering Facts] *********************************************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [create user tripleo-admin] ***********************************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [grant admin rights to user tripleo-admin] ********************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [ensure .ssh dir exists for user tripleo-admin] ***************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [ensure authorized_keys file exists for user tripleo-admin] ***************
Jan 10 17:19:45 localhost os-collect-config: changed: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [authorize TripleO Mistral key for user tripleo-admin] ********************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: PLAY RECAP *********************************************************************
Jan 10 17:19:45 localhost os-collect-config: localhost                  : ok=6    changed=1    unreachable=0    failed=0
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_playbook.yaml
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [INFO] Completed /usr/libexec/heat-config/hooks/ansible
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.json < /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.notify.json


Does the timestamp shown there correspond with when you were doing an update with the blacklist?

I'm actually unable to identify the source of this deployment. I don't think this code should be in any of the shipped templates for OSP12. I'm unsure at the moment where it's coming from.

Do you by chance have any pip or source installs on your undercloud, or perhaps any packages from osp13 on your undercloud? I actually can't access the sosreport for the undercloud due to a permission denied error on the url, can you fix that as well?

Comment 4 Chris Janiszewski 2018-01-11 16:50:29 UTC
Thanks for clarification on the os-collect-config still getting messages. It might be worth adding this comment to official documentation. I expected to not see any activity there since the doc mentions you can even stop this service.

I am only using official CDN repos for my deployment - osp12.
I don't use pip or source installs.
I changed permissions for sosreport from undercloud - sorry for that.

We are validating DeploymentServerBlacklist use case at the OSP21 hackfest this week. There are more people running this use case and notices similar behavior and actually noticed changes on roles others then compute. I am going to ask the to comment.

Comment 5 James Slagle 2018-01-11 17:09:28 UTC
(In reply to Chris Janiszewski from comment #4)
> Thanks for clarification on the os-collect-config still getting messages. It
> might be worth adding this comment to official documentation. I expected to
> not see any activity there since the doc mentions you can even stop this
> service.
> 
> I am only using official CDN repos for my deployment - osp12.
> I don't use pip or source installs.
> I changed permissions for sosreport from undercloud - sorry for that.
> 
> We are validating DeploymentServerBlacklist use case at the OSP21 hackfest
> this week. There are more people running this use case and notices similar
> behavior and actually noticed changes on roles others then compute. I am
> going to ask the to comment.

ok, well we still need to figure out the source of this deployment, as I don't see where it's coming from looking at the 12 rpm's.

Can you attach the output from the following?

openstack stack resource list -n 7 overcloud

Comment 6 Chris Janiszewski 2018-01-11 17:17:15 UTC
Created attachment 1380123 [details]
stack resource list output

Comment 7 James Slagle 2018-01-11 17:30:24 UTC
(In reply to Chris Janiszewski from comment #6)
> Created attachment 1380123 [details]
> stack resource list output

I don't see the uuid of the deployment in that output (dbdc2c99-7e1c-4006-aaf8-60db91798e99).

Do you run any out of band deployments?
Do you use the /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh script?

Is there any documentation for the hackfest i can take a look at to see how this deployment might be getting triggered?

Comment 8 Chris Janiszewski 2018-01-11 17:51:13 UTC
(In reply to James Slagle from comment #7)

> I don't see the uuid of the deployment in that output
> (dbdc2c99-7e1c-4006-aaf8-60db91798e99).



I have not used that script manually but I see it being out there on my environment:
(undercloud) [stack@chrisjupgrade-undercloud ~]$ ls /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh
/usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh

The deployment is triggered by the deploy.sh that I have pasted in comment #2

I have sent you information about the access to this environment via email. Feel free to log in and poke around. You are also more then welcome to just trigger another deployment - to see if this occurs again. I will leave it up for you do investigate.

Comment 9 James Slagle 2018-01-11 18:31:33 UTC
I believe I've narrowed this down to the interaction between the ceph-ansible.yaml and access.yaml workbooks when ceph-ansible.yaml is triggered by Heat.

First, it does not honor DeploymentServerBlacklist. ceph-ansible.yaml calls:
      enable_ssh_admin:
        workflow: tripleo.access.v1.enable_ssh_admin
which then does:
      get_servers:
        action: nova.servers_list

Not only does that not honor the blacklist, but it will create tripleo-admin on every server, not just the ones where we are installing ceph. Particularly for the ceph-ansible case, I think this ought to be configurable and we only create the user on ceph nodes that are in the inventory for ceph-ansible.

If you made get_servers take an input of server uuids and only call nova.servers_list if the input is not provided, you could then make use of the servers json parameter in deploy-steps.j2 which has already had the blacklisted servers removed.

Further, from what I can tell, this action ends up getting triggered on every stack update. There's nothing to say "don't create tripleo-admin if it's already been done" (that I can find anyway, and based on this bug report that seems to be the case). That should also be fixed.

Comment 13 Yogev Rabl 2018-03-28 03:36:01 UTC
Verification failed

actions: 
1) deployed an overcloud with 
 - 3 controllers
 - 1 compute node
 - 1 ceph storage node (with 5 osds in it + replication between osds, cluster healthy)

deployment command:
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/ceph-single-host-mode.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /home/stack/virt/docker-images.yaml \
--log-file overcloud_deployment_62.log

the content of /home/stack/virt/nodes_data.yaml is:
parameter_defaults:
    ControllerCount: 3
    OvercloudControlFlavor: controller
    ComputeCount: 1
    OvercloudComputeFlavor: compute
    CephStorageCount: 1
    OvercloudCephStorageFlavor: ceph


2) create two environment files:
/home/stack/blacklist.yaml
parameter_defaults:
    DeploymentServerBlacklist:
        - controller-0
        - controller-1
        - controller-2
        - compute-0
        - ceph-0
/home/stack/virt/nodes_data_plus_one.yaml
parameter_defaults:
    ControllerCount: 3
    OvercloudControlFlavor: controller
    ComputeCount: 2
    OvercloudComputeFlavor: compute
    CephStorageCount: 1
    OvercloudCephStorageFlavor: ceph

3) ran the update, it failed with an error
overcloud.AllNodesDeploySteps:
  resource_type: OS::TripleO::PostDeploySteps
  physical_resource_id: 3bc0b155-51a7-45b5-bd15-392236636fc5
  status: UPDATE_FAILED
  status_reason: |
    resources.AllNodesDeploySteps: Property error: resources.BootstrapServerId.properties.value:

Comment 14 Giulio Fidente 2018-03-28 07:55:30 UTC
(In reply to Yogev Rabl from comment #13)
> Verification failed
> 
> actions: 
> 1) deployed an overcloud with 
>  - 3 controllers
>  - 1 compute node
>  - 1 ceph storage node (with 5 osds in it + replication between osds,
> cluster healthy)
> 
> deployment command:
> openstack overcloud deploy \
> --timeout 100 \
> --templates /usr/share/openstack-tripleo-heat-templates \
> --stack overcloud \
> --libvirt-type kvm \
> --ntp-server clock.redhat.com \
> --environment-file
> /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
> -e /home/stack/virt/internal.yaml \
> -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.
> yaml \
> -e /home/stack/virt/network/network-environment.yaml \
> -e /home/stack/virt/hostnames.yml \
> -e
> /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-
> ansible.yaml \
> -e /home/stack/virt/debug.yaml \
> -e /home/stack/virt/ceph-single-host-mode.yaml \
> -e /home/stack/virt/nodes_data.yaml \
> -e /home/stack/virt/docker-images.yaml \
> --log-file overcloud_deployment_62.log
> 
> the content of /home/stack/virt/nodes_data.yaml is:
> parameter_defaults:
>     ControllerCount: 3
>     OvercloudControlFlavor: controller
>     ComputeCount: 1
>     OvercloudComputeFlavor: compute
>     CephStorageCount: 1
>     OvercloudCephStorageFlavor: ceph
> 
> 
> 2) create two environment files:
> /home/stack/blacklist.yaml
> parameter_defaults:
>     DeploymentServerBlacklist:
>         - controller-0
>         - controller-1
>         - controller-2
>         - compute-0
>         - ceph-0
> /home/stack/virt/nodes_data_plus_one.yaml
> parameter_defaults:
>     ControllerCount: 3
>     OvercloudControlFlavor: controller
>     ComputeCount: 2
>     OvercloudComputeFlavor: compute
>     CephStorageCount: 1
>     OvercloudCephStorageFlavor: ceph
> 
> 3) ran the update, it failed with an error
> overcloud.AllNodesDeploySteps:
>   resource_type: OS::TripleO::PostDeploySteps
>   physical_resource_id: 3bc0b155-51a7-45b5-bd15-392236636fc5
>   status: UPDATE_FAILED
>   status_reason: |
>     resources.AllNodesDeploySteps: Property error:
> resources.BootstrapServerId.properties.value:

Yogev this is a different error due to the fact that you're blacklisting all 3 controllers while the bootstrapserverid is set taking one node from the nodes belonging to the primary role (controller by default). I guess if we want this scenario to work (blacklisting all controllers), we can track it with a different BZ, probably for DFG:DF.

Regarding support for the blacklist in ceph-ansible instead, I guess a simpler scenario could be (what this BZ was about):

1) deploy an overcloud
2) update the overcloud blacklisting 1 node hosting any of the ceph services
3) verify that on the blacklisted node ceph-ansible did not update/refresh the ceph config

Comment 17 Yogev Rabl 2018-03-28 14:34:39 UTC
verified.
according to gfidente's comment.

Comment 18 errata-xmlrpc 2018-03-28 17:27:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0607

Comment 19 John Fulton 2018-07-31 14:05:24 UTC
*** Bug 1595674 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.