Description of problem: DeploymentServerBlacklist parameter doesn't exclude servers in the list from running update. Deployed OSP12 with 3 controllers, 1 compute, 3 ceph nodes. After it completed successfully, added following extension: (undercloud) [stack@chrisjupgrade-undercloud ~]$ cat templates/server-blacklist.yaml parameter_defaults: DeploymentServerBlacklist: - overcloud-compute-0 - overcloud-controller-0 - overcloud-controller-1 - overcloud-controller-2 - overcloud-cephstorage-0 - overcloud-cephstorage-1 - overcloud-cephstorage-2 Adjusted ComputeCount: from 1 -> 2 and started another deployment/update. Monitored os-collect-config on one of the controllers and seen a lot of traffic during the update, even though it should be excluded. Docs - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#Scaling-Blacklisting_Nodes also mentioning: "You can also power off or stop the os-collect-config agents during the operation." Version-Release number of selected component (if applicable): osp12 How reproducible: Every time Steps to Reproduce: 1. Deploy overcloud with 1 compute node 2. Scale-out overcloud with +1 compute while running update with DeploymentServerBlacklist parameter Actual results: All steps run on all the hosts even the blacklisted Expected results: Blacklisted nodes should not run any updates. Additional info: http://chrisj.cloud/sosreport-controller0-DeploymentServerBlacklist-issue-20180110223206.tar.xz http://chrisj.cloud/sosreport-undercloud-DeploymentServerBlacklist-issue-20180110173246.tar.xz
What is your deployment command? Please provide all custom templates as well if there are any in use.
Created attachment 1380066 [details] custom templates Hey James, Thanks for looking at it. Here is the deploy command that has been used: source ~/stackrc cd ~/ time openstack overcloud deploy --templates \ --ntp-server 10.9.71.7 \ -e templates/server-blacklist.yaml \ -e templates/network-environment.yaml \ -e templates/storage-environment.yaml \ -e templates/docker-registry.yaml \ -e templates/node-info.yaml \ -e templates/inject-trust-anchor-hiera.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml I am attaching custom templates directory for your reference.
Just because there is output from os-collect-config output in /var/log/messages on controller-0, does not necessarily mean that any changes or updates were applied. You will in fact see some output as the Heat metadata changes due to the deployments being blacklisted. That should not cause any issues though (the service could even be stopped). Did you actually see any updates applied to the controller? I looked at /var/log/messages myself from the controller, and I actually do see one update being applied: Jan 10 17:19:39 localhost os-collect-config: [2018-01-10 22:19:39,563] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/ansible < /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.json Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [INFO] {"deploy_stdout": "\nPLAY [localhost] ***************************************************************\n\nTASK [Gathering Facts] *********************************************************\nok: [localhost]\n\nTASK [create user tripleo-admin] ***********************************************\nok: [localhost]\n\nTASK [grant admin rights to user tripleo-admin] ********************************\nok: [localhost]\n\nTASK [ensure .ssh dir exists for user tripleo-admin] ***************************\nok: [localhost]\n\nTASK [ensure authorized_keys file exists for user tripleo-admin] ***************\nchanged: [localhost]\n\nTASK [authorize TripleO Mistral key for user tripleo-admin] ********************\nok: [localhost]\n\nPLAY RECAP *********************************************************************\nlocalhost : ok=6 changed=1 unreachable=0 failed=0 \n\n", "deploy_stderr": "", "deploy_status_code": 0} Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [DEBUG] [2018-01-10 22:19:39,593] (heat-config) [DEBUG] Running ansible-playbook -i localhost, /var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_playbook.yaml --extra-vars @/var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_variables.json Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Return code 0 Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Jan 10 17:19:45 localhost os-collect-config: PLAY [localhost] *************************************************************** Jan 10 17:19:45 localhost os-collect-config: TASK [Gathering Facts] ********************************************************* Jan 10 17:19:45 localhost os-collect-config: ok: [localhost] Jan 10 17:19:45 localhost os-collect-config: TASK [create user tripleo-admin] *********************************************** Jan 10 17:19:45 localhost os-collect-config: ok: [localhost] Jan 10 17:19:45 localhost os-collect-config: TASK [grant admin rights to user tripleo-admin] ******************************** Jan 10 17:19:45 localhost os-collect-config: ok: [localhost] Jan 10 17:19:45 localhost os-collect-config: TASK [ensure .ssh dir exists for user tripleo-admin] *************************** Jan 10 17:19:45 localhost os-collect-config: ok: [localhost] Jan 10 17:19:45 localhost os-collect-config: TASK [ensure authorized_keys file exists for user tripleo-admin] *************** Jan 10 17:19:45 localhost os-collect-config: changed: [localhost] Jan 10 17:19:45 localhost os-collect-config: TASK [authorize TripleO Mistral key for user tripleo-admin] ******************** Jan 10 17:19:45 localhost os-collect-config: ok: [localhost] Jan 10 17:19:45 localhost os-collect-config: PLAY RECAP ********************************************************************* Jan 10 17:19:45 localhost os-collect-config: localhost : ok=6 changed=1 unreachable=0 failed=0 Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_playbook.yaml Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [INFO] Completed /usr/libexec/heat-config/hooks/ansible Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.json < /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.notify.json Does the timestamp shown there correspond with when you were doing an update with the blacklist? I'm actually unable to identify the source of this deployment. I don't think this code should be in any of the shipped templates for OSP12. I'm unsure at the moment where it's coming from. Do you by chance have any pip or source installs on your undercloud, or perhaps any packages from osp13 on your undercloud? I actually can't access the sosreport for the undercloud due to a permission denied error on the url, can you fix that as well?
Thanks for clarification on the os-collect-config still getting messages. It might be worth adding this comment to official documentation. I expected to not see any activity there since the doc mentions you can even stop this service. I am only using official CDN repos for my deployment - osp12. I don't use pip or source installs. I changed permissions for sosreport from undercloud - sorry for that. We are validating DeploymentServerBlacklist use case at the OSP21 hackfest this week. There are more people running this use case and notices similar behavior and actually noticed changes on roles others then compute. I am going to ask the to comment.
(In reply to Chris Janiszewski from comment #4) > Thanks for clarification on the os-collect-config still getting messages. It > might be worth adding this comment to official documentation. I expected to > not see any activity there since the doc mentions you can even stop this > service. > > I am only using official CDN repos for my deployment - osp12. > I don't use pip or source installs. > I changed permissions for sosreport from undercloud - sorry for that. > > We are validating DeploymentServerBlacklist use case at the OSP21 hackfest > this week. There are more people running this use case and notices similar > behavior and actually noticed changes on roles others then compute. I am > going to ask the to comment. ok, well we still need to figure out the source of this deployment, as I don't see where it's coming from looking at the 12 rpm's. Can you attach the output from the following? openstack stack resource list -n 7 overcloud
Created attachment 1380123 [details] stack resource list output
(In reply to Chris Janiszewski from comment #6) > Created attachment 1380123 [details] > stack resource list output I don't see the uuid of the deployment in that output (dbdc2c99-7e1c-4006-aaf8-60db91798e99). Do you run any out of band deployments? Do you use the /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh script? Is there any documentation for the hackfest i can take a look at to see how this deployment might be getting triggered?
(In reply to James Slagle from comment #7) > I don't see the uuid of the deployment in that output > (dbdc2c99-7e1c-4006-aaf8-60db91798e99). I have not used that script manually but I see it being out there on my environment: (undercloud) [stack@chrisjupgrade-undercloud ~]$ ls /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh The deployment is triggered by the deploy.sh that I have pasted in comment #2 I have sent you information about the access to this environment via email. Feel free to log in and poke around. You are also more then welcome to just trigger another deployment - to see if this occurs again. I will leave it up for you do investigate.
I believe I've narrowed this down to the interaction between the ceph-ansible.yaml and access.yaml workbooks when ceph-ansible.yaml is triggered by Heat. First, it does not honor DeploymentServerBlacklist. ceph-ansible.yaml calls: enable_ssh_admin: workflow: tripleo.access.v1.enable_ssh_admin which then does: get_servers: action: nova.servers_list Not only does that not honor the blacklist, but it will create tripleo-admin on every server, not just the ones where we are installing ceph. Particularly for the ceph-ansible case, I think this ought to be configurable and we only create the user on ceph nodes that are in the inventory for ceph-ansible. If you made get_servers take an input of server uuids and only call nova.servers_list if the input is not provided, you could then make use of the servers json parameter in deploy-steps.j2 which has already had the blacklisted servers removed. Further, from what I can tell, this action ends up getting triggered on every stack update. There's nothing to say "don't create tripleo-admin if it's already been done" (that I can find anyway, and based on this bug report that seems to be the case). That should also be fixed.
Verification failed actions: 1) deployed an overcloud with - 3 controllers - 1 compute node - 1 ceph storage node (with 5 osds in it + replication between osds, cluster healthy) deployment command: openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/hostnames.yml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/virt/debug.yaml \ -e /home/stack/virt/ceph-single-host-mode.yaml \ -e /home/stack/virt/nodes_data.yaml \ -e /home/stack/virt/docker-images.yaml \ --log-file overcloud_deployment_62.log the content of /home/stack/virt/nodes_data.yaml is: parameter_defaults: ControllerCount: 3 OvercloudControlFlavor: controller ComputeCount: 1 OvercloudComputeFlavor: compute CephStorageCount: 1 OvercloudCephStorageFlavor: ceph 2) create two environment files: /home/stack/blacklist.yaml parameter_defaults: DeploymentServerBlacklist: - controller-0 - controller-1 - controller-2 - compute-0 - ceph-0 /home/stack/virt/nodes_data_plus_one.yaml parameter_defaults: ControllerCount: 3 OvercloudControlFlavor: controller ComputeCount: 2 OvercloudComputeFlavor: compute CephStorageCount: 1 OvercloudCephStorageFlavor: ceph 3) ran the update, it failed with an error overcloud.AllNodesDeploySteps: resource_type: OS::TripleO::PostDeploySteps physical_resource_id: 3bc0b155-51a7-45b5-bd15-392236636fc5 status: UPDATE_FAILED status_reason: | resources.AllNodesDeploySteps: Property error: resources.BootstrapServerId.properties.value:
(In reply to Yogev Rabl from comment #13) > Verification failed > > actions: > 1) deployed an overcloud with > - 3 controllers > - 1 compute node > - 1 ceph storage node (with 5 osds in it + replication between osds, > cluster healthy) > > deployment command: > openstack overcloud deploy \ > --timeout 100 \ > --templates /usr/share/openstack-tripleo-heat-templates \ > --stack overcloud \ > --libvirt-type kvm \ > --ntp-server clock.redhat.com \ > --environment-file > /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \ > -e /home/stack/virt/internal.yaml \ > -e > /usr/share/openstack-tripleo-heat-templates/environments/network-isolation. > yaml \ > -e /home/stack/virt/network/network-environment.yaml \ > -e /home/stack/virt/hostnames.yml \ > -e > /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph- > ansible.yaml \ > -e /home/stack/virt/debug.yaml \ > -e /home/stack/virt/ceph-single-host-mode.yaml \ > -e /home/stack/virt/nodes_data.yaml \ > -e /home/stack/virt/docker-images.yaml \ > --log-file overcloud_deployment_62.log > > the content of /home/stack/virt/nodes_data.yaml is: > parameter_defaults: > ControllerCount: 3 > OvercloudControlFlavor: controller > ComputeCount: 1 > OvercloudComputeFlavor: compute > CephStorageCount: 1 > OvercloudCephStorageFlavor: ceph > > > 2) create two environment files: > /home/stack/blacklist.yaml > parameter_defaults: > DeploymentServerBlacklist: > - controller-0 > - controller-1 > - controller-2 > - compute-0 > - ceph-0 > /home/stack/virt/nodes_data_plus_one.yaml > parameter_defaults: > ControllerCount: 3 > OvercloudControlFlavor: controller > ComputeCount: 2 > OvercloudComputeFlavor: compute > CephStorageCount: 1 > OvercloudCephStorageFlavor: ceph > > 3) ran the update, it failed with an error > overcloud.AllNodesDeploySteps: > resource_type: OS::TripleO::PostDeploySteps > physical_resource_id: 3bc0b155-51a7-45b5-bd15-392236636fc5 > status: UPDATE_FAILED > status_reason: | > resources.AllNodesDeploySteps: Property error: > resources.BootstrapServerId.properties.value: Yogev this is a different error due to the fact that you're blacklisting all 3 controllers while the bootstrapserverid is set taking one node from the nodes belonging to the primary role (controller by default). I guess if we want this scenario to work (blacklisting all controllers), we can track it with a different BZ, probably for DFG:DF. Regarding support for the blacklist in ceph-ansible instead, I guess a simpler scenario could be (what this BZ was about): 1) deploy an overcloud 2) update the overcloud blacklisting 1 node hosting any of the ceph services 3) verify that on the blacklisted node ceph-ansible did not update/refresh the ceph config
verified. according to gfidente's comment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0607
*** Bug 1595674 has been marked as a duplicate of this bug. ***