Bug 1533275
| Summary: | access workbook does not honor DeploymentServerBlacklist parameter during update | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Chris Janiszewski <cjanisze> | ||||||
| Component: | openstack-tripleo-common | Assignee: | Giulio Fidente <gfidente> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 12.0 (Pike) | CC: | acanan, cjanisze, david.costakos, ebarrera, emacchi, jomurphy, jschluet, jslagle, kejones, mburns, rhel-osp-director-maint, slinaber, uemit.seren | ||||||
| Target Milestone: | z2 | Keywords: | Triaged, ZStream | ||||||
| Target Release: | 12.0 (Pike) | ||||||||
| Hardware: | All | ||||||||
| OS: | All | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | openstack-tripleo-common-7.6.9-1.el7ost openstack-tripleo-heat-templates-7.0.9-1.el7ost | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-03-28 17:27:14 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
What is your deployment command? Please provide all custom templates as well if there are any in use. Created attachment 1380066 [details]
custom templates
Hey James, Thanks for looking at it.
Here is the deploy command that has been used:
source ~/stackrc
cd ~/
time openstack overcloud deploy --templates \
--ntp-server 10.9.71.7 \
-e templates/server-blacklist.yaml \
-e templates/network-environment.yaml \
-e templates/storage-environment.yaml \
-e templates/docker-registry.yaml \
-e templates/node-info.yaml \
-e templates/inject-trust-anchor-hiera.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml
I am attaching custom templates directory for your reference.
Just because there is output from os-collect-config output in /var/log/messages on controller-0, does not necessarily mean that any changes or updates were applied. You will in fact see some output as the Heat metadata changes due to the deployments being blacklisted. That should not cause any issues though (the service could even be stopped).
Did you actually see any updates applied to the controller?
I looked at /var/log/messages myself from the controller, and I actually do see one update being applied:
Jan 10 17:19:39 localhost os-collect-config: [2018-01-10 22:19:39,563] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/ansible < /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.json
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [INFO] {"deploy_stdout": "\nPLAY [localhost] ***************************************************************\n\nTASK [Gathering Facts] *********************************************************\nok: [localhost]\n\nTASK [create user tripleo-admin] ***********************************************\nok: [localhost]\n\nTASK [grant admin rights to user tripleo-admin] ********************************\nok: [localhost]\n\nTASK [ensure .ssh dir exists for user tripleo-admin] ***************************\nok: [localhost]\n\nTASK [ensure authorized_keys file exists for user tripleo-admin] ***************\nchanged: [localhost]\n\nTASK [authorize TripleO Mistral key for user tripleo-admin] ********************\nok: [localhost]\n\nPLAY RECAP *********************************************************************\nlocalhost : ok=6 changed=1 unreachable=0 failed=0 \n\n", "deploy_stderr": "", "deploy_status_code": 0}
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [DEBUG] [2018-01-10 22:19:39,593] (heat-config) [DEBUG] Running ansible-playbook -i localhost, /var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_playbook.yaml --extra-vars @/var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_variables.json
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Return code 0
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO]
Jan 10 17:19:45 localhost os-collect-config: PLAY [localhost] ***************************************************************
Jan 10 17:19:45 localhost os-collect-config: TASK [Gathering Facts] *********************************************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [create user tripleo-admin] ***********************************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [grant admin rights to user tripleo-admin] ********************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [ensure .ssh dir exists for user tripleo-admin] ***************************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [ensure authorized_keys file exists for user tripleo-admin] ***************
Jan 10 17:19:45 localhost os-collect-config: changed: [localhost]
Jan 10 17:19:45 localhost os-collect-config: TASK [authorize TripleO Mistral key for user tripleo-admin] ********************
Jan 10 17:19:45 localhost os-collect-config: ok: [localhost]
Jan 10 17:19:45 localhost os-collect-config: PLAY RECAP *********************************************************************
Jan 10 17:19:45 localhost os-collect-config: localhost : ok=6 changed=1 unreachable=0 failed=0
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,978] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-ansible/dbdc2c99-7e1c-4006-aaf8-60db91798e99_playbook.yaml
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [INFO] Completed /usr/libexec/heat-config/hooks/ansible
Jan 10 17:19:45 localhost os-collect-config: [2018-01-10 22:19:45,983] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.json < /var/lib/heat-config/deployed/dbdc2c99-7e1c-4006-aaf8-60db91798e99.notify.json
Does the timestamp shown there correspond with when you were doing an update with the blacklist?
I'm actually unable to identify the source of this deployment. I don't think this code should be in any of the shipped templates for OSP12. I'm unsure at the moment where it's coming from.
Do you by chance have any pip or source installs on your undercloud, or perhaps any packages from osp13 on your undercloud? I actually can't access the sosreport for the undercloud due to a permission denied error on the url, can you fix that as well?
Thanks for clarification on the os-collect-config still getting messages. It might be worth adding this comment to official documentation. I expected to not see any activity there since the doc mentions you can even stop this service. I am only using official CDN repos for my deployment - osp12. I don't use pip or source installs. I changed permissions for sosreport from undercloud - sorry for that. We are validating DeploymentServerBlacklist use case at the OSP21 hackfest this week. There are more people running this use case and notices similar behavior and actually noticed changes on roles others then compute. I am going to ask the to comment. (In reply to Chris Janiszewski from comment #4) > Thanks for clarification on the os-collect-config still getting messages. It > might be worth adding this comment to official documentation. I expected to > not see any activity there since the doc mentions you can even stop this > service. > > I am only using official CDN repos for my deployment - osp12. > I don't use pip or source installs. > I changed permissions for sosreport from undercloud - sorry for that. > > We are validating DeploymentServerBlacklist use case at the OSP21 hackfest > this week. There are more people running this use case and notices similar > behavior and actually noticed changes on roles others then compute. I am > going to ask the to comment. ok, well we still need to figure out the source of this deployment, as I don't see where it's coming from looking at the 12 rpm's. Can you attach the output from the following? openstack stack resource list -n 7 overcloud Created attachment 1380123 [details]
stack resource list output
(In reply to Chris Janiszewski from comment #6) > Created attachment 1380123 [details] > stack resource list output I don't see the uuid of the deployment in that output (dbdc2c99-7e1c-4006-aaf8-60db91798e99). Do you run any out of band deployments? Do you use the /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh script? Is there any documentation for the hackfest i can take a look at to see how this deployment might be getting triggered? (In reply to James Slagle from comment #7) > I don't see the uuid of the deployment in that output > (dbdc2c99-7e1c-4006-aaf8-60db91798e99). I have not used that script manually but I see it being out there on my environment: (undercloud) [stack@chrisjupgrade-undercloud ~]$ ls /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh The deployment is triggered by the deploy.sh that I have pasted in comment #2 I have sent you information about the access to this environment via email. Feel free to log in and poke around. You are also more then welcome to just trigger another deployment - to see if this occurs again. I will leave it up for you do investigate. I believe I've narrowed this down to the interaction between the ceph-ansible.yaml and access.yaml workbooks when ceph-ansible.yaml is triggered by Heat.
First, it does not honor DeploymentServerBlacklist. ceph-ansible.yaml calls:
enable_ssh_admin:
workflow: tripleo.access.v1.enable_ssh_admin
which then does:
get_servers:
action: nova.servers_list
Not only does that not honor the blacklist, but it will create tripleo-admin on every server, not just the ones where we are installing ceph. Particularly for the ceph-ansible case, I think this ought to be configurable and we only create the user on ceph nodes that are in the inventory for ceph-ansible.
If you made get_servers take an input of server uuids and only call nova.servers_list if the input is not provided, you could then make use of the servers json parameter in deploy-steps.j2 which has already had the blacklisted servers removed.
Further, from what I can tell, this action ends up getting triggered on every stack update. There's nothing to say "don't create tripleo-admin if it's already been done" (that I can find anyway, and based on this bug report that seems to be the case). That should also be fixed.
Verification failed
actions:
1) deployed an overcloud with
- 3 controllers
- 1 compute node
- 1 ceph storage node (with 5 osds in it + replication between osds, cluster healthy)
deployment command:
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/ceph-single-host-mode.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /home/stack/virt/docker-images.yaml \
--log-file overcloud_deployment_62.log
the content of /home/stack/virt/nodes_data.yaml is:
parameter_defaults:
ControllerCount: 3
OvercloudControlFlavor: controller
ComputeCount: 1
OvercloudComputeFlavor: compute
CephStorageCount: 1
OvercloudCephStorageFlavor: ceph
2) create two environment files:
/home/stack/blacklist.yaml
parameter_defaults:
DeploymentServerBlacklist:
- controller-0
- controller-1
- controller-2
- compute-0
- ceph-0
/home/stack/virt/nodes_data_plus_one.yaml
parameter_defaults:
ControllerCount: 3
OvercloudControlFlavor: controller
ComputeCount: 2
OvercloudComputeFlavor: compute
CephStorageCount: 1
OvercloudCephStorageFlavor: ceph
3) ran the update, it failed with an error
overcloud.AllNodesDeploySteps:
resource_type: OS::TripleO::PostDeploySteps
physical_resource_id: 3bc0b155-51a7-45b5-bd15-392236636fc5
status: UPDATE_FAILED
status_reason: |
resources.AllNodesDeploySteps: Property error: resources.BootstrapServerId.properties.value:
(In reply to Yogev Rabl from comment #13) > Verification failed > > actions: > 1) deployed an overcloud with > - 3 controllers > - 1 compute node > - 1 ceph storage node (with 5 osds in it + replication between osds, > cluster healthy) > > deployment command: > openstack overcloud deploy \ > --timeout 100 \ > --templates /usr/share/openstack-tripleo-heat-templates \ > --stack overcloud \ > --libvirt-type kvm \ > --ntp-server clock.redhat.com \ > --environment-file > /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \ > -e /home/stack/virt/internal.yaml \ > -e > /usr/share/openstack-tripleo-heat-templates/environments/network-isolation. > yaml \ > -e /home/stack/virt/network/network-environment.yaml \ > -e /home/stack/virt/hostnames.yml \ > -e > /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph- > ansible.yaml \ > -e /home/stack/virt/debug.yaml \ > -e /home/stack/virt/ceph-single-host-mode.yaml \ > -e /home/stack/virt/nodes_data.yaml \ > -e /home/stack/virt/docker-images.yaml \ > --log-file overcloud_deployment_62.log > > the content of /home/stack/virt/nodes_data.yaml is: > parameter_defaults: > ControllerCount: 3 > OvercloudControlFlavor: controller > ComputeCount: 1 > OvercloudComputeFlavor: compute > CephStorageCount: 1 > OvercloudCephStorageFlavor: ceph > > > 2) create two environment files: > /home/stack/blacklist.yaml > parameter_defaults: > DeploymentServerBlacklist: > - controller-0 > - controller-1 > - controller-2 > - compute-0 > - ceph-0 > /home/stack/virt/nodes_data_plus_one.yaml > parameter_defaults: > ControllerCount: 3 > OvercloudControlFlavor: controller > ComputeCount: 2 > OvercloudComputeFlavor: compute > CephStorageCount: 1 > OvercloudCephStorageFlavor: ceph > > 3) ran the update, it failed with an error > overcloud.AllNodesDeploySteps: > resource_type: OS::TripleO::PostDeploySteps > physical_resource_id: 3bc0b155-51a7-45b5-bd15-392236636fc5 > status: UPDATE_FAILED > status_reason: | > resources.AllNodesDeploySteps: Property error: > resources.BootstrapServerId.properties.value: Yogev this is a different error due to the fact that you're blacklisting all 3 controllers while the bootstrapserverid is set taking one node from the nodes belonging to the primary role (controller by default). I guess if we want this scenario to work (blacklisting all controllers), we can track it with a different BZ, probably for DFG:DF. Regarding support for the blacklist in ceph-ansible instead, I guess a simpler scenario could be (what this BZ was about): 1) deploy an overcloud 2) update the overcloud blacklisting 1 node hosting any of the ceph services 3) verify that on the blacklisted node ceph-ansible did not update/refresh the ceph config verified. according to gfidente's comment. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0607 *** Bug 1595674 has been marked as a duplicate of this bug. *** |
Description of problem: DeploymentServerBlacklist parameter doesn't exclude servers in the list from running update. Deployed OSP12 with 3 controllers, 1 compute, 3 ceph nodes. After it completed successfully, added following extension: (undercloud) [stack@chrisjupgrade-undercloud ~]$ cat templates/server-blacklist.yaml parameter_defaults: DeploymentServerBlacklist: - overcloud-compute-0 - overcloud-controller-0 - overcloud-controller-1 - overcloud-controller-2 - overcloud-cephstorage-0 - overcloud-cephstorage-1 - overcloud-cephstorage-2 Adjusted ComputeCount: from 1 -> 2 and started another deployment/update. Monitored os-collect-config on one of the controllers and seen a lot of traffic during the update, even though it should be excluded. Docs - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#Scaling-Blacklisting_Nodes also mentioning: "You can also power off or stop the os-collect-config agents during the operation." Version-Release number of selected component (if applicable): osp12 How reproducible: Every time Steps to Reproduce: 1. Deploy overcloud with 1 compute node 2. Scale-out overcloud with +1 compute while running update with DeploymentServerBlacklist parameter Actual results: All steps run on all the hosts even the blacklisted Expected results: Blacklisted nodes should not run any updates. Additional info: http://chrisj.cloud/sosreport-controller0-DeploymentServerBlacklist-issue-20180110223206.tar.xz http://chrisj.cloud/sosreport-undercloud-DeploymentServerBlacklist-issue-20180110173246.tar.xz