Very long playbook execution time on compute and controller that have lots of interfaces Across all of the clusters we were seeing timeouts during the following portion of stack updates: 2020-06-20 11:47:58Z [overcloud-fozzie-AllNodesDeploySteps-wkbflvledpui.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS state changed This is where ceph-ansible is being ran. In some cases, by looking at /var/log/mistral/ceph-ansible.log, we were seeing the initial portion of the site-docker.yml playbook taking ~24 hours to gather facts: because ansible would run for ~18 hours or more. 2020-06-19 03:30:20,793 p=18969 u=mistral | TASK [gather and delegate facts] *********************************************** 2020-06-19 03:30:20,793 p=18969 u=mistral | Friday 19 June 2020 03:30:20 +0000 (0:00:00.245) 0:00:00.314 *********** 2020-06-19 05:48:06,716 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.13] => (item=10.10.10.13) 2020-06-19 11:18:54,263 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.9] => (item=10.10.10.9) 2020-06-19 16:15:47,032 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.8] => (item=10.10.10.8) 2020-06-19 16:15:56,109 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.16] => (item=10.10.10.16) 2020-06-19 16:15:56,171 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.15] => (item=10.10.10.15) 2020-06-19 16:15:56,235 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.14] => (item=10.10.10.14) 2020-06-19 16:15:56,606 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.7] => (item=10.10.10.7) 2020-06-19 19:48:01,037 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.12] => (item=10.10.10.12) 2020-06-19 19:48:10,701 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.5] => (item=10.10.10.5) 2020-06-20 01:44:17,531 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.10] => (item=10.10.10.10) 2020-06-20 02:49:53,998 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.11] => (item=10.10.10.11) 2020-06-20 02:50:00,836 p=18969 u=mistral | ok: [10.10.10.13 -> 10.10.10.6] => (item=10.10.10.6) The length of time it took to complete the ceph-ansible run during this tripleo deployment step seemed to cause the below errors with fetching files from swift on the undercloud, besides the impact of how long stack deployments obviously take...: e] <html><h1>Unauthorized</h1><p>This server could not verify t'} [action_ex_id=a5b25e5d-c6d7-4c3c-93b6-ae666f48ba11, idx=0]: {u'msg': u'Error attempting an operation on container: Container GET failed: http://10.10.10.3:8080/v1/AUTH_463490ef385049a5a03c36d5a67e85fc/overcloud-fozzie_ceph_ansible_fetch_dir?format=json 401 Unauthorized [first 60 chars of response] <html><h1 >Unauthorized</h1><p>This server could not verify t'} We were able to speed up the run of ceph-ansible significantly by making a change to the playbook. After some troubleshooting, it seems that due to the number of interfaces existing on our computes and controllers; Ansible gathering facts for each overcloud node on the networking/hardware details is very slow. Reading thru the playbook it looks like the only fact used during it is ansible_hostname. The change to gather only the minimal subset of facts using !all [1] significantly speeds up the runtime of that task and does not impact the playbook. After making this change we saw ceph-ansible runs on all our clusters taking ~1.5 hours to complete and the specific step above taking only seconds: 2020-06-20 21:31:52,230 p=26304 u=mistral | TASK [gather and delegate facts] *********************************************** 2020-06-20 21:31:52,230 p=26304 u=mistral | Saturday 20 June 2020 21:31:52 +0000 (0:00:00.283) 0:00:00.516 ********* 2020-06-20 21:31:53,294 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.13] => (item=10.10.10.13) 2020-06-20 21:31:54,183 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.9] => (item=10.10.10.9) 2020-06-20 21:31:54,913 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.8] => (item=10.10.10.8) 2020-06-20 21:31:55,353 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.16] => (item=10.10.10.16) 2020-06-20 21:31:55,797 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.15] => (item=10.10.10.15) 2020-06-20 21:31:56,239 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.14] => (item=10.10.10.14) 2020-06-20 21:31:56,760 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.7] => (item=10.10.10.7) 2020-06-20 21:31:57,437 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.12] => (item=10.10.10.12) 2020-06-20 21:31:58,008 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.5] => (item=10.10.10.5) 2020-06-20 21:31:58,748 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.10] => (item=10.10.10.10) 2020-06-20 21:31:59,421 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.11] => (item=10.10.10.11) 2020-06-20 21:31:59,938 p=26304 u=mistral | ok: [10.10.10.13 -> 10.10.10.6] => (item=10.10.10.6) 2020-06-20 21:32:00,026 p=26304 u=mistral | TASK [check if it is atomic host] **********************************************
Hi David Hill, From Customer case 02745633, looks like fixing a H/w issue resolved the reported problem. Is it valid BZ?
If it's already fixed in the code (I'm pretty sure it is in later releases), I'm pretty sure we can close this BZ if the case is solved.
Hi Guillaume, For this BZ is there nay fix to verify. from Customer case this seems to be H/w issue. Please suggest?