Bug 1640756

Summary: [osp14] baremetal deployment fails with jq: error: Could not open file /var/lib/heat-config/deployed/<id>.notify.json
Product: Red Hat OpenStack Reporter: James Slagle <jslagle>
Component: openstack-tripleo-commonAssignee: James Slagle <jslagle>
Status: CLOSED ERRATA QA Contact: Gurenko Alex <agurenko>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: jslagle, mburns, slinaber
Target Milestone: betaKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-9.4.1-0.20181012010865.67bab16.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:54:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Slagle 2018-10-18 16:48:19 UTC
job is failing because it cannot find the notify json file it is looking for:

2018-09-12 17:14:17 | TASK [Run deployment NetworkDeployment] ****************************************
2018-09-12 17:14:17 | Wednesday 12 September 2018 17:13:58 +0000 (0:00:00.254) 0:00:27.104 ***
2018-09-12 17:14:17 | fatal: [overcloud-controller-0]: FAILED! => {"changed": true, "cmd": "/usr/libexec/os-refresh-config/configure.d/55-heat-config\n exit $(jq .deploy_status_code /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json)", "delta": "0:00:00.047339", "end": "2018-09-12 17:14:16.472866", "msg": "non-zero return code", "rc": 2, "start": "2018-09-12 17:14:16.425527", "stderr": "[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed\n[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json\njq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory", "stderr_lines": ["[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed", "[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json", "jq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory"], "stdout": "", "stdout_lines": []}

All the logs concerning the failed job are available here [1] and there's already a proposed review [2] that might address the issue.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-rocky-rdo_trunk-bmu-ha-lab-cygnus-float_nic_with_vlans-3
[2] https://review.openstack.org/#/c/602270/

Comment 1 James Slagle 2018-10-18 16:48:39 UTC

i debugged this issue in the reproducer environment and found that when os-net-config was configuring the network configuration on the overcloud node, this was causing ssh to drop the connection:

Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task since it was failed by a ssh connection error.

However, the first task was actually still running and it eventually succeeds.

The second task that was kicked off by ansible as a retry, sees that the deployment is already applied, but the notification file (*.notify.json) does not yet exist since the first task is still in progress. This causes the second task to fail with the error reported in the bug and the whole ansible-playbook run to then fail.



Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix the issue as ssh doesn't drop the first connection when these are configured.

Comment 2 James Slagle 2018-10-18 16:48:57 UTC

Fix proposed to branch: master
Review: https://review.openstack.org/604171

Comment 3 James Slagle 2018-10-18 16:49:11 UTC
 Ben Nemec (bnemec) wrote on 2018-10-04: 	#17

    networkdeployment-failure.log Edit (37.6 KiB, text/plain)

I'm still seeing this on master. I've attached the deployment output and I can see that I do have the keepalive fix on the undercloud:

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5

This is doing a fairly basic deployment using the OVB multinic templates: https://github.com/cybertron/openstack-virtual-baremetal/tree/master/overcloud-templates/network-templates-v2

Deploy command is this: openstack overcloud deploy --templates --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-isolation-absolute.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-environment.yaml

Comment 8 errata-xmlrpc 2019-01-11 11:54:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Comment 9 Red Hat Bugzilla 2023-09-14 04:40:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days