Bug 1640756

Summary:	[osp14] baremetal deployment fails with jq: error: Could not open file /var/lib/heat-config/deployed/<id>.notify.json
Product:	Red Hat OpenStack	Reporter:	James Slagle <jslagle>
Component:	openstack-tripleo-common	Assignee:	James Slagle <jslagle>
Status:	CLOSED ERRATA	QA Contact:	Gurenko Alex <agurenko>
Severity:	high	Docs Contact:
Priority:	high
Version:	14.0 (Rocky)	CC:	jslagle, mburns, slinaber
Target Milestone:	beta	Keywords:	Triaged
Target Release:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-common-9.4.1-0.20181012010865.67bab16.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-11 11:54:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description James Slagle 2018-10-18 16:48:19 UTC

job is failing because it cannot find the notify json file it is looking for:

2018-09-12 17:14:17 | TASK [Run deployment NetworkDeployment] ****************************************
2018-09-12 17:14:17 | Wednesday 12 September 2018 17:13:58 +0000 (0:00:00.254) 0:00:27.104 ***
2018-09-12 17:14:17 | fatal: [overcloud-controller-0]: FAILED! => {"changed": true, "cmd": "/usr/libexec/os-refresh-config/configure.d/55-heat-config\n exit $(jq .deploy_status_code /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json)", "delta": "0:00:00.047339", "end": "2018-09-12 17:14:16.472866", "msg": "non-zero return code", "rc": 2, "start": "2018-09-12 17:14:16.425527", "stderr": "[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed\n[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json\njq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory", "stderr_lines": ["[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed", "[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json", "jq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory"], "stdout": "", "stdout_lines": []}

All the logs concerning the failed job are available here [1] and there's already a proposed review [2] that might address the issue.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-rocky-rdo_trunk-bmu-ha-lab-cygnus-float_nic_with_vlans-3
[2] https://review.openstack.org/#/c/602270/

Comment 1 James Slagle 2018-10-18 16:48:39 UTC


i debugged this issue in the reproducer environment and found that when os-net-config was configuring the network configuration on the overcloud node, this was causing ssh to drop the connection:

Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task since it was failed by a ssh connection error.

However, the first task was actually still running and it eventually succeeds.

The second task that was kicked off by ansible as a retry, sees that the deployment is already applied, but the notification file (*.notify.json) does not yet exist since the first task is still in progress. This causes the second task to fail with the error reported in the bug and the whole ansible-playbook run to then fail.



Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix the issue as ssh doesn't drop the first connection when these are configured.

Comment 2 James Slagle 2018-10-18 16:48:57 UTC


Fix proposed to branch: master
Review: https://review.openstack.org/604171

Comment 3 James Slagle 2018-10-18 16:49:11 UTC

 Ben Nemec (bnemec) wrote on 2018-10-04: 	#17

    networkdeployment-failure.log Edit (37.6 KiB, text/plain)

I'm still seeing this on master. I've attached the deployment output and I can see that I do have the keepalive fix on the undercloud:

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5

This is doing a fairly basic deployment using the OVB multinic templates: https://github.com/cybertron/openstack-virtual-baremetal/tree/master/overcloud-templates/network-templates-v2

Deploy command is this: openstack overcloud deploy --templates --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-isolation-absolute.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-environment.yaml

Comment 8 errata-xmlrpc 2019-01-11 11:54:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Comment 9 Red Hat Bugzilla 2023-09-14 04:40:29 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days