Bug 1640756 - [osp14] baremetal deployment fails with jq: error: Could not open file /var/lib/heat-config/deployed/<id>.notify.json
Summary: [osp14] baremetal deployment fails with jq: error: Could not open file /var/l...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 14.0 (Rocky)
Assignee: James Slagle
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-18 16:48 UTC by James Slagle
Modified: 2023-09-14 04:40 UTC (History)
3 users (show)

Fixed In Version: openstack-tripleo-common-9.4.1-0.20181012010865.67bab16.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-11 11:54:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1792343 0 None None None 2018-10-18 16:49:48 UTC
OpenStack gerrit 604171 0 None MERGED Set SSH server keep alive options 2020-07-31 07:58:19 UTC
OpenStack gerrit 611712 0 None MERGED Run NetworkDeployment as async task 2020-07-31 07:58:19 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:54:18 UTC

Description James Slagle 2018-10-18 16:48:19 UTC
job is failing because it cannot find the notify json file it is looking for:

2018-09-12 17:14:17 | TASK [Run deployment NetworkDeployment] ****************************************
2018-09-12 17:14:17 | Wednesday 12 September 2018 17:13:58 +0000 (0:00:00.254) 0:00:27.104 ***
2018-09-12 17:14:17 | fatal: [overcloud-controller-0]: FAILED! => {"changed": true, "cmd": "/usr/libexec/os-refresh-config/configure.d/55-heat-config\n exit $(jq .deploy_status_code /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json)", "delta": "0:00:00.047339", "end": "2018-09-12 17:14:16.472866", "msg": "non-zero return code", "rc": 2, "start": "2018-09-12 17:14:16.425527", "stderr": "[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed\n[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json\njq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory", "stderr_lines": ["[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed", "[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json", "jq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory"], "stdout": "", "stdout_lines": []}

All the logs concerning the failed job are available here [1] and there's already a proposed review [2] that might address the issue.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-rocky-rdo_trunk-bmu-ha-lab-cygnus-float_nic_with_vlans-3
[2] https://review.openstack.org/#/c/602270/

Comment 1 James Slagle 2018-10-18 16:48:39 UTC

i debugged this issue in the reproducer environment and found that when os-net-config was configuring the network configuration on the overcloud node, this was causing ssh to drop the connection:

Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task since it was failed by a ssh connection error.

However, the first task was actually still running and it eventually succeeds.

The second task that was kicked off by ansible as a retry, sees that the deployment is already applied, but the notification file (*.notify.json) does not yet exist since the first task is still in progress. This causes the second task to fail with the error reported in the bug and the whole ansible-playbook run to then fail.



Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix the issue as ssh doesn't drop the first connection when these are configured.

Comment 2 James Slagle 2018-10-18 16:48:57 UTC

Fix proposed to branch: master
Review: https://review.openstack.org/604171

Comment 3 James Slagle 2018-10-18 16:49:11 UTC
 Ben Nemec (bnemec) wrote on 2018-10-04: 	#17

    networkdeployment-failure.log Edit (37.6 KiB, text/plain)

I'm still seeing this on master. I've attached the deployment output and I can see that I do have the keepalive fix on the undercloud:

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5

This is doing a fairly basic deployment using the OVB multinic templates: https://github.com/cybertron/openstack-virtual-baremetal/tree/master/overcloud-templates/network-templates-v2

Deploy command is this: openstack overcloud deploy --templates --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-isolation-absolute.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-environment.yaml

Comment 8 errata-xmlrpc 2019-01-11 11:54:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Comment 9 Red Hat Bugzilla 2023-09-14 04:40:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.