Bug 1640756
Summary: | [osp14] baremetal deployment fails with jq: error: Could not open file /var/lib/heat-config/deployed/<id>.notify.json | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | James Slagle <jslagle> |
Component: | openstack-tripleo-common | Assignee: | James Slagle <jslagle> |
Status: | CLOSED ERRATA | QA Contact: | Gurenko Alex <agurenko> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 14.0 (Rocky) | CC: | jslagle, mburns, slinaber |
Target Milestone: | beta | Keywords: | Triaged |
Target Release: | 14.0 (Rocky) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-common-9.4.1-0.20181012010865.67bab16.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-01-11 11:54:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
James Slagle
2018-10-18 16:48:19 UTC
i debugged this issue in the reproducer environment and found that when os-net-config was configuring the network configuration on the overcloud node, this was causing ssh to drop the connection: Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task since it was failed by a ssh connection error. However, the first task was actually still running and it eventually succeeds. The second task that was kicked off by ansible as a retry, sees that the deployment is already applied, but the notification file (*.notify.json) does not yet exist since the first task is still in progress. This causes the second task to fail with the error reported in the bug and the whole ansible-playbook run to then fail. Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix the issue as ssh doesn't drop the first connection when these are configured. Fix proposed to branch: master Review: https://review.openstack.org/604171 Ben Nemec (bnemec) wrote on 2018-10-04: #17 networkdeployment-failure.log Edit (37.6 KiB, text/plain) I'm still seeing this on master. I've attached the deployment output and I can see that I do have the keepalive fix on the undercloud: [ssh_connection] ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5 This is doing a fairly basic deployment using the OVB multinic templates: https://github.com/cybertron/openstack-virtual-baremetal/tree/master/overcloud-templates/network-templates-v2 Deploy command is this: openstack overcloud deploy --templates --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-isolation-absolute.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-environment.yaml Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |