Bug 1741093

Summary: Message: Build of instance xxxx aborted: Failure prepping block device., Code: 500
Product: Red Hat OpenStack Reporter: Shailesh Chhabdiya <schhabdi>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: mlammon
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: bfournie, mburns
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-24 12:57:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shailesh Chhabdiya 2019-08-14 08:59:51 UTC
Description of problem:
Fresh deployment fails with the error mention in title

Version-Release number of selected component (if applicable):
Red Hat OpenStack 13


Additional info:

Issue seems to be with rabbit lot of error messages

Comment 2 Shailesh Chhabdiya 2019-08-14 09:01:29 UTC
Following are errors  present in some log files

/var/log/ironic/app.log
~~~~

2019-08-12 11:58:46.690 12156 ERROR keystonemiddleware.auth_token [req-b1994b3b-c1cf-4f17-95dc-739c0957b45e 1127d62ec95945e8b66f027dcd58b9d3 47edaca4bb3d41ec9a3420b52e3be993 - default default] Bad response code while validating token: 504: GatewayTimeout: Gateway Timeout (HTTP 504)
2019-08-12 11:58:47.508 12156 WARNING keystonemiddleware.auth_token [req-b1994b3b-c1cf-4f17-95dc-739c0957b45e 1127d62ec95945e8b66f027dcd58b9d3 47edaca4bb3d41ec9a3420b52e3be993 - default default] Identity response: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
: GatewayTimeout: Gateway Timeout (HTTP 504)
2019-08-12 11:58:47.552 12156 CRITICAL keystonemiddleware.auth_token [req-b1994b3b-c1cf-4f17-95dc-739c0957b45e 1127d62ec95945e8b66f027dcd58b9d3 47edaca4bb3d41ec9a3420b52e3be993 - default default] Unable to validate token: Failed to fetch token data from identity server: ServiceError: Failed to fetch token data from identity server
(END)
~~~~

/var/log/heat/heat-engine.log
~~~
2019-08-12 12:01:41.195 6548 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:01:41.341 6548 DEBUG heat.engine.service [req-379e7b5f-de04-4118-b509-4cc203ad4f28 - - - - -] Service 8f006d0b-b7f6-4e82-904f-abf16f6205b9 is updated service_manage_report /usr/lib/python2.7/site-packages/heat/engine/service.py:2322
2019-08-12 12:01:46.120 6545 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:01:56.211 6548 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:02:01.131 6545 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:02:11.216 6548 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:02:13.012 6547 DEBUG heat.engine.service [req-2e5371fd-ab59-44a3-9f29-738b27a1c4bf - - - - -] Service ee71dc5c-0797-42d9-a30f-b713aa114fc7 is updated service_manage_report /usr/lib/python2.7/site-packages/heat/engine/service.py:2322
2019-08-12 12:02:16.140 6545 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:02:26.220 6548 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:02:31.155 6545 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
2019-08-12 12:02:33.093 6546 DEBUG heat.engine.service [req-4bc08958-6a85-432d-96b4-74330dfa8561 - - - - -] Service def7d6e4-0d16-443c-b83f-1d20c90164c2 is updated service_manage_report /usr/lib/python2.7/site-packages/heat/engine/service.py:2322

~~~

free output
~~~
              total        used        free      shared  buff/cache   available
Mem:       16266124     7488248     1487380         840     7290496     8308080
Swap:       8257532     7538456      719076
~~~

Comment 4 Bob Fournier 2019-08-15 15:40:26 UTC
The "Failure prepping block device" is a generic failure message that occurs when a node cannot be properly deployed.  There does not seem to be any Ironic related problems but most likely some fundamental networking configuration issues based on the errors accessing services on the undercloud - a small subset of the numerous errors is shown below.

- Please provide the full set of heat template files including nic configuration files used for deployment and the deployment command.

- Has there been any smaller scale deployments to prove that the network is set up correctly or is using the 16 compute nodes and 5 ceph nodes the first attempt at a deployment?


Ironic connection errors accessing Ironic API
2019-05-22 13:30:42.983 7653 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:30:44.683 7671 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:30:46.741 7700 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:30:48.433 7716 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:49:50.937 12679 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:49:52.695 12692 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:49:54.463 12707 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:49:56.178 12723 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-22 13:49:57.927 12747 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-23 09:28:59.753 6649 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-23 09:29:01.354 6663 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-23 09:29:03.120 6675 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-23 09:29:04.723 6691 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-05-23 09:29:06.357 6712 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-08-07 16:03:19.099 9853 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-08-07 16:03:21.082 9879 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-08-07 16:03:22.892 9892 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-08-07 16:03:24.894 9908 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use
2019-08-07 16:03:26.942 9928 ERROR oslo.service.wsgi [-] Could not bind to 30.102.1.2:6385: error: [Errno 98] Address already in use

Neutron connection errors:
heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
neutron/server.log:2019-08-12 11:56:07.951 6662 ERROR oslo.messaging._drivers.impl_rabbit [-] [1b861f34-edb9-48be-88bf-c94a1dff3ea0] AMQP server on 30.102.1.2:5672 is unreachable: timed out. Trying again in 12 seconds.: timeout: timed out
neutron/server.log:2019-08-12 11:56:10.109 6658 ERROR oslo.messaging._drivers.impl_rabbit [-] [617e51d4-44ac-4b16-af6d-3d98c7634336] AMQP server on 30.102.1.2:5672 is unreachable: timed out. Trying again in 6 seconds.: timeout: timed out
neutron/server.log:2019-08-12 11:56:10.859 6662 ERROR oslo.messaging._drivers.impl_rabbit [-] [1fe1e451-4de4-4a21-ae31-bd9de0967046] AMQP server on 30.102.1.2:5672 is unreachable: timed out. Trying again in 4 seconds.: timeout: timed out
neutron/server.log:2019-08-12 11:56:10.860 6662 ERROR oslo.messaging._drivers.impl_rabbit [-] [c07bffb2-e930-4cd7-aa88-2b1bcd47b76f] AMQP server on 30.102.1.2:5672 is unreachable: timed out. Trying again in 4 seconds.: timeout: timed out
neutron/server.log:2019-08-12 11:56:14.939 6662 ERROR oslo.messaging._drivers.impl_rabbit [req-cbdb1392-f227-421e-a4f2-74f9ed90f8a0 - - - - -] [e8721e61-e507-40a4-9571-06b3d6d11246] AMQP server on 30.102.1.2:5672 is unreachable: timed out. Trying again in 12 seconds.: timeout: timed out

 - - -] Timeout in RPC method get_service_plugin_list. Waiting for 10 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID fed7d38240224c55bde47d259fa8ebc8
neutron/l3-agent.log:2019-08-12 05:11:21.686 6029 ERROR neutron.common.rpc [req-0d030130-61eb-499d-aea0-7465583fd130 - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 34 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID da123667f99345a18e4217fd9d5a41ea
neutron/l3-agent.log:2019-08-12 05:21:55.768 6029 ERROR neutron.common.rpc [req-d68fd0a2-9030-41cb-90f0-ec4d5c7ba01d - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 12 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID c7c16f2c22d54adebf1772f6df656760


neutron/dhcp-agent.log:2019-08-05 10:46:15.993 9234 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
neutron/dhcp-agent.log:2019-08-05 10:46:30.994 9234 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
neutron/dhcp-agent.log:2019-08-05 10:46:45.995 9234 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
neutron/dhcp-agent.log:2019-08-05 10:47:00.995 9234 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
neutron/dhcp-agent.log:2019-08-05 10:47:15.996 9234 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed

Nova connection errors:
nova/nova-compute.log:2019-08-12 11:56:54.277 6028 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
nova/nova-compute.log:2019-08-12 11:56:54.479 6028 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
nova/nova-compute.log:2019-08-12 11:57:09.523 6028 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
nova/nova-compute.log:2019-08-12 11:57:09.656 6028 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
nova/nova-compute.log:2019-08-12 11:57:24.705 6028 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed
nova/nova-compute.log:2019-08-12 11:57:24.968 6028 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed

Comment 6 Shailesh Chhabdiya 2019-08-21 08:47:35 UTC
Hello,

- Has there been any smaller scale deployments to prove that the network is set up correctly or is using the 16 compute nodes and 5 ceph nodes the first attempt at a deployment?
Yes, Customer has tried running deployment with  3 CTRL +3  CEPH +1 CPT 
Still same error.


Regards,
Shailesh Chhabdiya

Comment 7 Bob Fournier 2019-08-21 22:27:14 UTC
Looking at the supplied nic configuration files, there are quite a few inconsistencies and deviations from recommended configuration especially as the logs are clearly showing connectivity problems with services between controllers and computes (see Comment 4).  Its not clear how the architecture was done but it doesn't look like the recommended nic config files are being used [1].   Note that in OSP-13 the actual yaml files are generated from the jinja files using the instructions here [2].

Some inconsistencies:

compute.yaml 
- uses a bond for controlplane which is not recommended, the only way this will work is if the switch is configured to use LACP fallback
- has each network on a separate bond on a separate bridge which is not recommended

controller
- uses a bond for controlplane which is not recommended, the only way this will work is if the switch is configured to use LACP fallback
- has each network on the same bridge and bond, which is different than the compute configuration

ceph
- seems to follow the recommended template and it has the controlplane using a single nic

Here are some recommendations:

- Start with a small config of 1 controller, 1 compute and get that to work to confirm that the switch is configured properly and the nic configuration matches the switch configuration
- Run introspection and verify it works properly. If its a baremetal setup and LLDP is configured on the switch you should be able to see the VLANs that are configured and compare to what is expected via the command "$ openstack baremetal introspection interface list <node>"
- For the deployment, use the recommended bond-with-vlans configuration from [1] and create the yaml files using [2].
     - Use 1 nic for the controlplane network
- When the servers are created look at the ip configuration on the nodes themselves (use "openstack server list" and ssh heat-admin@<IPsInList>) to verify the addressing and connectivity on each network
- After the simple configuration has verified that networking configuration is correct, proceed to the full configuration and include the ceph nodes.
- Also, regarding the VIF messages, you can check after you delete the stack if any VIFs (virtual interfaces) exist by using the commmand "openstack baremetal node vif list <node>"  There shouldn't be any after the stack is deleted but if a deployment was interrupted you can remove them using "openstack baremetal node vif detach"

As indicated earlier, there is no bug here so I will close this out unless there is a specific bug that can be pointed to.


[1] https://github.com/openstack/tripleo-heat-templates/tree/stable/queens/network/config
[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/advanced_overcloud_customization/custom-network-interface-templates
compute.yaml

Comment 8 Bob Fournier 2019-09-24 12:57:07 UTC
The case has been closed, it appears that the issue has been resolved.  Closing this out.