Description of problem: Unsure maybe this bug is a Ceph issue, had a running system with internal ceph, deleted the overcloud. However when I try to redeploy the same overcloud_deploy.sh system fails to complete the overcloud deployment: (undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud --long overcloud.CephStorage.0.CephStorage: resource_type: OS::TripleO::CephStorageServer physical_resource_id: e369fd6b-f1bf-44ed-b8f5-184cad1045b7 status: CREATE_FAILED status_reason: | ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500" overcloud.Compute.1.NovaCompute: resource_type: OS::TripleO::ComputeServer physical_resource_id: 768e4a1f-fca4-41e6-948c-f0ba365ff486 status: CREATE_FAILED status_reason: | ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486., Code: 500" overcloud.Controller.2.Controller: resource_type: OS::TripleO::ControllerServer physical_resource_id: 19cbe78f-fd0e-4c93-98af-f74414f71657 status: CREATE_FAILED status_reason: | ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 19cbe78f-fd0e-4c93-98af-f74414f71657., Code: 500" Version-Release number of selected component (if applicable): ceph-ansible-4.0.0-0.1.rc9.el8cp.noarch puppet-ceph-3.0.1-0.20190612030451.e157497.el8ost.noarch ansible-role-tripleo-modify-image-1.0.1-0.20190621160359.1415b04.el8ost.noarch openstack-tripleo-puppet-elements-10.3.1-0.20190614132452.79c0c76.el8ost.noarch openstack-tripleo-heat-templates-10.5.1-0.20190701110422.889d4d4.el8ost.noarch puppet-tripleo-10.4.2-0.20190701160408.ecbec17.el8ost.noarch openstack-tripleo-common-containers-10.7.1-0.20190701090413.783a692.el8ost.noarch python3-tripleoclient-heat-installer-11.4.1-0.20190628190406.7d41f81.el8ost.noarch python3-tripleo-common-10.7.1-0.20190701090413.783a692.el8ost.noarch python3-tripleoclient-11.4.1-0.20190628190406.7d41f81.el8ost.noarch openstack-tripleo-image-elements-10.4.1-0.20190604170406.2c8a6a5.el8ost.noarch openstack-tripleo-validations-10.4.1-0.20190701170405.10b4d77.el8ost.noarch openstack-tripleo-common-10.7.1-0.20190701090413.783a692.el8ost.noarch ansible-tripleo-ipsec-9.1.1-0.20190513190404.ffe104c.el8ost.noarch How reproducible: Suspect this isn't the first time I'd hit this same issue. Steps to Reproduce: 1. Deploy and overcloud with internal ceph nodes. 2. Delete the overcloud: (undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... Waiting for messages on queue 'tripleo' with no timeout. Deleting plan overcloud... Success. 3. Re run same overcloud_deploy.sh (undercloud) [stack@undercloud-0 ~]$ ./overcloud_deploy.sh Creating Swift container to store the plan Creating plan from template files in: /tmp/tripleoclient-d_16hes5/tripleo-heat-templates Plan created. Processing templates in the directory /tmp/tripleoclient-d_16hes5/tripleo-heat-templates WARNING: Following parameter(s) are deprecated and still defined. Deprecated parameters will be removed soon! OvercloudControlFlavor .... At ceph node deploy it fails see below: Actual results: Re-deploy of overclould fails: 2019-07-07 08:37:00Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500" 2019-07-07 08:37:00Z [overcloud.CephStorage.0]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-48b3-422e-a001-f3d47d027212., Code: 5 2019-07-07 08:37:00Z [overcloud.CephStorage.0]: CREATE_FAILED ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-48b3-422e-a001-f3d47d027212., Code: 500" 2019-07-07 08:37:00Z [overcloud.CephStorage]: UPDATE_FAILED Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-48b3-422e-a001-f3d47d027 2019-07-07 08:37:00Z [overcloud.CephStorage]: CREATE_FAILED resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-4 2019-07-07 08:37:00Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures Stack overcloud/84797b8f-b1fb-49d8-8dfb-b64762d9b5bf CREATE_FAILED overcloud.CephStorage.0.CephStorage: resource_type: OS::TripleO::CephStorageServer physical_resource_id: e369fd6b-f1bf-44ed-b8f5-184cad1045b7 status: CREATE_FAILED status_reason: | ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500" And (undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud --long overcloud.CephStorage.0.CephStorage: resource_type: OS::TripleO::CephStorageServer physical_resource_id: e369fd6b-f1bf-44ed-b8f5-184cad1045b7 status: CREATE_FAILED status_reason: | ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500" overcloud.Compute.1.NovaCompute: resource_type: OS::TripleO::ComputeServer physical_resource_id: 768e4a1f-fca4-41e6-948c-f0ba365ff486 status: CREATE_FAILED status_reason: | ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486., Code: 500" overcloud.Controller.2.Controller: resource_type: OS::TripleO::ControllerServer physical_resource_id: 19cbe78f-fd0e-4c93-98af-f74414f71657 status: CREATE_FAILED status_reason: | ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 19cbe78f-fd0e-4c93-98af-f74414f71657., Code: 500" Expected results: Well overcloud should manage to deploy it worked fine before I delete my overcloud on same host/VMs. Additional info:
Created attachment 1588045 [details] Mistral and Nova logs Noticed some nova errors as well adding node list +--------------------------------------+--------------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+----------+ | e369fd6b-f1bf-44ed-b8f5-184cad1045b7 | ceph-0 | ERROR | - | NOSTATE | | | d5724c0f-eb59-4593-b26b-bc1b56a56731 | compute-0 | BUILD | scheduling | NOSTATE | | | 768e4a1f-fca4-41e6-948c-f0ba365ff486 | compute-1 | ERROR | - | NOSTATE | | | 563f20b6-ab10-4f75-91cc-4d7b7abaf12b | controller-0 | BUILD | scheduling | NOSTATE | | | 2ed71038-19d4-4fbf-a9d9-985d00e5b90d | controller-1 | BUILD | scheduling | NOSTATE | | | 19cbe78f-fd0e-4c93-98af-f74414f71657 | controller-2 | ERROR | - | NOSTATE | | +--------------------------------------+--------------+--------+------------+-------------+----------+ /var/log/containers/nova/nova-compute.log.1:26528:2019-07-07 05:46:19.311 7 ERROR nova.compute.manager [instance: 768e4a1f-fca4-41e6-948c-f0ba365ff486] /var/log/containers/nova/nova-compute.log.1:26541:2019-07-07 05:46:21.472 7 DEBUG ironicclient.common.http [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Error contacting Ironic server: Node 30f04583-e5b4-4ad3-806f-b0dbb2e04166 is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409). Attempt 1 of 61 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:302 /var/log/containers/nova/nova-compute.log.1:26563:2019-07-07 05:46:25.584 7 DEBUG ironicclient.common.http [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Error contacting Ironic server: Node 30f04583-e5b4-4ad3-806f-b0dbb2e04166 is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 61 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:302 /var/log/containers/nova/nova-compute.log.1:26566:2019-07-07 05:46:29.691 7 DEBUG ironicclient.common.http [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Error contacting Ironic server: Node 30f04583-e5b4-4ad3-806f-b0dbb2e04166 is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409). Attempt 3 of 61 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:302 /var/log/containers/nova/nova-compute.log.1:26641:2019-07-07 05:46:44.784 7 DEBUG nova.compute.utils [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] [instance: 768e4a1f-fca4-41e6-948c-f0ba365ff486] Failed to provision instance 768e4a1f-fca4-41e6-948c-f0ba365ff486: Deploy failed for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486. Error: Unexpected error while running command. /var/log/containers/nova/nova-compute.log.1:26643:2019-07-07 05:46:44.788 7 DEBUG nova.compute.manager [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] [instance: 768e4a1f-fca4-41e6-948c-f0ba365ff486] Build of instance 768e4a1f-fca4-41e6-948c-f0ba365ff486 was re-scheduled: Failed to provision instance 768e4a1f-fca4-41e6-948c-f0ba365ff486: Deploy failed for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486. Error: Unexpected error while running command. Looks like it ain't just a ceph issue, also compute-1 and controller-2 in a sad error state.
Unsure if same root cause, again had a working overcloud, which was deleted: (undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... Waiting for messages on queue 'tripleo' with no timeout. Deleting plan overcloud... Success. Then attempted to redeploy using same overcloud_deploy.sh, which failed: 2019-07-15 07:47:51Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE state changed 2019-07-15 07:48:11Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 591c1e13-21c7-490b-b3d6-b5e78515fcc9., Code: 500" 2019-07-15 07:48:11Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2019-07-15 07:48:13Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2019-07-15 07:48:24Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2019-07-15 07:48:47Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2019-07-15 07:49:59Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance b74f2166-e8f5-4359-b5fb-787337355359., Code: 500" 2019-07-15 07:49:59Z [overcloud.Compute.0]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-4207-a056-571d47889499., Code: 5 2019-07-15 07:49:59Z [overcloud.Compute.0]: CREATE_FAILED ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-4207-a056-571d47889499., Code: 500" 2019-07-15 07:49:59Z [overcloud.Compute]: UPDATE_FAILED Resource CREATE failed: ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-4207-a056-571d47889 2019-07-15 07:50:00Z [overcloud.Compute]: CREATE_FAILED resources.Compute: Resource CREATE failed: ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca- 2019-07-15 07:50:00Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.Compute: Resource CREATE failed: ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for Stack overcloud/00d85ecc-af52-4f2b-9a36-570bd1da1d40 CREATE_FAILED overcloud.Compute.0.NovaCompute: resource_type: OS::TripleO::ComputeServer physical_resource_id: b74f2166-e8f5-4359-b5fb-787337355359 status: CREATE_FAILED status_reason: | ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance b74f2166-e8f5-4359-b5fb-787337355359., Code: 500" Heat Stack create failed. Heat Stack create failed. I'd forgotten to create an sosreport before reusing server, no logs this time :( But I'll upload logs next time I hit this.
See these errors in nova-conductor.log for a couple of the instances that are in error state. 2019-07-07 04:26:24.130 18 WARNING nova.scheduler.client.report [req-e8b5247a-f2cc-40b0-870c-b43425b1e0e3 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Failed to save allocation for 2ed71038-19d4-4fbf-a9d9-985d00e5b90d. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'CUSTOM_CONTROLLER' on resource provider '8d3244f0-649d-4700-8d51-ef5d09669d6f'. The requested amount would exceed the capacity. ", "code": "placement.undefined_code", "request_id": "req-a7c3578f-f0be-4b0f-b697-142d1dc5863e"}]} Followed by: File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 514, in _wait_for_active\n raise exception.InstanceDeployFailure(msg)\n', "nova.exception.InstanceDeployFailure: Failed to provision instance 563f20b6-ab10-4f75-91cc-4d7b7abaf12b: Deploy failed for instance 563f20b6-ab10-4f75-91cc-4d7b7abaf12b. Error: Unexpected error while running command.\nCommand: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.15:3260\nExit code: 7\nStdout: ''\nStderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\\niscsiadm: Could not perform SendTargets discovery: invalid parameter\\n'\n", Its unclear why these isciadm failures are occurring and if that's just a symptom of another issue. It would be good to see an sosreport if this failure occurs again. It would also be useful to see the "openstack baremetal node list" and "node show" for the nodes that have the failed instances.
I'll keep an eye on this issue, will upload logs and add command output should I hit it again. BTW could I ask something else here that might be related (or not), when deleting the overcloud I sometimes notice: (undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... Waiting for messages on queue 'tripleo' with no timeout. Deleting plan overcloud... Success. sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 44630)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 33738)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 57782), raddr=('192.168.24.2', 13989)> What/why do I get the last three lines? Is this a case of another bug or safe to ignore? If I should open a bug on which component would this land? Thanks
I don't think those log messages on delete matter as its just part of shutdown, unless you see a problem with the nodes using these IPs.
Any update on getting logs? Is this still occurring?
I'll try to reproduce as soon as I finish with a current ceph deployment which is being used for another bz. Should it fail again I'll update upload logs.
Still failing. Had a working overcloud with ceph which I deleted (overcloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... No stack found ('overcloud'), skipping delete Deleting plan overcloud... Error occurred while deleting plan publicURL endpoint for workflowv2 service in regionOne region not found sys:1: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.0.11', 53942), raddr=('10.0.0.130', 5000)> sys:1: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.0.11', 39068), raddr=('10.0.0.130', 8004)> (overcloud) [stack@undercloud-0 ~]$ . stackrc (undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... Waiting for messages on queue 'tripleo' with no timeout. Deleting plan overcloud... Success. Nodes are offline and available, so I should be able to redeploy. (undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ | f66521a6-696e-422a-bcfb-d6f31e0efa33 | ceph-0 | None | power off | available | False | | 698f1647-99cc-45d3-b9cf-13080242db89 | compute-0 | None | power off | available | False | | 4ead4af4-e8de-4389-ad1a-5b94a865f8ee | controller-0 | None | power off | available | False | | e6c0cd4f-e594-4874-8762-ed96378bbc17 | controller-1 | None | power off | available | False | | 118cf06b-b516-4276-acde-789969919e72 | controller-2 | None | power off | available | False | +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ Now rerun same overcloud_deploy.sh that was used before. (undercloud) [stack@undercloud-0 ~]$ ./overcloud_deploy.sh ... During the overcloud redeploy I noticed: 2019-09-11 04:53:33Z [overcloud.Controller.2.DeploymentActions]: CREATE_COMPLETE state changed 2019-09-11 04:55:33Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown" 2019-09-11 04:55:33Z [overcloud.CephStorage.0.CephStorage]: DELETE_IN_PROGRESS state changed 2019-09-11 04:55:36Z [overcloud.CephStorage.0.CephStorage]: DELETE_COMPLETE state changed 2019-09-11 04:55:39Z [overcloud.CephStorage.0.CephStorage]: CREATE_IN_PROGRESS state changed 2019-09-11 04:56:02Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 58e70c30-1be3-46ba-8373-f35774e62a3c., Code: 500" 2019-09-11 04:56:02Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS state changed 2019-09-11 04:56:07Z [overcloud.Controller.2.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance fe0fdc68-fd50-43b1-b9b2-551efdd1964a., Code: 500" 2019-09-11 04:56:07Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS state changed 2019-09-11 04:56:08Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE state changed 2019-09-11 04:56:08Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 683fafc7-3e10-424d-af5b-36b840b6dc11., Code: 500" 2019-09-11 04:56:08Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2019-09-11 04:56:12Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2019-09-11 04:56:12Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE state changed 2019-09-11 04:56:12Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2019-09-11 04:56:16Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed 2019-09-11 04:56:16Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2019-09-11 04:57:38Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 04f03a7d-0b9c-412f-b3dc-211c0e17d02d., Code: 500" 2019-09-11 04:57:38Z [overcloud.CephStorage.0.CephStorage]: DELETE_IN_PROGRESS state changed 2019-09-11 04:57:41Z [overcloud.CephStorage.0.CephStorage]: DELETE_COMPLETE state changed 2019-09-11 04:57:47Z [overcloud.CephStorage.0.CephStorage]: CREATE_IN_PROGRESS state changed .. 2019-09-11 05:06:46Z [overcloud.Controller.1.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 3ffa0781-9434-407c-87ee-6f6bef05358c., Code: 500" 2019-09-11 05:06:46Z [overcloud.Controller.1.Controller]: DELETE_IN_PROGRESS state changed 2019-09-11 05:06:49Z [overcloud.Controller.1.Controller]: DELETE_COMPLETE state changed 2019-09-11 05:06:52Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 51a42528-46ed-4758-97cb-637cf767e395., Code: 500" 2019-09-11 05:06:52Z [overcloud.CephStorage.0]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown" 2019-09-11 05:06:53Z [overcloud.CephStorage.0]: CREATE_FAILED ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown" 2019-09-11 05:06:53Z [overcloud.CephStorage]: UPDATE_FAILED Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown" 2019-09-11 05:06:54Z [overcloud.CephStorage]: CREATE_FAILED resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown" 2019-09-11 05:06:54Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown" Stack overcloud/083f0015-8eca-4ae9-8833-0c31da1dd830 CREATE_FAILED overcloud.CephStorage.0.CephStorage: resource_type: OS::TripleO::CephStorageServer physical_resource_id: 51a42528-46ed-4758-97cb-637cf767e395 status: CREATE_FAILED status_reason: | ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 51a42528-46ed-4758-97cb-637cf767e395., Code: 500" Heat Stack create failed. Heat Stack create failed. Again fail to redeploy previously deleted overcloud, to be honest I'm not sure ceph is to blame, maybe this would also happen on a none ceph backed overcloud. Any way reproduce is easy deploy an overcloud delete it and try to redeploy. Maybe the OC delete command (below) isn't deleting all it should correctly? openstack overcloud delete overcloud --yes
Thanks, looking... Still seeing these iscsiadm errors in ironic-conductor.log 2019-09-11 00:55:06.283 7 DEBUG oslo_concurrency.processutils [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] CMD "sudo ironic-rootwrap /etc/i ronic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260" returned: 7 in 0.277s execute /usr/lib/python3.6/site-packages/oslo_concurre ncy/processutils.py:409 2019-09-11 00:55:06.284 7 DEBUG oslo_concurrency.processutils [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] 'sudo ironic-rootwrap /etc/ironi c/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260' failed. Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/proc essutils.py:461 ... 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] Deploy of instance 157ed05c-7fb6-4518-81e6-43df773264ad on node f66521a6-696e-422a-bcfb-d6f31e0efa33 failed with exception: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Command: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260 Exit code: 7 Stdout: '' Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n' 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Traceback (most recent call last): 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 208, in continue_deploy 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy uuid_dict_returned = deploy_utils.deploy_partition_image(**params) 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 406, in deploy_partition_image 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy with _iscsi_setup_and_handle_errors(address, port, iqn, lun) as dev: 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__ 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy return next(self.gen) 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 457, in _iscsi_setup_and_handle_errors 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy discovery(address, port) 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 141, in discovery 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy delay_on_retry=True) 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib/python3.6/site-packages/ironic/common/utils.py", line 75, in execute 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy result = processutils.execute(*cmd, **kwargs) 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy File "/usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py", line 424, in execute 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy cmd=sanitized_cmd) 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Command: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Exit code: 7 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Stdout: '' 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n' 2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy 2019-09-11 00:55:07.285 7 DEBUG ironic.drivers.modules.agent_client [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] Executing agent command log.collect_system_logs for node f66521a6-696e-422a-bcfb-d6f31e0efa33 _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:85
It looks like IPA on the node are set up correctly as an iscsi target, for example in deploy/f66521a6-696e-422a-bcfb-d6f31e0efa33_89015206-0ccf-4181-91e5-4b124307e390_2019-09-11-04-59-14: Sep 11 00:59:07 host-192-168-24-36 ironic-python-agent[685]: 2019-09-11 00:59:07.941 685 INFO ironic_python_agent.extensions.iscsi [-] Created iSCSI target with iqn iqn.2008-10.org.openstack:f66521a6-696e-422a-bcfb-d6f31e0efa33, portal port 3260, on device /dev/vda using linux-io Sep 11 00:59:07 host-192-168-24-36 ironic-python-agent[685]: 2019-09-11 00:59:07.943 685 INFO root [-] Command iscsi.start_iscsi_target completed: Command name: start_iscsi_target, params: {'iqn': 'iqn.2008-10.org.openstack:f66521a6-696e-422a-bcfb-d6f31e0efa33', 'portal_port': 3260, 'wipe_disk_metadata': True}, status: SUCCEEDED, result: {'iscsi_target_iqn': 'iqn.2008-10.org.openstack:f66521a6-696e-422a-bcfb-d6f31e0efa33'}. Node is using IP : ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 inet 192.168.24.36/24 brd 192.168.24.255 scope global dynamic noprefixroute ens3 But the discovery failed from ironic to this IP: ironic-conductor.log.1:2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] Deploy of instance 157ed05c-7fb6-4518-81e6-43df773264ad on node f66521a6-696e-422a-bcfb-d6f31e0efa33 failed with exception: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. ironic-conductor.log.1:Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n' for node f66521a6-696e-422a-bcfb-d6f31e0efa33: ironic.common.exception.InstanceDeployFailure: Deploy failed for instance 157ed05c-7fb6-4518-81e6-43df773264ad. Error: Unexpected error while running command. 2019-09-11 00:59:13.582 7 ERROR ironic.drivers.modules.iscsi_deploy [req-b6ed196a-df71-4325-bcb8-1853d3455dda - - - - -] Deploy of instance 89015206-0ccf-4181-91e5-4b124307e390 on node f66521a6-696e-422a-bcfb-d6f31e0efa33 failed with exception: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Command: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.36:3260 Exit code: 7 Stdout: '' Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n' 2019-09-11 00:59:13.582 7 ERROR ironic.drivers.modules.iscsi_deploy Traceback (most recent call last): Looks like potentially a networking issue, although that doesn't explain why it failed after a redeploy. We have seen iscsi discovery problems when there was mtu mismatches, e.g. the undercloud was set for > 9000 and the nodes 1500, but that does not look to be the case here.
Because this only occurs after a stack is deleted and redeployed the theory is that this is happening because iscsid still has open sessions from the previous deployment. Questions: - Is this a pure virtual setup (nodes are all VMs)? - If this is reproducible can you check for established sessions using netstat after you delete the overcloud and redeploy? Specifically sessions to iscsi port 3260. - If the deployment fails it may be useful to look in /var/lib/iscsi/send_targets to see if a target that matches the node that failed is there. It appears that an iscsi session to port 3260 can't be created which is why the deployment is failing.
Yes Titan31 is a pure VM env. Yes this is reproducible, as I don't have a working OC at the moment, I deleted the stack again via: undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... Waiting for messages on queue 'tripleo' with no timeout. Deleting plan overcloud... Success. sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 58854)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 38330)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 34058), raddr=('192.168.24.2', 13989)> Now lets check netstat (undercloud) [stack@undercloud-0 ~]$ sudo netstat -tlpn | grep 3260 (undercloud) [stack@undercloud-0 ~]$ Nothing looks open, there are no files under /var/lib/iscsi/send_targets/ So I re-kicked another overcloud_deploy.sh while looking at iscsi folder, didn't see anything under folder. Deployment failed again.
Maybe check all sockets with "sudo netstat -tpan" but yeah probably nothing still open. The fact that there are no files under /var/lib/iscsi/send_targets/ just indicate iscsiadm isn't able to access the targets, we normally see for all nodes: $ ls /var/lib/iscsi/send_targets/ 10.8.146.11,3260 10.8.146.16,3260 10.8.146.18,3260 10.8.146.7,3260 10.8.146.9,3260 10.8.146.12,3260 10.8.146.17,3260 10.8.146.6,3260 10.8.146.8,3260 After the stack failure can you if any devices respond to discovery? (I just got onto Titan31 but it looks like any deployments have been cleared) $ sudo iscsiadm -m discovery Maybe restart iscsid. If you have another failure I can look again on titan31.
Also including Storage DFG for ideas as to what can cause "iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter" failures.
Would also be useful to run tcpdump to capture traffic to port 3260 during the deployment.
Not positive that this is the same issue but I was looking at a problem with pkesavar in which rebooting the undercloud resulted in the iscsid container exiting: undercloud) [stack@undercloud-0 ~]$ sudo podman ps -a | grep iscsi 0c2275a4f63d 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 dumb-init --singl... 23 hours ago Exited (1) About an hour ago iscsid The error was: ++ [[ ! -f /etc/iscsi/initiatorname.iscsi ]] Running command: '/usr/sbin/iscsid -f' + echo 'Running command: '\''/usr/sbin/iscsid -f'\''' + exec /usr/sbin/iscsid -f iscsid: Can not bind IPC socket This caused the same problem as above with "iscsiadm: Cannot perform discovery. Invalid Initiatorname " in ironic logs when attempting to deploy. We tracked the "Can not bind IPC socket" issue to https://bugzilla.redhat.com/show_bug.cgi?id=1642582, which has been closed. See the recommendation in https://bugzilla.redhat.com/show_bug.cgi?id=1642582#c2 to disable the iscsi services causing the container that are causing not being able to start. Please check the iscsid container to see if it exited and, if so, disable the iscsi services and restart it.
Installed a fresh new deployment, before deleting the overcloud here is the state of iscsi service and docker This is the status before overcloud delete. BTW on my original bug I didn't reboot the undercloud so if this here latest tip only happens after I restart the undercloud it's not my case. [stack@undercloud-0 ~]$ systemctl -a | grep iscsid.socket iscsid.socket loaded inactive dead Open-iSCSI iscsid Socket [stack@undercloud-0 ~]$ sudo podman ps -a | grep iscsi 94889421ba1f 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 dumb-init --singl... 3 hours ago Up 3 hours ago iscsid e07ca8bbf626 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 /var/lib/containe... 3 hours ago Exited (0) 3 hours ago container-puppet-iscsid Note both dockers are up, iscsid.socket dead. Before deleting we have a working OC: (undercloud) [stack@undercloud-0 ~]$ openstack stack list +--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+ | 89d722c2-552d-4962-9e1a-dbcb620d6c54 | overcloud | 616779bfdf304e3cb537a24c10a3e801 | CREATE_COMPLETE | 2019-10-03T09:45:03Z | None | +--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+ Delete the OC (undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes Undeploying stack overcloud... Waiting for messages on queue ...... (undercloud) [stack@undercloud-0 ~]$ openstack stack list - overcloud gone. (undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ | dfc0dee0-7647-4fe5-a238-f28548ce7f44 | ceph-0 | None | power off | available | False | | d01fbd1e-bfdc-4f8e-9b0c-740fd6029453 | compute-0 | None | power off | available | False | | b305adfa-07f8-4f5a-a427-781814e65e6d | compute-1 | None | power off | available | False | | c177c850-e7c7-436a-a003-7dd0d059573d | controller-0 | None | power off | available | False | | 98da825e-395e-4cc6-8fc4-182c5dc7bbda | controller-1 | None | power off | available | False | | 22b13de8-e610-4721-b092-88886ff6ca7e | controller-2 | None | power off | available | False | +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ Nodes all available, recheck service and docker state: (undercloud) [stack@undercloud-0 ~]$ sudo podman ps -a | grep iscsi 94889421ba1f 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 dumb-init --singl... 3 hours ago Up 3 hours ago iscsid e07ca8bbf626 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 /var/lib/containe... 3 hours ago Exited (0) 3 hours ago container-puppet-iscsid (undercloud) [stack@undercloud-0 ~]$ systemctl -a | grep iscsid.socket iscsid.socket loaded inactive dead Open-iSCSI iscsid Socket [stack@undercloud-0 ~]$ sudo netstat -tlpn | grep 3260 returns nothing so does sudo tcpdump -i br-ctlplane -X -vv 'port 3260' Hit the overcloud_deploy.sh again, while watching netstat and tcpdump in the background. So at around this time of overcloud deployment-> CREATE_COMPLETE state changed 2019-10-03 12:11:25Z [overcloud.Compute.0.NetHostMap]: CREATE_IN_PROGRESS state changed 2019-10-03 12:11:25Z [overcloud.Compute.0.NetHostMap]: CREATE_COMPLETE state changed 2019-10-03 12:11:33Z [overcloud.Compute.1.NovaCompute]: CREATE_COMPLETE state changed 2019-10-03 12:11:36Z [overcloud.Compute.0.StoragePort]: CREATE_COMPLETE state changed 2019-10-03 12:11:36Z [overcloud.Compute.0.InternalApiPort]: CREATE_COMPLETE state changed 2019-10-03 12:11:37Z [overcloud.Compute.0.TenantPort]: CREATE_COMPLETE state changed 2019-10-03 12:11:37Z [overcloud.Compute.1.StoragePort]: CREATE_IN_PROGRESS state changed 2019-10-03 12:11:37Z [overcloud.Compute.1.PreNetworkConfig]: CREATE_IN_PROGRESS state changed tcpdump started to spew out lots of such: 08:10:53.837677 IP (tos 0x0, ttl 64, id 19937, offset 0, flags [DF], proto TCP (6), length 100) 192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [P.], cksum 0xb1d2 (incorrect -> 0x7d5a), seq 13038397:13038445, ack 3851723552, win 24576, options [nop,nop,TS val 1469370364 ecr 920030516], length 48 0x0000: 4500 0064 4de1 4000 4006 3b37 c0a8 182a E..dM.@.@.;7...* 0x0010: c0a8 1801 0cbc a0bc 741d 3b84 9c17 5b6c ........t.;...[l 0x0020: 8018 6000 b1d2 0000 0101 080a 5794 cffc ..`.........W... 0x0030: 36d6 8d34 2180 0000 0000 0000 0000 0000 6..4!........... 0x0040: 0000 0000 3700 0000 0000 0000 40c0 ebce ....7.......@... 0x0050: 0000 1113 0000 1152 0000 0000 0000 0000 .......R........ 0x0060: 0000 0000 .... 08:10:53.837700 IP (tos 0x0, ttl 64, id 20420, offset 0, flags [DF], proto TCP (6), length 52) undercloud-0.localdomain.41148 > 192.168.24.42.iscsi-target: Flags [.], cksum 0xb1a2 (incorrect -> 0xb63f), seq 3851723552, ack 13038445, win 52886, options [nop,nop,TS val 920030517 ecr 1469370364], length 0 0x0000: 4500 0034 4fc4 4000 4006 3984 c0a8 1801 E..4O.@.@.9..... 0x0010: c0a8 182a a0bc 0cbc 9c17 5b6c 741d 3bb4 ...*......[lt.;. 0x0020: 8010 ce96 b1a2 0000 0101 080a 36d6 8d35 ............6..5 0x0030: 5794 cffc W... 08:10:53.851572 IP (tos 0x0, ttl 64, id 20421, offset 0, flags [DF], proto TCP (6), length 100) undercloud-0.localdomain.41148 > 192.168.24.42.iscsi-target: Flags [P.], cksum 0xb1d2 (incorrect -> 0xf9d5), seq 3851723552:3851723600, ack 13038445, win 52886, options [nop,nop,TS val 920030531 ecr 1469370364], length 48 0x0000: 4500 0064 4fc5 4000 4006 3953 c0a8 1801 E..dO.@.@.9S.... 0x0010: c0a8 182a a0bc 0cbc 9c17 5b6c 741d 3bb4 ...*......[lt.;. 0x0020: 8018 ce96 b1d2 0000 0101 080a 36d6 8d43 ............6..C 0x0030: 5794 cffc 4680 0000 0000 0000 0000 0000 W...F........... 0x0040: 0000 0000 3800 0000 0000 0000 0000 1113 ....8........... 0x0050: 40c0 ebcf 0000 0000 0000 0000 0000 0000 @............... 0x0060: 0000 0000 .... 08:10:53.851736 IP (tos 0x0, ttl 64, id 19938, offset 0, flags [DF], proto TCP (6), length 100) 192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [P.], cksum 0xb1d2 (incorrect -> 0x76dc), seq 13038445:13038493, ack 3851723600, win 24576, options [nop,nop,TS val 1469370378 ecr 920030531], length 48 0x0000: 4500 0064 4de2 4000 4006 3b36 c0a8 182a E..dM.@.@.;6...* 0x0010: c0a8 1801 0cbc a0bc 741d 3bb4 9c17 5b9c ........t.;...[. 0x0020: 8018 6000 b1d2 0000 0101 080a 5794 d00a ..`.........W... 0x0030: 36d6 8d43 2680 0000 0000 0000 0000 0000 6..C&........... 0x0040: 0000 0000 3800 0000 0000 0000 40c0 ebcf ....8.......@... 0x0050: 0000 1113 0000 1152 0000 0000 0000 0000 .......R........ 0x0060: 0000 0000 .... 08:10:53.851763 IP (tos 0x0, ttl 64, id 19939, offset 0, flags [DF], proto TCP (6), length 52) 192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [F.], cksum 0xb1a2 (incorrect -> 0x2459), seq 13038493, ack 3851723600, win 24576, options [nop,nop,TS val 1469370378 ecr 920030531], length 0 0x0000: 4500 0034 4de3 4000 4006 3b65 c0a8 182a E..4M.@.@.;e...* 0x0010: c0a8 1801 0cbc a0bc 741d 3be4 9c17 5b9c ........t.;...[. 0x0020: 8011 6000 b1a2 0000 0101 080a 5794 d00a ..`.........W... 0x0030: 36d6 8d43 6..C 08:10:53.854108 IP (tos 0x0, ttl 64, id 20422, offset 0, flags [DF], proto TCP (6), length 52) undercloud-0.localdomain.41148 > 192.168.24.42.iscsi-target: Flags [F.], cksum 0xb1a2 (incorrect -> 0xb5bf), seq 3851723600, ack 13038494, win 52886, options [nop,nop,TS val 920030533 ecr 1469370378], length 0 0x0000: 4500 0034 4fc6 4000 4006 3982 c0a8 1801 E..4O.@.@.9..... 0x0010: c0a8 182a a0bc 0cbc 9c17 5b9c 741d 3be5 ...*......[.t.;. 0x0020: 8011 ce96 b1a2 0000 0101 080a 36d6 8d45 ............6..E 0x0030: 5794 d00a W... 08:10:53.854222 IP (tos 0x0, ttl 64, id 19940, offset 0, flags [DF], proto TCP (6), length 52) 192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [.], cksum 0xb1a2 (incorrect -> 0x2453), seq 13038494, ack 3851723601, win 24576, options [nop,nop,TS val 1469370381 ecr 920030533], length 0 0x0000: 4500 0034 4de4 4000 4006 3b64 c0a8 182a E..4M.@.@.;d...* 0x0010: c0a8 1801 0cbc a0bc 741d 3be5 9c17 5b9d ........t.;...[. 0x0020: 8010 6000 b1a2 0000 0101 080a 5794 d00d ..`.........W... 0x0030: 36d6 8d45 6..E And overcloud deploy failed again, I'll attach ansible log. Per disabling iscsi unsure if I should stop docker or the service. Any way the current state is: (undercloud) [stack@undercloud-0 ~]$ systemctl -a | grep iscsid.socket iscsid.socket loaded inactive dead Open-iSCSI iscsid Socket 94889421ba1f 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 dumb-init --singl... 4 hours ago Up 4 hours ago iscsid e07ca8bbf626 192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1 /var/lib/containe... 4 hours ago Exited (0) 4 hours ago container-puppet-iscsid All three look to be at the same state they were before we deleted or attempted to redeploy the overcloud.
Created attachment 1622302 [details] Ansible deploy failed overcloud deployment.
Thanks for checking this, it looks like you didn't get the same container exit issue. Just to confirm, did you have the "iscsiadm: Cannot perform discovery. Invalid Initiatorname." error when the deploy failed? Will try to reproduce this.
Created attachment 1622304 [details] iscsid and nova logs Yeah I guess it's not the same error this time. Didn't notice the cannot preforma discovery. [stack@undercloud-0 ~]$ sudo grep -irn Initiatorname /var/log/containers/ /var/log/containers/stdouts/iscsid.log:6:2019-10-03T05:01:17.210560406-04:00 stderr F INFO:__main__:Copying /var/lib/kolla/config_files/src-iscsid/initiatorname.iscsi to /etc/iscsi/initiatorname.iscsi /var/log/containers/stdouts/iscsid.log:13:2019-10-03T05:01:17.218676057-04:00 stderr F ++ [[ ! -f /etc/iscsi/initiatorname.iscsi ]] /var/log/containers/stdouts/nova_compute.log:81:2019-10-03T05:07:07.872767411-04:00 stderr F INFO:__main__:Deleting /etc/iscsi/initiatorname.iscsi /var/log/containers/stdouts/nova_compute.log:82:2019-10-03T05:07:07.872926026-04:00 stderr F INFO:__main__:Copying /var/lib/kolla/config_files/src-iscsid/initiatorname.iscsi to /etc/iscsi/initiatorname.iscsi
OK thanks, so we don't know if this disproved the theory that the "iscsiadm: Cannot perform discovery. Invalid Initiatorname." is caused by the "Can not Bind IPC socket" issue, i.e. https://bugzilla.redhat.com/show_bug.cgi?id=1642582 since we had a different failure.
Hi Tzach - are you still hitting this issue, the issue with "iscsiadm: Cannot perform discovery. Invalid Initiatorname" that is? I'm pretty sure its due to https://bugzilla.redhat.com/show_bug.cgi?id=1642582 and can be cleared by disabling the iscsi services per that bug so would like to close this out.
Hi Bob, A bit busy with other stuff didn't have a chance to retest this a third time after #25. Not sure this was resolved as my error just changed on #25. However not having time to retest this at the moment, you may close bz. Should I re-hit this I'll reopen or report a new bug. Thanks