Bug 1727587

Summary: Failing to redeploy overcloud resourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries
Product: Red Hat OpenStack Reporter: Tzach Shefi <tshefi>
Component: openstack-tripleoAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: bfournie, dsneddon, emacchi, mburns, ssmolyak
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-14 15:13:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1752481    
Attachments:
Description Flags
Mistral and Nova logs
none
Ansible deploy failed overcloud deployment.
none
iscsid and nova logs none

Description Tzach Shefi 2019-07-07 11:57:12 UTC
Description of problem: Unsure maybe this bug is a Ceph issue, had a running system with internal ceph, deleted the overcloud. 
However when I try to redeploy the same overcloud_deploy.sh system fails to complete the overcloud deployment:  

(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud --long
overcloud.CephStorage.0.CephStorage:
  resource_type: OS::TripleO::CephStorageServer
  physical_resource_id: e369fd6b-f1bf-44ed-b8f5-184cad1045b7
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500"
overcloud.Compute.1.NovaCompute:
  resource_type: OS::TripleO::ComputeServer
  physical_resource_id: 768e4a1f-fca4-41e6-948c-f0ba365ff486
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486., Code: 500"
overcloud.Controller.2.Controller:
  resource_type: OS::TripleO::ControllerServer
  physical_resource_id: 19cbe78f-fd0e-4c93-98af-f74414f71657
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 19cbe78f-fd0e-4c93-98af-f74414f71657., Code: 500"


Version-Release number of selected component (if applicable):
ceph-ansible-4.0.0-0.1.rc9.el8cp.noarch
puppet-ceph-3.0.1-0.20190612030451.e157497.el8ost.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190621160359.1415b04.el8ost.noarch
openstack-tripleo-puppet-elements-10.3.1-0.20190614132452.79c0c76.el8ost.noarch
openstack-tripleo-heat-templates-10.5.1-0.20190701110422.889d4d4.el8ost.noarch
puppet-tripleo-10.4.2-0.20190701160408.ecbec17.el8ost.noarch
openstack-tripleo-common-containers-10.7.1-0.20190701090413.783a692.el8ost.noarch
python3-tripleoclient-heat-installer-11.4.1-0.20190628190406.7d41f81.el8ost.noarch
python3-tripleo-common-10.7.1-0.20190701090413.783a692.el8ost.noarch
python3-tripleoclient-11.4.1-0.20190628190406.7d41f81.el8ost.noarch
openstack-tripleo-image-elements-10.4.1-0.20190604170406.2c8a6a5.el8ost.noarch
openstack-tripleo-validations-10.4.1-0.20190701170405.10b4d77.el8ost.noarch
openstack-tripleo-common-10.7.1-0.20190701090413.783a692.el8ost.noarch
ansible-tripleo-ipsec-9.1.1-0.20190513190404.ffe104c.el8ost.noarch



How reproducible:
Suspect this isn't the first time I'd hit this same issue. 

Steps to Reproduce:
1. Deploy and overcloud with internal ceph nodes. 

2. Delete the overcloud:
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes
Undeploying stack overcloud...
Waiting for messages on queue 'tripleo' with no timeout.
Deleting plan overcloud...
Success.



3. Re run same overcloud_deploy.sh

(undercloud) [stack@undercloud-0 ~]$ ./overcloud_deploy.sh 
Creating Swift container to store the plan
Creating plan from template files in: /tmp/tripleoclient-d_16hes5/tripleo-heat-templates
Plan created.
Processing templates in the directory /tmp/tripleoclient-d_16hes5/tripleo-heat-templates
WARNING: Following parameter(s) are deprecated and still defined. Deprecated parameters will be removed soon!
  OvercloudControlFlavor
....  
At ceph node deploy it fails see below:


Actual results:
Re-deploy of overclould fails:

2019-07-07 08:37:00Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED  ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500"
2019-07-07 08:37:00Z [overcloud.CephStorage.0]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-48b3-422e-a001-f3d47d027212., Code: 5
2019-07-07 08:37:00Z [overcloud.CephStorage.0]: CREATE_FAILED  ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-48b3-422e-a001-f3d47d027212., Code: 500"
2019-07-07 08:37:00Z [overcloud.CephStorage]: UPDATE_FAILED  Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-48b3-422e-a001-f3d47d027
2019-07-07 08:37:00Z [overcloud.CephStorage]: CREATE_FAILED  resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a5dea9d9-4
2019-07-07 08:37:00Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures

 Stack overcloud/84797b8f-b1fb-49d8-8dfb-b64762d9b5bf CREATE_FAILED 

overcloud.CephStorage.0.CephStorage:
  resource_type: OS::TripleO::CephStorageServer
  physical_resource_id: e369fd6b-f1bf-44ed-b8f5-184cad1045b7
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500"


And 
(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud --long
overcloud.CephStorage.0.CephStorage:
  resource_type: OS::TripleO::CephStorageServer
  physical_resource_id: e369fd6b-f1bf-44ed-b8f5-184cad1045b7
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e369fd6b-f1bf-44ed-b8f5-184cad1045b7., Code: 500"
overcloud.Compute.1.NovaCompute:
  resource_type: OS::TripleO::ComputeServer
  physical_resource_id: 768e4a1f-fca4-41e6-948c-f0ba365ff486
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486., Code: 500"
overcloud.Controller.2.Controller:
  resource_type: OS::TripleO::ControllerServer
  physical_resource_id: 19cbe78f-fd0e-4c93-98af-f74414f71657
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 19cbe78f-fd0e-4c93-98af-f74414f71657., Code: 500"

Expected results:
Well overcloud should manage to deploy it worked fine before I delete my overcloud on same host/VMs. 

Additional info:

Comment 1 Tzach Shefi 2019-07-07 12:07:48 UTC
Created attachment 1588045 [details]
Mistral and Nova logs

Noticed some nova errors as well adding node list


+--------------------------------------+--------------+--------+------------+-------------+----------+
| ID                                   | Name         | Status | Task State | Power State | Networks |
+--------------------------------------+--------------+--------+------------+-------------+----------+
| e369fd6b-f1bf-44ed-b8f5-184cad1045b7 | ceph-0       | ERROR  | -          | NOSTATE     |          |
| d5724c0f-eb59-4593-b26b-bc1b56a56731 | compute-0    | BUILD  | scheduling | NOSTATE     |          |
| 768e4a1f-fca4-41e6-948c-f0ba365ff486 | compute-1    | ERROR  | -          | NOSTATE     |          |
| 563f20b6-ab10-4f75-91cc-4d7b7abaf12b | controller-0 | BUILD  | scheduling | NOSTATE     |          |
| 2ed71038-19d4-4fbf-a9d9-985d00e5b90d | controller-1 | BUILD  | scheduling | NOSTATE     |          |
| 19cbe78f-fd0e-4c93-98af-f74414f71657 | controller-2 | ERROR  | -          | NOSTATE     |          |
+--------------------------------------+--------------+--------+------------+-------------+----------+





/var/log/containers/nova/nova-compute.log.1:26528:2019-07-07 05:46:19.311 7 ERROR nova.compute.manager [instance: 768e4a1f-fca4-41e6-948c-f0ba365ff486] 
/var/log/containers/nova/nova-compute.log.1:26541:2019-07-07 05:46:21.472 7 DEBUG ironicclient.common.http [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Error contacting Ironic server: Node 30f04583-e5b4-4ad3-806f-b0dbb2e04166 is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409). Attempt 1 of 61 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:302
/var/log/containers/nova/nova-compute.log.1:26563:2019-07-07 05:46:25.584 7 DEBUG ironicclient.common.http [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Error contacting Ironic server: Node 30f04583-e5b4-4ad3-806f-b0dbb2e04166 is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 61 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:302
/var/log/containers/nova/nova-compute.log.1:26566:2019-07-07 05:46:29.691 7 DEBUG ironicclient.common.http [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Error contacting Ironic server: Node 30f04583-e5b4-4ad3-806f-b0dbb2e04166 is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409). Attempt 3 of 61 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:302
/var/log/containers/nova/nova-compute.log.1:26641:2019-07-07 05:46:44.784 7 DEBUG nova.compute.utils [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] [instance: 768e4a1f-fca4-41e6-948c-f0ba365ff486] Failed to provision instance 768e4a1f-fca4-41e6-948c-f0ba365ff486: Deploy failed for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486. Error: Unexpected error while running command.
/var/log/containers/nova/nova-compute.log.1:26643:2019-07-07 05:46:44.788 7 DEBUG nova.compute.manager [req-9b13ded3-5088-47d0-9a47-1753e1521b78 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] [instance: 768e4a1f-fca4-41e6-948c-f0ba365ff486] Build of instance 768e4a1f-fca4-41e6-948c-f0ba365ff486 was re-scheduled: Failed to provision instance 768e4a1f-fca4-41e6-948c-f0ba365ff486: Deploy failed for instance 768e4a1f-fca4-41e6-948c-f0ba365ff486. Error: Unexpected error while running command.


Looks like it ain't just a ceph issue, also compute-1 and controller-2 in a sad error state.

Comment 4 Tzach Shefi 2019-07-15 09:40:16 UTC
Unsure if same root cause, 
again had a working overcloud, which was deleted:

(undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes 
Undeploying stack overcloud...
Waiting for messages on queue 'tripleo' with no timeout.
Deleting plan overcloud...
Success.

Then attempted to redeploy using same overcloud_deploy.sh, which failed:
2019-07-15 07:47:51Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed
2019-07-15 07:48:11Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 591c1e13-21c7-490b-b3d6-b5e78515fcc9., Code: 500"
2019-07-15 07:48:11Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2019-07-15 07:48:13Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2019-07-15 07:48:24Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2019-07-15 07:48:47Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2019-07-15 07:49:59Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance b74f2166-e8f5-4359-b5fb-787337355359., Code: 500"
2019-07-15 07:49:59Z [overcloud.Compute.0]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-4207-a056-571d47889499., Code: 5
2019-07-15 07:49:59Z [overcloud.Compute.0]: CREATE_FAILED  ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-4207-a056-571d47889499., Code: 500"
2019-07-15 07:49:59Z [overcloud.Compute]: UPDATE_FAILED  Resource CREATE failed: ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-4207-a056-571d47889
2019-07-15 07:50:00Z [overcloud.Compute]: CREATE_FAILED  resources.Compute: Resource CREATE failed: ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e3e600c5-9dca-
2019-07-15 07:50:00Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.Compute: Resource CREATE failed: ResourceInError: resources[0].resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for

 Stack overcloud/00d85ecc-af52-4f2b-9a36-570bd1da1d40 CREATE_FAILED 

overcloud.Compute.0.NovaCompute:
  resource_type: OS::TripleO::ComputeServer
  physical_resource_id: b74f2166-e8f5-4359-b5fb-787337355359
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance b74f2166-e8f5-4359-b5fb-787337355359., Code: 500"
Heat Stack create failed.
Heat Stack create failed.


I'd forgotten to create an sosreport before reusing server, no logs this time :( 
But I'll upload logs next time I hit this.

Comment 5 Bob Fournier 2019-07-15 21:05:16 UTC
See these errors in nova-conductor.log for a couple of the instances that are in error state.

2019-07-07 04:26:24.130 18 WARNING nova.scheduler.client.report [req-e8b5247a-f2cc-40b0-870c-b43425b1e0e3 3d7b1900e3c64486941629e0aca2e3fe f625486eed7e4d8898acff96ae0329fd - default default] Failed to save allocation for 2ed71038-19d4-4fbf-a9d9-985d00e5b90d. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'CUSTOM_CONTROLLER' on resource provider '8d3244f0-649d-4700-8d51-ef5d09669d6f'. The requested amount would exceed the capacity.  ", "code": "placement.undefined_code", "request_id": "req-a7c3578f-f0be-4b0f-b697-142d1dc5863e"}]}

Followed by:
File "/usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 514, in _wait_for_active\n    raise exception.InstanceDeployFailure(msg)\n', "nova.exception.InstanceDeployFailure: Failed to provision instance 563f20b6-ab10-4f75-91cc-4d7b7abaf12b: Deploy failed for instance 563f20b6-ab10-4f75-91cc-4d7b7abaf12b. Error: Unexpected error while running command.\nCommand: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.15:3260\nExit code: 7\nStdout: ''\nStderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\\niscsiadm: Could not perform SendTargets discovery: invalid parameter\\n'\n",

Its unclear why these isciadm failures are occurring and if that's just a symptom of another issue.

It would be good to see an sosreport if this failure occurs again.  It would also be useful to see the "openstack baremetal node list" and "node show" for the nodes that have the failed instances.

Comment 7 Tzach Shefi 2019-07-17 06:41:24 UTC
I'll keep an eye on this issue, will upload logs and add command output should I hit it again. 

BTW could I ask something else here that might be related (or not), when deleting the overcloud I sometimes notice:

(undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes
Undeploying stack overcloud...
Waiting for messages on queue 'tripleo' with no timeout.
Deleting plan overcloud...
Success.
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 44630)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 33738)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 57782), raddr=('192.168.24.2', 13989)>


What/why do I get the last three lines? 
Is this a case of another bug or safe to ignore? 
If I should open a bug on which component would this land? 

Thanks

Comment 8 Bob Fournier 2019-07-17 14:57:16 UTC
I don't think those log messages on delete matter as its just part of shutdown, unless you see a problem with the nodes using these IPs.

Comment 9 Bob Fournier 2019-08-28 13:57:11 UTC
Any update on getting logs?  Is this still occurring?

Comment 10 Tzach Shefi 2019-08-28 14:06:20 UTC
I'll try to reproduce as soon as I finish with a current ceph deployment which is being used for another bz. 
Should it fail again I'll update upload logs.

Comment 11 Tzach Shefi 2019-09-11 05:17:54 UTC
Still failing. 

Had a working overcloud with ceph which I deleted 

(overcloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes                                                                                                                                                               
Undeploying stack overcloud...                                                                                                                                                                                                               
No stack found ('overcloud'), skipping delete                                                                                                                                                                                                
Deleting plan overcloud...                                                                                                                                                                                                                   
Error occurred while deleting plan publicURL endpoint for workflowv2 service in regionOne region not found                                                                                                                                   
sys:1: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.0.11', 53942), raddr=('10.0.0.130', 5000)>                                                            
sys:1: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.0.11', 39068), raddr=('10.0.0.130', 8004)>                                                            
(overcloud) [stack@undercloud-0 ~]$ . stackrc 
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes                                                                                                                                                              
Undeploying stack overcloud...                                                                                                                                                                                                               
Waiting for messages on queue 'tripleo' with no timeout.                                                                                                                                                                                     
Deleting plan overcloud...                                                                                                                                                                                                                   
Success.                        

Nodes are offline and available, so I should be able to redeploy.

(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+
| f66521a6-696e-422a-bcfb-d6f31e0efa33 | ceph-0       | None          | power off   | available          | False       |
| 698f1647-99cc-45d3-b9cf-13080242db89 | compute-0    | None          | power off   | available          | False       |
| 4ead4af4-e8de-4389-ad1a-5b94a865f8ee | controller-0 | None          | power off   | available          | False       |
| e6c0cd4f-e594-4874-8762-ed96378bbc17 | controller-1 | None          | power off   | available          | False       |
| 118cf06b-b516-4276-acde-789969919e72 | controller-2 | None          | power off   | available          | False       |
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+


Now rerun same overcloud_deploy.sh that was used before. 
(undercloud) [stack@undercloud-0 ~]$ ./overcloud_deploy.sh ... 


During the overcloud redeploy I noticed:
2019-09-11 04:53:33Z [overcloud.Controller.2.DeploymentActions]: CREATE_COMPLETE  state changed
2019-09-11 04:55:33Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED  ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2019-09-11 04:55:33Z [overcloud.CephStorage.0.CephStorage]: DELETE_IN_PROGRESS  state changed
2019-09-11 04:55:36Z [overcloud.CephStorage.0.CephStorage]: DELETE_COMPLETE  state changed
2019-09-11 04:55:39Z [overcloud.CephStorage.0.CephStorage]: CREATE_IN_PROGRESS  state changed
2019-09-11 04:56:02Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 58e70c30-1be3-46ba-8373-f35774e62a3c., Code: 500"
2019-09-11 04:56:02Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed
2019-09-11 04:56:07Z [overcloud.Controller.2.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance fe0fdc68-fd50-43b1-b9b2-551efdd1964a., Code: 500"
2019-09-11 04:56:07Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS  state changed
2019-09-11 04:56:08Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed
2019-09-11 04:56:08Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 683fafc7-3e10-424d-af5b-36b840b6dc11., Code: 500"
2019-09-11 04:56:08Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2019-09-11 04:56:12Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2019-09-11 04:56:12Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE  state changed
2019-09-11 04:56:12Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2019-09-11 04:56:16Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS  state changed
2019-09-11 04:56:16Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2019-09-11 04:57:38Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED  ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 04f03a7d-0b9c-412f-b3dc-211c0e17d02d., Code: 500"
2019-09-11 04:57:38Z [overcloud.CephStorage.0.CephStorage]: DELETE_IN_PROGRESS  state changed
2019-09-11 04:57:41Z [overcloud.CephStorage.0.CephStorage]: DELETE_COMPLETE  state changed
2019-09-11 04:57:47Z [overcloud.CephStorage.0.CephStorage]: CREATE_IN_PROGRESS  state changed
.. 


2019-09-11 05:06:46Z [overcloud.Controller.1.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 3ffa0781-9434-407c-87ee-6f6bef05358c., Code: 500"
2019-09-11 05:06:46Z [overcloud.Controller.1.Controller]: DELETE_IN_PROGRESS  state changed
2019-09-11 05:06:49Z [overcloud.Controller.1.Controller]: DELETE_COMPLETE  state changed
2019-09-11 05:06:52Z [overcloud.CephStorage.0.CephStorage]: CREATE_FAILED  ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 51a42528-46ed-4758-97cb-637cf767e395., Code: 500"
2019-09-11 05:06:52Z [overcloud.CephStorage.0]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2019-09-11 05:06:53Z [overcloud.CephStorage.0]: CREATE_FAILED  ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2019-09-11 05:06:53Z [overcloud.CephStorage]: UPDATE_FAILED  Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2019-09-11 05:06:54Z [overcloud.CephStorage]: CREATE_FAILED  resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2019-09-11 05:06:54Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.CephStorage: Resource CREATE failed: ResourceInError: resources[0].resources.CephStorage: Went to status ERROR due to "Message: Unknown, Code: Unknown"

 Stack overcloud/083f0015-8eca-4ae9-8833-0c31da1dd830 CREATE_FAILED 

overcloud.CephStorage.0.CephStorage:
  resource_type: OS::TripleO::CephStorageServer
  physical_resource_id: 51a42528-46ed-4758-97cb-637cf767e395
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.CephStorage: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 51a42528-46ed-4758-97cb-637cf767e395., Code: 500"
Heat Stack create failed.
Heat Stack create failed.


Again fail to redeploy previously deleted overcloud, 
to be honest I'm not sure ceph is to blame, maybe this would also happen on a none ceph backed overcloud.

Any way reproduce is easy deploy an overcloud delete it and try to redeploy. 

Maybe the OC delete command (below) isn't deleting all it should correctly? 
openstack overcloud delete overcloud --yes

Comment 13 Bob Fournier 2019-09-11 11:51:04 UTC
Thanks, looking...  Still seeing these iscsiadm errors in ironic-conductor.log
2019-09-11 00:55:06.283 7 DEBUG oslo_concurrency.processutils [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] CMD "sudo ironic-rootwrap /etc/i
ronic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260" returned: 7 in 0.277s execute /usr/lib/python3.6/site-packages/oslo_concurre
ncy/processutils.py:409
2019-09-11 00:55:06.284 7 DEBUG oslo_concurrency.processutils [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] 'sudo ironic-rootwrap /etc/ironi
c/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260' failed. Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/proc
essutils.py:461
...
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] Deploy of instance 157ed05c-7fb6-4518-81e6-43df773264ad on node f66521a6-696e-422a-bcfb-d6f31e0efa33 failed with exception: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260
Exit code: 7
Stdout: ''
Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n'
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Traceback (most recent call last):
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 208, in continue_deploy
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     uuid_dict_returned = deploy_utils.deploy_partition_image(**params)
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 406, in deploy_partition_image
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     with _iscsi_setup_and_handle_errors(address, port, iqn, lun) as dev:
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     return next(self.gen)
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 457, in _iscsi_setup_and_handle_errors
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     discovery(address, port)
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/deploy_utils.py", line 141, in discovery
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     delay_on_retry=True)
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib/python3.6/site-packages/ironic/common/utils.py", line 75, in execute
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     result = processutils.execute(*cmd, **kwargs)
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy   File "/usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py", line 424, in execute
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy     cmd=sanitized_cmd)
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Command: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.43:3260
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Exit code: 7
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Stdout: ''
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n'
2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy
2019-09-11 00:55:07.285 7 DEBUG ironic.drivers.modules.agent_client [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] Executing agent command log.collect_system_logs for node f66521a6-696e-422a-bcfb-d6f31e0efa33 _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:85

Comment 14 Bob Fournier 2019-09-11 14:31:17 UTC
It looks like IPA on the node are set up correctly as an iscsi target, for example in deploy/f66521a6-696e-422a-bcfb-d6f31e0efa33_89015206-0ccf-4181-91e5-4b124307e390_2019-09-11-04-59-14:

Sep 11 00:59:07 host-192-168-24-36 ironic-python-agent[685]: 2019-09-11 00:59:07.941 685 INFO ironic_python_agent.extensions.iscsi [-] Created iSCSI target with iqn iqn.2008-10.org.openstack:f66521a6-696e-422a-bcfb-d6f31e0efa33, portal port 3260, on device /dev/vda using linux-io
Sep 11 00:59:07 host-192-168-24-36 ironic-python-agent[685]: 2019-09-11 00:59:07.943 685 INFO root [-] Command iscsi.start_iscsi_target completed: Command name: start_iscsi_target, params: {'iqn': 'iqn.2008-10.org.openstack:f66521a6-696e-422a-bcfb-d6f31e0efa33', 'portal_port': 3260, 'wipe_disk_metadata': True}, status: SUCCEEDED, result: {'iscsi_target_iqn': 'iqn.2008-10.org.openstack:f66521a6-696e-422a-bcfb-d6f31e0efa33'}.

Node is using IP : 
    ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    inet 192.168.24.36/24 brd 192.168.24.255 scope global dynamic noprefixroute ens3


But the discovery failed from ironic to this IP:
ironic-conductor.log.1:2019-09-11 00:55:07.283 7 ERROR ironic.drivers.modules.iscsi_deploy [req-7d6f5573-921d-4e38-a8cd-8f411f3f11f0 - - - - -] Deploy of instance 157ed05c-7fb6-4518-81e6-43df773264ad on node f66521a6-696e-422a-bcfb-d6f31e0efa33 failed with exception: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
ironic-conductor.log.1:Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n' for node f66521a6-696e-422a-bcfb-d6f31e0efa33: ironic.common.exception.InstanceDeployFailure: Deploy failed for instance 157ed05c-7fb6-4518-81e6-43df773264ad. Error: Unexpected error while running command.

2019-09-11 00:59:13.582 7 ERROR ironic.drivers.modules.iscsi_deploy [req-b6ed196a-df71-4325-bcb8-1853d3455dda - - - - -] Deploy of instance 89015206-0ccf-4181-91e5-4b124307e390 on node f66521a6-696e-422a-bcfb-d6f31e0efa33 failed with exception: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.36:3260
Exit code: 7
Stdout: ''
Stderr: 'iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter\n'
2019-09-11 00:59:13.582 7 ERROR ironic.drivers.modules.iscsi_deploy Traceback (most recent call last):


Looks like potentially a networking issue, although that doesn't explain why it failed after a redeploy.  We have seen iscsi discovery problems when there was mtu mismatches, e.g. the undercloud was set for > 9000 and the nodes 1500, but that does not look to be the case here.

Comment 15 Bob Fournier 2019-09-13 17:29:57 UTC
Because this only occurs after a stack is deleted and redeployed the theory is that this is happening because iscsid still has open sessions from the previous deployment.

Questions:
- Is this a pure virtual setup (nodes are all VMs)?
- If this is reproducible can you check for established sessions using netstat after you delete the overcloud and redeploy? Specifically sessions to iscsi port 3260. 
- If the deployment fails it may be useful to look in /var/lib/iscsi/send_targets to see if a target that matches the node that failed is there.  It appears that an iscsi session to port 3260 can't be created which is why the deployment is failing.

Comment 16 Tzach Shefi 2019-09-15 08:46:37 UTC
Yes Titan31 is a pure VM env.

Yes this is reproducible, as I don't have a working OC at the moment, 
I deleted the stack again via:

undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes
Undeploying stack overcloud...
Waiting for messages on queue 'tripleo' with no timeout.
Deleting plan overcloud...
Success.
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 58854)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 38330)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 34058), raddr=('192.168.24.2', 13989)>

Now lets check netstat

(undercloud) [stack@undercloud-0 ~]$ sudo netstat -tlpn | grep 3260
(undercloud) [stack@undercloud-0 ~]$ 

Nothing looks open, there are no files under /var/lib/iscsi/send_targets/

So I re-kicked another overcloud_deploy.sh while looking at iscsi folder, didn't see anything under folder. 
Deployment failed again.

Comment 17 Bob Fournier 2019-09-17 13:40:37 UTC
Maybe check all sockets with "sudo netstat -tpan" but yeah probably nothing still open.

The fact that there are no files under /var/lib/iscsi/send_targets/ just indicate iscsiadm isn't able to access the targets, we normally see for all nodes:

$ ls /var/lib/iscsi/send_targets/
10.8.146.11,3260  10.8.146.16,3260  10.8.146.18,3260  10.8.146.7,3260  10.8.146.9,3260
10.8.146.12,3260  10.8.146.17,3260  10.8.146.6,3260   10.8.146.8,3260

After the stack failure can you if any devices respond to discovery? (I just got onto Titan31 but it looks like any deployments have been cleared)
$ sudo iscsiadm -m discovery

Maybe restart iscsid.

If you have another failure I can look again on titan31.

Comment 18 Bob Fournier 2019-09-17 13:57:45 UTC
Also including Storage DFG for ideas as to what can cause "iscsiadm: Cannot perform discovery. Invalid Initiatorname.\niscsiadm: Could not perform SendTargets discovery: invalid parameter" failures.

Comment 19 Bob Fournier 2019-09-17 14:35:18 UTC
Would also be useful to run tcpdump to capture traffic to port 3260 during the deployment.

Comment 21 Bob Fournier 2019-10-02 20:22:01 UTC
Not positive that this is the same issue but I was looking at a problem with pkesavar in which rebooting the undercloud resulted in the iscsid container exiting:
undercloud) [stack@undercloud-0 ~]$ sudo podman ps -a | grep iscsi
0c2275a4f63d  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     dumb-init --singl...  23 hours ago       Exited (1) About an hour ago         iscsid

The error was:
++ [[ ! -f /etc/iscsi/initiatorname.iscsi ]]
Running command: '/usr/sbin/iscsid -f'
+ echo 'Running command: '\''/usr/sbin/iscsid -f'\'''
+ exec /usr/sbin/iscsid -f
iscsid: Can not bind IPC socket

This caused the same problem as above with "iscsiadm: Cannot perform discovery. Invalid Initiatorname " in ironic logs when attempting to deploy.

We tracked the "Can not bind IPC socket" issue to https://bugzilla.redhat.com/show_bug.cgi?id=1642582, which has been closed.

See the recommendation in https://bugzilla.redhat.com/show_bug.cgi?id=1642582#c2 to disable the iscsi services causing the container that are causing not being able to start.

Please check the iscsid container to see if it exited and, if so, disable the iscsi services and restart it.

Comment 22 Tzach Shefi 2019-10-03 12:54:21 UTC
Installed a fresh new deployment, before deleting the overcloud here is the state of iscsi service and docker

This is the status before overcloud delete. 
BTW on my original bug I didn't reboot the undercloud so if this here latest tip only happens after I restart the undercloud it's not my case. 


[stack@undercloud-0 ~]$ systemctl -a | grep  iscsid.socket
  iscsid.socket                                                                                                                        loaded    inactive   dead            Open-iSCSI iscsid Socket                                                                                                                                                                                                                                                                                      

[stack@undercloud-0 ~]$ sudo podman ps -a | grep iscsi                                                                                                                                                                                       
94889421ba1f  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     dumb-init --singl...  3 hours ago  Up 3 hours ago                 iscsid
e07ca8bbf626  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     /var/lib/containe...  3 hours ago  Exited (0) 3 hours ago         container-puppet-iscsid

Note both dockers are up, iscsid.socket dead. 

Before deleting we have a working OC:
(undercloud) [stack@undercloud-0 ~]$ openstack stack list                                                                                                                                                                                    
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+                                                                                             
| ID                                   | Stack Name | Project                          | Stack Status    | Creation Time        | Updated Time |                                                                                             
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+                                                                                             
| 89d722c2-552d-4962-9e1a-dbcb620d6c54 | overcloud  | 616779bfdf304e3cb537a24c10a3e801 | CREATE_COMPLETE | 2019-10-03T09:45:03Z | None         |                                                                                             
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+    


Delete the OC
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud delete overcloud --yes                                                                                                                                                              
Undeploying stack overcloud...                                                                                                                                                                                                               
Waiting for messages on queue   ......


(undercloud) [stack@undercloud-0 ~]$ openstack stack list    - overcloud gone. 

                                                                                                                                                                                                                                             
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+                                                                                                                     
| UUID                                 | Name         | Instance UUID | Power State | Provisioning State | Maintenance |                                                                                                                     
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+                                                                                                                     
| dfc0dee0-7647-4fe5-a238-f28548ce7f44 | ceph-0       | None          | power off   | available          | False       |                                                                                                                     
| d01fbd1e-bfdc-4f8e-9b0c-740fd6029453 | compute-0    | None          | power off   | available          | False       |
| b305adfa-07f8-4f5a-a427-781814e65e6d | compute-1    | None          | power off   | available          | False       |
| c177c850-e7c7-436a-a003-7dd0d059573d | controller-0 | None          | power off   | available          | False       |
| 98da825e-395e-4cc6-8fc4-182c5dc7bbda | controller-1 | None          | power off   | available          | False       |
| 22b13de8-e610-4721-b092-88886ff6ca7e | controller-2 | None          | power off   | available          | False       |
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+

Nodes all available, recheck service and docker state:

(undercloud) [stack@undercloud-0 ~]$ sudo podman ps -a | grep iscsi
94889421ba1f  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     dumb-init --singl...  3 hours ago  Up 3 hours ago                 iscsid
e07ca8bbf626  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     /var/lib/containe...  3 hours ago  Exited (0) 3 hours ago         container-puppet-iscsid
(undercloud) [stack@undercloud-0 ~]$ systemctl -a | grep  iscsid.socket
  iscsid.socket                                                                                                                        loaded    inactive   dead            Open-iSCSI iscsid Socket                                                         


[stack@undercloud-0 ~]$ sudo netstat -tlpn  | grep 3260
returns nothing so does

sudo tcpdump -i br-ctlplane -X -vv 'port 3260'



Hit the overcloud_deploy.sh again, while watching netstat and tcpdump in the background. 

So at around this time of overcloud deployment-> 

 CREATE_COMPLETE  state changed
2019-10-03 12:11:25Z [overcloud.Compute.0.NetHostMap]: CREATE_IN_PROGRESS  state changed
2019-10-03 12:11:25Z [overcloud.Compute.0.NetHostMap]: CREATE_COMPLETE  state changed
2019-10-03 12:11:33Z [overcloud.Compute.1.NovaCompute]: CREATE_COMPLETE  state changed
2019-10-03 12:11:36Z [overcloud.Compute.0.StoragePort]: CREATE_COMPLETE  state changed
2019-10-03 12:11:36Z [overcloud.Compute.0.InternalApiPort]: CREATE_COMPLETE  state changed
2019-10-03 12:11:37Z [overcloud.Compute.0.TenantPort]: CREATE_COMPLETE  state changed
2019-10-03 12:11:37Z [overcloud.Compute.1.StoragePort]: CREATE_IN_PROGRESS  state changed
2019-10-03 12:11:37Z [overcloud.Compute.1.PreNetworkConfig]: CREATE_IN_PROGRESS  state changed


tcpdump started to spew out lots of such:


08:10:53.837677 IP (tos 0x0, ttl 64, id 19937, offset 0, flags [DF], proto TCP (6), length 100)
    192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [P.], cksum 0xb1d2 (incorrect -> 0x7d5a), seq 13038397:13038445, ack 3851723552, win 24576, options [nop,nop,TS val 1469370364 ecr 920030516], length 48
        0x0000:  4500 0064 4de1 4000 4006 3b37 c0a8 182a  E..dM.@.@.;7...*
        0x0010:  c0a8 1801 0cbc a0bc 741d 3b84 9c17 5b6c  ........t.;...[l
        0x0020:  8018 6000 b1d2 0000 0101 080a 5794 cffc  ..`.........W...
        0x0030:  36d6 8d34 2180 0000 0000 0000 0000 0000  6..4!...........
        0x0040:  0000 0000 3700 0000 0000 0000 40c0 ebce  ....7.......@...
        0x0050:  0000 1113 0000 1152 0000 0000 0000 0000  .......R........
        0x0060:  0000 0000                                ....
08:10:53.837700 IP (tos 0x0, ttl 64, id 20420, offset 0, flags [DF], proto TCP (6), length 52)
    undercloud-0.localdomain.41148 > 192.168.24.42.iscsi-target: Flags [.], cksum 0xb1a2 (incorrect -> 0xb63f), seq 3851723552, ack 13038445, win 52886, options [nop,nop,TS val 920030517 ecr 1469370364], length 0
        0x0000:  4500 0034 4fc4 4000 4006 3984 c0a8 1801  E..4O.@.@.9.....
        0x0010:  c0a8 182a a0bc 0cbc 9c17 5b6c 741d 3bb4  ...*......[lt.;.
        0x0020:  8010 ce96 b1a2 0000 0101 080a 36d6 8d35  ............6..5
        0x0030:  5794 cffc                                W...
08:10:53.851572 IP (tos 0x0, ttl 64, id 20421, offset 0, flags [DF], proto TCP (6), length 100)
    undercloud-0.localdomain.41148 > 192.168.24.42.iscsi-target: Flags [P.], cksum 0xb1d2 (incorrect -> 0xf9d5), seq 3851723552:3851723600, ack 13038445, win 52886, options [nop,nop,TS val 920030531 ecr 1469370364], length 48
        0x0000:  4500 0064 4fc5 4000 4006 3953 c0a8 1801  E..dO.@.@.9S....
        0x0010:  c0a8 182a a0bc 0cbc 9c17 5b6c 741d 3bb4  ...*......[lt.;.
        0x0020:  8018 ce96 b1d2 0000 0101 080a 36d6 8d43  ............6..C
        0x0030:  5794 cffc 4680 0000 0000 0000 0000 0000  W...F...........
        0x0040:  0000 0000 3800 0000 0000 0000 0000 1113  ....8...........
        0x0050:  40c0 ebcf 0000 0000 0000 0000 0000 0000  @...............
        0x0060:  0000 0000                                ....
08:10:53.851736 IP (tos 0x0, ttl 64, id 19938, offset 0, flags [DF], proto TCP (6), length 100)
    192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [P.], cksum 0xb1d2 (incorrect -> 0x76dc), seq 13038445:13038493, ack 3851723600, win 24576, options [nop,nop,TS val 1469370378 ecr 920030531], length 48
        0x0000:  4500 0064 4de2 4000 4006 3b36 c0a8 182a  E..dM.@.@.;6...*
        0x0010:  c0a8 1801 0cbc a0bc 741d 3bb4 9c17 5b9c  ........t.;...[.
        0x0020:  8018 6000 b1d2 0000 0101 080a 5794 d00a  ..`.........W...
        0x0030:  36d6 8d43 2680 0000 0000 0000 0000 0000  6..C&...........
        0x0040:  0000 0000 3800 0000 0000 0000 40c0 ebcf  ....8.......@...
        0x0050:  0000 1113 0000 1152 0000 0000 0000 0000  .......R........
        0x0060:  0000 0000                                ....
08:10:53.851763 IP (tos 0x0, ttl 64, id 19939, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [F.], cksum 0xb1a2 (incorrect -> 0x2459), seq 13038493, ack 3851723600, win 24576, options [nop,nop,TS val 1469370378 ecr 920030531], length 0
        0x0000:  4500 0034 4de3 4000 4006 3b65 c0a8 182a  E..4M.@.@.;e...*
        0x0010:  c0a8 1801 0cbc a0bc 741d 3be4 9c17 5b9c  ........t.;...[.
        0x0020:  8011 6000 b1a2 0000 0101 080a 5794 d00a  ..`.........W...
        0x0030:  36d6 8d43                                6..C
08:10:53.854108 IP (tos 0x0, ttl 64, id 20422, offset 0, flags [DF], proto TCP (6), length 52)
    undercloud-0.localdomain.41148 > 192.168.24.42.iscsi-target: Flags [F.], cksum 0xb1a2 (incorrect -> 0xb5bf), seq 3851723600, ack 13038494, win 52886, options [nop,nop,TS val 920030533 ecr 1469370378], length 0
        0x0000:  4500 0034 4fc6 4000 4006 3982 c0a8 1801  E..4O.@.@.9.....
        0x0010:  c0a8 182a a0bc 0cbc 9c17 5b9c 741d 3be5  ...*......[.t.;.
        0x0020:  8011 ce96 b1a2 0000 0101 080a 36d6 8d45  ............6..E
        0x0030:  5794 d00a                                W...
08:10:53.854222 IP (tos 0x0, ttl 64, id 19940, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.24.42.iscsi-target > undercloud-0.localdomain.41148: Flags [.], cksum 0xb1a2 (incorrect -> 0x2453), seq 13038494, ack 3851723601, win 24576, options [nop,nop,TS val 1469370381 ecr 920030533], length 0
        0x0000:  4500 0034 4de4 4000 4006 3b64 c0a8 182a  E..4M.@.@.;d...*
        0x0010:  c0a8 1801 0cbc a0bc 741d 3be5 9c17 5b9d  ........t.;...[.
        0x0020:  8010 6000 b1a2 0000 0101 080a 5794 d00d  ..`.........W...
        0x0030:  36d6 8d45                                6..E



And overcloud deploy failed again, I'll attach ansible log. 

Per disabling iscsi unsure if I should stop docker or the service. 

Any way the current state is:

(undercloud) [stack@undercloud-0 ~]$ systemctl -a | grep  iscsid.socket
  iscsid.socket                                                                                                                        loaded    inactive   dead            Open-iSCSI iscsid Socket                                                                                                    

94889421ba1f  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     dumb-init --singl...  4 hours ago  Up 4 hours ago                 iscsid
e07ca8bbf626  192.168.24.1:8787/rhosp15/openstack-iscsid:20190926.1                     /var/lib/containe...  4 hours ago  Exited (0) 4 hours ago         container-puppet-iscsid


All three look to be at the same state they were before we deleted or attempted to redeploy the overcloud.

Comment 23 Tzach Shefi 2019-10-03 12:55:22 UTC
Created attachment 1622302 [details]
Ansible deploy failed overcloud deployment.

Comment 24 Bob Fournier 2019-10-03 13:04:17 UTC
Thanks for checking this, it looks like you didn't get the same container exit issue. Just to confirm, did you have the "iscsiadm: Cannot perform discovery. Invalid Initiatorname." error when the deploy failed?  Will try to reproduce this.

Comment 25 Tzach Shefi 2019-10-03 13:29:27 UTC
Created attachment 1622304 [details]
iscsid and nova logs

Yeah I guess it's not the same error this time. 
Didn't notice the cannot preforma discovery. 

[stack@undercloud-0 ~]$ sudo grep -irn Initiatorname /var/log/containers/

/var/log/containers/stdouts/iscsid.log:6:2019-10-03T05:01:17.210560406-04:00 stderr F INFO:__main__:Copying /var/lib/kolla/config_files/src-iscsid/initiatorname.iscsi to /etc/iscsi/initiatorname.iscsi

/var/log/containers/stdouts/iscsid.log:13:2019-10-03T05:01:17.218676057-04:00 stderr F ++ [[ ! -f /etc/iscsi/initiatorname.iscsi ]]

/var/log/containers/stdouts/nova_compute.log:81:2019-10-03T05:07:07.872767411-04:00 stderr F INFO:__main__:Deleting /etc/iscsi/initiatorname.iscsi

/var/log/containers/stdouts/nova_compute.log:82:2019-10-03T05:07:07.872926026-04:00 stderr F INFO:__main__:Copying /var/lib/kolla/config_files/src-iscsid/initiatorname.iscsi to /etc/iscsi/initiatorname.iscsi

Comment 26 Bob Fournier 2019-10-03 13:42:14 UTC
OK thanks, so we don't know if this disproved the theory that the "iscsiadm: Cannot perform discovery. Invalid Initiatorname." is caused by the "Can not Bind IPC socket" issue, i.e. https://bugzilla.redhat.com/show_bug.cgi?id=1642582 since we had a different failure.

Comment 27 Bob Fournier 2019-11-13 13:43:50 UTC
Hi Tzach - are you still hitting this issue, the issue with "iscsiadm: Cannot perform discovery. Invalid Initiatorname" that is? 

I'm pretty sure its due to https://bugzilla.redhat.com/show_bug.cgi?id=1642582 and can be cleared by disabling the iscsi services per that bug so would like to close this out.

Comment 28 Tzach Shefi 2019-11-14 15:01:40 UTC
Hi Bob, 
A bit busy with other stuff didn't have a chance to retest this a third time after #25. 
Not sure this was resolved as my error just changed on #25.  

However not having time to retest this at the moment, you may close bz. 
Should I re-hit this I'll reopen or report a new bug.

Thanks