Bug 1430758

Summary:	Node Deployment failure due to root_device hints using WWN
Product:	Red Hat OpenStack	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	openstack-ironic-python-agent	Assignee:	Dmitry Tantsur <dtantsur>
Status:	CLOSED ERRATA	QA Contact:	mlammon
Severity:	high	Docs Contact:
Priority:	high
Version:	10.0 (Newton)	CC:	bengland, bfournie, dtantsur, jkachuck, kambiz, lmartins, mburns, mhalas, mknutson, mkovacik, racedoro, rhel-osp-director-maint, slinaber, smalleni, srevivo
Target Milestone:	z9	Keywords:	Triaged, ZStream
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	scale_lab
Fixed In Version:	openstack-ironic-python-agent-1.5.2-6.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-17 16:59:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2017-03-09 14:27:23 UTC

Description of problem:
When trying to deploy to a node that has root device set through WWN, with errors like onto: No suitable device was found for deployment using these hints {u'wwn': u'0x5000c50091799672'}", u'code': 404, u'type': u'DeviceNotFound', u'details': u"No suitable device was found for deployment using these hints {u'wwn': u'0x5000c50091799672'}"}. Interestingly, if the deploy is killed and reattempted the same node that complained of not finding root disk goes into active mode and deploys but some other node fails.

Version-Release number of selected component (if applicable):
RHOP 10 Puddle

How reproducible:
Out of a possible 16 or so nodes that had root device set (70 nodes total but rest don't have root device set) atleast one fails with every deployment attempt causing whole deployment to fail.

Steps to Reproduce:
1. Deploy with root divce hints on nodes with more than one disk
2.
3.

Actual results:
Deployment fails because of one node not being able to find the WWN specified.

Expected results:
Deployment should succeeed

Additional info:
This gist shows how a node that complains about not finding WWN in one deployment attempt goes into active in a later deployment attempt.
https://gist.github.com/smalleni/81884e80f499fd5417bb2a45c71ff96b

Comment 1 Ben England 2017-03-10 21:16:32 UTC

raised priority, this is really important for scale lab testing of Openstack-Ceph at scale, which is stated DFG goal for Ocata right?

Sai, why is the WWN hint needed?  You told me once, I don't see it here and can't remember it.  Are the system disks the same size as the OSD disks?  If ironic metadata cleaning is used, there should be no problem with booting off wrong disk right?

Comment 2 Sai Sindhur Malleni 2017-04-12 15:58:17 UTC

Ben,
Yes as you might be knowing already in cases with multiple disks and especially the situation we have in the scale lab where device naming is inconsistent we need to specify root disk via WWN.

Comment 3 Ramon Acedo 2017-04-20 10:06:02 UTC

Hi Sai, could you share the introspection data? 

https://docs.openstack.org/developer/tripleo-docs/advanced_deployment/introspection_data.html

Milan, could you help Sai with this one?

Comment 4 mkovacik 2017-04-20 15:21:49 UTC

Ramon, the upstream tracker seems to be a legit bug and Lucas seems to be working on the IPA patch there; if not, we can backlog it.

Comment 5 Dmitry Tantsur 2017-04-26 15:57:40 UTC

Sai, could you please confirm that https://review.openstack.org/#/c/443649/ fixes the issue for you? Lucas has left our team, so I need to overtake it.

Comment 6 Sai Sindhur Malleni 2017-05-03 14:31:53 UTC

Dmitry, we rebuilt the ramdisk image with the patch, but that did not help.

Comment 8 Miro Halas 2017-08-28 19:39:41 UTC

This seem to still exist in RHOS 10 (Newton). We have hit it on the following server when setting /dev/sdb/ as the root device

NODE: f800fe90-3e4a-4148-816f-cd2a4f13cf4b
[
  {
    "size": 1000204886016,
    "rotational": true,
    "vendor": "ATA",
    "name": "/dev/sda",
    "wwn_vendor_extension": null,
    "wwn_with_extension": "0x5000c500505bdb82",
    "model": "ST1000NM0033-9ZM",
    "wwn": "0x5000c500505bdb82",
    "serial": "Z1W037BK"
  },
  {
    "size": 240057409536,
    "rotational": false,
    "vendor": "ATA",
    "name": "/dev/sdb",
    "wwn_vendor_extension": null,
    "wwn_with_extension": "0x55cd2e414c819b40",
    "model": "INTEL SSDSC2BF24",
    "wwn": "0x55cd2e414c819b40",
    "serial": "CVTS518100Z0240JGN"
  }
]

[stack@lmorlct0311dir0 ~]$ openstack baremetal node show f800fe90-3e4a-4148-816f-cd2a4f13cf4b
+------------------------+-------------------------------------------------------------------------------------------------+
| Field                  | Value                                                                                           |
+------------------------+-------------------------------------------------------------------------------------------------+
| clean_step             | {}                                                                                              |
| console_enabled        | False                                                                                           |
| created_at             | 2017-08-28T16:37:49+00:00                                                                       |
| driver                 | fake_pxe                                                                                        |
| driver_info            | {u'deploy_ramdisk': u'64c0786d-28aa-45e4-a8bd-f3167813f10e', u'deploy_kernel': u'cba60158-7c99  |
|                        | -43ca-b7af-0a5ee080820e'}                                                                       |
| driver_internal_info   | {u'agent_url': u'http://172.20.0.10:9999', u'is_whole_disk_image': False,                       |
|                        | u'agent_last_heartbeat': 1503945298}                                                            |
| extra                  | {u'hardware_swift_object': u'extra_hardware-f800fe90-3e4a-4148-816f-cd2a4f13cf4b'}              |
| inspection_finished_at | None                                                                                            |
| inspection_started_at  | None                                                                                            |
| instance_info          | {}                                                                                              |
| instance_uuid          | None                                                                                            |
| last_error             | None                                                                                            |
| maintenance            | False                                                                                           |
| maintenance_reason     | None                                                                                            |
| name                   | None                                                                                            |
| ports                  | [{u'href': u'http://10.240.41.179:13385/v1/nodes/f800fe90-3e4a-4148-816f-cd2a4f13cf4b/ports',   |
|                        | u'rel': u'self'}, {u'href': u'http://10.240.41.179:13385/nodes/f800fe90-3e4a-4148-816f-         |
|                        | cd2a4f13cf4b/ports', u'rel': u'bookmark'}]                                                      |
| power_state            | power off                                                                                       |
| properties             | {u'cpu_arch': u'x86_64', u'root_device': {u'serial': u'CVTS518100Z0240JGN'}, u'cpus': u'4',     |
|                        | u'capabilities': u'profile:control,boot_option:local', u'memory_mb': u'32768', u'local_gb':     |
|                        | u'222'}                                                                                         |
| provision_state        | available                                                                                       |
| provision_updated_at   | 2017-08-28T18:35:00+00:00                                                                       |
| raid_config            | {}                                                                                              |
| reservation            | None                                                                                            |
| states                 | [{u'href': u'http://10.240.41.179:13385/v1/nodes/f800fe90-3e4a-4148-816f-cd2a4f13cf4b/states',  |
|                        | u'rel': u'self'}, {u'href': u'http://10.240.41.179:13385/nodes/f800fe90-3e4a-4148-816f-         |
|                        | cd2a4f13cf4b/states', u'rel': u'bookmark'}]                                                     |
| target_power_state     | None                                                                                            |
| target_provision_state | None                                                                                            |
| target_raid_config     | {}                                                                                              |
| updated_at             | 2017-08-28T18:35:00+00:00                                                                       |
| uuid                   | f800fe90-3e4a-4148-816f-cd2a4f13cf4b                                                            |
+------------------------+-------------------------------------------------------------------------------------------------+

results in 

2017-08-28 14:32:30.610 7440 DEBUG oslo_messaging._drivers.amqpdriver [req-b5616222-9a6b-479a-802b-5dd76c38dfb3 2e125d9506b3484686a1d017c8e7a577 277da56f19ad4559b703b36dde26be28 - - -] CAST unique_id: 817128e32cbd4c4fa96874aa81b91368 NOTIFY exchange 'nova' topic 'notifications.error' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:479
2017-08-28 14:35:03.362 7439 ERROR nova.scheduler.utils [req-5d75bff2-885f-4dd5-a1bc-236fda9f531f 2e125d9506b3484686a1d017c8e7a577 277da56f19ad4559b703b36dde26be28 - - -] [instance: a2e43458-b8a6-4aa6-8c29-62b91ef951b9] Error from last host: lmorlct0311dir0 (node f800fe90-3e4a-4148-816f-cd2a4f13cf4b): [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1783, in _do_build_and_run_instance\n    filter_properties)\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1981, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance a2e43458-b8a6-4aa6-8c29-62b91ef951b9 was re-scheduled: Failed to provision instance a2e43458-b8a6-4aa6-8c29-62b91ef951b9: Failed to start the iSCSI target to deploy the node f800fe90-3e4a-4148-816f-cd2a4f13cf4b. Error: {u\'message\': u"Error finding the disk or partition device to deploy the image onto: No suitable device was found for deployment using these hints {u\'serial\': u\'CVTS518100Z0240JGN\'}", u\'code\': 404, u\'type\': u\'DeviceNotFound\', u\'details\': u"No suitable device was found for deployment using these hints {u\'serial\': u\'CVTS518100Z0240JGN\'}"}\n']

Comment 11 Joseph Kachuck 2017-10-05 16:24:04 UTC

Hello Lenovo,
Please let me know if you would be able to attach introspection data plus ramdisk logs from deployment.

Thank You
Joe Kachuck

Comment 12 Ben England 2017-10-06 15:30:19 UTC

I'm confused by comment 8 error of device not found.  It is unlikely that the device really went away.

questions: 

- could this be some sort of race condition where the serial number hasn't been made visible yet in sysfs or wherever?  For example, is the device in JBOD mode?  This is an intermittent problem, right?  In other words, if it had waited an extra few seconds would the device still not have been found?  Seems unlikely, disks are typically already spun up before the OS reboots, particularly the system disk (otherwise it wouldn't boot).  But perhaps in JBOD mode the storage controller doesn't do this for the non-system-disk devices, and just lets the OS hot plug the devices as they come online?  If so, then the answer is to not use JBOD mode.  You then have a virtual drive, and it will always be online before the OS boots so there should be no race.

- does this fail the same way if WWID is used as device hint?

- if you specify the other device's serial number, does that work?

- is there a way to turn on extra logging to see what facts it knew about available devices at the point when it decided that there was no such device?  Or is this already in ramdisk logs?

HTH -ben

Comment 13 Dmitry Tantsur 2017-10-09 13:11:37 UTC

I've posted an update of the patch, this will hopefully fix the bug. I cannot test it, however.

Comment 14 Bob Fournier 2017-10-18 17:14:21 UTC

Fix is in master.  If it does resolve problem should we backport to OSP-10 (which this bug is flagged for)?

Comment 16 Ramon Acedo 2017-12-05 15:44:55 UTC

This patch should be backported to Newton

Comment 26 Alex McLeod 2018-09-03 07:58:28 UTC

Hi there,

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field.

The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Thanks,
Alex

Comment 29 errata-xmlrpc 2018-09-17 16:59:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2671

Comment 30 Red Hat Bugzilla 2023-09-14 03:55:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days