Bug 1347430

Summary: [NOK] [RFE] iSCSI Diskless Installation using OSPd
Product: Red Hat OpenStack Reporter: Yossi Ovadia <yossi.ovadia>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED DUPLICATE QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: amitborulkar, bburns, brault, dbecker, dcain, dtantsur, fzdarsky, kbasil, mburns, mknutson, morazi, pchriste, radoslaw.smigielski, rhel-osp-director-maint, rkharwar, scorcora, srevivo, yossi.ovadia, yroblamo
Target Milestone: ---Keywords: FutureFeature, ZStream
Target Release: 8.0 (Liberty)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-22 10:36:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1276147    
Bug Blocks: 1299906    
Attachments:
Description Flags
messages from one of the computes
none
Overcloud first compute boot
none
Overcloud second compute boot.
none
Before dropping to dracut none

Description Yossi Ovadia 2016-06-16 20:26:27 UTC
Created attachment 1168828 [details]
messages from one of the computes

Description of problem:

After implementing https://review.openstack.org/327807 to pass the introspection I moved to the overcloud deployment.

1. During the overcloud deployment ( which eventually fails ) I notice the following on the nodes ( which boots from same modified images as introspection )  :

After several minutes ( ~330 seconds) that the OS is up, there **seems** to be a disconnection toward the iscsi and operation system is unable to reach the 'disks'.

I'm not sure if that's the last of my problems in the deployment, but we sure need to figure our this out.
I have tried to use newer be2iscsi emulex driver ( 10.7.110.34 ) and seeing same results.

As mentioned, the above occurs using the overcloud deployment using the modified ODPd imaged.
I have installed RH7.2 on same hardware , and I do not see the same problem there.

2. I dont think it's related ( maybe I should issue a different ticket on this one ), I notice the below in /var/log/messages

Jun 16 15:14:54 localhost ironic-python-agent: 2016-06-16 15:14:54.655 2797 INFO root [-] Hardware manager found: ironic_python_agent.hardware:GenericHardwareManager
Jun 16 15:14:54 localhost ironic-python-agent: 2016-06-16 15:14:54.655 2797 INFO ironic_python_agent.inspector [-] Inspection is disabled, skipping
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 CRITICAL ironic-python-agent [-] AttributeError: 'module' object has no attribute 'BackOffLoopingCall'
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent Traceback (most recent call last):
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent   File "/usr/bin/ironic-python-agent", line 10, in <module>
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent     sys.exit(run())
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent   File "/usr/lib/python2.7/site-packages/ironic_python_agent/cmd/agent.py", line 47, in run
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent     CONF.hardware_initialization_delay).run()
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent   File "/usr/lib/python2.7/site-packages/ironic_python_agent/agent.py", line 311, in run
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent     node_uuid=uuid)
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent   File "/usr/lib/python2.7/site-packages/ironic_python_agent/ironic_api_client.py", line 84, in lookup_node
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent     timer = loopingcall.BackOffLoopingCall(
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent AttributeError: 'module' object has no attribute 'BackOffLoopingCall'
Jun 16 15:14:55 localhost ironic-python-agent: 2016-06-16 15:14:55.092 2797 ERROR ironic-python-agent
Jun 16 15:14:55 localhost systemd: openstack-ironic-python-agent.service: main process exited, code=exited, status=1/FAILURE
Jun 16 15:14:55 localhost systemd: Unit openstack-ironic-python-agent.service entered failed state.
Jun 16 15:14:55 localhost systemd: openstack-ironic-python-agent.service failed.



Version-Release number of selected component (if applicable):
Redhat OSP8

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
This bug is related to 1283436 but not same.

Comment 5 Yossi Ovadia 2016-06-22 18:58:29 UTC
Hi,

Two things I made have improve the status - 

- I notice that running iscsistart –b several times halts the system, so, we need to  added code that make sure it will executed only once. 
- I have changed the template to RH latest templates ( it resolved 'callback exception' ) 

Now this is where I am and need assistance - 

When running the overcloud deployment I notice the following behavioural -
Two PXE boots -  
At first boot, it loads deploy_kernel and deploy_ramdisk , note that deploy_ramdisk is my modified image. Everything looks fine , then it reboots and 
In second boot , it loads – 'kernel' and 'ramdisk' ( See below. ) 
What I am seeing on seconds boot is that it dropped to dracut and unable to proceed. 

I don't know how it is possible to modify the 'ramdisk' file ( I was unable to extract it same way I as with deploy_ramdisk) 


[root@undercloud httpboot]# cd 485e57a4-eeb5-49c1-a379-ed7d4e92afe7/                                                   
[root@undercloud 485e57a4-eeb5-49c1-a379-ed7d4e92afe7]# ll                                                             
total 432632                                                                                                           
-rw-r--r--. 1 ironic ironic      1049 Jun 21 18:48 config                                                              
-rw-r--r--. 5 ironic ironic   5153536 Jun 21 18:08 deploy_kernel                                                       
-rw-r--r--. 5 ironic ironic 392371696 Jun 21 18:08 deploy_ramdisk                                                      
-rw-r--r--. 5 ironic ironic   5153408 Jun 15 17:32 kernel                                                              
-rw-r--r--. 5 ironic ironic  40324447 Jun 15 17:32 ramdisk

Screen shots attached shows the two different boots. 

Please advise !

Thanks.

Comment 6 Yossi Ovadia 2016-06-22 19:00:41 UTC
Created attachment 1171052 [details]
Overcloud first compute boot

Comment 7 Yossi Ovadia 2016-06-22 19:01:12 UTC
Created attachment 1171053 [details]
Overcloud second compute boot.

Comment 8 Yossi Ovadia 2016-06-22 19:02:55 UTC
Created attachment 1171054 [details]
Before dropping to dracut

Note that I dont see be2iscsi driver loads.

Comment 9 Paul Christensen 2016-06-22 19:06:15 UTC
So the question to ask is this: Where does this other image come from and can it be accessed to so we can modify like you did the first image. 

Can someone in eng please advise?

Comment 10 Yossi Ovadia 2016-06-24 15:40:42 UTC
I have a way ( I got frank ) of editing the overcloud images.
The procedure is to edit overcloud qcow , which then being build up by the `dib` ( disk image builder ) 

(I think that) The problem i'm facing is that the output initramfs does not include be2iscsi drivers.

I did try to add the module to the image, but failed since it seems that the kernel in the overcloud qcow is missing some dependencies.

Comment 11 Paul Christensen 2016-06-26 18:38:16 UTC
Hi Yossi.

Can you please attach error message with the dependency failures?

The missing be2iscsi drivers in the overlcloud images seem to be the missing link to this entire installation failure. Would that be a proper assessment?

Comment 13 Yossi Ovadia 2016-06-29 21:36:09 UTC
Hi Paul.
no logs. compute fails to boot. ( on second boot. ) dropping to dracut.

Your assessment is correct afaik.

Comment 15 Amit 2016-07-26 00:54:39 UTC
Hi Yossi,

I was facing a similar issue, I used the updated overcloud.qcow2 image with the modules mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1283436#c19 
and generated new initramfs and kernel images for the overcloud.qcow2 image. 

e.g. virt-builder --get-kernel /var/lib/libvirt/images/overcloud-full.qcow2
you could also use disk image builder.  

I found that it is important to have iscsi and multipath modules in the initramfs of the overcloud image

Comment 17 Yossi Ovadia 2016-08-09 20:03:26 UTC
Hi Amit,
We reached the same conclusion, drivers need to be in the initramfs.

- I'm not familiar with a way to add the drivers to initramfs without the DiB, can you elaborate how you did that with virt-builder ? 

- in case you missed it, we also realised that it is required to modify the ironic python agent otherwise introspection fails ( on diskless hardware. ) 
This was merged in master , see -  https://review.openstack.org/#/c/327807/

Comment 18 Paul Christensen 2016-08-16 20:59:53 UTC
It seems that the problem is that there is a specific proprietary driver is needed is available in a specific kernel (3.10.0-327.el7) 

The latest kernels do not work and and are not supported. 

Custom workarounds can be done, but this is not a long term supportable solution, and new features cannot be taken advantage of if the image is pinned to a specific kernel.

Options:

- contact with hardware provider (as Yuval said, is HP), to ask for better support and updates on that driver for current and future kernels

- if that's not possible, investigate the possibility of getting the source code for that driver, so custom builds can be done per kernel update

- Investigate the possibility of using alternative drivers with better support.

Feedback from the partner requested.

Comment 19 Monte Knutson 2016-08-26 14:55:02 UTC
Is there a support case logged with HPE? or Emulex?  We'll need it to get them engaged.

Thanks.

Comment 20 Dmitry Tantsur 2016-09-06 16:11:51 UTC
*** Bug 1322430 has been marked as a duplicate of this bug. ***

Comment 21 Yolanda Robla 2016-09-08 10:55:44 UTC
Lately we checked the overcloud images, and the be2iscsi native driver is present there, so they shall be using it.
We passed the driver information to Yossi and Joey, and we are waiting for their feedback about if that's possible to use that driver instead of HP one.

Comment 22 Paul Christensen 2016-09-15 13:59:32 UTC
Follow up email to comment #21 sent to confirm driver presence.

Comment 24 Yossi Ovadia 2016-09-19 20:44:22 UTC
So, we'll need to download the latest overcloud images and give it a shot.
I'll try that soon and will update.

Thanks !

Comment 25 Dmitry Tantsur 2016-09-21 10:21:06 UTC
*** Bug 1283436 has been marked as a duplicate of this bug. ***

Comment 26 Dmitry Tantsur 2016-09-21 10:22:54 UTC
Changing the component, as it seems like the proposed fix is not within Ironic, and the Ironic problem in the original report seems resolved. Thanks!

Comment 27 Dmitry Tantsur 2016-09-21 10:25:10 UTC
Yossi, does this bug essentially duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1276147 at this stage? It seems like both are about be2iscsi now.

Comment 28 Yossi Ovadia 2016-09-21 19:07:30 UTC
Yes, its seems indeed dup.

Comment 29 Dmitry Tantsur 2016-09-22 10:36:22 UTC
Thanks! I'll close this one to keep our backlog clean.

*** This bug has been marked as a duplicate of bug 1276147 ***