Bug 1316550

Summary: rhel-osp-director: Node introspection fails and nodes gets to #DRACUT mode . (machine type: HP DL165 G7 with Intel X520 10G dual port NIC).
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED NOTABUG QA Contact: Tzach Shefi <tshefi>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: aschultz, bfournie, dbecker, dtantsur, dyocum, lmartins, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, sasha, sclewis, tshefi
Target Milestone: ---   
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-19 20:33:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
dracut-screen-shot
none
Dracut logs
none
Discovery fails
none
Dracut logs
none
Dracut logs after modifing ipxe paramaters
none
http logs Tigris01
none
Logs for #34
none
Logs for #39 none

Description Omri Hochman 2016-03-10 13:07:54 UTC
rhel-osp-director: Node introspection fails and nodes gets to #DRACUT mode machine type HP DL165 G7 with Intel X520 10G dual port NIC). 


Environment ( ospd-7.3 GA): 
-----------------------------
python-rdomanager-oscplugin-0.0.10-28.el7ost.noarch
instack-undercloud-2.1.2-39.el7ost.noarch
openstack-ironic-common-2015.1.2-2.el7ost.noarch
openstack-ironic-api-2015.1.2-2.el7ost.noarch
python-ironicclient-0.5.1-12.el7ost.noarch
python-ironic-discoverd-1.1.0-8.el7ost.noarch
openstack-ironic-conductor-2015.1.2-2.el7ost.noarch
openstack-ironic-discoverd-1.1.0-8.el7ost.noarch
 

Description:
-------------
Attempting to deploy 7 nodes Bare-Metal environment , 2 of the nodes constantly  fails during introspection - when looking at the node console (see attach screen shot) after boot the nodes get to DRACUT mode and do not proceed.

on the same setup the same NIC is being used in other node (different brand and it works!)   

We suspect the issue occurs only when using this specific HW and NIC:
 

Working server
-----------------
Gizmo IBM - x3250
Single Intel Xeon E3-1230 v2 3.3GHz
32GB RAM
1Tera 
Intel x520 10G dual port NIC, all onboard 1G nics are disabled.    Firmware Intel Boot agent xe 2.1.87

None working Tigris01-02 identical hardware ( same NIC as Gizmo):
------------------------------------------------------------------
HP DL165 G7 (BIOS 02/02/2012 ROMID 037)  Tigris01 has a newer bios same error. 
Dual CPU AMD Opteron 6272 2.1GHz 
128GB RAM
Disk 300G (dual disks in mirror), HP smart array P410. 
Intel X520 10G dual port NIC, onboard 1G nics are all disabled.   Tigris02 Firmware Intel Boot agent xe 2.1.96   (tigris01 has an even later firmware still same problem)



We've try the suggested workaround from: https://bugzilla.redhat.com/show_bug.cgi?id=1234601 

It didn't solve the problem.

Comment 1 Omri Hochman 2016-03-10 13:08:53 UTC
Created attachment 1134889 [details]
dracut-screen-shot

Comment 2 Dmitry Tantsur 2016-03-10 13:15:00 UTC
Please looks for the exact error in the /run/initramds/rdsosreport.txt. Also if it's OSPd7, there should be a /logs file, please take a look at it as well. (all these files are on the ramdisk, not on the undercloud)

Comment 3 Lucas Alvares Gomes 2016-03-10 13:25:19 UTC
Boot errors can vary a lot, it can missing drivers, network problems (since in the bash ramdisk the code runs as pid 1) and so on...

There's a file that dracut generates once you drop there, can you see any hints about what happened when you run the command below?

$ cat /run/initramfs/sosreport.txt

Another thing you can do to try to figure out what went wrong is to append "rd.debug" to the kernel command line. If you do that the logs will go to journald and you can check it by running "journalctl -ab". If there's no systemd you can find the debug logs in the "dmesg" and into the file "/run/initramfs/init.log".

You can add the "rd.debug" parameter to the command line by editing the /etc/ironic/ironic.conf file and appending the parameter there for the "pxe_append_params" config option, e.g:

[pxe]
pxe_append_params=nofb nomodeset vga=normal rd.debug

Restart the openstack-ironic-conductor service after editing it to apply the changes and try to deploy again.

Hope that helps,
Lucas

Comment 4 Tzach Shefi 2016-03-10 14:28:21 UTC
Created attachment 1134903 [details]
Dracut logs

Comment 5 Tzach Shefi 2016-03-10 14:28:46 UTC
Attached above requested logs plus others i found on the way. 

discovery-log
ens1f0.log   (provision eth)
ens1f1.log
log
rdsosreport.txt

Looking over these logs my self also.

Comment 6 Tzach Shefi 2016-03-10 14:41:29 UTC
Maybe this no route to host, under log file line 1048 

WARNING: log file /run/initramfs/rdsosreport.txt does not exist
ERROR: ('Connection aborted.', error(113, 'No route to host')) when calling to discoverd
///lib/dracut/hooks/pre-mount/50-init.sh@475(source): give_up 'Failed to discover hardware'
///lib/dracut/hooks/pre-mount/50-init.sh@144(give_up): log 'Failed to discover hardware'
///lib/dracut/hooks/pre-mount/50-init.sh@136(log): echo 'Failed to discover hardware'
Failed to discover hardware
///lib/dracut/hooks/pre-mount/50-init.sh@146(give_up): case "$ONFAILURE" in
///lib/dracut/hooks/pre-mount/50-init.sh@152(give_up): log 'ONFAILURE=console, launching an interactive shell'
///lib/dracut/hooks/pre-mount/50-init.sh@136(log): echo 'ONFAILURE=console, launching an interactive shell'
ONFAILURE=console, launching an interactive shell

Comment 7 Tzach Shefi 2016-03-10 15:34:24 UTC
Created attachment 1134933 [details]
Discovery fails

Comment 9 Tzach Shefi 2016-03-13 12:23:38 UTC
Shooting in the dark here, could AMD CPU cause this issue? Asking as both of my effected server having this issue are AMD based, all the other servers Intel based and don't exhibit this error.   

On line 2691 of rdsosreport.txt:
[   42.717425] localhost mcelog[804]: ERROR: AMD Processor family 21: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
                                      : No such file or directory
[   48.242185] localhost lldpad[1155]: config file failed to load,

Started reading about mcelog, found these links:
https://access.redhat.com/solutions/158503
https://bugzilla.redhat.com/show_bug.cgi?id=1166978

Comment 10 Lucas Alvares Gomes 2016-03-14 10:17:10 UTC
(In reply to Tzach Shefi from comment #9)
> Shooting in the dark here, could AMD CPU cause this issue? Asking as both of
> my effected server having this issue are AMD based, all the other servers
> Intel based and don't exhibit this error.   
> 
> On line 2691 of rdsosreport.txt:
> [   42.717425] localhost mcelog[804]: ERROR: AMD Processor family 21: mcelog
> does not support this processor.  Please use the edac_mce_amd module instead.
>                                       : No such file or directory
> [   48.242185] localhost lldpad[1155]: config file failed to load,
> 
> Started reading about mcelog, found these links:
> https://access.redhat.com/solutions/158503
> https://bugzilla.redhat.com/show_bug.cgi?id=1166978

That's interesting... While I don't think that error particularly would cause the node boot to fail but, the fact that the AMD machines are not working and the Intel ones are; which makes me think about the version of the kernel used for the deployment. I believe the host OS we use for the distributed images is RHEL right? Can you take a look at the kernel version it is using please?

Also, I think it would worth to try to create an image based on a distro with a newer kernel and see if that works, so that we can isolate the problem. Can you generate a deploy ramdisk/kernel with fedora and see if that works please?

You can use the command below to create the image:

$ ramdisk-image-create -o fedora-deploy fedora deploy-ironic dracut-ramdisk

Once it's create we need to upload it to Glance:

$ glance image-create --name my-kernel --is-public True --disk-format aki --container-format aki  < fedora-deploy.vmlinuz

$ glance image-create --name my-image.initrd --is-public True --disk-format ari --container-format ari  < fedora-deploy.initrd

And now set the new image to the Ironic nodes:

ironic node update <node uuid or name> add driver_info/deploy_ramdisk=<glance uuid> driver_info/deploy_kernel=<glance uuid>

Start the deployment again to see if that works.

Comment 12 Tzach Shefi 2016-03-14 16:01:21 UTC
uname -a under dracut returns:

Linux localhost 3.10.0-327.10.1.el7d /x86_64 #1 SMP Sat Jan23 ...

We created fedora based discovery images:
export ELEMENTS_PATH=/usr/share/instack-undercloud:/usr/share/diskimage-builder/elements:/usr/share/tripleo-image-elements
export DELOREAN_TRUNK_REPO="http://trunk.rdoproject.org/f22/current"
export DELOREAN_REPO_URL=$DELOREAN_TRUNK_REPO

disk-image-create -a amd64 -o fedora-discover fedora ironic-agent delorean-repo -p python-hardware-detect 2>&1 | tee disk1.log
 
Then replaced files under /httpboot/
sudo mv fedora-discover.initramfs /httpboot/discovery.ramdisk
sudo mv fedora-discover.vmlinuz /httpboot/discovery.kernel

Chown ironic:ironic on both files 
setenforce0 to not mess with SE. 

Introspect AMD node, this time it didn't get stuck in dracut, after a while I got login screen. 
Fedora release 21.. 
Kernel 4.1.13-100.. 
Localhost login: 

when I checked introspection status: 
[stack@undercloud72 ~]$ openstack baremetal introspection status 13b102df-8ccb-42e8-abf0-7eda3f48181a
+----------+-------+
| Field    | Value |
+----------+-------+
| error    | None  |
| finished | False |
+----------+-------+

Gave it 15-20 minuets, same status doesn't look like it's going to get finished=true.

Comment 14 Lucas Alvares Gomes 2016-03-15 09:52:52 UTC
(In reply to Tzach Shefi from comment #12)
> uname -a under dracut returns:
> 
> Linux localhost 3.10.0-327.10.1.el7d /x86_64 #1 SMP Sat Jan23 ...
> 
> We created fedora based discovery images:
> export
> ELEMENTS_PATH=/usr/share/instack-undercloud:/usr/share/diskimage-builder/
> elements:/usr/share/tripleo-image-elements
> export DELOREAN_TRUNK_REPO="http://trunk.rdoproject.org/f22/current"
> export DELOREAN_REPO_URL=$DELOREAN_TRUNK_REPO
> 
> disk-image-create -a amd64 -o fedora-discover fedora ironic-agent
> delorean-repo -p python-hardware-detect 2>&1 | tee disk1.log
>  

Oh, so you don't see to be using the discover element to create the ramdisk. The "ironic-agent" works for inspection but only on OSP 8.0+ which uses the IPA ramdisk for deploy and inspection.

For 7.0 you have to use the ironic-discoverd-ramdisk-instack[0] element when creating the image.


[0] https://github.com/rdo-management/instack-undercloud/tree/master/elements/ironic-discoverd-ramdisk-instack

Comment 15 Dmitry Tantsur 2016-03-16 13:02:05 UTC
I see 2 problems in this report:

1. ERROR: ('Connection aborted.', error(113, 'No route to host')) when calling to discoverd

2. "No node found for MAC blah-blah".

Which one are we debugging right now? I don't think mcelog is somehow involved here.

Comment 16 Tzach Shefi 2016-03-16 13:18:12 UTC
Adding info I have deployed OPSD 7.3 on of the AMD nodes, created rhel images on it and tested introspection of second AMD node - same dracut error. 
So we know this is a discovery image issue with AMD CPUS. 

Trying to make fedora based images on same OSPD, not having too much luck creating them.

Comment 17 Dmitry Tantsur 2016-03-16 13:40:51 UTC
I still see no direct link to the CPU manufacturer. Seems like the case of "no route to host" error we saw on some configurations.

Just dumping my findings in the logs:

[    0.000000] localhost kernel: Command line: discoverd_callback_url=http://10.35.20.1:5050/v1/continue RUNBENCH=0 ip=10.35.20.29:10.35.20.1:10.35.20.1:255.255.255.0 BOOTIF=a0:36:9f:22:e8:78

///lib/dracut/hooks/pre-mount/50-init.sh@465(source): ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens1f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether a0:36:9f:22:e8:78 brd ff:ff:ff:ff:ff:ff
    inet 10.35.20.29 peer 10.35.20.1/32 scope global ens1f0
       valid_lft forever preferred_lft forever
    inet 10.35.20.29/24 brd 10.35.20.255 scope global dynamic ens1f0
       valid_lft 106sec preferred_lft 106sec
3: ens1f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:36:9f:22:e8:7a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a236:9fff:fe22:e87a/64 scope link 
       valid_lft forever preferred_lft forever

///lib/dracut/hooks/pre-mount/50-init.sh@473(source): ironic-discoverd-ramdisk --use-hardware-detect --bootif a0:36:9f:22:e8:78 -L /run/initramfs/rdsosreport.txt -L /log http://10.35.20.1:5050/v1/continue
ERROR: ('Connection aborted.', error(113, 'No route to host')) when calling to discoverd

2016-03-10 13:46:36,529 INFO: ironic-discoverd-ramdisk: posting collected data to http://10.35.20.1:5050/v1/continue
2016-03-10 13:46:36,550 INFO: requests.packages.urllib3.connectionpool: Starting new HTTP connection (1): 10.35.20.1
2016-03-10 13:46:54,585 ERROR: ironic-discoverd-ramdisk: ('Connection aborted.', error(113, 'No route to host')) when calling to discoverd

Comment 18 Tzach Shefi 2016-03-16 14:10:05 UTC
Lets assume its a network issue, no route to host, then I shouldn't be able to ping undercloud from node and vise versa right? 

Also notice this on ens1fo ->  mq state DOWN qlen 1000
Yet when I check under node in dracut mode, ip a -> ens1f0 is up with ip address x.y.20.29

So logs might show an issue, but it's either incorrect or none recent. 

I'm adding logs from latest attempt which i did on same node, same network switches, just installed OSPD on one of the AMD nodes.

Comment 19 Tzach Shefi 2016-03-16 14:14:19 UTC
Created attachment 1137056 [details]
Dracut logs

Comment 20 Dmitry Tantsur 2016-03-16 14:48:14 UTC
These lines surprise me:

    inet 10.35.20.29 peer 10.35.20.1/32 scope global ens1f0
       valid_lft forever preferred_lft forever
    inet 10.35.20.29/24 brd 10.35.20.255 scope global dynamic ens1f0
       valid_lft 106sec preferred_lft 106sec

Maybe that's how dracut is expected to work, but dunno. Could you please try editing /httpboot/discoverd.ipxe by hand, removing "ip=${ip}:${next-server}:${gateway}:${netmask}" parameter. Then restart the introspection. If it fails, please also provide the ramdisk logs.

Comment 21 Tzach Shefi 2016-03-16 15:58:52 UTC
Tested this tip as well: 
Modify your /httpboot/discoverd.ipxe and remove "ip=${ip}:${next-server}:${gateway}:${netmask}" parameter from your command line?

Restart discovery service attempt to introspect same dracut issue.

Comment 22 Dmitry Tantsur 2016-03-16 16:05:49 UTC
Please provide logs for this issue. I need to see how different it was this time.

Comment 23 Tzach Shefi 2016-03-16 16:14:48 UTC
Created attachment 1137097 [details]
Dracut logs after modifing ipxe paramaters

These are the logs, after changing below plus restarting ironic services. 

/httpboot/discoverd.ipxe and remove "ip=${ip}:${next-server}:${gateway}:${netmask}"

Comment 24 Dmitry Tantsur 2016-03-16 16:20:54 UTC
I still see ip=10.35.20.29:10.35.20.1:10.35.20.1:255.255.255.0 in the logs. Could you please paste your discoverd.ipxe?

Comment 25 Tzach Shefi 2016-03-17 06:33:22 UTC
[stack@localhost httpboot]$ cat discoverd.ipxe 
#!ipxe

dhcp

kernel http://10.35.20.1:8088/discovery.kernel discoverd_callback_url=http://10.35.20.1:5050/v1/continue RUNBENCH=0 BOOTIF=${mac}
initrd http://10.35.20.1:8088/discovery.ramdisk
boot

Comment 26 Dmitry Tantsur 2016-03-18 08:46:25 UTC
That's weird, I wonder how this "ip" variable is set then. Did you change any PXE/iPXE configuration previously? Could you check the httpd logs to ensure that it indeed serves the modified file?

Comment 27 Tzach Shefi 2016-03-20 06:39:38 UTC
Created attachment 1138187 [details]
http logs Tigris01

Remember we first time we hit this issue OPSD was installed on seal13 (intel) server, on which we also changed ipxe/pxe which didn't help. 
Then we build a fedora discovery image, which booted up OK, but reached log-in prompt without doing any discovery. 

To further debug issue, I took Tigris01 (AMD) server installed same OSPD on it as Seal13. Build images on it and then ran introspection of Tigris02 (AMD). Figuring that building images on same AMD hardware might help, it didn't we are stuck on same issue. 

To answer your question on recent OPSD (tigris01) all I'd changed was  discoverd.ipxe  as listed on comment #26. Attached http logs from tigris01.

Comment 28 Lucas Alvares Gomes 2016-03-23 15:39:05 UTC
Hi @Tzach,

Since the bug is now targeting OSP8 can we re-testing this bug on it?

Comment 29 Tzach Shefi 2016-03-24 11:35:20 UTC
Hi Lucas, installed OPSD8 today, version below: 
openstack-tripleo-0.0.7-1.el7ost.noarch
openstack-tripleo-common-0.3.0-3.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.5-1.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.12-2.el7ost.noarch
python-tripleoclient-0.3.1-1.el7ost.noarch
openstack-tripleo-heat-templates-0.8.12-2.el7ost.noarch

Installed OSPD8 on Tigris02 server (AMD), attempted to introspect Tigris01 (AMD) and Gizmo (Intel) server. 
This way I can update for both versions, if Tigris01 is booted from disk check  OPSD7 if Tigris02 is booted from disk OSPD8. Don't worry I fixed boot order and instack.json files as needed. 

Any way during introspection process I hit a new bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1320962

Can't say for if OPSD8 resolved this current AMD issue or not, as I don't finish introspection. On the new bug I had posted a screenshot of node it happens to be Gizmo but Tigris01 looks the same, if this step is past our DRACUT point than OPSD8 might have resolved this bug. 

Without completing introspection I can't say for sure.

Comment 30 Lucas Alvares Gomes 2016-03-24 15:46:21 UTC
(In reply to Tzach Shefi from comment #29)
> Hi Lucas, installed OPSD8 today, version below: 
> openstack-tripleo-0.0.7-1.el7ost.noarch
> openstack-tripleo-common-0.3.0-3.el7ost.noarch
> openstack-tripleo-puppet-elements-0.0.5-1.el7ost.noarch
> openstack-tripleo-image-elements-0.9.9-1.el7ost.noarch
> openstack-tripleo-heat-templates-kilo-0.8.12-2.el7ost.noarch
> python-tripleoclient-0.3.1-1.el7ost.noarch
> openstack-tripleo-heat-templates-0.8.12-2.el7ost.noarch
> 

Can you check the version of the ironic-python-agent package as well?


> Installed OSPD8 on Tigris02 server (AMD), attempted to introspect Tigris01
> (AMD) and Gizmo (Intel) server. 
> This way I can update for both versions, if Tigris01 is booted from disk
> check  OPSD7 if Tigris02 is booted from disk OSPD8. Don't worry I fixed boot
> order and instack.json files as needed. 
> 
> Any way during introspection process I hit a new bug 
> https://bugzilla.redhat.com/show_bug.cgi?id=1320962
> 

This new error seems very similar to the one described in https://bugzilla.redhat.com/show_bug.cgi?id=1308981

The fix for that was merged downstream yesterday, I wonder if that would also fix this new error you are seem there.

Comment 31 Tzach Shefi 2016-03-27 07:09:09 UTC
Ironic version, forgot to mention. 
openstack-ironic-conductor-4.2.2-4.el7ost.noarch
openstack-ironic-api-4.2.2-4.el7ost.noarch
openstack-ironic-common-4.2.2-4.el7ost.noarch
openstack-ironic-inspector-2.2.5-1.el7ost.noarch
python-ironic-inspector-client-1.2.0-6.el7ost.noarch
python-ironicclient-0.8.1-1.el7ost.noarch

Looking at bz130981 says fixed in:
ironic-python-agent-doc-1.1.0-8.el7ost.noarch.rpm
I don't have any such competent installed, is this normal?

Comment 32 Dmitry Tantsur 2016-03-29 09:55:11 UTC
So, the bug you mention seems to be swift-related. Could you please temporary disable storing introspection data in swift? Set "store_data" to "none" in /etc/ironic-inspector/inspector.conf, then restart openstack-ironic-inspector, then retry.

Confirming that this bug affects OSPd8 would help a lot, as the ramdisk in OSPd8 is much easier to debug.

Comment 33 Dmitry Tantsur 2016-03-29 10:51:33 UTC
Hmm, ignore me. The swift failure happens at a much later stage, when data is already received from the ramdisk. So OSPd8 does not seem to be affected by this bug.

Do you plan on getting back to OSPd7?

Comment 34 Tzach Shefi 2016-03-31 07:07:52 UTC
So I ran a fresh install on Tigris02(AMD) failed to introspect 
openstack-ironic-conductor-4.2.2-4.el7ost.noarch
openstack-ironic-api-4.2.2-4.el7ost.noarch
openstack-ironic-common-4.2.2-4.el7ost.noarch
openstack-ironic-inspector-2.2.5-2.el7ost.noarch
python-ironic-inspector-client-1.2.0-6.el7ost.noarch
python-ironicclient-0.8.1-1.el7ost.noarch
openstack-tripleo-common-0.3.1-1.el7ost.noarch
openstack-tripleo-0.0.7-1.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.5-1.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-1.el7ost.noarch
python-tripleoclient-0.3.4-1.el7ost.noarch
openstack-tripleo-heat-templates-0.8.14-1.el7ost.noarch


Introspection didn't finish, again it looks again bug 1320962 

[stack@localhost ~]$ openstack baremetal introspection bulk start
Setting nodes for introspection to manageable...
Starting introspection of node: a3674785-5e3a-4b27-a898-626e42210118
Starting introspection of node: b05cff45-5c2f-4694-b166-4d3b25feedcb
Waiting for introspection to finish...
Introspection for UUID b05cff45-5c2f-4694-b166-4d3b25feedcb finished with error: Unexpected exception ConnectionError during processing: ('Connection aborted.', error(111, 'ECONNREFUSED'))
Introspection didn't finish for nodes a3674785-5e3a-4b27-a898-626e42210118
Setting manageable nodes to available...
Introspection completed with errors:
b05cff45-5c2f-4694-b166-4d3b25feedcb: Unexpected exception ConnectionError during processing: ('Connection aborted.', error(111, 'ECONNREFUSED'))
[stack@localhost ~]$ Write failed: Broken pipe

Noticed some swift services were down, restarted them, deleted ironic nodes.
Added name amdcpu on instack.json file imported again and restart introspection. 

Now it got something else, it reported finished but had errors
http://pastebin.test.redhat.com/361236

Dmitry let me know how you wish to proceed, should I try without using swift as source of introspection images, or just reboot undercloud and retry.  

BTW I'd only used AMD node as my Intel one (the second node) is down due to HW issues.

Comment 35 Tzach Shefi 2016-03-31 07:09:38 UTC
Created attachment 1142082 [details]
Logs for #34

Comment 36 Tzach Shefi 2016-03-31 07:39:39 UTC
I've now updated ironic-inspector ("store_data" to "none") as per #32. 
Deleted node re-imported it,  the output looks the same pastebin from #34 
I've saved it in a new pastebin just in case 
http://pastebin.test.redhat.com/361246

Node status is available, but what about these errors on the way?

Comment 37 Dmitry Tantsur 2016-03-31 10:11:54 UTC
These errors are actually warnings, dunno why they get displayed to you. Actually introspection finished successfully for you. Mind reporting a new bugzilla for these scary warnings? And lets get back to OSPd7 if you feel like, as seems like OSPd8 is not affected.

Comment 38 Tzach Shefi 2016-04-03 05:53:50 UTC
Opened bug for OPSD8 and it's warnings:
https://bugzilla.redhat.com/show_bug.cgi?id=1323444

I'll reinstall OSPD7 and report once I've got it up.

Comment 39 Tzach Shefi 2016-04-04 12:43:52 UTC
Installed from scratch OPSD7, it's looking better I sent it to introspect the same an AMD hardware, this time it worked no DRACUT, gave some warnings see below:
release 7-director   -p 2016-03-09.1
rhos-release 7   -p 2016-03-24.2

Do we still need deployment? Can i reuse hardware or do you wish to further debug/check this?


[stack@localhost ~]$ openstack baremetal introspection bulk start
Setting available nodes to manageable...
Starting introspection of node: 245baee0-f4c9-4385-87e4-006cc5fc2dd5
Waiting for discovery to finish...
Discovery for UUID 245baee0-f4c9-4385-87e4-006cc5fc2dd5 finished successfully.
Setting manageable nodes to available...
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 1 of 61
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 61
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 3 of 61
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 4 of 61
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 5 of 61
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 61
WARNING: ironicclient.common.http Request returned failure status.
WARNING: ironicclient.common.http Error contacting Ironic server: Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 7 of 61
Node 245baee0-f4c9-4385-87e4-006cc5fc2dd5 has been set to available.
Discovery completed.

Comment 40 Tzach Shefi 2016-04-04 12:47:45 UTC
Created attachment 1143350 [details]
Logs for #39

Comment 42 Dan Yocum 2016-04-04 15:57:10 UTC
Disable pxe boot on all NICs that are not on the provisioning network, and try again.

ggillies discovered this on the OS1 Public Prime cloud a while back where I've got the Intel X520 cards, too.

Comment 43 Dan Yocum 2016-04-04 16:07:29 UTC
Hmmm... my previous comment may be a red herring (but it certainly can't hurt).

Tzach, see this bug regarding the "locked by host" errors: https://bugzilla.redhat.com/show_bug.cgi?id=1232997

Supposedly this was fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1233452, but it's rearing its ugly head, again - I'm experiencing this error, too.

Comment 44 Tzach Shefi 2016-04-06 09:22:34 UTC
I'm PXE booting off a 1G nic for this not the IntelX520 (had done so on #39 as well). Don't want to disable PXE booting on Intel 10G nic it's a pain in the butt to enable/disable it Intel's boot util. 
Plus I need it to PXE from 10G nic once done with this bug. 

I did however disconnect 10G cables this time, introspected again, same results as comment #39. Not sure disconnecting cables is 100% equivalent to disabling PXE on 10Gs but it easy to do/undo.

Let me know if I should check/test anything else.

Comment 45 Dmitry Tantsur 2016-04-18 08:01:05 UTC
> I did however disconnect 10G cables this time, introspected again, same results as comment #39

So, does it mean that with the 2nd NIC disabled, introspection actually works, but gives away scary warnings?

Comment 46 Dan Yocum 2016-04-21 14:58:50 UTC
I can verify that the following ramdisk images allow introspection to complete successfully on Dell R630 and R730xd systems with Intel X520 i350 nics:

[root@ops2 ~]# rpm -qa | grep director-images
rhosp-director-images-ipa-8.0-20160415.1.el7ost.noarch
rhosp-director-images-8.0-20160415.1.el7ost.noarch

Comment 47 Dan Yocum 2016-04-21 14:59:01 UTC
I can verify that the following ramdisk images allow introspection to complete successfully on Dell R630 and R730xd systems with Intel X520 i350 nics:

[root@ops2 ~]# rpm -qa | grep director-images
rhosp-director-images-ipa-8.0-20160415.1.el7ost.noarch
rhosp-director-images-8.0-20160415.1.el7ost.noarch

Comment 48 Dmitry Tantsur 2016-04-22 12:02:44 UTC
Ok, we can assume OSPd8 works. What about OSPd7? I'm trying to clarify where we are now. So introspection does work if only one NIC is left enabled, right? Or only if non-X520 NIC is enabled?

Comment 49 Tzach Shefi 2016-04-24 05:56:39 UTC
OSPd7 introspected OK with onboard 1G NIC PXE enabled. 
The 10G Intel X520's cable was physically disconnected during process (but not PXE disabled). 
I'm guessing a discontented 10G NIC is equivalent to PXE disabling it.  
So yes introspection worked with only one NIC PXE enabled. 
Yet it still spat out warnings #39, but completed OK.

Comment 50 Bob Fournier 2017-09-03 19:15:37 UTC
As it appears there is a workaround for OSP-7, we'll close this one unless there are objections.