1364079 – iPXE hangs with an infinite stream of different errors

Bug 1364079 - iPXE hangs with an infinite stream of different errors

Summary: iPXE hangs with an infinite stream of different errors

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	ipxe
Sub Component:
Version:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	RHOS Maint
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-04 12:49 UTC by Dmitry Tantsur
Modified:	2016-09-16 03:35 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-05 08:05:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Example of failures (218.50 KB, image/png) 2016-08-04 12:49 UTC, Dmitry Tantsur	no flags	Details
iPXE versions as seen by the machine (178.00 KB, image/png) 2016-08-04 12:50 UTC, Dmitry Tantsur	no flags	Details
Screenshot of OVB instance console during introspection (89.75 KB, image/png) 2016-08-04 15:16 UTC, Ronelle Landy	no flags	Details
View All

Description Dmitry Tantsur 2016-08-04 12:49:05 UTC

Created attachment 1187474 [details]
Example of failures

Introspection hangs on OSPd9 in my bare metal lab (PowerEdge R320). We're using the following iPXE script (generated by OSPd, stripped of kernel parameters):

 #!ipxe

 :retry_dhcp
 dhcp || goto retry_dhcp

 :retry_boot
 imgfree
 kernel --timeout 600000 http://172.21.64.1:8088/agent.kernel  initrd=agent.ramdisk || goto retry_boot
 initrd --timeout 600000 http://172.21.64.1:8088/agent.ramdisk || goto retry_boot
 boot

When I start introspection, the nodes gets DHCP successfully, then proceeds with iPXE (second screenshot). After downloading the iPXE scripts (from the same host http://172.21.64.1:8088), it starts displaying 3 kinds of messages (first screenshot): "file not found" (this is not true), "not enough space", "connection timed out". Note that the last error happens after it starts downloading the kernel. Even more interesting: it always breaks on one of 2 positions!

Name        : ipxe-bootimgs
Arch        : noarch
Version     : 20160127
Release     : 1.git6366fa7a.el7
Size        : 3.4 M
Repo        : installed
From repo   : rhelosp-9.0-director-puddle

Comment 2 Dmitry Tantsur 2016-08-04 12:50:03 UTC

Created attachment 1187475 [details]
iPXE versions as seen by the machine

Comment 3 Ronelle Landy 2016-08-04 15:15:25 UTC

We are seeing similar errors and hangs with introspection in CI jobs (happening on OVB and real baremetal hardware) - but only with master jobs - mitaka jobs are passing atm.

The job logs show:

Introspection completed with errors:
2a296b82-47d2-446f-961a-e8f34e0b21ea: Introspection timeout
678319e5-a270-49af-ab2f-f88aa52b9f7c: Introspection timeout
ca5e6f69-32e8-4580-88ba-602685a3a2f1: Introspection timeout
ad54e211-f5ad-41cf-833a-10a556f2181a: Introspection timeout
Setting nodes for introspection to manageable...
Starting introspection of manageable nodes
Waiting for introspection to finish...
Introspection for UUID 2a296b82-47d2-446f-961a-e8f34e0b21ea finished with error: Introspection timeout
Introspection for UUID 678319e5-a270-49af-ab2f-f88aa52b9f7c finished with error: Introspection timeout
Introspection for UUID ca5e6f69-32e8-4580-88ba-602685a3a2f1 finished with error: Introspection timeout
Introspection for UUID ad54e211-f5ad-41cf-833a-10a556f2181a finished with error: Introspection timeout
No nodes in manageable state found for introspection

Looking at the console, I see repeated messages: 'Connection timeout...' Attaching screenshot of console

Comment 4 Ronelle Landy 2016-08-04 15:16:32 UTC

Created attachment 1187565 [details]
Screenshot of OVB instance console during introspection

Comment 5 Dmitry Tantsur 2016-08-04 15:25:05 UTC

I'm sorry for the noise, my problem was due to some missing steps ended up in missing images. Ronelle, please double-check that the agent image is actually present in /httpboot.

Comment 6 Ronelle Landy 2016-08-04 19:11:26 UTC

Dmitry, I checked the undercloud:

[stack@undercloud ~]$ ls -la /httpboot
total 339516
drwxr-xr-x.  2 ironic ironic        66 Aug  4 15:08 .
dr-xr-xr-x. 19 root   root        4096 Aug  4 14:50 ..
-rwxr-xr-x.  1 root   root     5158704 Aug  4 15:08 agent.kernel
-rw-r--r--.  1 root   root   342493754 Aug  4 15:08 agent.ramdisk
-rw-r--r--.  1 ironic ironic       465 Aug  4 14:52 inspector.ipxe

Comment 7 Ronelle Landy 2016-08-04 19:22:38 UTC

Watching the console during introspection, it gets inspector.ipxe just fine
but times out on agent.kernel

Comment 8 Ronelle Landy 2016-08-04 19:38:16 UTC

OK - I think this is due to the MTU values of the interface used for provisioning being set back to 1500 during undercloud install. This overwrites the value set in the OVB setup.
Resetting value before introspection ... under test

Comment 9 Ronelle Landy 2016-08-04 22:16:41 UTC

Confirmed it was an MTU issue.

Comment 10 Dmitry Tantsur 2016-08-05 08:05:50 UTC

Thanks, so I'm closing it. Do you think we could update some documentation mentioning this potential MTU issue?

Comment 11 Ronelle Landy 2016-08-05 12:47:13 UTC

The instructions to modify MTU are doc'ed. (and I did have those modifications made). The issue is that undercloud-install overwrites them - that could possibly be doc'ed. It's only an issue with OVB and some hardware platforms

Note You need to log in before you can comment on or make changes to this bug.