Bug 1468557 - Discovery KExec does not work with Atomic Host 7
Summary: Discovery KExec does not work with Atomic Host 7
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Discovery Image
Version: 6.2.10
Hardware: x86_64
OS: Linux
high
high vote
Target Milestone: Released
Assignee: Lukas Zapletal
QA Contact: Roman Plevka
URL: https://projects.theforeman.org/issue...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-07 12:06 UTC by Mihir Lele
Modified: 2019-10-07 17:19 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-14 12:36:19 UTC


Attachments (Terms of Use)
screengrab1 (12.57 KB, image/png)
2017-07-07 12:14 UTC, Mihir Lele
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:1222 None None None 2019-05-14 12:36:30 UTC
Foreman Issue Tracker 25101 None None None 2018-10-02 11:00:22 UTC

Description Mihir Lele 2017-07-07 12:06:24 UTC
Description of problem:

PXEless provisioning through Forman Discovery Image fails for "Red Hat Enterprise Linux Atomic Host 7"

Version-Release number of selected component (if applicable): 6.2.10


How reproducible: Always


Steps to Reproduce:
1. Discover a host using FDI. Image used by me and the Customer: foreman-discovery-image-3.1.1-22.iso
2. Go to discovered host, fill the host profile and hit "Submit".


Actual results:

Provisioning is not initiated for the host and the Host displays the "Discovery Status" page again which is normally displayed after the host sends the facts to the Satellite.

Discovery Status page says:

"Status: N/A (use Status to update)"

Expected results:

Host should be put in build mode, when we hit submit.

Additional info:

1) This issue is not observed for rhel base os. (it worked for me with rhel7.3)
2) I am attaching the screenshot of the "Discovery Status" from the host

Comment 1 Mihir Lele 2017-07-07 12:14:38 UTC
Created attachment 1295279 [details]
screengrab1

Comment 5 Lukas Zapletal 2017-07-18 16:06:04 UTC
Yeah, Mihir are you using ONDEMAND policy? There is a bug in Katello (not sure if this was fixed in Satellite) when it does not correctly download kickstarts with ONDEMAND policy.

Comment 6 Mihir Lele 2017-07-18 17:45:20 UTC
Lukas, 

I just logged in and cross checked the download policy its "Immidiate". (I am not sure what the download policy is at the customer end. I Can check that and confirm)
And like I said, I am able to provision the same kickstart using pxe. Issue is only observed for provisioning through FDI

Comment 7 Lukas Zapletal 2017-07-21 15:25:38 UTC
I am able to reproduce, inspecting kexec JSON shows that it has correct info:

==> /var/log/foreman/production.log <==
2017-07-21 11:22:41 049a3bbd [app] [I] KEXEC JSON: {
 |   "kernel": "http://older.home.lan/pulp/repos/MyOrg/Library/content/dist/rhel/atomic/7/7Server/x86_64/kickstart//images/pxeboot/vmlinuz",
 |   "initram": "http://older.home.lan/pulp/repos/MyOrg/Library/content/dist/rhel/atomic/7/7Server/x86_64/kickstart//images/pxeboot/initrd.img",
 |   "append": "ks=http://older.home.lan:8000/unattended/provision?token=b51cbe25-8a2f-4ecf-adc4-389a900e63ed&static=yes inst.ks.sendmac ip=192.168.99.7::192.168.99.1:255.255.255.0:::none nameserver=192.168.162.1 ksdevice=bootif BOOTIF=00-52-54-00-e6-13-01 "
 | }
 |

Comment 8 Lukas Zapletal 2017-07-21 15:38:21 UTC
I am able to kexec into Anaconda Atomic installer manually

wget http://older.home.lan/pulp/repos/MyOrg/Library/content/dist/rhel/atomic/7/7Server/x86_64/kickstart//images/pxeboot/vmlinuz

wget http://older.home.lan/pulp/repos/MyOrg/Library/content/dist/rhel/atomic/7/7Server/x86_64/kickstart//images/pxeboot/initrd.img

kexec -d --initrd=initrd.img --append="ks=http://older.home.lan:8000/unattended/provision?token=b51cbe25-8a2f-4ecf-adc4-389a900e63ed&static=yes inst.ks.sendmac ip=192.168.99.7::192.168.99.1:255.255.255.0:::none nameserver=192.168.162.1 ksdevice=bootif BOOTIF=00-52-54-00-e6-13-01" vmlinuz

It fails to download kickstart. That's perhaps my environment misconfiguration.

Anyway, modify:

/opt/theforeman/tfm/root/usr/share/gems/gems/foreman_discovery-5.0.0.9/app/models/host/managed_extensions.rb

file and put this line

    Foreman::Logging.logger('app').info "KEXEC JSON: #{json}"

just before this one (line 57)

    old.becomes(Host::Discovered).kexec json

Restart httpd, then you will see KEXEC JSON log in production log. There you can check the URLs and append line. Paste it here please.

Comment 10 Lukas Zapletal 2017-08-09 08:09:43 UTC
Need more info, I need to see the KEXEC JSON, see comment 8.

Comment 12 Lukas Zapletal 2017-08-10 12:56:28 UTC
The "ampersand" bug mentioned in the case was found in Anaconda installer in RHEL 7.4 beta, it should be fixed in anaconda-21.48.22.112-1. When Atomic Anaconda boots up, check its version. Here is the bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1443485

Anyway with FDI 3.1.1 I see the same behavior, reboot.

With FDI 3.4.1 (latest) I see it just freezes in libvirt VM.

But when I add "nomodeset" to the Red Hat Kexec template (this was fixed in 6.3 but not in 6.2), FDI 3.4.1 works fine.

https://github.com/theforeman/foreman_discovery/pull/272/files

That's first problem. Second, the DHCP boot mode of the network must be set to Static if kexec should happen there correctly. Or template must be edited to provide static NIC configuration.

I am now able to correctly boot Anaconda, it errors out with dracut timeout - starting timeout scripts tho.

Comment 13 Lukas Zapletal 2017-08-10 13:28:12 UTC
Scratch my note about DHCP boot mode, there is a static flag in the KS URL.

Anyway, I can pass the stuck kexec with "nomodeset" option, but still Anaconda won't continue and error out with timeout error. Not sure why. Can you try to reproduce this?

Comment 14 Mihir Lele 2017-11-12 16:00:12 UTC
My apologies for the delayed response.

I tried with fdi-3.4.2 with the default kexec template as well as the one in https://github.com/theforeman/foreman_discovery/pull/272/files.

But, didnt work for me

Comment 20 Lukas Zapletal 2018-01-16 16:49:55 UTC
Please locate smart_proxy_discovery_image/power_api.rb file on the FDI image (you need shell and GNU find) and patch it:

https://gist.github.com/lzap/47a34e2e7301f2b02245c0b6bca43ecb

Then restart proxy:

systemctl restart foreman-proxy

Then switch to console tty2 or run journalctl -f and then perform KExec from Satellite. Now it should print MD5 sum of kernel/iniatramdisk and wait 90 seconds before performing actual kexec command.

This is to rule out transmission errors.

Also please try with 3.4.3-1 discovery image which will be part of 6.3 and 6.2.14 versions, there are some improvements around logging as well.

Try to record the case with screen recorder, I struggle reproducing it. I have seen this once, but now I am unable. Thanks.

Comment 37 Dan Stock 2018-05-25 08:06:11 UTC
I've been following this bug for a while, because we have the same problem. 

Last week we updated to Satellite 6.3.1. One of the problems that we hoped would be solved with the update is this one. 

I've double checked all the configuration and recommendation for the installation. 
The current Version in use is: 
   foreman-discovery-image-3.4.4-1.el7sat.noarch

I can say that the installation still stops with the same Status Error.
  Status: N/A (use Status to update)
  

Could anyone give me Information regarding what coming? 
Are there any other solutions? I've thought about setting up a new segment that uses PXE, even though we use FDI for all our other installation and it would be a big change in our infrastructure.

Comment 38 Lukas Zapletal 2018-05-25 08:47:42 UTC
Dan, there was a lot of private conversation. In short, kickstart repositories to have .treeinfo file which specifies few things but most importantly:

[stage2]
mainimage = LiveOS/squashfs.img

This is expected to be present, e.g. this example is from RHEL Server repo. The thing is - it is missing in Atomic 7 repos at the moment, therefore Anaconda won't load stage 2.

It looks like Atomic team requires users to put this onto kernel command line:

https://github.com/projectatomic/docs-projectatomic/blob/master/attic/atomic-guide/installation_and_configuration_guide/content/pxe_installation.adoc

But at some point, the documentation downstream was merged with RHEL documentation where this is no longer required. We are still investigating this.

WORKAROUND:

Add 

inst.stage2=http://hostname/rhel/server/x.y/os/

to your KExec template (or PXE template). You can use templating ERB to find the path in a generic way (untested):

inst.stage2=<%= @host.operatingsystem.medium_uri(@host) %>

The reason why this takes so long is we have several teams involved: RCM (release engineering), Satellite engineering, Atomic engineering and docs. I will update as soon as I will know where we are gonna be fixing this problem (RCM/CDN, Satellite template, RHEL docs or combination).

Comment 43 Roman Plevka 2018-10-02 09:22:59 UTC
@lzap
the issue persists. 

adding the inst.stage2=<%= @host.operatingsystem.medium_uri(@host) %>
works great.

- I'm wondering what to do with the bz since it is not really satellite issue.
- can you file a docu bz and close this one?

Comment 44 Lukas Zapletal 2018-10-02 09:47:47 UTC
Roman, can you verify then? Here is the article:

https://access.redhat.com/solutions/3635501

Comment 45 Lukas Zapletal 2018-10-02 10:06:26 UTC
We have an easy workaround, KBASE article in place and this is put onto backlog. Patch is upstream and pending review.

Comment 46 Roman Plevka 2018-10-02 12:27:26 UTC
(In reply to Lukas Zapletal from comment #44)
> Roman, can you verify then? Here is the article:
> 
> https://access.redhat.com/solutions/3635501

ACK.
I'm treating the kbase article as a resolution of this bug and putting the bug to VERIFIED.

Comment 53 errata-xmlrpc 2019-05-14 12:36:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:1222


Note You need to log in before you can comment on or make changes to this bug.