Bug 1252068 - HA Deployment failed over an empty templates and yaml files of the 'controller' nodes.
HA Deployment failed over an empty templates and yaml files of the 'controlle...
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: ruby193-rubygem-staypuft (Show other bugs)
unspecified
x86_64 Linux
urgent Severity high
: z4
: Installer
Assigned To: Mike Burns
Omri Hochman
: TestBlocker, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-10 11:48 EDT by Omri Hochman
Modified: 2016-02-15 07:35 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-17 13:17:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
production.logh (1.53 MB, text/plain)
2015-08-10 11:48 EDT, Omri Hochman
no flags Details

  None (edit)
Description Omri Hochman 2015-08-10 11:48:54 EDT
Created attachment 1061140 [details]
production.logh

HA Deployment failed over an empty templates and yaml files of the 'controller' nodes.

Environment: 
-------------
openstack-puppet-modules-2014.2.15-3.el7ost.noarch
puppet-3.6.2-2.el7.noarch
puppet-server-3.6.2-2.el7.noarch
ruby193-rubygem-staypuft-0.5.25-2.el7ost.noarch


description:
-------------
I've attempted to deploy HA-Neutron 3X controllers / 2X computes , the deployment failed since puppet wasn't able to get the dialog on the controllers (look at the messages below..) ,  When entered the Staypuft UI and browse on the controller hosts:  https://10.8.30.99/hosts/5  

(A) the Template sub-tab of the controller nodes was empty ! 
(B) clicking on the yaml of the controller nodes was empty, ans show :
"Unable to generate output, Check log files\n" 


logs: 
-----

production.log (attach) :
---------------------------
/opt/rh/ruby193/root/usr/share/gems/gems/rack-1.4.1/lib/rack/urlmap.rb:49:in `call'
/usr/share/gems/gems/passenger-4.0.18/lib/phusion_passenger/rack/thread_handler_extension.rb:77:in `process_request'
/usr/share/gems/gems/passenger-4.0.18/lib/phusion_passenger/request_handler/thread_handler.rb:140:in `accept_and_process_next_request'
/usr/share/gems/gems/passenger-4.0.18/lib/phusion_passenger/request_handler/thread_handler.rb:108:in `main_loop'
/usr/share/gems/gems/passenger-4.0.18/lib/phusion_passenger/request_handler.rb:441:in `block (3 levels) in start_threads'
Failed to generate external nodes for maca25400702877.example.com with undefined method 'split' for NilClass::Jail (NilClass)
  Rendered text template (0.0ms)
Completed 412 Precondition Failed in 941ms (Views: 0.5ms | ActiveRecord: 73.2ms)
Imported report for maca25400702875.example.com in 0.09 seconds
Completed 201 Created in 102ms (Views: 1.1ms | ActiveRecord: 0.0ms)


Started POST "/api/reports" for 10.8.30.99 at 2015-08-10 11:04:06 -0400
Processing by Api::V2::ReportsController#create as JSON
  Parameters: {"report"=>"[FILTERED]", "apiv"=>"v2"}
processing report for maca25400702877.example.com
Imported report for maca25400702877.example.com in 0.03 seconds
Completed 201 Created in 36ms (Views: 1.0ms | ActiveRecord: 0.0ms)
ERF42-0992 [Staypuft::Exception]: Latest Puppet run contains failures for host: 4 (Staypuft::Exception)



messages form one the controllers  : 
-------------------------------------
Aug 10 11:02:37 maca25400702876 puppet-agent[12089]: Unable to fetch my node definition, but the agent run will continue:
Aug 10 11:02:37 maca25400702876 puppet-agent[12089]: Error 400 on SERVER: Failed to find maca25400702876.example.com via exec: Execution of '/etc/puppet/node.
rb maca25400702876.example.com' returned 1:
Aug 10 11:02:59 maca25400702876 puppet-agent[12089]: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node maca25
400702876.example.com: Failed to find maca25400702876.example.com via exec: Execution of '/etc/puppet/node.rb maca25400702876.example.com' returned 1:
Aug 10 11:02:59 maca25400702876 puppet-agent[12089]: Not using cache on failed catalog
Aug 10 11:02:59 maca25400702876 puppet-agent[12089]: Could not retrieve catalog; skipping run
Aug 10 11:03:00 maca25400702876 systemd-logind: Removed session 3.
Aug 10 11:03:04 maca25400702876 systemd: Starting Session 4 of user root.
Aug 10 11:03:04 maca25400702876 systemd: Started Session 4 of user root.
Aug 10 11:03:04 maca25400702876 systemd-logind: New session 4 of user root.
Aug 10 11:03:37 maca25400702876 dhclient[1121]: DHCPREQUEST on eth0 to 192.168.0.1 port 67 (xid=0x6b2aece9)
Aug 10 11:03:37 maca25400702876 dhclient[1121]: DHCPACK from 192.168.0.1 (xid=0x6b2aece9)
Aug 10 11:03:39 maca25400702876 dhclient[1121]: bound to 192.168.0.8 -- renewal in 231 seconds.
Aug 10 11:03:42 maca25400702876 puppet-agent[12360]: Unable to fetch my node definition, but the agent run will continue:
Aug 10 11:03:42 maca25400702876 puppet-agent[12360]: Error 400 on SERVER: Failed to find maca25400702876.example.com via exec: Execution of '/etc/puppet/node.
rb maca25400702876.example.com' returned 1:
Aug 10 11:04:05 maca25400702876 puppet-agent[12360]: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node maca25
400702876.example.com: Failed to find maca25400702876.example.com via exec: Execution of '/etc/puppet/node.rb maca25400702876.example.com' returned 1:
Aug 10 11:04:05 maca25400702876 puppet-agent[12360]: Not using cache on failed catalog
Aug 10 11:04:05 maca25400702876 puppet-agent[12360]: Could not retrieve catalog; skipping run
Comment 3 Alexander Chuzhoy 2015-08-10 11:53:26 EDT
Reproduced on another setup deploying HA nova.
The issue doesn't always reproduce, as I was able to deploy on subsequent attempt.
Comment 4 Scott Seago 2015-08-10 12:16:33 EDT
"undefined method 'split' for NilClass::Jail (NilClass)"

It sounds like a variable in the template was nil when it was supposed to hold a string. This is the usual symptom in Ruby when a variable was empty and it was expected to hold an actual object of some sort.
Comment 7 Alexander Chuzhoy 2015-08-10 13:55:47 EDT
When the deployment starts the template and the yaml appear correctly.
After the OS is installed and the next step of the deployment commences - things go wrong.
Comment 8 Mike Burns 2015-08-10 18:57:14 EDT
(In reply to Scott Seago from comment #4)
> "undefined method 'split' for NilClass::Jail (NilClass)"
> 
> It sounds like a variable in the template was nil when it was supposed to
> hold a string. This is the usual symptom in Ruby when a variable was empty
> and it was expected to hold an actual object of some sort.

This is because we call .split on the ntp-servers parameter.  .split is not allowed in the UI.  It *works*, just causes a UI error.  This is *not* causing a problem.

(In reply to Alexander Chuzhoy from comment #7)
> When the deployment starts the template and the yaml appear correctly.
> After the OS is installed and the next step of the deployment commences -
> things go wrong.

If it's showing up right after deployment, then something else is breaking, that's a different issue.  What is breaking there?  It's not the same error, since a SafeMode Error will only occur on the UI and not during deployment.


With comment 7, this is really notabug, but I'll wait on the answer to the question before closing it.
Comment 9 Alexander Chuzhoy 2015-08-11 09:55:08 EDT
The problem shows after the deployment reaches 25%. In time terms ~10 minutes after the deployment starts.
So it's when the puppet attempts to run, the templates and the yaml vanish from UI for the controllers, although existed until that point - double checked.
Comment 10 Mike Burns 2015-08-11 13:36:35 EDT
Ohad,  Any chance you can spare someone to look at this?  It's a somewhat high priority bug that seems completely outside the stuff we're doing in staypuft and we're short on anyone with foreman knowledge at the moment.
Comment 11 Daniel Lobato Garcia 2015-08-13 06:11:10 EDT
Alex, what happens between 0% to 25% exactly on the Foreman side? It surprises me that the YAML would become unavailable, but it might be because either the token expired (set its duration on Administer -> Settings/Provisioning/token_duration) or because the hosts got built during that period and the templates are not available on that URL after the host is built?
Comment 12 Alexander Chuzhoy 2015-08-13 09:23:20 EDT
Daniel, the "token_duration" is set to 60 minutes - default. The problem appears sooner than that.

The reported issue is limited only to the controllers, i.e. the computes have the template/yaml as expected.

I can restart a deployment, which log(s) would you be interested in?
Comment 13 Ohad Levy 2015-08-13 09:43:32 EDT
Daniel, would you be able to provide a patch to identify which parameter cant evaluate the ERB block? - don't forget they are using a pretty old foreman version as well.
Comment 14 Daniel Lobato Garcia 2015-08-17 12:56:09 EDT
Finally I think the issue should be closed. I've been debugging this by examining a broken deployment.

What I did was to go to the failed dynflow task task, checked what failed and it seemed to be that Puppet failed to run in some hosts. (Staypuft::Exception: ERF42-7244 [Staypuft::Exception]: Latest Puppet run contains failures for host: 6) Then using the report_id provided in the error I went to satellitehost.domain/reports/report_id, and it turned out that node.rb couldn't run from the puppetmaster for some nodes. 

The reason the puppet master could not get a definition for some nodes was that 
after running /etc/puppet/node.rb for these, Foreman would return 412 precondition failed. So I checked /var/log/foreman/production.log and there was the missing ntp-servers parameter wrecking the puppetrun.          


There was error rendering string: "<%={key_id:1320}; @host.params['ntp-servers'].split(',') %>"
undefined method `split' for nil:NilClass (NoMethodError)


It turns out ntp-servers gets set only for the Operating system the Staypuft base host uses, in this case Staypuft was on RHEL 7.2 beta. However, the nodes were on RHEL 7.1, which did not contain the appropriate parameter.

https://github.com/theforeman/foreman-installer-staypuft/blob/2882216eaa43e328d76de1478bab01094cfc5526/hooks/lib/provisioning_seeder.rb is the culprit

The workaround for a broken deployment is just to copy the parameters in the Staypuft OS to the OS of the openstack hosts, including ntp-servers. Or the error can be avoided altogether by making sure you provisiong an Openstack cloud using the same RHEL version as the Staypuft version.
Comment 15 Mike Burns 2015-08-17 13:17:19 EDT
based on the info in comment 14, this is due to foreman being installed on 7.2 initially.  this is not supported, so this is closed notabug

Note You need to log in before you can comment on or make changes to this bug.