Bug 1255356 - [RFE] Ability to troubleshoot node discovery
Summary: [RFE] Ability to troubleshoot node discovery
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ga
: 8.0 (Liberty)
Assignee: Lucas Alvares Gomes
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
: 1333026 (view as bug list)
Depends On:
Blocks: 1359230 1423474
TreeView+ depends on / blocked
 
Reported: 2015-08-20 11:17 UTC by David Juran
Modified: 2023-02-22 23:02 UTC (History)
19 users (show)

Fixed In Version: python-tripleoclient-0.3.4-7.el7ost
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-18 14:32:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1250 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 8 director Bug Fix Advisory 2017-05-18 18:30:47 UTC

Description David Juran 2015-08-20 11:17:27 UTC
Description of problem:
If introspection is hung, there should be a way of troubleshooting the process. Ability of starting a second VT on the host would be a starter but ability to ssh to the discover-image would be nice.

Comment 3 August Simonelli 2015-08-24 22:17:40 UTC
Is there anyway at the moment to get onto that discovery image?

Comment 4 Dmitry Tantsur 2015-08-25 07:07:11 UTC
This will be fixed with mova to IPA in OSP 8. The only thing we can do with our current image is to use virtual console..

Comment 5 August Simonelli 2015-08-25 07:19:04 UTC
fair enough ... i'm having trouble getting to them ... what keys give which virtual consoles? i know it should be obvious but i'm having issues, perhaps with emulated macros ....

Comment 6 August Simonelli 2015-08-25 13:34:50 UTC
I've tried ctrl-alt-f keys but nothing seems to happen. should it?

Comment 7 Dmitry Tantsur 2015-08-25 13:37:10 UTC
If inspection fails during ramdisk run, you can connect to the virtual console of the machine, and get into simple shell. If inspection fails after ramdisk successfully ran, you will get some logs on the undercloud (e.g. sudo journalctl -u openstack-ironic-discoverd). Hope that helps.

Comment 8 August Simonelli 2015-08-25 13:41:12 UTC
i think it is during, but i can't get a virtual console. how do i get that?

Comment 9 Dmitry Tantsur 2015-08-25 13:44:26 UTC
It depends on your vendor. Usually you point your browser to the BMC host (ipmi_address/ilo_address/drac_address) and get to e.g. iDRAC or ILO web UI. There you'll see an option to run virtual consoler.

Comment 10 David Juran 2015-08-26 20:28:25 UTC
Dmitry, are you referring to the discovery image?
 I concur with August, I have connected to the hardware console of the machine, but I haven't found a way of getting a prompt.

Comment 11 Mike Burns 2016-04-07 20:47:27 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 12 Dmitry Tantsur 2016-04-08 11:01:52 UTC
So, this bug is something never-ending. We do have several debugging features in place for OSPd8, namely: 1. passing logs from the ramdisk in case of failures, 2. ability to pass SSH key to the ramdisk by modifying its kernel command line. I think we can call this fixed.

Comment 17 Alexander Chuzhoy 2016-07-22 15:01:19 UTC
FailedQA:
Environment:
openstack-ironic-conductor-4.2.5-2.el7ost.noarch
openstack-ironic-common-4.2.5-2.el7ost.noarch
openstack-ironic-inspector-2.2.6-1.el7ost.noarch
openstack-ironic-api-4.2.5-2.el7ost.noarch


So followed comment #12:
Used http://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-nodes.html as guidance:

1. passing logs from the ramdisk in case of failures.
The logs became available under /var/log/ironic-inspector/ramdisk after setting "always_store_ramdisk_logs = true" in /etc/ironic-inspector/inspector.conf and bouncing the openstack-ironic-inspector service.
PASS

2. ability to pass SSH key to the ramdisk by modifying its kernel command line. I think we can call this fixed.

I failed to connect to the node being introspected. I also tried to set root's password with rootpwd="<HASH>" as described in the doc. Also no luck.
FAILED.

Comment 19 Alexander Chuzhoy 2016-08-04 01:22:34 UTC
*** Bug 1333026 has been marked as a duplicate of this bug. ***

Comment 20 Lucas Alvares Gomes 2016-09-15 13:42:13 UTC
(In reply to Alexander Chuzhoy from comment #17)
> FailedQA:
> Environment:
> openstack-ironic-conductor-4.2.5-2.el7ost.noarch
> openstack-ironic-common-4.2.5-2.el7ost.noarch
> openstack-ironic-inspector-2.2.6-1.el7ost.noarch
> openstack-ironic-api-4.2.5-2.el7ost.noarch
> 
> 
> So followed comment #12:
> Used
> http://docs.openstack.org/developer/tripleo-docs/troubleshooting/
> troubleshooting-nodes.html as guidance:
> 
> 1. passing logs from the ramdisk in case of failures.
> The logs became available under /var/log/ironic-inspector/ramdisk after
> setting "always_store_ramdisk_logs = true" in
> /etc/ironic-inspector/inspector.conf and bouncing the
> openstack-ironic-inspector service.
> PASS
> 
> 2. ability to pass SSH key to the ramdisk by modifying its kernel command
> line. I think we can call this fixed.
> 
> I failed to connect to the node being introspected. I also tried to set
> root's password with rootpwd="<HASH>" as described in the doc. Also no luck.
> FAILED.

I recently tested "rootpwd" and "sshkey" with the OSP9 ramdisk and it works [0]. How did you generate that hash ?

Here's how I did it:

1. Generate the password hash

$ openssl passwd -1
Password: 
Verifying - Password: 
$1$shnYk4hW$GmXa4mN.duC6WQYvuIyot0

2. Update the kernel cmdline to include it.

  2.1 Update the iPXE file directly

  $ vim /httpboot/pxelinux.cfg/<mac address>

  :deploy
  imgfree
  kernel --timeout ... rootpwd="$1$shnYk4hW$GmXa4mN.duC6WQYvuIyot0" || goto deploy

  2.2 Update it for all instances

  $ vim /etc/ironic/ironic.conf

  [pxe]
  pxe_append_parameters = nofb nomodeset vga=normal rootpwd="$1$shnYk4hW$GmXa4mN.duC6WQYvuIyot0"


 $ systemctl restart openstack-ironic-conductor

 $ <now start inspection/deployment>

[0] here's the logs: http://paste.openstack.org/show/577447/

...

Can you give it another go please ?

Comment 21 Alexander Chuzhoy 2016-09-15 17:34:09 UTC
The introspection completes too quickly and shuts down the node.

Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel 3.10.0-327.28.3.el7.x86_64 on an x86_64

localhost login: root
Password: [   18.483805] IPMI System Interface driver.
[   18.484834] ipmi_si: Unable to find any System Interface(s)

 -- root: no shell: Permission denied

Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel 3.10.0-327.28.3.el7.x86_64 on an x86_64

localhost login:

The "no shell: Permission denied" message above makes me question if a successful login was prevented by the system being shut down.

Is there a way to pause the introspection or make it keep the node being introspected UP longer.

Comment 22 Dmitry Tantsur 2016-09-16 07:09:26 UTC
You can try to artificially prevent it from working by providing an unreachable ipa-inspection-callback-url in /httpboot/inspector.ipxe. Then it will probably loop in attempts to reach it.

Comment 23 Alexander Chuzhoy 2016-09-16 16:48:55 UTC
Thanks Dmitry.
So here's what happens:
1) specifying wrong password on purpose:
Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel 3.10.0-327.28.3.el7.x86_64 on an x86_64

localhost login: root
Password:
Login incorrect

2) specifying the right password:
localhost login: root
Password:
Last failed login: Fri Sep 16 12:45:26 EDT 2016 on ttyS0
There was 1 failed login attempt since the last successful login.
Last login: Fri Sep 16 12:44:17 on ttyS0
 -- root: no shell: Permission denied

Comment 24 Dmitry Tantsur 2016-09-21 10:57:57 UTC
Just to clarify: was this "permission denied" fatal or were you able to log in?

Comment 25 Alexander Chuzhoy 2016-09-21 13:15:26 UTC
I was not able to login.
Thanks.

Comment 26 Dmitry Tantsur 2016-09-30 09:19:10 UTC
Sigh, this is strange.. Maybe it only affects OSPd8? Have you tried other versions?

Comment 27 Lucas Alvares Gomes 2016-09-30 12:55:24 UTC
(In reply to Alexander Chuzhoy from comment #25)
> I was not able to login.
> Thanks.

Hi sasha, I was looking into it and I just found out that the element that is suppose to allow you to login the image wasn't present in it. The patch https://code.engineering.redhat.com/gerrit/#/c/64749/ should fix it (linked in the external links)

Comment 29 Alexander Chuzhoy 2017-03-17 16:05:27 UTC
FailedQA

Environment:
instack-undercloud-6.0.0-2.el7ost.noarch
openstack-ironic-api-7.0.1-0.20170301202959.91540cd.el7ost.noarch
python-ironic-inspector-client-1.11.0-0.20170208193115.481a92e.el7ost.noarch
python-ironic-lib-2.5.2-0.20170208212103.ace87b6.el7ost.noarch
python-ironicclient-1.11.0-0.20170208194603.f1f10cb.el7ost.noarch
puppet-ironic-10.3.0-1.el7ost.noarch
openstack-ironic-inspector-5.0.0-2.el7ost.noarch
openstack-ironic-conductor-7.0.1-0.20170301202959.91540cd.el7ost.noarch
openstack-ironic-common-7.0.1-0.20170301202959.91540cd.el7ost.noarch


When I tried to ssh into a node being introspected, I got:
[stack@undercloud-0 ~]$ ssh 192.168.24.104 -l root
root.24.104's password: 
/bin/bash: Permission denied
Connection to 192.168.24.104 closed.

Comment 30 Alexander Chuzhoy 2017-03-17 17:19:46 UTC
This is an selinux issue.
I actually succeeded to login after adding "selinux=0" to the kernel line.
Are we going to document it or modify the images?

Comment 31 Alexander Chuzhoy 2017-03-17 17:20:33 UTC
Lukas,
per comment #30
Are we going to document it or modify the images?

Comment 32 Lucas Alvares Gomes 2017-03-20 16:19:40 UTC
(In reply to Alexander Chuzhoy from comment #31)
> Lukas,
> per comment #30
> Are we going to document it or modify the images?

Hi sasha,

Good finding btw, yeah, probably the dynamic-login [0] element in DIB should configure selinux to allow people to SSH in the node when it's specified in the list of elements to create the image. That way we don't need people to pass "selinux=0" in the kernel cmdline (and disable selinux as a whole).
 
[0] https://github.com/openstack/diskimage-builder/tree/master/elements/dynamic-login

Comment 33 Ramon Acedo 2017-04-03 21:22:07 UTC
The dynamic-login README [1] has a warning about this actually:

"Some base operational systems might require selinux to be in permissive or disabled mode so that you can log in the image. This can be achieved by building the image with the selinux-permissive element for diskimage-builder or by passing selinux=0 in the kernel command line. RHEL/CentOS are examples of OSs which this is true."

So it seems that this is by design and should probably be part of the documentation.

[1] https://github.com/openstack/diskimage-builder/tree/master/diskimage_builder/elements/dynamic-login

Comment 34 Alexander Chuzhoy 2017-04-05 13:20:46 UTC
The selinux note was added here: https://docs.openstack.org/developer/tripleo-docs/troubleshooting/troubleshooting-nodes.html#accessing-the-ramdisk


Verifying the bug based on the above + comment #29 + comment #30

Comment 35 David Juran 2017-04-26 09:24:51 UTC
In my opinion, disabling selinux is a workaround at best. If a policy or labelling change is needed, we should do this rather then instruct users to disable selinux.

Would you like a seperate Bz for this?

Comment 39 errata-xmlrpc 2017-05-18 14:32:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1250


Note You need to log in before you can comment on or make changes to this bug.