Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1054274

Summary: Host re-install using SSH Authentication and soft fencing fails on Puma hosts on host broker
Product: Red Hat Enterprise Virtualization Manager Reporter: sefi litmanovich <slitmano>
Component: ovirt-engineAssignee: Martin Perina <mperina>
Status: CLOSED NOTABUG QA Contact: Pavel Stehlik <pstehlik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: acathrow, alonbl, bazulay, emesika, gklein, iheim, lbednar, lpeer, Rhev-m-bugs, slitmano, talayan, yeylon
Target Milestone: ---Keywords: TestBlocker, Triaged
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-06 09:00:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log none

Description sefi litmanovich 2014-01-16 14:45:13 UTC
Created attachment 851125 [details]
engine log

Description of problem:

This issue is either a problem with jenkins hosts provisioning or with ssh authentication feature. The issue occured so far only with Puma hosts on hostbroker during automatic testing of soft fencing.

test case - stop vdsmd service on an installed host, state:up to invoke soft fencing. this test was working fine less then a month ago:

http://jenkins.qa.lab.tlv.redhat.com:8080/view/RhevmCore/view/3.3/job/3.3-git-rhevmCore-infra_soft_fencing-rest/11/

recently stopped working due to the following reason (see the full engin log attached:

2014-01-08 15:47:35,960 ERROR [org.ovirt.engine.core.bll.SshSoftFencingCommand] (pool-5-thread-48) [9dbec40] SSH connection to host puma19.scl.lab.tlv.redhat.com failed: javax.naming.AuthenticationException: SSH authentication to 'root.lab.tlv.redhat.com' failed. Please verify provided credentials. Make sure key is authorized at host: javax.naming.AuthenticationException: SSH authentication to 'root.lab.tlv.redhat.com' failed. Please verify provided credentials. Make sure key is authorized at host.

1. after test failed I reserved the host and installed it on my engine. installation worked. then reproduced the test case manually and got the same issue. after that removed the host, installed it again and tried re-install with ssh authentication method and received the same result.

2. I made sure that the entry for ovirt-engine on the host's .ssh/authorized_keys is the same is seen on the engine.

3. I reproduced this issue on a one of my hosts and did not suffer the same results. after that compared the ovirt-engine entry in .ssh/authorized_keys on both my host and puma host and they were the same.

4. after checking with lukas bednar if this might be a plugin issue, he suggested this might occur due to puppet plugin which cleans ssh keys from host broker hosts. he then disabled this option on the plugin but the issue occured again.

5. tried  xcmd - ssh -i /etc/pki/ovirt-engine/keys/engine_id_rsa root@host from engine to puma host, and it worked so it seems that the key should be fine.

whether this is a problem with the ssh key or with host's in host broker I have gone out of ideas.


How reproducible:

this does not reproduce manually (at least so far). reproducable upon bulidng this jeknkins task (Test case name: SoftFencingPassedWithoutPM): 

http://jenkins.qa.lab.tlv.redhat.com:8080/view/RhevmCore/view/3.3/job/3.3-git-rhevmCore-infra_soft_fencing-rest/


Actual results:

engine isn't able to connect the host and soft fence, host become non-responsive

Expected results:

engine connects to host via ssh and does soft fencing. host state back to up.


Additional info:

Test was running on IS29, stopped working on IS30 through.

Comment 1 Alon Bar-Lev 2014-01-16 14:51:27 UTC
you need to stop automation when this happens so we can see it in practice.

there were past reports similar to that, all turned out to be false alarm, as some component at host modified the authorized keys.