Bug 1266561 - Cloudforms can confuse two hosts as being a single one
Cloudforms can confuse two hosts as being a single one
Status: CLOSED ERRATA
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers (Show other bugs)
5.4.0
All All
medium Severity medium
: GA
: 5.5.0
Assigned To: Adam Grare
Pavol Kotvan
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-25 11:41 EDT by Felix Dewaleyne
Modified: 2015-12-08 08:33 EST (History)
8 users (show)

See Also:
Fixed In Version: 5.5.0.10
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-08 08:33:19 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2046923 None None None Never

  None (edit)
Description Felix Dewaleyne 2015-09-25 11:41:15 EDT
Description of problem:
When scanning a large Vcenter 5.5 setup, Cloudforms is not seeing all hosts in all clusters.

Version-Release number of selected component (if applicable):


How reproducible:
all the time in the customer's environment

Steps to Reproduce:
1. configure the vmware provider to the appliance
2. refresh the powerstate of the provider


Actual results:
only 12 of the 13 hosts are showing - the vms that are on the misssing host are assigned to another host 

Expected results:
all hosts are showing on the appliance with the proper

Additional info:
- access to the host was not configured in tests.
- the host is definitely active :

[----] I, [2015-09-24T08:34:05.105532 #2456:3bf808]  INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [1000013707273], MiqWorker id: [1000000034238], Zone: [default], Role: [event], Server: [], Ident: [ems], Target id: [1000000000003], Instance id: [], Task id: [], Command: [EmsEvent.add_vc], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{"key"=>"217905087", "chainId"=>"217905084", "createdTime"=>"2015-09-24T08:33:58.439988Z", "userName"=>"NCEDOM\\$psi_automation", "datacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "computeResource"=>{"name"=>"RND_PROD", "computeResource"=>"domain-c12183"}, "host"=>{"name"=>"ncerndesx14.nce.amadeus.net", "host"=>"host-18204"}, "vm"=>{"name"=>"NCELISAVSEPNR", "vm"=>"vm-26994"}, "fullFormattedMessage"=>"Deploying NCELISAVSEPNR on host ncerndesx14.nce.amadeus.net in PSI_Production_Farm from template NCE-RHEL-63-SSSD-MKHOME", "changeTag"=>"", "template"=>"true", "srcTemplate"=>{"name"=>"NCE-RHEL-63-SSSD-MKHOME", "vm"=>"vm-13189", "path"=>"[HDS_PSI_TEMPLATES] RHEL-63-ldap-krb5-mkhome/RHEL-63-ldap-krb5-mkhome.vmtx"}, "eventType"=>"VmBeingDeployedEvent"}], Dequeued in: [5.135613853] seconds

- permissions given to cloudforms on the provider do allow access to the host (confirmed connecting to the vcenter appliance with the credentials)

- the systems on that host are all showing in CF as being on ncerndesx15.nce.amadeus.net - sample event : 

[----] I, [2015-09-24T08:35:22.013449 #2456:3bf808]  INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [1000013707473], MiqWorker id: [1000000034238], Zone: [default], Role: [event], Server: [], Ident: [ems], Target id: [1000000000003], Instance id: [], Task id: [], Command: [EmsEvent.add_vc], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{"key"=>"217905128", "chainId"=>"217905127", "createdTime"=>"2015-09-24T08:35:05.639988Z", "userName"=>"NCEDOM\\$psi_automation", "datacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "computeResource"=>{"name"=>"RND_PROD", "computeResource"=>"domain-c12183"}, "host"=>{"name"=>"ncerndesx05.nce.amadeus.net", "host"=>"host-12232"}, "vm"=>{"name"=>"NCELISASERVER", "vm"=>"vm-26992", "path"=>"[HDS_PSI_HP_1304] NCELISASERVER/NCELISASERVER.vmx"}, "ds"=>{"name"=>"HDS_PSI_HP_1304", "datastore"=>"datastore-11373"}, "fullFormattedMessage"=>"Relocating NCELISASERVER in PSI_Production_Farm from ncerndesx05.nce.amadeus.net, HDS_PSI_HP_1304 to ncerndesx15.nce.amadeus.net, HDS_PSI_HP_1304", "changeTag"=>"", "template"=>"false", "destHost"=>{"name"=>"ncerndesx15.nce.amadeus.net", "host"=>"host-18126"}, "destDatacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "destDatastore"=>{"name"=>"HDS_PSI_HP_1304", "datastore"=>"datastore-11373"}, "eventType"=>"VmBeingRelocatedEvent"}], Dequeued in: [5.353053195] seconds
Comment 4 Felix Dewaleyne 2015-09-25 12:22:59 EDT
I made an error in the opening statement, host 15's event should have been


[----] I, [2015-09-24T10:37:55.914417 #2456:3bf808]  INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [1000013726062], MiqWorker id: [1000000034238], Zone: [default], Role: [event], Server: [], Ident: [ems], Target id: [1000000000003], Instance id: [], Task id: [], Command: [EmsEvent.add_vc], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{"key"=>"217917347", "chainId"=>"217917248", "createdTime"=>"2015-09-24T10:37:51.606988Z", "userName"=>"", "datacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "computeResource"=>{"name"=>"RND_PROD", "computeResource"=>"domain-c12183"}, "host"=>{"name"=>"ncerndesx15.nce.amadeus.net", "host"=>"host-18126"}, "vm"=>{"name"=>"NCERNDUPKKPI001", "vm"=>"vm-23340", "path"=>"[HUSVM01-CL-VIP-SAS-R10-L-00E1] NCERNDUPKKPI001/NCERNDUPKKPI001.vmx"}, "fullFormattedMessage"=>"Changed resource allocation for NCERNDUPKKPI001", "changeTag"=>"", "template"=>"false", "eventType"=>"VmResourceReallocatedEvent"}], Dequeued in: [2.796374229] seconds
Comment 5 Dave Johnson 2015-10-05 10:34:28 EDT
This sounds environmental like an issue in DNS, etc
Comment 6 Felix Dewaleyne 2015-10-08 05:48:34 EDT
(In reply to Dave Johnson from comment #5)
> This sounds environmental like an issue in DNS, etc

this is unconfirmed this far, but maybe a hypervisor was created by cloning another or something alike. The main problem is that after the hostnames are updated and the situation is resolved in the environment, it remains like that in Cloudforms.
Comment 7 Felix Dewaleyne 2015-10-09 12:09:36 EDT
the dns configuration was confirmed to be correct using dig from the cloudforms appliance - we don't have access to the vmware hosts. investigations aroun the hostname setup show no anomaly in Vcenter.
Comment 8 Felix Dewaleyne 2015-10-13 04:35:07 EDT
looking into the database showed that the host ID had been previously used on a host that was still in maintenance mode in the interface but does not look like it was still in use at all. it seems further investigation will be required, but we do not have the vim logs and don't know the history for that other host at the time.
Comment 9 Adam Grare 2015-10-15 14:45:30 EDT
Felix, could the customer have two hosts with the same DNS name?
There seems to be hosts with two different ManagedObjectReferences but with the same hostname and IP address.

1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: [172.16.135.41] ems_ref: [host-18204]
2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: [172.16.135.41] ems_ref: [host-18126]

Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we actually get the IP address from a DNS lookup on the hostname, so there could have been two hosts with the same hostname that we just resolved to having the same IP address.
Comment 10 Felix Dewaleyne 2015-10-16 05:01:49 EDT
(In reply to Adam Grare from comment #9)
> Felix, could the customer have two hosts with the same DNS name?
> There seems to be hosts with two different ManagedObjectReferences but with
> the same hostname and IP address.
> 
> 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP:
> [172.16.135.41] ems_ref: [host-18204]
> 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP:
> [172.16.135.41] ems_ref: [host-18126]
> 
> Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we
> actually get the IP address from a DNS lookup on the hostname, so there
> could have been two hosts with the same hostname that we just resolved to
> having the same IP address.


During investigation I checked the ouptut of a reverse lookup on the ips... and the ip addresses for what the hosts should be named are correctly resolved. 

After looking into the database, it turns out another host is also in the database with the id host-18126.
Comment 11 Adam Grare 2015-10-16 12:11:17 EDT
(In reply to Felix Dewaleyne from comment #10)
> (In reply to Adam Grare from comment #9)
> > Felix, could the customer have two hosts with the same DNS name?
> > There seems to be hosts with two different ManagedObjectReferences but with
> > the same hostname and IP address.
> > 
> > 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP:
> > [172.16.135.41] ems_ref: [host-18204]
> > 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP:
> > [172.16.135.41] ems_ref: [host-18126]
> > 
> > Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we
> > actually get the IP address from a DNS lookup on the hostname, so there
> > could have been two hosts with the same hostname that we just resolved to
> > having the same IP address.
> 
> 
> During investigation I checked the ouptut of a reverse lookup on the ips...
> and the ip addresses for what the hosts should be named are correctly
> resolved. 
> 
> After looking into the database, it turns out another host is also in the
> database with the id host-18126.

Are we still in contact with this customer?  Could be helpful to find those two hosts in his ManagedObjectBrowser.
Comment 17 Adam Grare 2015-10-21 11:10:06 EDT
(In reply to Felix Dewaleyne from comment #10)
> (In reply to Adam Grare from comment #9)
> > Felix, could the customer have two hosts with the same DNS name?
> > There seems to be hosts with two different ManagedObjectReferences but with
> > the same hostname and IP address.
> > 
> > 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP:
> > [172.16.135.41] ems_ref: [host-18204]
> > 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP:
> > [172.16.135.41] ems_ref: [host-18126]
> > 
> > Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we
> > actually get the IP address from a DNS lookup on the hostname, so there
> > could have been two hosts with the same hostname that we just resolved to
> > having the same IP address.
> 
> 
> During investigation I checked the ouptut of a reverse lookup on the ips...
> and the ip addresses for what the hosts should be named are correctly
> resolved. 
> 
> After looking into the database, it turns out another host is also in the
> database with the id host-18126.

Yes there is another host with that MOR (ncepsiesx34.top.nce.amadeus.net) but it looks like it is from another EMS (ncepsivc02.top.nce.amadeus.net) so probably just a duplicate MOR on another vcenter.
Comment 20 CFME Bot 2015-10-29 15:50:20 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/f4aa372a570c934f99a5d789aaffafb8ac84c6c9

commit f4aa372a570c934f99a5d789aaffafb8ac84c6c9
Author:     Adam Grare <agrare@redhat.com>
AuthorDate: Tue Oct 27 16:05:46 2015 -0400
Commit:     Adam Grare <agrare@redhat.com>
CommitDate: Thu Oct 29 11:39:31 2015 -0400

    Handle duplicate infra host hostnames
    
    If two hosts have the same hostname they will get assigned
    the same database ID and overwrite each other every refresh.
    To resolve this in addition to looking up a host by hostname
    make sure that what is returned does not have a different
    ManagedObjectReference to ensure we aren't overwriting a
    different host.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1266561

 app/models/ems_refresh/save_inventory_infra.rb |  2 +-
 app/models/host.rb                             | 11 ++++++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)
Comment 22 CFME Bot 2015-11-04 15:48:44 EST
New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=86d111906ea4b855b3c17b3082f345c0c3fd4f86

commit 86d111906ea4b855b3c17b3082f345c0c3fd4f86
Author:     Adam Grare <agrare@redhat.com>
AuthorDate: Tue Oct 27 16:05:46 2015 -0400
Commit:     Adam Grare <agrare@redhat.com>
CommitDate: Fri Oct 30 09:09:58 2015 -0400

    Handle duplicate infra host hostnames
    
    If two hosts have the same hostname they will get assigned
    the same database ID and overwrite each other every refresh.
    To resolve this in addition to looking up a host by hostname
    make sure that what is returned does not have a different
    ManagedObjectReference to ensure we aren't overwriting a
    different host.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1266561

 app/models/ems_refresh/save_inventory_infra.rb |  2 +-
 app/models/host.rb                             | 11 ++++++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)
Comment 23 Felix Dewaleyne 2015-11-10 07:11:32 EST
Confirmed : this issue is triggered by the hostname being the same inside vmware.
Comment 27 errata-xmlrpc 2015-12-08 08:33:19 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551

Note You need to log in before you can comment on or make changes to this bug.