Description of problem: When scanning a large Vcenter 5.5 setup, Cloudforms is not seeing all hosts in all clusters. Version-Release number of selected component (if applicable): How reproducible: all the time in the customer's environment Steps to Reproduce: 1. configure the vmware provider to the appliance 2. refresh the powerstate of the provider Actual results: only 12 of the 13 hosts are showing - the vms that are on the misssing host are assigned to another host Expected results: all hosts are showing on the appliance with the proper Additional info: - access to the host was not configured in tests. - the host is definitely active : [----] I, [2015-09-24T08:34:05.105532 #2456:3bf808] INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [1000013707273], MiqWorker id: [1000000034238], Zone: [default], Role: [event], Server: [], Ident: [ems], Target id: [1000000000003], Instance id: [], Task id: [], Command: [EmsEvent.add_vc], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{"key"=>"217905087", "chainId"=>"217905084", "createdTime"=>"2015-09-24T08:33:58.439988Z", "userName"=>"NCEDOM\\$psi_automation", "datacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "computeResource"=>{"name"=>"RND_PROD", "computeResource"=>"domain-c12183"}, "host"=>{"name"=>"ncerndesx14.nce.amadeus.net", "host"=>"host-18204"}, "vm"=>{"name"=>"NCELISAVSEPNR", "vm"=>"vm-26994"}, "fullFormattedMessage"=>"Deploying NCELISAVSEPNR on host ncerndesx14.nce.amadeus.net in PSI_Production_Farm from template NCE-RHEL-63-SSSD-MKHOME", "changeTag"=>"", "template"=>"true", "srcTemplate"=>{"name"=>"NCE-RHEL-63-SSSD-MKHOME", "vm"=>"vm-13189", "path"=>"[HDS_PSI_TEMPLATES] RHEL-63-ldap-krb5-mkhome/RHEL-63-ldap-krb5-mkhome.vmtx"}, "eventType"=>"VmBeingDeployedEvent"}], Dequeued in: [5.135613853] seconds - permissions given to cloudforms on the provider do allow access to the host (confirmed connecting to the vcenter appliance with the credentials) - the systems on that host are all showing in CF as being on ncerndesx15.nce.amadeus.net - sample event : [----] I, [2015-09-24T08:35:22.013449 #2456:3bf808] INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [1000013707473], MiqWorker id: [1000000034238], Zone: [default], Role: [event], Server: [], Ident: [ems], Target id: [1000000000003], Instance id: [], Task id: [], Command: [EmsEvent.add_vc], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{"key"=>"217905128", "chainId"=>"217905127", "createdTime"=>"2015-09-24T08:35:05.639988Z", "userName"=>"NCEDOM\\$psi_automation", "datacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "computeResource"=>{"name"=>"RND_PROD", "computeResource"=>"domain-c12183"}, "host"=>{"name"=>"ncerndesx05.nce.amadeus.net", "host"=>"host-12232"}, "vm"=>{"name"=>"NCELISASERVER", "vm"=>"vm-26992", "path"=>"[HDS_PSI_HP_1304] NCELISASERVER/NCELISASERVER.vmx"}, "ds"=>{"name"=>"HDS_PSI_HP_1304", "datastore"=>"datastore-11373"}, "fullFormattedMessage"=>"Relocating NCELISASERVER in PSI_Production_Farm from ncerndesx05.nce.amadeus.net, HDS_PSI_HP_1304 to ncerndesx15.nce.amadeus.net, HDS_PSI_HP_1304", "changeTag"=>"", "template"=>"false", "destHost"=>{"name"=>"ncerndesx15.nce.amadeus.net", "host"=>"host-18126"}, "destDatacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "destDatastore"=>{"name"=>"HDS_PSI_HP_1304", "datastore"=>"datastore-11373"}, "eventType"=>"VmBeingRelocatedEvent"}], Dequeued in: [5.353053195] seconds
I made an error in the opening statement, host 15's event should have been [----] I, [2015-09-24T10:37:55.914417 #2456:3bf808] INFO -- : MIQ(MiqQueue.get_via_drb) Message id: [1000013726062], MiqWorker id: [1000000034238], Zone: [default], Role: [event], Server: [], Ident: [ems], Target id: [1000000000003], Instance id: [], Task id: [], Command: [EmsEvent.add_vc], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: [{"key"=>"217917347", "chainId"=>"217917248", "createdTime"=>"2015-09-24T10:37:51.606988Z", "userName"=>"", "datacenter"=>{"name"=>"PSI_Production_Farm", "datacenter"=>"datacenter-7"}, "computeResource"=>{"name"=>"RND_PROD", "computeResource"=>"domain-c12183"}, "host"=>{"name"=>"ncerndesx15.nce.amadeus.net", "host"=>"host-18126"}, "vm"=>{"name"=>"NCERNDUPKKPI001", "vm"=>"vm-23340", "path"=>"[HUSVM01-CL-VIP-SAS-R10-L-00E1] NCERNDUPKKPI001/NCERNDUPKKPI001.vmx"}, "fullFormattedMessage"=>"Changed resource allocation for NCERNDUPKKPI001", "changeTag"=>"", "template"=>"false", "eventType"=>"VmResourceReallocatedEvent"}], Dequeued in: [2.796374229] seconds
This sounds environmental like an issue in DNS, etc
(In reply to Dave Johnson from comment #5) > This sounds environmental like an issue in DNS, etc this is unconfirmed this far, but maybe a hypervisor was created by cloning another or something alike. The main problem is that after the hostnames are updated and the situation is resolved in the environment, it remains like that in Cloudforms.
the dns configuration was confirmed to be correct using dig from the cloudforms appliance - we don't have access to the vmware hosts. investigations aroun the hostname setup show no anomaly in Vcenter.
looking into the database showed that the host ID had been previously used on a host that was still in maintenance mode in the interface but does not look like it was still in use at all. it seems further investigation will be required, but we do not have the vim logs and don't know the history for that other host at the time.
Felix, could the customer have two hosts with the same DNS name? There seems to be hosts with two different ManagedObjectReferences but with the same hostname and IP address. 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: [172.16.135.41] ems_ref: [host-18204] 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: [172.16.135.41] ems_ref: [host-18126] Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we actually get the IP address from a DNS lookup on the hostname, so there could have been two hosts with the same hostname that we just resolved to having the same IP address.
(In reply to Adam Grare from comment #9) > Felix, could the customer have two hosts with the same DNS name? > There seems to be hosts with two different ManagedObjectReferences but with > the same hostname and IP address. > > 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: > [172.16.135.41] ems_ref: [host-18204] > 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: > [172.16.135.41] ems_ref: [host-18126] > > Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we > actually get the IP address from a DNS lookup on the hostname, so there > could have been two hosts with the same hostname that we just resolved to > having the same IP address. During investigation I checked the ouptut of a reverse lookup on the ips... and the ip addresses for what the hosts should be named are correctly resolved. After looking into the database, it turns out another host is also in the database with the id host-18126.
(In reply to Felix Dewaleyne from comment #10) > (In reply to Adam Grare from comment #9) > > Felix, could the customer have two hosts with the same DNS name? > > There seems to be hosts with two different ManagedObjectReferences but with > > the same hostname and IP address. > > > > 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: > > [172.16.135.41] ems_ref: [host-18204] > > 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: > > [172.16.135.41] ems_ref: [host-18126] > > > > Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we > > actually get the IP address from a DNS lookup on the hostname, so there > > could have been two hosts with the same hostname that we just resolved to > > having the same IP address. > > > During investigation I checked the ouptut of a reverse lookup on the ips... > and the ip addresses for what the hosts should be named are correctly > resolved. > > After looking into the database, it turns out another host is also in the > database with the id host-18126. Are we still in contact with this customer? Could be helpful to find those two hosts in his ManagedObjectBrowser.
(In reply to Felix Dewaleyne from comment #10) > (In reply to Adam Grare from comment #9) > > Felix, could the customer have two hosts with the same DNS name? > > There seems to be hosts with two different ManagedObjectReferences but with > > the same hostname and IP address. > > > > 1. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: > > [172.16.135.41] ems_ref: [host-18204] > > 2. id: [1000000000067] hostname: [ncerndesx15.nce.amadeus.net] IP: > > [172.16.135.41] ems_ref: [host-18126] > > > > Due to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1260139) we > > actually get the IP address from a DNS lookup on the hostname, so there > > could have been two hosts with the same hostname that we just resolved to > > having the same IP address. > > > During investigation I checked the ouptut of a reverse lookup on the ips... > and the ip addresses for what the hosts should be named are correctly > resolved. > > After looking into the database, it turns out another host is also in the > database with the id host-18126. Yes there is another host with that MOR (ncepsiesx34.top.nce.amadeus.net) but it looks like it is from another EMS (ncepsivc02.top.nce.amadeus.net) so probably just a duplicate MOR on another vcenter.
https://github.com/ManageIQ/manageiq/pull/5192
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/f4aa372a570c934f99a5d789aaffafb8ac84c6c9 commit f4aa372a570c934f99a5d789aaffafb8ac84c6c9 Author: Adam Grare <agrare> AuthorDate: Tue Oct 27 16:05:46 2015 -0400 Commit: Adam Grare <agrare> CommitDate: Thu Oct 29 11:39:31 2015 -0400 Handle duplicate infra host hostnames If two hosts have the same hostname they will get assigned the same database ID and overwrite each other every refresh. To resolve this in addition to looking up a host by hostname make sure that what is returned does not have a different ManagedObjectReference to ensure we aren't overwriting a different host. https://bugzilla.redhat.com/show_bug.cgi?id=1266561 app/models/ems_refresh/save_inventory_infra.rb | 2 +- app/models/host.rb | 11 ++++++++++- 2 files changed, 11 insertions(+), 2 deletions(-)
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=86d111906ea4b855b3c17b3082f345c0c3fd4f86 commit 86d111906ea4b855b3c17b3082f345c0c3fd4f86 Author: Adam Grare <agrare> AuthorDate: Tue Oct 27 16:05:46 2015 -0400 Commit: Adam Grare <agrare> CommitDate: Fri Oct 30 09:09:58 2015 -0400 Handle duplicate infra host hostnames If two hosts have the same hostname they will get assigned the same database ID and overwrite each other every refresh. To resolve this in addition to looking up a host by hostname make sure that what is returned does not have a different ManagedObjectReference to ensure we aren't overwriting a different host. https://bugzilla.redhat.com/show_bug.cgi?id=1266561 app/models/ems_refresh/save_inventory_infra.rb | 2 +- app/models/host.rb | 11 ++++++++++- 2 files changed, 11 insertions(+), 2 deletions(-)
Confirmed : this issue is triggered by the hostname being the same inside vmware.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2551