1336568 – Satellite 6 Content Hosts Disappear due to Elasticsearch

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1336568 - Satellite 6 Content Hosts Disappear due to Elasticsearch

Summary: Satellite 6 Content Hosts Disappear due to Elasticsearch

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Hosts - Content
Sub Component:
Version:	6.1.8
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent with 1 vote
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1317008
TreeView+	depends on / blocked

Reported:	2016-05-16 22:11 UTC by Wilson Harris
Modified:	2021-08-30 12:11 UTC (History)
CC List:	32 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-29 21:16:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
reindex.rake replacement (6.16 KB, text/plain) 2016-05-19 15:55 UTC, Justin Sherrill	no flags	Details
reindex_object.rake (1.74 KB, text/plain) 2016-07-07 19:29 UTC, Justin Sherrill	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2328141	0	None	None	None	2016-05-18 09:49:48 UTC
Red Hat Product Errata	RHBA-2017:1668	0	normal	SHIPPED_LIVE	Satellite 6.1.12 Async Errata	2017-06-30 01:15:59 UTC

Description Wilson Harris 2016-05-16 22:11:07 UTC

Description of problem:

Satellite 6 Content Hosts Disappear due to Elastisearch

The content hosts page is blank.

Re-indexing the satellite does not resolve this issue to repopulate the web-ui.


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Benjamin Chardi 2016-05-17 15:06:28 UTC

Hi there,

We are facing the same problem in our customer, Mercadona.
We have Satellite 6.1.7 and after some "foreman-rake katello:reindex" done because other bugs, the content host list on WebUI appears empty.

We need a solution ASAP, this bug is putting on risk the project ...

Comment 2 Pavel Moravec 2016-05-17 17:03:09 UTC

Simple check if content hosts list empty is caused by elasticsearch:

curl -X GET 'http://localhost:9200/katello_katello::system/katello%2Fsystem/_search?pretty' -d '{"query":{"match_all":{}},"sort":[{"name_sort":"asc"}]}'

get some UUID from the output, and replace it in:

curl -X GET 'http://localhost:9200/katello_katello::system/katello%2Fsystem/_search?pretty' -d '{"query":{"match_all":{}},"sort":[{"name_sort":"asc"}],"filter":{"and":[{"terms":{"uuid":["0ecddc45-efa4-4963-b8e9-843828b61f95"]}}]}}'

if this curl returns:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

then elasticsearch index katello_katello::system (the one required for content host list) is broken.


Possible workaround that sometimes works:
1) revert change from https://access.redhat.com/solutions/2076563 (this KCS had been often applied before reporting this problem)
2)
service elasticsearch stop
rm -rf /var/lib/elasticsearch/*
katello-service restart
foreman-rake katello:reindex

(if reindex wont help, try it once again)

Worth to provide if it wont help:
- DBs backups
- tcpdump taken during running reindex:
tcpdump -i any -s 0 port 9200 or port 9300 -w es.$(date "+%s").cap

Comment 5 Justin Sherrill 2016-05-18 13:28:04 UTC

If you see this again, please grab the following output BEFORE trying to fix the issue:

curl -X GET 'http://localhost:9200/katello_katello::system/katello%2Fsystem/_mapping?pretty'

Comment 6 Justin Sherrill 2016-05-18 13:31:21 UTC

and also ask the user if they ran katello:reindex recently

Comment 12 Justin Sherrill 2016-05-19 18:08:03 UTC

Freddy,  IF you look at the contents of that script it should explain it. 

I provided that script more for people that are already familiar with it (this is an updated copy).

Comment 14 jnikolak 2016-05-20 00:07:32 UTC

We ran this on customers box, but we didn't get any new content_hosts showing up

The reindex just finished and still no content hosts.

Re-indexing Katello::Distribution
Re-indexing Katello::PackageGroup
Re-indexing Katello::Erratum

real    222m28.304s
user    36m53.866s
sys     0m47.355s
+ date
Thu May 19 16:01:05 CDT 2016

Comment 15 Justin Sherrill 2016-05-20 14:49:39 UTC

Jon,

Would you be able to run through these steps and gather the output of all commands:


https://gist.github.com/jlsherrill/cceafcf643d489b1a10c2ed260b8126f

Comment 17 jnikolak 2016-05-20 23:30:41 UTC

You have to have the memory to be able to handle the increase.

On box where this was applied, the server had 32GB of memory and was hardly using any of it.
I guess its a case of different options for different servers.

Comment 18 jnikolak 2016-05-21 07:25:19 UTC

In one more environment, I run the same reindex on system.
1st time, with default memory in elasticsearch and tomcat it failed.
2nd time, with memory 1gig in elasticsearch and tomcat it got futher.
3rd time, with memory min 1gig and 4gig in both elasticsearch and tomcat, it finished all the way.

This machine only had 8gig. 


my shards are set to: 
--> index.number_of_shards: 3

I may change that and see if it increases/decreases in speed, to reproduce issue (of performance slow)

Comment 20 jnikolak 2016-05-22 07:24:25 UTC

I was running more tests.

I tried adjusting shards, replicas,  bootstrap.mlockall, tomcat memory and elasticsearch memory.

Each setting made no difference to performance of the reindex.
I got variable results, probably just based on the output of the request at the time.

I put everything back to default.
The only difference I found was:

In elasticsearch.yaml
*  that if you have replicas greater than 1, there will be no content hosts shown. you have to reindex to get everything back.

* if you have 0 shards, there will be no content hosts
This might be a good technique to clear the content_hosts.

* if you exceed the memory in /etc/syconfig/elasticsearch can handle, it will error out.

* the more replicas you have the longer it takes.

* having 10,100,500 shards made no difference to reindex or site performance.

* having more memory in tomcat or elastic (/etc/sysconfig) won't give much improvement.


#####################
Conclusion, leaving everything default is fine.

Comment 25 hprakash 2016-06-03 07:10:37 UTC

An another customer (case#  01644740) faced same issue, after deleting elasticsearch index and re-creating only for "system", customer was able to load content hosts properly. 

Steps followed by the customer:

1- Removed the index from elastic search:
curl -X DELETE 'http://localhost:9200/katello_katello::system/'

2- From the foreman console run the below command to re-generate indexes. 
 Katello::System.create_elasticsearch_index

Thanks
Himanshu Prakash

Comment 31 Justin Sherrill 2016-07-07 19:27:48 UTC

Currently attached  reindex_object.task  does not properly handle hypervisors, uploading a new one

Comment 32 Justin Sherrill 2016-07-07 19:29:03 UTC

Created attachment 1177435 [details]
reindex_object.rake

Comment 33 hprakash 2016-08-09 05:08:50 UTC

(In reply to Justin Sherrill from comment #32)
> Created attachment 1177435 [details]
> reindex_object.rake

@Justin, would reindex_object.rake resolve the issue in Sat6.1?
I believe this issue would not exist in Sat 6.2 as elastic search is removed in it.

Comment 34 Justin Sherrill 2016-08-09 12:30:30 UTC

Yes it should, and yes elastic search is gone in 6.2

Comment 36 Bryan Kearney 2016-09-12 15:28:56 UTC

Adding needinfo onto justin.

Comment 39 Bryan Kearney 2016-12-07 20:17:49 UTC

I am closing this bug out. The root cause was due to interaction with elasticsearch. Elasticsearch has been removed from Satellite 6.2, and this removal will not be back ported. Customers should use the resolution above until they are able to upgrade to Satellite 6.2 or later.

Comment 41 Justin Sherrill 2017-03-01 15:18:24 UTC

We will ship this updated rake task as part of a future 6.1.z release

Comment 51 jcallaha 2017-06-22 19:46:02 UTC

Failed in Satellite 6.1.12 Snap 2.

Running the reindex, in its current state, removes all hosts from the index. The UI continues "loading" forever, but a 500 ISE is seen in the web console. After further testing, we determined that the Hypervisor reindex was causing the issue.

Comment 53 jcallaha 2017-06-28 15:34:17 UTC

Verified in Satellite 6.1.12 Snap 3, based on no-break criteria.

The reindex now successfully completes and no longer removes host entries from elastisearch. The hypervisor portion no longer runs, but those systems are accounted for by the system portion.

Additionally, I added a sleep to the reindex task per comment #50, allowing me to register new hosts during the process. At no point did the content host page fail to load. After this completion, all previous and new content hosts were shown.

[root@cloud-qe-19 yum.repos.d]# foreman-rake katello:reindex
Re-indexing Katello::ActivationKey
Re-indexing Katello::ContentView
Re-indexing Katello::Repository
Re-indexing Katello::ContentViewFilter
Re-indexing Katello::Product
Re-indexing Katello::ContentViewErratumFilterRule
Re-indexing Katello::ContentViewHistory
Re-indexing Katello::Provider
Re-indexing Katello::ContentViewPackageFilterRule
Re-indexing Katello::ContentViewPackageGroupFilterRule
Re-indexing Katello::TaskStatus
Re-indexing Katello::ContentViewPuppetEnvironment
Re-indexing Katello::ContentViewPuppetModule
Re-indexing Katello::Distributor
Re-indexing Katello::HostCollection
Re-indexing Katello::System
Re-indexing Katello::Job
Re-indexing Katello::Notice
Re-indexing Katello::Package
Re-indexing Katello::PuppetModule
Re-indexing Katello::Distribution
Re-indexing Katello::PackageGroup
Re-indexing Katello::Erratum

Comment 55 errata-xmlrpc 2017-06-29 21:16:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1668

Note You need to log in before you can comment on or make changes to this bug.

ahumbe
anerurka
asahni
bbuckingham
bchardim
bkearney
chrobert
cwelton
dcaplan
egolov
ehelms
erinn.looneytriggs
ftsiadim
fwissing
hprakash
jcallaha
jnikolak
jsherril
ktordeur
lzap
mmccune
mmello
nshaik
pmoravec
pmutha
pparsons
rdickens
riehecky
sauchter
sjagtap
wpinheir
xdmoon