Bug 1479435

Summary:	[RFE] KUBE_PING does not separate clusters during Rolling Upgrade
Product:	OpenShift Container Platform	Reporter:	Francesco Marchioni <fmarchio>
Component:	RFE	Assignee:	Eric Paris <eparis>
Status:	CLOSED WONTFIX	QA Contact:	Xiaoli Tian <xtian>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.2.0	CC:	aos-bugs, jokerman, mmccomas, myllynen, rafael.ruiz, slaskawi, trogers
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-12 11:57:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1267746

Description Francesco Marchioni 2017-08-08 14:44:34 UTC

> 3. What is the nature and description of the request?  

We have discovered the following bug / misbehavior in the KUBE_PING protocol of the JBoss EAP docker image.
We are running the following docker image in production : https://access.redhat.com/containers/?tab=overview#/registry.access.redhat.com/jboss-eap-6/eap64-openshift
We recently tried to upgrade the deploymentconfig from version
jboss-eap-6/eap64-openshift:1.4-13
to
jboss-eap-6/eap64-openshift:1.4-34

We saw many errors like :
12474271 --> 2017/04/26 21:18:32.000686 WARN  [org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 48) JOIN(ahp-adminui-10-ddbe8/web) sent to ahp-adminui-9-fl7lm/web timed out (after 3000 ms), on try 202

During the rolling upgrade phase, the pod with prefix ahp-adminui-10 tried to join 2 pods with prefix ahp-adminui-9 from another cluster.

This RFE is to avoid this misbehavior of KUBE_PING during a Rolling upgrade

> 4. Why does the customer need this? (List the business requirements here)  

Currently the customer recreates the deployment config or uses undeploy/deploy option to tackle this issue but both result in downtime (which is not acceptable for our SLA commitment in the long term).
Also another option, which is using a different template for each cluster, could be implemented but it has some efforts as the customer deploys preconstructed "json/yaml" objects (routes, svc, dc,...) in a static way


> 5. How would the customer like to achieve this? (List the functional requirements here)  

We request, as solution to this issue, that the KUBE_PING protocol contains a variable like CLUSTER_CREATION_ONLY_FOR_POD_SIBLINGS=true, avoiding the above behavior (as during a rolling upgrade of Openshift we can not assure  that no serialized objects in the cache have been changed in the newer version)

During a rolling upgrade we have - for the time of upgrade - a pod <dc-name>-1-XXXXX and a new pod <dc-name>-2-YYYYY that is started.
Both names will be in the list retrieved by the KUBE_PING implementation.
But when OPENSHIFT_KUBE_PING_ONLY_POD_SIBLINGS=true would only allow a new pod <dc-name>-2-ZZZZZ to retrieve all pod names with prefix <dc-name>-2-* 

In other words, we could use the incrementing deployment config number as discriminant for joining the cluster.

> 6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.  
I think it will be easy to check that, during the rolling upgrade phase, a pod with prefix say ahp-adminui-10 will not join other pods with prefix ahp-adminui-9 from another cluster.

> 10. List any affected packages or components.  
The KUBE_PING JGroups protocol

Comment 2 Sebastian Łaskawiec 2017-08-23 06:23:07 UTC

Linked JIRAs:
* https://issues.jboss.org/browse/JGRP-2212
* https://issues.jboss.org/browse/CLOUD-2001

Comment 4 Kirsten Newcomer 2019-06-12 11:57:51 UTC

With the introduction of OpenShift 4, Red Hat has delivered or roadmapped a substantial number of features based on feedback by our customers.  Many of the enhancements encompass specific RFEs which have been requested, or deliver a comparable solution to a customer problem, rendering an RFE redundant.

This bz (RFE) has been identified as a feature request not yet planned or scheduled for an OpenShift release and is being closed. 

If this feature is still an active request that needs to be tracked, Red Hat Support can assist in filing a request in the new JIRA RFE system, as well as provide you with updates as the RFE progress within our planning processes. Please open a new support case: https://access.redhat.com/support/cases/#/case/new 

Opening a New Support Case: https://access.redhat.com/support/cases/#/case/new 

As the new Jira RFE system is not yet public, Red Hat Support can help answer your questions about your RFEs via the same support case system.