1479435 – [RFE] KUBE_PING does not separate clusters during Rolling Upgrade

Bug 1479435 - [RFE] KUBE_PING does not separate clusters during Rolling Upgrade

Summary: [RFE] KUBE_PING does not separate clusters during Rolling Upgrade

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RFE
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Eric Paris
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746
TreeView+	depends on / blocked

Reported:	2017-08-08 14:44 UTC by Francesco Marchioni
Modified:	2021-12-10 15:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-12 11:57:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLOUD-2001	0	Major	New	KUBE_PING requires option to separate clusters during Rolling Update	2020-09-18 14:46:04 UTC

Description Francesco Marchioni 2017-08-08 14:44:34 UTC

> 3. What is the nature and description of the request?  

We have discovered the following bug / misbehavior in the KUBE_PING protocol of the JBoss EAP docker image.
We are running the following docker image in production : https://access.redhat.com/containers/?tab=overview#/registry.access.redhat.com/jboss-eap-6/eap64-openshift
We recently tried to upgrade the deploymentconfig from version
jboss-eap-6/eap64-openshift:1.4-13
to
jboss-eap-6/eap64-openshift:1.4-34

We saw many errors like :
12474271 --> 2017/04/26 21:18:32.000686 WARN  [org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 48) JOIN(ahp-adminui-10-ddbe8/web) sent to ahp-adminui-9-fl7lm/web timed out (after 3000 ms), on try 202

During the rolling upgrade phase, the pod with prefix ahp-adminui-10 tried to join 2 pods with prefix ahp-adminui-9 from another cluster.

This RFE is to avoid this misbehavior of KUBE_PING during a Rolling upgrade

> 4. Why does the customer need this? (List the business requirements here)  

Currently the customer recreates the deployment config or uses undeploy/deploy option to tackle this issue but both result in downtime (which is not acceptable for our SLA commitment in the long term).
Also another option, which is using a different template for each cluster, could be implemented but it has some efforts as the customer deploys preconstructed "json/yaml" objects (routes, svc, dc,...) in a static way


> 5. How would the customer like to achieve this? (List the functional requirements here)  

We request, as solution to this issue, that the KUBE_PING protocol contains a variable like CLUSTER_CREATION_ONLY_FOR_POD_SIBLINGS=true, avoiding the above behavior (as during a rolling upgrade of Openshift we can not assure  that no serialized objects in the cache have been changed in the newer version)

During a rolling upgrade we have - for the time of upgrade - a pod <dc-name>-1-XXXXX and a new pod <dc-name>-2-YYYYY that is started.
Both names will be in the list retrieved by the KUBE_PING implementation.
But when OPENSHIFT_KUBE_PING_ONLY_POD_SIBLINGS=true would only allow a new pod <dc-name>-2-ZZZZZ to retrieve all pod names with prefix <dc-name>-2-* 

In other words, we could use the incrementing deployment config number as discriminant for joining the cluster.

> 6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.  
I think it will be easy to check that, during the rolling upgrade phase, a pod with prefix say ahp-adminui-10 will not join other pods with prefix ahp-adminui-9 from another cluster.

> 10. List any affected packages or components.  
The KUBE_PING JGroups protocol

Comment 2 Sebastian Łaskawiec 2017-08-23 06:23:07 UTC

Linked JIRAs:
* https://issues.jboss.org/browse/JGRP-2212
* https://issues.jboss.org/browse/CLOUD-2001

Comment 4 Kirsten Newcomer 2019-06-12 11:57:51 UTC

With the introduction of OpenShift 4, Red Hat has delivered or roadmapped a substantial number of features based on feedback by our customers.  Many of the enhancements encompass specific RFEs which have been requested, or deliver a comparable solution to a customer problem, rendering an RFE redundant.

This bz (RFE) has been identified as a feature request not yet planned or scheduled for an OpenShift release and is being closed. 

If this feature is still an active request that needs to be tracked, Red Hat Support can assist in filing a request in the new JIRA RFE system, as well as provide you with updates as the RFE progress within our planning processes. Please open a new support case: https://access.redhat.com/support/cases/#/case/new 

Opening a New Support Case: https://access.redhat.com/support/cases/#/case/new 

As the new Jira RFE system is not yet public, Red Hat Support can help answer your questions about your RFEs via the same support case system.

Note You need to log in before you can comment on or make changes to this bug.