Bug 1667441 - Upgrade the OCP 3.10 cluster starts the complete SDN upgrade causing pod transactions don't finish correctly
Summary: Upgrade the OCP 3.10 cluster starts the complete SDN upgrade causing pod tra...
Keywords:
Status: CLOSED DUPLICATE of bug 1660880
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Russell Teague
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-18 14:09 UTC by Oscar Casal Sanchez
Modified: 2019-07-11 07:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-18 15:37:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1660880 0 high CLOSED 3.10.14 to v3.10.72 upgrade - control plane upgrade upgrades the ovs and sdn pods on all node network causing downtime o... 2022-03-13 16:34:30 UTC

Description Oscar Casal Sanchez 2019-01-18 14:09:45 UTC
Description of problem:

When a user runs the control plane upgrade in separate phases in OCP 3.10 an SDN upgrade happens in all the cluster and the pods lost the network connection being these recreated with the pods losing transactions. 

This behavior of the SDN upgrade didn't exist in OCP 3.9 and a user could have some control over how the node was restarted and drain it correctly.

How reproducible:

Always. 

Steps to Reproduce:

Run the control plane upgrade playbook following the procedure described in the document "2.2.2. Upgrading the Control Plane and Nodes in Separate Phases" [1]

Actual results:

When the control plane upgrade runs the all SDN is upgraded and pods end without control by the customer node by node and losing transactions from the pods running. It's possible to see this behavior below

10:51:55 AM	Normal	Created	"Created container
 5 times in the last 2 hours"
10:51:53 AM	Normal	Pulled	"Container image ""rhscl/httpd-24-rhel7"" already present on machine
 5 times in the last 2 hours"
10:51:47 AM	Warning	Back-off	Back-off restarting failed container
10:51:39 AM	Normal	Sandbox Changed	Pod sandbox changed, it will be killed and re-created.
10:51:32 AM	Warning	Failed	Error: failed to start container "httpd-01": Error response from daemon: cannot join network of a non running container: f967f51f568d763d6b4334696eba07347452e04e9f3f3323914227c2deeeeeee
10:51:28 AM	Normal	Killing	"Killing container with id docker://httpd-01:Container failed liveness probe.. Container will be killed and recreated.
 3 times in the last 2 hours"
10:51:26 AM	Warning	Unhealthy	"Liveness probe failed: Get http://10.128.4.190:8080/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
 7 times in the last 2 hours"
10:51:23 AM	Warning	Network Failed	The pod's network interface has been lost and the pod will be stopped.
10:51:17 AM	Normal	Started	"Started container
 3 times in the last 2 hours"
10:46:25 AM	Normal	Scheduled	Successfully assigned httpd-01-47-jh6nf to server1.example.com


Expected results:

What a user desires is to have control on the upgrade process and to be able to upgrade parts/regions of the cluster as in OCP 3.9 where the upgrade procedure didn't run a complete SDN rollout. In OCP 3.9, the customer can drain a node before given the opportunity to the transactions to finish correctly.


Additional information:

One Bugzilla exists BZ1660880 [2], but it is only trying to delete the SDN upgrade to the phase of upgrading nodes and separate it from the control plane upgrade 

[1] https://access.redhat.com/documentation/en-us/openshift_container_platform/3.10/html/upgrading_clusters/install-config-upgrading-automated-upgrades#upgrading-control-plane-nodes-separate-phases
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1660880


Note You need to log in before you can comment on or make changes to this bug.