Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1387391 - api-server never initializes when configured with storage-backend=etcd3 - logs flooded with client transport error msgs
api-server never initializes when configured with storage-backend=etcd3 - log...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.4.0
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Timothy St. Clair
Mike Fiedler
aos-scalability-34
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-20 14:39 EDT by Mike Fiedler
Modified: 2017-03-08 13 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-01-18 07:43:58 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
api server logs (312.04 KB, application/x-gzip)
2016-10-20 14:39 EDT, Mike Fiedler
no flags Details
etcd3 cluster leader log with --debug=true (154.92 KB, application/x-gzip)
2016-10-20 14:40 EDT, Mike Fiedler
no flags Details
api server logs at loglevel=8 (13.29 KB, application/x-gzip)
2016-10-20 15:46 EDT, Mike Fiedler
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 12:23:26 EST

  None (edit)
Description Mike Fiedler 2016-10-20 14:39:13 EDT
Description of problem:

After updating the etcd systems in an HA cluster from 2.3.7 to 3.0.12, migrating the data and configuring the master-api servers to use storage-backend=etcd3, the api-server will not completely start.    The logs are flooded with the following errors:

Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken read tcp 192.1.11.211:33308->192.1.11.215:2379: read: connection reset by peer.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken read tcp 192.1.11.211:34574->192.1.11.214:2379: read: connection reset by peer.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken read tcp 192.1.11.211:48334->192.1.11.216:2379: read: connection reset by peer.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken read tcp 192.1.11.211:34570->192.1.11.214:2379: read: connection reset by peer.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
Oct 20 13:41:17 svt-m-1.localdomain openshift[44796]: transport: http2Client.notifyError got notified that the client transport was broken read tcp 192.1.11.211:48374->192.1.11.216:2379: read: connection reset by peer.


Version-Release number of selected component (if applicable): 3.4.0.11


How reproducible: Always


Steps to Reproduce:

Environment:

3.4.0.11 cluster installed with 3 masters and 3 2.3.7 etcd servers and 300 nodes.   Ran the conformance tests in the cluster and then created 1K projects with 4K pods.   Everything successful up to this point.

This issue has also been reproduced in a smaller cluster with 1 etcd, 1 master and 3 nodes.  Cluster size not an issue.

0. Shutdown all OCP masters
1. Shutdown etcd servers and update (yum swap) to install etcd 3.0.12
2. On each etcd:  etcdctl migrate --data-dir=/var/lib/etcd
3. Verify the migration completes with no errors and restart all etcds
4. On each master, edit master-config.yaml and add the following to apiServerArguments:

  apiServerArguments:
    storage-backend:
    - "etcd3"

5. Restart the master services

Actual results:

The master-api services will fail to initialize.   The logs will be overrun with the messages above.  Attached are the master-api journal logs and etcd with --debug=true.   The etcd logs show no errors.


Expected results:

Migrated cluster operates as before the etcd upgrade.

Additional info:

I ran tcpdump/wireshark and every time the master sends something to the etcd system, the etcd system sends back a reset/RST
Comment 1 Mike Fiedler 2016-10-20 14:39 EDT
Created attachment 1212615 [details]
api server logs
Comment 2 Mike Fiedler 2016-10-20 14:40 EDT
Created attachment 1212616 [details]
etcd3 cluster leader log with --debug=true
Comment 3 Mike Fiedler 2016-10-20 14:42:22 EDT
Removing storage-backend=etcd3 from master-config.yaml restores cluster functionality.
Comment 4 Mike Fiedler 2016-10-20 15:04:24 EDT
One note.   For a successful migration (etcdctl migrate) you might need a patch to etcdctl.  See https://bugzilla.redhat.com/show_bug.cgi?id=1386963#c3
Comment 5 Mike Fiedler 2016-10-20 15:46 EDT
Created attachment 1212633 [details]
api server logs at loglevel=8
Comment 6 Timothy St. Clair 2016-10-20 15:57:17 EDT
Client in openshift is v3.1.0-alpha.1, and I'm guessing it's un-vetted for version compat as things have changed.
Comment 7 Clayton Coleman 2016-10-21 11:41:58 EDT
This is https://github.com/openshift/origin/pull/10980
Comment 10 Troy Dawson 2016-10-27 12:06:18 EDT
This has been merged into ose and is in OSE v3.4.0.16 or newer.
Comment 12 Mike Fiedler 2016-11-03 07:39:34 EDT
This specific issue is resolved, although additional issues with running in migrated etcd2->etcd3 clusters are still being investigated.
Comment 15 errata-xmlrpc 2017-01-18 07:43:58 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.