1489168 – etcd watch-cache does not update stale data -- API conflicts result (storage migration retries indefinitely)

Bug 1489168 - etcd watch-cache does not update stale data -- API conflicts result (storage migration retries indefinitely)

Summary: etcd watch-cache does not update stale data -- API conflicts result (storage ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Maciej Szulik
QA Contact:	Wang Haoran
Docs Contact:
URL:
Whiteboard:
Depends On:	1492186
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-06 20:50 UTC by Scott Dodson
Modified:	2019-11-21 17:35 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: There are several problems here: 1. stale cached version was being re-used even when conflict was being noticed 2. wrong comparison when updating objects in cache Consequence: Sometimes we were working with stale data. Fix: 1. Retry with a live object instead of the cached version if the watch cache receives a conflict trying to do the update. 2. Correctly verify the in memory serialization against the on disk serialization by fetching the latest serialized data. Result: Etcd data should be correct.
Clone Of:
Environment:
Last Closed:	2019-11-21 17:35:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Several data dumps from a node which hung storage migration (143.51 KB, text/plain) 2017-09-14 18:49 UTC, Justin Pierce	no flags	Details
listing showing node watch is not stuck (960.46 KB, text/plain) 2017-09-14 18:51 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1492186	0	high	CLOSED	Rebase etcd to 3.2.6 or later	2021-02-22 00:41:40 UTC

Internal Links: 1492186

Description Scott Dodson 2017-09-06 20:50:57 UTC

From our CI upgrades we've encountered this failure in `oc adm migrate storage --include=*` when running oc v3.7.0-0.104.0 prior to upgrading to a newer 3.7 build.



fatal: [free-int-master-3c664]: FAILED! => {"changed": true, "cmd": ["oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "3:14:46.232383", "end": "2017-09-06 20:21:57.360164", "failed": true, "failed_when_result": true, "rc": 1, "start": "2017-09-06 17:07:11.127781", "stderr": "", "stderr_lines": [], "stdout": "error:     -n ops-health-monitoring builds/build-09010900z-xb-1: Operation cannot be fulfilled on builds \"build-09010900z-xb-1\": the object has been modified; please apply your changes to the latest version and try again\nsummary: total=28748 errors=1 ignored=0 unchanged=12169 migrated=16578\ninfo: to rerun only failing resources, add --include=builds\nerror: 1 resources failed to migrate", "stdout_lines": ["error:     -n ops-health-monitoring builds/build-09010900z-xb-1: Operation cannot be fulfilled on builds \"build-09010900z-xb-1\": the object has been modified; please apply your changes to the latest version and try again", "summary: total=28748 errors=1 ignored=0 unchanged=12169 migrated=16578", "info: to rerun only failing resources, add --include=builds", "error: 1 resources failed to migrate"]}

Comment 1 Michal Fojtik 2017-09-11 09:34:28 UTC

Clayton, I wonder if we should use the PATCH when we running the migration and updating or we should retry the update or retry the entire command. Thoughts?

Comment 4 Justin Pierce 2017-09-14 03:30:34 UTC

This is presently blocking upgrades to free-int.

Comment 5 Tomáš Nožička 2017-09-14 13:17:28 UTC

please provide the output of

  $ oc get po/pull-09140110z-ep-1-deploy -o yaml

Comment 7 Maciej Szulik 2017-09-14 14:50:16 UTC

This looks like it's being fixed in https://github.com/openshift/origin/pull/16340, with a PR to staging: https://github.com/openshift/origin/pull/16350.

Waiting for final confirmation from Justin.

Comment 8 Justin Pierce 2017-09-14 18:49:27 UTC

Created attachment 1326169 [details]
Several data dumps from a node which hung storage migration

Comment 9 Justin Pierce 2017-09-14 18:51:37 UTC

Created attachment 1326170 [details]
listing showing node watch is not stuck

Comment 10 Justin Pierce 2017-09-14 18:54:36 UTC

Mo says that his PRs do not fix this particular issue. They only fix infinite retries. This issue appears to be an etcd watch-cache issue where is hangs onto stale data.

Comment 11 Justin Pierce 2017-09-14 20:25:51 UTC

oc v3.7.0-0.104.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.free-int.openshift.com:443
openshift v3.7.0-0.104.0
kubernetes v1.7.0+695f48a16f



[root@free-int-master-3c664 ~]# etcdctl3 version
etcdctl version: 3.1.9
API version: 3.1


etcd-3.1.9-2.el7.x86_64

Comment 12 Justin Pierce 2017-09-14 20:33:40 UTC

[root@free-int-master-3c664 ~]# etcd --version
etcd Version: 3.1.9
Git SHA: 0f4a535
Go Version: go1.8.3
Go OS/Arch: linux/amd64

Comment 14 Jordan Liggitt 2017-09-15 16:46:17 UTC

the etcd members on free-int are still struggling (`etcdctl3 endpoint status -w json` was timing out occasionally, some API calls took a minute to respond or timed out)

it seems likely this triggered https://github.com/coreos/etcd/issues/8411 which causes a dropped etcd node that restores a snapshot as part of rejoining the cluster to not send watch events for changes in that snapshot to watchers

Comment 15 Scott Dodson 2017-09-15 16:56:27 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1492186 requesting rebase to 3.2.6 or later

Comment 16 Justin Pierce 2017-09-19 13:14:51 UTC

One more occurrence of this after restarting master-api on all three masters might call into question whether there is only one cause of this prob.

- Ran an upgrade
- After a long time, it timed out due to migrate issue on a build object
- The next morning, I logged in and ran migrate manually on builds. It succeeded.

Output from upgrade timeout.
-----------------------------------------
TASK [Upgrade all storage] *****************************************************
Tuesday 19 September 2017  01:16:49 +0000 (0:00:00.337)       0:02:36.722 ***** 

.... cicd-control still running: Tue Sep 19 01:23:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 01:33:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 01:43:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 01:53:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 02:03:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 02:13:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 02:23:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 02:33:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 02:43:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 02:53:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 03:03:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 03:13:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 03:23:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 03:33:42 UTC 2017 ....

.... cicd-control still running: Tue Sep 19 03:43:42 UTC 2017 ....
fatal: [free-int-master-3c664]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "migrate", 
        "storage", 
        "--include=*", 
        "--confirm"
    ], 
    "delta": "2:30:58.834079", 
    "end": "2017-09-19 03:47:48.005064", 
    "failed": true, 
    "failed_when_result": true, 
    "rc": 1, 
    "start": "2017-09-19 01:16:49.170985"
}

STDOUT:

error:     -n dakinitest20170517 builds/cakephp-mysql-persistent-3: Timeout: request did not complete within allowed duration
summary: total=29840 errors=1 ignored=0 unchanged=29796 migrated=43
info: to rerun only failing resources, add --include=builds
error: 1 resources failed to migrate

PLAY RECAP *********************************************************************
free-int-master-3c664      : ok=104  changed=7    unreachable=0    failed=1   
free-int-master-5470f      : ok=99   changed=7    unreachable=0    failed=0   
free-int-master-de987      : ok=99   changed=7    unreachable=0    failed=0   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   

Tuesday 19 September 2017  03:47:48 +0000 (2:30:59.018)       2:33:35.740 ***** 
=============================================================================== 
Upgrade all storage -------------------------------------------------- 9059.02s
openshift_excluder : Get available excluder version -------------------- 65.99s
openshift_excluder : Install docker excluder --------------------------- 13.99s
openshift_excluder : Install openshift excluder ------------------------ 12.19s
Ensure openshift-ansible installer package deps are installed ----------- 7.55s
Check for invalid namespaces and SDN errors ----------------------------- 6.35s
openshift_repos : refresh cache ----------------------------------------- 4.63s
openshift_excluder : Get available excluder version --------------------- 3.06s
Check latest available OpenShift RPM version ---------------------------- 2.41s
openshift_docker_facts : Set docker facts ------------------------------- 2.19s
openshift_docker_facts : Set docker facts ------------------------------- 2.13s
openshift_facts --------------------------------------------------------- 2.13s
Gather Cluster facts and set is_containerized if needed ----------------- 1.44s
Gathering Facts --------------------------------------------------------- 1.35s
Gathering Facts --------------------------------------------------------- 1.35s
Get latest available version of Docker ---------------------------------- 1.16s
openshift_repos : Ensure libselinux-python is installed ----------------- 1.15s
Get current version of Docker ------------------------------------------- 1.10s
openshift_facts --------------------------------------------------------- 1.09s
Gathering Facts --------------------------------------------------------- 0.71s
~/aos-cd/git/openshift-ansible-ops/playbooks/release/bin
BYO CONTROL PLANE upgrade process returned: 2
BYO CONTROL PLANE upgrade terminated with a failure

real	153m37.640s
user	3m37.365s
sys	1m44.595s

Comment 17 Scott Dodson 2017-09-19 14:20:38 UTC

Test build here https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=597908

Comment 18 Michal Fojtik 2017-10-06 08:53:39 UTC

1.7.x rebase landed and should contain fix for this.

Comment 19 Jordan Liggitt 2017-10-06 14:24:10 UTC

fixed in 1.7.8, still needs to be picked

Comment 20 Jordan Liggitt 2017-10-11 13:40:11 UTC

stale update issue fixed in https://github.com/openshift/origin/pull/16722

still need etcd 3.2.6+ to fix underlying cause

Comment 21 Jordan Liggitt 2017-10-13 01:03:04 UTC

stale update issue fixed in 3.7.0-0.146.1+

Comment 22 ge liu 2017-10-13 10:22:59 UTC

@jupierce, could you help to provide more detail about how to reproduce/verify this issue? thanks

Comment 23 Justin Pierce 2017-10-13 15:42:45 UTC

In an HA etcd configuration with three nodes (E1, E2, and E3) running the updated etcd & pull 16722:
- Fill up the /var file system on E2 until etcd has no space to record data. etcd should crash or report errors.
- Delete the file you created to fill up /var. Allow E2 to recover (ensure etcd process is running)
- Run a storage migration on all resources. 

Repeat this process several times. If migration runs successfully each time, I would consider this verified.

Comment 24 ge liu 2017-10-17 10:29:27 UTC

Currently, upgrade from 3.7 old version to newer version is blocked by bug: https://bugzilla.redhat.com/show_bug.cgi?id=1502866, so we can't do the upgrade, this bug's verification should be blocked now.

Comment 25 ge liu 2017-10-30 09:20:17 UTC

Try to verify it, but be blocked by another bug in upgrade process, Bug 1507433 - oadm migrate storage failed in upgrade process.

Comment 26 ge liu 2017-11-06 08:18:07 UTC

Verified on ocp env, setup HA env on aws cluster, then upgrade ocp from 3.7.0-0.190.0 to 3.7.0-0.191.0， fill up /var then recover it, run the storage migrateion on all resources, it works well.

# openshift version
openshift v3.7.0-0.191.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Note You need to log in before you can comment on or make changes to this bug.