From our CI upgrades we've encountered this failure in `oc adm migrate storage --include=*` when running oc v3.7.0-0.104.0 prior to upgrading to a newer 3.7 build. fatal: [free-int-master-3c664]: FAILED! => {"changed": true, "cmd": ["oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "3:14:46.232383", "end": "2017-09-06 20:21:57.360164", "failed": true, "failed_when_result": true, "rc": 1, "start": "2017-09-06 17:07:11.127781", "stderr": "", "stderr_lines": [], "stdout": "error: -n ops-health-monitoring builds/build-09010900z-xb-1: Operation cannot be fulfilled on builds \"build-09010900z-xb-1\": the object has been modified; please apply your changes to the latest version and try again\nsummary: total=28748 errors=1 ignored=0 unchanged=12169 migrated=16578\ninfo: to rerun only failing resources, add --include=builds\nerror: 1 resources failed to migrate", "stdout_lines": ["error: -n ops-health-monitoring builds/build-09010900z-xb-1: Operation cannot be fulfilled on builds \"build-09010900z-xb-1\": the object has been modified; please apply your changes to the latest version and try again", "summary: total=28748 errors=1 ignored=0 unchanged=12169 migrated=16578", "info: to rerun only failing resources, add --include=builds", "error: 1 resources failed to migrate"]}
Clayton, I wonder if we should use the PATCH when we running the migration and updating or we should retry the update or retry the entire command. Thoughts?
This is presently blocking upgrades to free-int.
please provide the output of $ oc get po/pull-09140110z-ep-1-deploy -o yaml
This looks like it's being fixed in https://github.com/openshift/origin/pull/16340, with a PR to staging: https://github.com/openshift/origin/pull/16350. Waiting for final confirmation from Justin.
Created attachment 1326169 [details] Several data dumps from a node which hung storage migration
Created attachment 1326170 [details] listing showing node watch is not stuck
Mo says that his PRs do not fix this particular issue. They only fix infinite retries. This issue appears to be an etcd watch-cache issue where is hangs onto stale data.
oc v3.7.0-0.104.0 kubernetes v1.7.0+695f48a16f features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://internal.api.free-int.openshift.com:443 openshift v3.7.0-0.104.0 kubernetes v1.7.0+695f48a16f [root@free-int-master-3c664 ~]# etcdctl3 version etcdctl version: 3.1.9 API version: 3.1 etcd-3.1.9-2.el7.x86_64
[root@free-int-master-3c664 ~]# etcd --version etcd Version: 3.1.9 Git SHA: 0f4a535 Go Version: go1.8.3 Go OS/Arch: linux/amd64
the etcd members on free-int are still struggling (`etcdctl3 endpoint status -w json` was timing out occasionally, some API calls took a minute to respond or timed out) it seems likely this triggered https://github.com/coreos/etcd/issues/8411 which causes a dropped etcd node that restores a snapshot as part of rejoining the cluster to not send watch events for changes in that snapshot to watchers
https://bugzilla.redhat.com/show_bug.cgi?id=1492186 requesting rebase to 3.2.6 or later
One more occurrence of this after restarting master-api on all three masters might call into question whether there is only one cause of this prob. - Ran an upgrade - After a long time, it timed out due to migrate issue on a build object - The next morning, I logged in and ran migrate manually on builds. It succeeded. Output from upgrade timeout. ----------------------------------------- TASK [Upgrade all storage] ***************************************************** Tuesday 19 September 2017 01:16:49 +0000 (0:00:00.337) 0:02:36.722 ***** .... cicd-control still running: Tue Sep 19 01:23:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 01:33:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 01:43:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 01:53:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 02:03:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 02:13:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 02:23:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 02:33:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 02:43:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 02:53:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 03:03:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 03:13:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 03:23:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 03:33:42 UTC 2017 .... .... cicd-control still running: Tue Sep 19 03:43:42 UTC 2017 .... fatal: [free-int-master-3c664]: FAILED! => { "changed": true, "cmd": [ "oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm" ], "delta": "2:30:58.834079", "end": "2017-09-19 03:47:48.005064", "failed": true, "failed_when_result": true, "rc": 1, "start": "2017-09-19 01:16:49.170985" } STDOUT: error: -n dakinitest20170517 builds/cakephp-mysql-persistent-3: Timeout: request did not complete within allowed duration summary: total=29840 errors=1 ignored=0 unchanged=29796 migrated=43 info: to rerun only failing resources, add --include=builds error: 1 resources failed to migrate PLAY RECAP ********************************************************************* free-int-master-3c664 : ok=104 changed=7 unreachable=0 failed=1 free-int-master-5470f : ok=99 changed=7 unreachable=0 failed=0 free-int-master-de987 : ok=99 changed=7 unreachable=0 failed=0 localhost : ok=13 changed=0 unreachable=0 failed=0 Tuesday 19 September 2017 03:47:48 +0000 (2:30:59.018) 2:33:35.740 ***** =============================================================================== Upgrade all storage -------------------------------------------------- 9059.02s openshift_excluder : Get available excluder version -------------------- 65.99s openshift_excluder : Install docker excluder --------------------------- 13.99s openshift_excluder : Install openshift excluder ------------------------ 12.19s Ensure openshift-ansible installer package deps are installed ----------- 7.55s Check for invalid namespaces and SDN errors ----------------------------- 6.35s openshift_repos : refresh cache ----------------------------------------- 4.63s openshift_excluder : Get available excluder version --------------------- 3.06s Check latest available OpenShift RPM version ---------------------------- 2.41s openshift_docker_facts : Set docker facts ------------------------------- 2.19s openshift_docker_facts : Set docker facts ------------------------------- 2.13s openshift_facts --------------------------------------------------------- 2.13s Gather Cluster facts and set is_containerized if needed ----------------- 1.44s Gathering Facts --------------------------------------------------------- 1.35s Gathering Facts --------------------------------------------------------- 1.35s Get latest available version of Docker ---------------------------------- 1.16s openshift_repos : Ensure libselinux-python is installed ----------------- 1.15s Get current version of Docker ------------------------------------------- 1.10s openshift_facts --------------------------------------------------------- 1.09s Gathering Facts --------------------------------------------------------- 0.71s ~/aos-cd/git/openshift-ansible-ops/playbooks/release/bin BYO CONTROL PLANE upgrade process returned: 2 BYO CONTROL PLANE upgrade terminated with a failure real 153m37.640s user 3m37.365s sys 1m44.595s
Test build here https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=597908
1.7.x rebase landed and should contain fix for this.
fixed in 1.7.8, still needs to be picked
stale update issue fixed in https://github.com/openshift/origin/pull/16722 still need etcd 3.2.6+ to fix underlying cause
stale update issue fixed in 3.7.0-0.146.1+
@jupierce, could you help to provide more detail about how to reproduce/verify this issue? thanks
In an HA etcd configuration with three nodes (E1, E2, and E3) running the updated etcd & pull 16722: - Fill up the /var file system on E2 until etcd has no space to record data. etcd should crash or report errors. - Delete the file you created to fill up /var. Allow E2 to recover (ensure etcd process is running) - Run a storage migration on all resources. Repeat this process several times. If migration runs successfully each time, I would consider this verified.
Currently, upgrade from 3.7 old version to newer version is blocked by bug: https://bugzilla.redhat.com/show_bug.cgi?id=1502866, so we can't do the upgrade, this bug's verification should be blocked now.
Try to verify it, but be blocked by another bug in upgrade process, Bug 1507433 - oadm migrate storage failed in upgrade process.
Verified on ocp env, setup HA env on aws cluster, then upgrade ocp from 3.7.0-0.190.0 to 3.7.0-0.191.0, fill up /var then recover it, run the storage migrateion on all resources, it works well. # openshift version openshift v3.7.0-0.191.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8