Bug 2101622 - Drain happens before other image-registry pod is ready to service requests, causing disruption
Summary: Drain happens before other image-registry pod is ready to service requests, c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.12.0
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 2102639
TreeView+ depends on / blocked
 
Reported: 2022-06-28 01:46 UTC by Dennis Periquet
Modified: 2023-01-17 19:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:50:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 442 0 None Merged Bug 2101622: Revert 'kube 1.24 rebase'; master 2022-08-05 16:27:03 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:51:07 UTC

Description Dennis Periquet 2022-06-28 01:46:18 UTC
Description of problem:

During openshift upgrades, the amount of time (in seconds) that the image registry endpoint is available
is tracked on each job run.  Those times are stored in a database and when other jobs run, the amount time
when the endpoint is not available is compared.  If the amount of time unavailable is much greater than what
is considered normal, the disruption test fails:

In this case, the test failure looks like this:

disruption_tests: [sig-imageregistry] Image registry remains available using new connections expand_less	1h0m22s
{Jun 27 14:50:55.710: ns/openshift-image-registry route/test-disruption-new disruption/image-registry connection/new was unreachable during disruption:  for at least 27s of 59m47s (maxAllowed=1s):

for this job run:
  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1541403733752025088

Looking at the first spyglass chart, scroll down to the "node-state" and "endpoint-availability" sections.
The vertical red lines next to openshift-image-registry route denote disruption events.

The disruption looks like it coincides with when a node is drained.

I took some logs from loki (the same query can be used on the loki UI) as well as from the node logs in the gather-extra
artifacts directory.

logcli --bearer-token="$GRAFANA_API_KEY" --tls-skip-verify --addr https://grafana-loki.ci.openshift.org/api/datasources/proxy/3/ query '{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1541403733752025088"} | unpack | namespace="openshift-image-registry" | container_name="registry"' --since 240h --limit 1000000 --timezone=UTC --forward > registry.log

I looked at the log to determine which image-registry-xxx-xxx pods were running when the nodes (master-0
and worker-a) were drained in an effort to examine what is happening during the first chunk of disruption.

The two pods were:

image-registry-6fbb499746-gc4hh (worker a)
image-registry-6fbb499746-95xvn" (worker b)

The timeline is as follows (times are all in UTC):

First note that I observed that image-registry pods have a readinessProbe with an initialDelaySeconds of 15:

  readinessProbe:
    failureThreshold: 3
    httpGet:
      path: /healthz
      port: 5000
      scheme: HTTPS
    initialDelaySeconds: 15
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5

The first pod image-registry-6fbb499746-gc4hh is scheduled and ready on worker-a

14:18:06 scheduled on worker-a "image-registry-6fbb499746-gc4hh"
14:18:26 ci-op-04h4d5pl-d3bee-l5rtj-worker-a-74259 hyperkube[1473]: I0627 14:18:26.468399    1473 kubelet.go:2205] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-image-registry/image-registry-6fbb499746-gc4hh"

The second pod image-registry-6fbb499746-95xvn is scheduled on worker-b.

14:33:42 scheduled on worker b "image-registry-6fbb499746-95xvn"
14:33:42 Created pod "image-registry-6fbb499746-95xvn"

Worker-a is drained and so is master-0; you can see this happening at about the same time in the spyglass chart.

14:33:35 worker-a drain
14:33:36 master-0 drain

The first pod image-registry-6fbb499746-gc4hh is being deleted (as part of the drain).

14:33:42.116754 ci-op-04h4d5pl-d3bee-l5rtj-worker-a-74259 hyperkube[1473]: I0627 14:33:42.116706    1473 kubelet.go:2111] "SyncLoop DELETE" source="api" pods=[openshift-image-registry/image-registry-6fbb499746-gc4hh]

Disruption occurs (I took this from the event-filter.html at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1541403733752025088/artifacts/e2e-gcp-upgrade/gather-must-gather/artifacts/event-filter.html)

14:33:49 disruption began on new connection
14:33:49 disruption ended on new connection

The pod continues its deletion cycle.

14:34:07.365944 ci-op-04h4d5pl-d3bee-l5rtj-worker-a-74259 hyperkube[1473]: I0627 14:34:07.365902    1473 kuberuntime_container.go:728] "Killing container with a grace period" pod="openshift-image-registry/image-registry-6fbb499746-gc4hh" podUID=b0409415-2fb0-470d-bfb9-a60e7ee3ccd0 containerName="registry" containerID="cri-o://0f268dd9f18e3056ef6f11a240552372f5142a8757014d6374976f57622e8c9d" gracePeriod=55

Note that the second pod (which presumably is supposed to be able to service requests) just became ready.

14:34:02.939631 ci-op-04h4d5pl-d3bee-l5rtj-worker-b-zghmp hyperkube[1475]: I0627 14:34:02.939270    1475 kubelet.go:2205] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-image-registry/image-registry-6fbb499746-95xvn"

So, I believe that the drain occurred too soon (or should've waited until the second pod was ready to service requests).


Version-Release number of selected component (if applicable):

4.11

How reproducible:


The test fails quite a bit and fails quite a few payloads:

https://search.ci.openshift.org/?search=Image+registry+remains+available+using+new+connections&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-imageregistry%5D%20Image%20registry%20remains%20available%20using%20new%20connections
The combination tends to be: gcp, amd64, upgrade, upgrade-minor, sdn, ha
The aggregated job where it fails is aggregated-gcp-sdn-upgrade-4.11-micro, periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade

Steps to Reproduce:
1.
2.
3.

Actual results:

minimal disruption

Expected results:

Excessive disruption due to other pod not ready yet when first one is deleted.

Additional info:

Comment 3 Devan Goodwin 2022-06-28 10:56:19 UTC
Reassigning to Node as I suspect this is outside the realm of the image registry, and also likely affecting console and oauth.

Comment 4 Devan Goodwin 2022-06-28 10:57:02 UTC
Bumping to severity urgent, this is blocking payloads, 12 consecutive ci stream failures.

Comment 14 zhaozhanqi 2022-08-11 07:45:50 UTC
Move this to verified.

Comment 17 errata-xmlrpc 2023-01-17 19:50:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.