Bug 2076193

Summary: oc patch command for the liveness probe and readiness probe parameters of an OpenShift router deployment doesn't take effect
Product: OpenShift Container Platform Reporter: Shudi Li <shudili>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Shudi Li <shudili>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, cscribne, hongli, mmasters
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:07:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shudi Li 2022-04-18 07:38:05 UTC
Description of problem: The default timeout of liveness probe and readiness probe is 1s, try to modify it to 5s by the oc patch command, but it doesn't take effect. And the SCC warning message appears while executing the oc command.


OpenShift release version:
4.11.0-0.nightly-2022-04-16-163450

Cluster Platform:


How reproducible:
execute the oc patch command below, and try to see the changes in router deployment and router pod in openshift-ingress namespace

oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'

Steps to Reproduce (in detail):
1. Try to change the timeout to 5s by the oc patch command

% oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "router" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/router-default patched
% 

2. The timeout is still 1s when oc describe deploy/router-default
% oc -n openshift-ingress describe deploy/router-default | grep -e Liveness: -e Readiness:
    Liveness:   http-get http://:1936/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:1936/healthz/ready delay=0s timeout=1s period=10s #success=1 #failure=3
% 

3. check the yaml file of deploy/router-default
% oc -n openshift-ingress get  deploy/router-default -o yaml | grep -A8 template:
  template:
    metadata:
      annotations:
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
        unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds: "10"     <------
      creationTimestamp: null
      labels:
        ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
        ingresscontroller.operator.openshift.io/hash: 6d57464bc8
% 

% oc -n openshift-ingress get  deploy/router-default -o yaml | grep -A8 nessProbe:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
--
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz/ready
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
%

4. Check the router pod
% oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-5944d9f999-zjsp6   1/1     Terminating   0          8m51s
router-default-5d6fd94455-9wdkn   1/1     Running       0          92m
router-default-5d6fd94455-ftcbc   1/1     Running       0          8m50s
%

% oc -n openshift-ingress get pod router-default-5d6fd94455-ftcbc -o yaml | grep -A8 nessProbe:
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 1936
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
--
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz/ready
        port: 1936
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
%

5.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-16-163450   True        False         6h      Cluster version is 4.11.0-0.nightly-2022-04-16-163450
%

Actual results:
The timeout of liveness probe and readiness probe is 1s.

Expected results:
The timeout of liveness probe and readiness probe is 5s.

Impact of the problem:


Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Shudi Li 2022-04-18 10:14:41 UTC
Can't list the security context constraint of router pod by the scc-subject-review command
1.
% oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-5d6fd94455-9wdkn   1/1     Running   0          4h5m
router-default-5d6fd94455-ftcbc   1/1     Running   0          161m
%

2.
% oc -n openshift-ingress get pod router-default-5d6fd94455-9wdkn | oc adm policy scc-subject-review -f -
unable to decode "STDIN": couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" }
%

Comment 6 Shudi Li 2022-04-25 02:54:31 UTC
Verified it with 4.11.0-0.nightly-2022-04-24-135651

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        False         70m     Cluster version is 4.11.0-0.nightly-2022-04-24-135651
shudi@Shudis-MacBook-Pro 410 % oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "router" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/router-default patched
%

2. check timeout of liveness probe and readiness probe in deploy/router-default event, it is 5s.
% oc -n openshift-ingress describe deploy/router-default | grep -e Liveness: -e Readiness:
    Liveness:   http-get http://:1936/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:1936/healthz/ready delay=0s timeout=5s period=10s #success=1 #failure=3
% 

3. check timeoutSeconds of liveness probe and readiness probe in deploy/router-default, it is 5.
% oc -n openshift-ingress get  deploy/router-default -o yaml | grep -A8 nessProbe:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
--
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz/ready
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
% 

4.
% oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-56d5fcbbdf-g64tn   1/1     Running   0          2m48s
router-default-56d5fcbbdf-p6n9h   1/1     Running   0          2m48s
% 

5. check timeoutSeconds of liveness probe and readiness probe in a router pod, it is 5.
% oc -n openshift-ingress get pod router-default-56d5fcbbdf-g64tn -o yaml | grep -A8 nessProbe:
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 1936
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
--
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz/ready
        port: 1936
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
%

Comment 8 errata-xmlrpc 2022-08-10 11:07:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069