Bug 1991551

Summary: Idle service cannot be waked up
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:45:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description zhaozhanqi 2021-08-09 12:05:20 UTC
Description of problem:
Idle service cannot be waken up

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-07-175228

How reproducible:
always

Steps to Reproduce:

1. Create rc test pod and svc

$ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json
replicationcontroller/test-rc created
service/test-service created
$ oc get pod
NAME            READY   STATUS    RESTARTS   AGE
test-rc-bswdm   1/1     Running   0          4s
test-rc-p7hpg   1/1     Running   0          4s


2. Create another test pod

oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/pod-for-ping.json


3. oc idle test-service


4. Access the idle service, can NOT be worked

$ oc exec -n ffcbg hello-pod -- curl 172.30.54.122:27017
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0^C


5. check the ep

$ oc get ep -n ffcbg -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      idling.alpha.openshift.io/idled-at: "2021-08-09T11:52:34Z"
      idling.alpha.openshift.io/unidle-targets: '[{"kind":"ReplicationController","name":"test-rc","replicas":2}]'
    creationTimestamp: "2021-08-09T11:52:24Z"
    labels:
      name: test-service
    name: test-service
    namespace: ffcbg
    resourceVersion: "206036"
    uid: 7e2d6e4d-5a93-4922-9825-ace2fe442100
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

6. check the service

$ oc get svc -n ffcbg -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      idling.alpha.openshift.io/idled-at: "2021-08-09T11:52:34Z"
      idling.alpha.openshift.io/unidle-targets: '[{"kind":"ReplicationController","name":"test-rc","replicas":2}]'
    creationTimestamp: "2021-08-09T11:52:24Z"
    labels:
      name: test-service
    name: test-service
    namespace: ffcbg
    resourceVersion: "206021"
    uid: 051fe7ee-23ef-451a-b32c-aaad8b687cb0
  spec:
    clusterIP: 172.30.54.122
    clusterIPs:
    - 172.30.54.122
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: http
      port: 27017
      protocol: TCP
      targetPort: 8080
    selector:
      name: test-pods
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


7. check the event

$ oc get event -n ffcbg
LAST SEEN   TYPE     REASON             OBJECT                          MESSAGE
5m26s       Normal   Scheduled          pod/hello-pod                   Successfully assigned ffcbg/hello-pod to ip-10-0-188-18.us-east-2.compute.internal
5m24s       Normal   AddedInterface     pod/hello-pod                   Add eth0 [10.131.0.57/23] from openshift-sdn
5m24s       Normal   Pulled             pod/hello-pod                   Container image "quay.io/openshifttest/hello-sdn@sha256:d5785550cf77b7932b090fcd1a2625472912fb3189d5973f177a5a2c347a1f95" already present on machine
5m24s       Normal   Created            pod/hello-pod                   Created container hello-pod
5m24s       Normal   Started            pod/hello-pod                   Started container hello-pod
5m41s       Normal   Scheduled          pod/test-rc-7vj5w               Successfully assigned ffcbg/test-rc-7vj5w to ip-10-0-215-184.us-east-2.compute.internal
5m39s       Normal   AddedInterface     pod/test-rc-7vj5w               Add eth0 [10.129.2.146/23] from openshift-sdn
5m39s       Normal   Pulled             pod/test-rc-7vj5w               Container image "quay.io/openshifttest/hello-sdn@sha256:d5785550cf77b7932b090fcd1a2625472912fb3189d5973f177a5a2c347a1f95" already present on machine
5m39s       Normal   Created            pod/test-rc-7vj5w               Created container test-pod
5m39s       Normal   Started            pod/test-rc-7vj5w               Started container test-pod
5m29s       Normal   Killing            pod/test-rc-7vj5w               Stopping container test-pod
5m41s       Normal   Scheduled          pod/test-rc-jqbl5               Successfully assigned ffcbg/test-rc-jqbl5 to ip-10-0-188-18.us-east-2.compute.internal
5m39s       Normal   AddedInterface     pod/test-rc-jqbl5               Add eth0 [10.131.0.56/23] from openshift-sdn
5m39s       Normal   Pulled             pod/test-rc-jqbl5               Container image "quay.io/openshifttest/hello-sdn@sha256:d5785550cf77b7932b090fcd1a2625472912fb3189d5973f177a5a2c347a1f95" already present on machine
5m39s       Normal   Created            pod/test-rc-jqbl5               Created container test-pod
5m39s       Normal   Started            pod/test-rc-jqbl5               Started container test-pod
5m29s       Normal   Killing            pod/test-rc-jqbl5               Stopping container test-pod
5m41s       Normal   SuccessfulCreate   replicationcontroller/test-rc   Created pod: test-rc-7vj5w
5m41s       Normal   SuccessfulCreate   replicationcontroller/test-rc   Created pod: test-rc-jqbl5
5m29s       Normal   SuccessfulDelete   replicationcontroller/test-rc   Deleted pod: test-rc-7vj5w
5m29s       Normal   SuccessfulDelete   replicationcontroller/test-rc   Deleted pod: test-rc-jqbl5

8.  check the iptables rule
sh-4.4# iptables-save | grep ffcbg
-A KUBE-PORTALS-CONTAINER -d 172.30.54.122/32 -p tcp -m comment --comment "ffcbg/test-service:http" -m tcp --dport 27017 -j REDIRECT --to-ports 35041
-A KUBE-PORTALS-HOST -d 172.30.54.122/32 -p tcp -m comment --comment "ffcbg/test-service:http" -m tcp --dport 27017 -j DNAT --to-destination 10.0.128.147:35041



Actual results:




Expected results:

access the svc can wake up the pods

Additional info:

Comment 1 Alexander Constantinescu 2021-08-09 15:25:56 UTC
Assigning to Dan as he's been working on some idling bugs for openshift-sdn

Comment 2 Dan Winship 2021-08-09 19:15:00 UTC
LOL so how did this pass CI?

E0809 19:07:45.030051    1951 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-service.1699b8ec3bca9e5a", GenerateName:"", Namespace:"test", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, EventTime:v1.MicroTime{Time:time.Time{wall:0xc03c7d20418db83b, ext:1334903747131, loc:(*time.Location)(0x30c9ba0)}}, Series:(*v1.EventSeries)(nil), ReportingController:"kube-proxy", ReportingInstance:"kube-proxy-ip-10-0-191-238.us-west-1.compute.internal", Action:"The service-port %s:%s needs pods.", Reason:"NeedPods", Regarding:v1.ObjectReference{Kind:"Service", Namespace:"test", Name:"test-service", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Related:(*v1.ObjectReference)(nil), Note:"test-service%!(EXTRA string=http)", Type:"Normal", DeprecatedSource:v1.EventSource{Component:"", Host:""}, DeprecatedFirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeprecatedLastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeprecatedCount:0}': 'events.events.k8s.io is forbidden: User "system:serviceaccount:openshift-sdn:sdn" cannot create resource "events" in API group "events.k8s.io" in the namespace "test"' (will not retry!)


in particular 'cannot create resource "events" in API group "events.k8s.io"'

In 1.22 kube-proxy moved from the old core/v1 Event API to the new events.k8s.io/v1 Event api, so I guess we need to update our RBAC rules...

Comment 4 zhaozhanqi 2021-08-20 08:42:06 UTC
Verified this bug on 4.9.0-0.nightly-2021-08-19-184748

We met an issue when idle service in 4.9 

$ oc idle test-service --kubeconfig=/home/zzhao/workdir/dhcp-140-240-zzhao/ocp4_testuser-21.kubeconfig
      ReplicationController "v0te6/test-rc" has been idled 
      
      STDERR:
      error: unable to mark service "v0te6/test-service" as idled: endpoints "test-service" is forbidden: User "testuser-21" cannot patch resource "endpoints" in API group "" in the namespace "v0te6"

this is another bug is tracing this https://bugzilla.redhat.com/show_bug.cgi?id=1995505



$ oc get pod -n v0te6
No resources found in v0te6 namespace.
$ oc get svc -n v0te6
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
test-service   ClusterIP   172.30.241.66   <none>        27017/TCP   4m40s

$ oc rsh -n openshift-multus multus-admission-controller-7kh4c
Defaulted container "multus-admission-controller" out of: multus-admission-controller, kube-rbac-proxy
sh-4.4# curl 172.30.241.66:27017
Hello OpenShift!

$ oc get pod -n v0te6
NAME            READY   STATUS    RESTARTS   AGE
test-rc-rgvn2   1/1     Running   0          16s
test-rc-zbwr5   1/1     Running   0          16s

Move this bug to verified

Comment 7 errata-xmlrpc 2021-10-18 17:45:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759