Bug 1879887

Summary: GCP: NodeClockNotSynchronising alert is triggered in restricted networks
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: Machine Config OperatorAssignee: MCO Team <team-mco>
Machine Config Operator sub component: Machine Config Operator QA Contact: Rio Liu <rioliu>
Status: CLOSED DEFERRED Docs Contact:
Severity: low    
Priority: low CC: amurdaca, aos-bugs, bbreard, imcleod, jligon, miabbott, mkrejci, nstielau, walters, zhsun
Version: 4.6Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-12 12:18:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
issue is reproduced on GCP(4.4.29 -> 4.5.18 -> 4.6.1) none

Description Junqi Zhao 2020-09-17 09:45:26 UTC
Description of problem:
The same bug as bug 1834252, it is happened on GCP

IPI_GCP_Restricted_HTTP_Proxy cluster, upgrade from 4.4.0-0.nightly-2020-09-14-143910 to 4.5.0-0.nightly-2020-09-14-124053 is successfully, there are not NodeClockNotSynchronising alerts, continue to upgrade to 4.6.0-0.nightly-2020-09-16-050052, there are NodeClockNotSynchronising alerts only for the master nodes

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname,instance}'
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "juzhao-gcp-gm75k-master-2.c.openshift-qe.internal"
}
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "juzhao-gcp-gm75k-master-1.c.openshift-qe.internal"
}

{
  "alertname": "NodeClockNotSynchronising",
  "instance": "juzhao-gcp-gm75k-master-0.c.openshift-qe.internal"
}

# oc get node --show-labels | grep master
juzhao-gcp-gm75k-master-0.c.openshift-qe.internal         Ready    master   8h      v1.19.0+35ab7c5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-gcp-gm75k-master-0.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=n1-standard-4,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-a
juzhao-gcp-gm75k-master-1.c.openshift-qe.internal         Ready    master   8h      v1.19.0+35ab7c5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-gcp-gm75k-master-1.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=n1-standard-4,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-b
juzhao-gcp-gm75k-master-2.c.openshift-qe.internal         Ready    master   8h      v1.19.0+35ab7c5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=n1-standard-4,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=juzhao-gcp-gm75k-master-2.c.openshift-qe.internal,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=n1-standard-4,node.openshift.io/os_id=rhcos,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c

 oc get node -o wide | grep master
juzhao-gcp-gm75k-master-0.c.openshift-qe.internal         Ready    master   8h      v1.19.0+35ab7c5   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 46.82.202009151940-0 (Ootpa)   4.18.0-193.19.1.el8_2.x86_64   cri-o://1.19.0-17.rhaos4.6.git99c925b.el8
juzhao-gcp-gm75k-master-1.c.openshift-qe.internal         Ready    master   8h      v1.19.0+35ab7c5   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 46.82.202009151940-0 (Ootpa)   4.18.0-193.19.1.el8_2.x86_64   cri-o://1.19.0-17.rhaos4.6.git99c925b.el8
juzhao-gcp-gm75k-master-2.c.openshift-qe.internal         Ready    master   8h      v1.19.0+35ab7c5   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 46.82.202009151940-0 (Ootpa)   4.18.0-193.19.1.el8_2.x86_64   cri-o://1.19.0-17.rhaos4.6.git99c925b.el8


# oc get proxy/cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2020-09-17T00:20:43Z"
  generation: 1
  name: cluster
  resourceVersion: "443"
  selfLink: /apis/config.openshift.io/v1/proxies/cluster
  uid: 37d9a54f-e3dd-48d2-8f5e-0dcf6a239c85
spec:
  httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
  httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
  noProxy: test.no-proxy.com
  trustedCA:
    name: ""
status:
  httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
  httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
  noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.juzhao-gcp.qe.gcp.devcluster.openshift.com,etcd-0.juzhao-gcp.qe.gcp.devcluster.openshift.com,etcd-1.juzhao-gcp.qe.gcp.devcluster.openshift.com,etcd-2.juzhao-gcp.qe.gcp.devcluster.openshift.com,localhost,metadata,metadata.google.internal,metadata.google.internal.,test.no-proxy.com


# oc debug node/juzhao-gcp-gm75k-master-0.c.openshift-qe.internal
Creating debug namespace/openshift-debug-node-psgpt ...
Starting pod/juzhao-gcp-gm75k-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /proc/sys/crypto/fips_enabled
1
sh-4.4# systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-09-17 07:41:24 UTC; 48min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 1399 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 1386 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1390 (chronyd)
    Tasks: 1 (limit: 95632)
   Memory: 1.9M
      CPU: 33ms
   CGroup: /system.slice/chronyd.service
           └─1390 /usr/sbin/chronyd

Sep 17 07:41:23 localhost systemd[1]: Starting NTP client/server...
Sep 17 07:41:24 localhost chronyd[1390]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Sep 17 07:41:24 localhost chronyd[1390]: Frequency -65.363 +/- 0.058 ppm read from /var/lib/chrony/drift
Sep 17 07:41:24 localhost chronyd[1390]: Using right/UTC timezone to obtain leap second data
Sep 17 07:41:24 localhost systemd[1]: Started NTP client/server.
sh-4.4# chronyc -a 'burst 4/4'
501 Not authorised
sh-4.4# timedatectl
Failed to query server: Connection timed out


Version-Release number of selected component (if applicable):
IPI_GCP_Restricted_HTTP_Proxy cluster, upgrade from 4.4.0-0.nightly-2020-09-14-143910 to 4.5.0-0.nightly-2020-09-14-124053 and 4.6.0-0.nightly-2020-09-16-050052

How reproducible:
not sure

Steps to Reproduce:
1. IPI_GCP_Restricted_HTTP_Proxy cluster, upgrade from 4.4.0-0.nightly-2020-09-14-143910 to 4.5.0-0.nightly-2020-09-14-124053 and 4.6.0-0.nightly-2020-09-16-050052
2.
3.

Actual results:
NodeClockNotSynchronising alerts for the master nodes

Expected results:
no such alerts

Additional info:

Comment 2 Colin Walters 2020-09-17 13:38:49 UTC
Can you paste the output of `cat /run/coreos-platform-chrony.conf` and `journalctl --grep=chrony`?

Comment 3 Colin Walters 2020-09-17 13:44:25 UTC
To be able to successfully debug currently with `oc debug node/` it will work to `setenforce 0` by the way.

Comment 4 Micah Abbott 2020-09-17 19:48:23 UTC
This is unlikely to be addressed until 4.7

Comment 5 Micah Abbott 2020-10-05 13:10:42 UTC
We are working on higher priority items in the 4.6 release; marking for UpcomingSprint

Comment 6 Micah Abbott 2020-10-25 18:41:15 UTC
We are working on higher priority items in the 4.6 release; marking for UpcomingSprint

Comment 7 Colin Walters 2020-11-02 19:31:42 UTC
I'm not reproducing this with a 4.6 freshly provisioned cluster; trying an upgrade one now.

Are you still reproducing this?  If so can you attach the requested logs from https://bugzilla.redhat.com/show_bug.cgi?id=1879887#c2 ?

Comment 8 Junqi Zhao 2020-11-03 07:23:53 UTC
GCP IPI_HTTP_PROXY cluster, upgrade from 4.4.29 to 4.5.16, no NodeClockNotSynchronising alerts, upgrade from 4.5.16 to 4.6.1, there are NodeClockNotSynchronising alerts only for the master nodes
# oc get node -o wide | grep master
juzhao-upg-gbn2l-master-0.c.openshift-qe.internal         Ready    master   4h47m   v1.19.0+d59ce34   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 46.82.202010091720-0 (Ootpa)   4.18.0-193.24.1.el8_2.dt1.x86_64   cri-o://1.19.0-20.rhaos4.6.git97d715e.el8
juzhao-upg-gbn2l-master-1.c.openshift-qe.internal         Ready    master   4h47m   v1.19.0+d59ce34   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 46.82.202010091720-0 (Ootpa)   4.18.0-193.24.1.el8_2.dt1.x86_64   cri-o://1.19.0-20.rhaos4.6.git97d715e.el8
juzhao-upg-gbn2l-master-2.c.openshift-qe.internal         Ready    master   4h47m   v1.19.0+d59ce34   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 46.82.202010091720-0 (Ootpa)   4.18.0-193.24.1.el8_2.dt1.x86_64   cri-o://1.19.0-20.rhaos4.6.git97d715e.el8

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.1     True        False         57m     Cluster version is 4.6.1


# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`;oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname,instance}'
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "juzhao-upg-gbn2l-master-1.c.openshift-qe.internal"
}
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "juzhao-upg-gbn2l-master-2.c.openshift-qe.internal"
}
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "juzhao-upg-gbn2l-master-0.c.openshift-qe.internal"
}

# oc debug node/juzhao-upg-gbn2l-master-0.c.openshift-qe.internal
Creating debug namespace/openshift-debug-node-xszpj ...
Starting pod/juzhao-upg-gbn2l-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# cat /run/coreos-platform-chrony.conf
cat: /run/coreos-platform-chrony.conf: No such file or directory
sh-4.4# cat /etc/chrony.conf
pool 10.0.0.2 iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
keyfile /etc/chrony.keys
leapsectz right/UTC
logdir /var/log/chrony
sh-4.4# journalctl --grep=chrony
-- Logs begin at Tue 2020-11-03 01:08:55 UTC, end at Tue 2020-11-03 05:54:05 UTC. --
Nov 03 01:10:48 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal chronyd[1604]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 03 01:11:58 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal chronyd[1604]: chronyd exiting
Nov 03 01:11:58 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 45ms CPU time
-- Reboot --
Nov 03 01:12:17 localhost chronyd[1241]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 03 01:12:17 localhost chronyd[1241]: Frequency -62.626 +/- 27.671 ppm read from /var/lib/chrony/drift
Nov 03 01:12:19 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1474]: I1103 01:12:19.818716    1474 factory.go:173] Using factory "raw" for container "/system.slice/chronyd.service"
Nov 03 01:12:19 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1474]: I1103 01:12:19.819254    1474 manager.go:950] Added container: "/system.slice/chronyd.service" (aliases: [], namespace: "")
Nov 03 01:12:19 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1474]: I1103 01:12:19.820185    1474 container.go:467] Start housekeeping for container "/system.slice/chronyd.service"
Nov 03 01:56:51 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal chronyd[1241]: chronyd exiting
Nov 03 01:56:51 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 95ms CPU time
-- Reboot --
Nov 03 01:58:39 localhost chronyd[1233]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 03 01:58:39 localhost chronyd[1233]: Frequency -67.800 +/- 0.012 ppm read from /var/lib/chrony/drift
Nov 03 01:58:42 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1481]: I1103 01:58:42.023947    1481 factory.go:173] Using factory "raw" for container "/system.slice/chronyd.service"
Nov 03 01:58:42 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1481]: I1103 01:58:42.024457    1481 manager.go:950] Added container: "/system.slice/chronyd.service" (aliases: [], namespace: "")
Nov 03 01:58:42 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1481]: I1103 01:58:42.026477    1481 container.go:467] Start housekeeping for container "/system.slice/chronyd.service"
Nov 03 03:05:10 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal chronyd[1233]: chronyd exiting
Nov 03 03:05:11 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 68ms CPU time
-- Reboot --
Nov 03 03:06:56 localhost coreos-platform-chrony: /etc/chrony.conf is modified; not changing the default
Nov 03 03:06:58 localhost chronyd[1269]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 03 03:06:58 localhost chronyd[1269]: Frequency -67.803 +/- 0.009 ppm read from /var/lib/chrony/drift
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773258    1528 factory.go:166] Error trying to work out if we can handle /system.slice/chronyd.service: /system.s>
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773264    1528 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/chronyd.service"
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773271    1528 factory.go:177] Factory "crio" was unable to handle container "/system.slice/chronyd.service"
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773280    1528 factory.go:173] Using factory "raw" for container "/system.slice/chronyd.service"
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773524    1528 manager.go:950] Added container: "/system.slice/chronyd.service" (aliases: [], namespace: "")
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773756    1528 handler.go:325] Added event &{/system.slice/chronyd.service 2020-11-03 03:07:03.77247023 +0000 UTC>
Nov 03 03:07:03 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1528]: I1103 03:07:03.773807    1528 container.go:467] Start housekeeping for container "/system.slice/chronyd.service"
Nov 03 04:39:06 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal chronyd[1269]: chronyd exiting
Nov 03 04:39:06 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 64ms CPU time
-- Reboot --
Nov 03 04:39:51 localhost coreos-platform-chrony: /etc/chrony.conf is modified; not changing the default
Nov 03 04:39:53 localhost chronyd[1259]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 03 04:39:53 localhost chronyd[1259]: Frequency -67.800 +/- 0.006 ppm read from /var/lib/chrony/drift
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.928901    1522 factory.go:166] Error trying to work out if we can handle /system.slice/chronyd.service: /system.s>
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.928909    1522 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/chronyd.service"
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.928918    1522 factory.go:177] Factory "crio" was unable to handle container "/system.slice/chronyd.service"
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.928928    1522 factory.go:173] Using factory "raw" for container "/system.slice/chronyd.service"
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.929411    1522 manager.go:950] Added container: "/system.slice/chronyd.service" (aliases: [], namespace: "")
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.929687    1522 handler.go:325] Added event &{/system.slice/chronyd.service 2020-11-03 04:39:57.92771631 +0000 UTC>
Nov 03 04:39:57 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1522]: I1103 04:39:57.930837    1522 container.go:467] Start housekeeping for container "/system.slice/chronyd.service"
Nov 03 04:51:43 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal chronyd[1259]: chronyd exiting
Nov 03 04:51:43 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 63ms CPU time
-- Reboot --
Nov 03 04:52:28 localhost coreos-platform-chrony: /etc/chrony.conf is modified; not changing the default
Nov 03 04:52:30 localhost chronyd[1139]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 03 04:52:30 localhost chronyd[1139]: Frequency -67.798 +/- 0.055 ppm read from /var/lib/chrony/drift
Nov 03 04:52:41 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1568]: I1103 04:52:41.573240    1568 factory.go:216] Factory "crio" was unable to handle container "/system.slice/chronyd.service"
Nov 03 04:52:41 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1568]: I1103 04:52:41.573251    1568 factory.go:216] Factory "systemd" was unable to handle container "/system.slice/chronyd.service"
Nov 03 04:52:41 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1568]: I1103 04:52:41.573261    1568 factory.go:212] Using factory "raw" for container "/system.slice/chronyd.service"
Nov 03 04:52:41 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1568]: I1103 04:52:41.579064    1568 manager.go:987] Added container: "/system.slice/chronyd.service" (aliases: [], namespace: "")
Nov 03 04:52:41 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1568]: I1103 04:52:41.579567    1568 handler.go:325] Added event &{/system.slice/chronyd.service 2020-11-03 04:52:41.571726235 +0000 UT>
Nov 03 04:52:41 juzhao-upg-gbn2l-master-0.c.openshift-qe.internal hyperkube[1568]: I1103 04:52:41.588708    1568 container.go:490] Start housekeeping for container "/system.slice/chronyd.service"

Comment 10 Colin Walters 2020-11-04 20:21:24 UTC
Here's the problem:
```
Nov 03 04:52:28 localhost coreos-platform-chrony: /etc/chrony.conf is modified; not changing the default
```

Which seems to be caused by this:
```
$ oc get -o yaml machineconfig/99-master-chrony-configuration
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: "2020-11-03T01:48:13Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-chrony-configuration
  resourceVersion: "24725"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigs/99-master-chrony-configuration
  uid: 28ac1573-efc2-42a1-9d17-2ae04e0f8571
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 2.2.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,cG9vbCAxMC4wLjAuMiBpYnVyc3QKZHJpZnRmaWxlIC92YXIvbGliL2Nocm9ueS9kcmlmdAptYWtlc3RlcCAxLjAgMwpydGNzeW5jCmtK
          verification: {}
        filesystem: root
        mode: 420
        path: /etc/chrony.conf
  osImageURL: ""
```

decoded contents:
```
pool 10.0.0.2 iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
keyfile /etc/chrony.keys
leapsectz right/UTC
logdir /var/log/chrony
``

Is this CI job/test procedure injecting this machineconfig?  It should stop doing so on GCP as this is now handled by default.

Custom chrony configuration is still necessary for restricted networks on other platforms though, but at least AWS/GCP/Azure are covered.

Comment 11 Junqi Zhao 2020-11-05 01:08:06 UTC
(In reply to Colin Walters from comment #10)
> Is this CI job/test procedure injecting this machineconfig?  It should stop
> doing so on GCP as this is now handled by default.

Sorry, I am not familiar with machineconfig.
@zhsun, can you help to answer the question?

Comment 12 Junqi Zhao 2020-11-05 01:24:04 UTC
Antonio Murdaca
can you help to check questions in Comment 10, thanks

Comment 13 Colin Walters 2020-11-05 19:06:33 UTC
I can't be sure but what I believe is happening is something like:

- In the 4.3 timeframe or so, QE hits NodeClockNotSynchronising and finds the docs to set up chrony (e.g. https://github.com/openshift/machine-config-operator/#applying-configuration-changes-to-the-cluster )
- That change gets committed to something like a git repository of tests
- Later RHCOS adds https://github.com/coreos/fedora-coreos-tracker/blob/master/internals/README-internals.md#time-synchronization (around 4.5)
- The custom NTP server set up by QE (10.0.0.2) goes down 
- Now the tests are back to NodeClockNotSynchronising

Basically in whatever system you're using to launch these clusters there is a file with that custom machineconfig, which you should just stop injecting for all versions >= 4.5 at least.

Comment 14 Colin Walters 2020-11-09 19:24:13 UTC
This is not a bug in the product per above; an incorrect MachineConfig object is being provided by this QE setup at install time.
Closing, but feel free to reopen if you believe my analysis is incorrect.

Comment 15 Junqi Zhao 2020-11-10 02:11:25 UTC
reopen it, since we did not inject MachineConfig object during the upgrade, we just upgrade OCP, example
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-16-050052 --allow-explicit-upgrade=true --force

if this is machineconfig issue, I think we should assign to them

Comment 16 Colin Walters 2020-11-10 18:59:47 UTC
Comment #1 says "IPI_GCP_Restricted_HTTP_Proxy cluster".

This implies to me that "IPI_GCP_Restricted_HTTP_Proxy" is some sort of explicit test case (Jenkins job, shell script, or something) that is being run.
I believe the test case framework contains a custom MachineConfig object as noted here:
https://bugzilla.redhat.com/show_bug.cgi?id=1879887#c10

Where is the source code or definition of "IPI_GCP_Restricted_HTTP_Proxy" ?

Comment 17 Junqi Zhao 2020-11-11 01:58:26 UTC
(In reply to Colin Walters from comment #16)
> Comment #1 says "IPI_GCP_Restricted_HTTP_Proxy cluster".
> 
> This implies to me that "IPI_GCP_Restricted_HTTP_Proxy" is some sort of
> explicit test case (Jenkins job, shell script, or something) that is being
> run.
> I believe the test case framework contains a custom MachineConfig object as
> noted here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1879887#c10
> 
> Where is the source code or definition of "IPI_GCP_Restricted_HTTP_Proxy" ?

It means the configuration for the cluster, the meaning is, the cluster is disconnected cluster with HTTP_Proxy on GCP, installation mode is IPI

Comment 18 Junqi Zhao 2020-11-12 01:37:04 UTC
Created attachment 1728583 [details]
issue is reproduced on GCP(4.4.29 -> 4.5.18 -> 4.6.1)

Comment 21 Junqi Zhao 2020-11-17 07:03:48 UTC
upgraded from 4.4.29 -> 4.5.19 -> 4.6.4
did not meet the issue, although we still saw "localhost coreos-platform-chrony: /etc/chrony.conf is modified; not changing the default" info
# oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.4     True        False         29m     Cluster version is 4.6.4

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`;oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq '.data[].labels | {alertname,instance}'
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "instance": null
}
{
  "alertname": "Watchdog",
  "instance": null
}

# oc debug node/juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal
sh-4.4# journalctl --grep=chrony
-- Logs begin at Tue 2020-11-17 02:16:11 UTC, end at Tue 2020-11-17 06:57:28 UTC. --
Nov 17 02:17:49 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal ignition[1075]: INFO     : files: createFilesystemsFiles: createFiles: op(23): [started]  writing file "/etc/chrony.conf"
Nov 17 02:17:49 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal ignition[1075]: INFO     : files: createFilesystemsFiles: createFiles: op(23): [finished] writing file "/etc/chrony.conf"
Nov 17 02:17:52 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal ignition[1075]: INFO     : files: createFilesystemsFiles: createFiles: op(23): [started]  writing file "/etc/chrony.conf"
Nov 17 02:17:52 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal ignition[1075]: INFO     : files: createFilesystemsFiles: createFiles: op(23): [finished] writing file "/etc/chrony.conf"
Nov 17 02:17:54 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal restorecon[1460]: Relabeled /etc/chrony.conf from system_u:object_r:unlabeled_t:s0 to system_u:object_r:etc_t:s0
Nov 17 02:17:54 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal chronyd[1571]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 17 02:18:40 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal chronyd[1571]: chronyd exiting
Nov 17 02:18:40 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 33ms CPU time
-- Reboot --
Nov 17 02:18:59 localhost chronyd[1240]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 17 02:18:59 localhost chronyd[1240]: Frequency -85.906 +/- 72.276 ppm read from /var/lib/chrony/drift
Nov 17 02:19:01 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal hyperkube[1474]: I1117 02:19:01.466946    1474 factory.go:173] Using factory "raw" for container "/system.slice/chronyd.service"
Nov 17 02:19:01 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal hyperkube[1474]: I1117 02:19:01.467240    1474 manager.go:950] Added container: "/system.slice/chronyd.service" (aliases: [], namespace: "")
Nov 17 02:19:01 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal hyperkube[1474]: I1117 02:19:01.468035    1474 container.go:467] Start housekeeping for container "/system.slice/chronyd.service"
Nov 17 04:04:04 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal chronyd[1240]: chronyd exiting
Nov 17 04:04:04 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal systemd[1]: chronyd.service: Consumed 90ms CPU time
-- Reboot --
Nov 17 04:05:50 localhost coreos-platform-chrony: /etc/chrony.conf is modified; not changing the default
Nov 17 04:05:51 localhost chronyd[1271]: chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 +DEBUG)
Nov 17 04:05:51 localhost chronyd[1271]: Frequency -93.655 +/- 0.005 ppm read from /var/lib/chrony/drift
Nov 17 04:05:56 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal hyperkube[1537]: I1117 04:05:56.689194    1537 factory.go:177] Factory "crio" was unable to handle container "/system.slice/chronyd.service"
Nov 17 04:05:56 juzhao-gcp-bpn5s-master-0.c.openshift-qe.internal hyperkube[1537]: I1117 04:05:56.689207    1537 factory.go:166] Error trying to work out if we can handle /system.slice/chronyd.service: /system.s>

Comment 23 Colin Walters 2020-12-01 23:15:01 UTC
OK technically I think the bug here is in the MCO; reassigning.

My current triage here is that we might have a bug when updating from 4.4 → 4.5 (possibly also on 4.5 → 4.6) around deleting files that were included in machineconfig.
More in https://bugzilla.redhat.com/show_bug.cgi?id=1767669

Comment 25 Colin Walters 2021-10-12 12:18:26 UTC
There might still be a bug here, but we haven't had any customer reports and new installs shouldn't be subject to this.
Tentatively closing.