Bug 1942145 - AWS: instance super slow to initialize
Summary: AWS: instance super slow to initialize
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-23 18:24 UTC by Michael Gugino
Modified: 2023-09-15 01:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-13 20:56:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael Gugino 2021-03-23 18:24:31 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1940551#c3

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1373949894003265536

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1373949894003265536/artifacts/e2e-aws-serial/nodes/ip-10-0-238-198.us-west-1.compute.internal/journal

machine in question: ci-op-h1zqznhj-ef0bf-jglv4-worker-us-west-1b-9w9jb

node in question: ip-10-0-238-198.us-west-1.compute.internal


Instance was created via machine API around 11:58, as expected.  Node eventually joins the cluster, but after a particular e2e test times out waiting for it.  CSR approval happened correctly.  It seems the instance booted fine, but there are 2 11.5 minute pauses in the journal:

Mar 22 12:02:29.007513 ip-10-0-238-198 rpm-ostree[1979]: In idle state; will auto-exit in 60 seconds
Mar 22 12:02:29.018230 ip-10-0-238-198 systemd[1]: rpm-ostreed.service: Succeeded.
Mar 22 12:02:29.019118 ip-10-0-238-198 systemd[1]: rpm-ostreed.service: Consumed 46ms CPU time
Mar 22 12:14:08.241182 ip-10-0-238-198 systemd[1]: Starting Cleanup of Temporary Directories...
Mar 22 12:14:08.253721 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/pkg-dbus-daemon.conf:1] Duplicate line for path "/var/lib/dbus", ignoring.
Mar 22 12:14:08.259151 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/rpm-ostree-1-autovar.conf:1] Duplicate line for path "/var/lib/alternatives
", ignoring.
Mar 22 12:14:08.260616 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/tmp.conf:12] Duplicate line for path "/var/tmp", ignoring.
Mar 22 12:14:08.260740 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Mar 22 12:14:08.261187 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:19] Duplicate line for path "/var/cache", ignoring.
Mar 22 12:14:08.261202 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:21] Duplicate line for path "/var/lib", ignoring.
Mar 22 12:14:08.261210 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:23] Duplicate line for path "/var/spool", ignoring.
Mar 22 12:14:08.268883 ip-10-0-238-198 systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Mar 22 12:14:08.269504 ip-10-0-238-198 systemd[1]: Started Cleanup of Temporary Directories.
Mar 22 12:14:08.269788 ip-10-0-238-198 systemd[1]: systemd-tmpfiles-clean.service: Consumed 21ms CPU time
Mar 22 12:23:30.565724 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:23:30.565471    1969 update.go:1729] Updating OS to quay.io/openshift-release-dev/
ocp-v4.0-art-dev@sha256:8db65a64505bb77e208f882270ff6c9778c73bfc13812602a0f8a01dc27b18e4


In comparison, machine from the same machineset (machineset which was created by the installer and unmodified) boots and joins the cluster in little time (just under 4 minutes):

machine: ci-op-h1zqznhj-ef0bf-jglv4-worker-us-west-1b-wgwmm
"creationTimestamp": "2021-03-22T11:06:39Z"

node: ip-10-0-213-0.us-west-1.compute.internal
"creationTimestamp": "2021-03-22T11:10:19Z"

Comment 1 Micah Abbott 2021-03-23 18:57:57 UTC
I wonder if the delays are related to extracting + updating to the `machine-os-content`?

The large delays seem to happen right around when the MCD is doing the `oc image extract` and then afterwards when it is updating.

@Colin could you have a look at this?


```
Mar 22 12:01:27.606761 ip-10-0-238-198 systemd[1]: Started rpm-ostree System Management Daemon.                                                                                                                                                                      
Mar 22 12:01:27.620913 ip-10-0-238-198 rpm-ostree[1979]: client(id:cli dbus:1.32 unit:machine-config-daemon-firstboot.service uid:0) added; new total=1                                                                                                              
Mar 22 12:01:27.626357 ip-10-0-238-198 rpm-ostree[1979]: client(id:cli dbus:1.32 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0                                                                                                           
Mar 22 12:01:27.626381 ip-10-0-238-198 rpm-ostree[1979]: In idle state; will auto-exit in 61 seconds                                                                                                                                                                 
Mar 22 12:01:27.638281 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:01:27.638246    1969 daemon.go:219] Booted osImageURL:  (48.83.202103122318-0)                                                                                                          
Mar 22 12:01:27.640008 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:01:27.639979    1969 update.go:543] Checking Reconcilable for config mco-empty-mc to rendered-worker-0db43007924f36a26c672f8c71bbb458                                                   
Mar 22 12:01:27.641445 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:01:27.641410    1969 update.go:1851] Starting update from mco-empty-mc to rendered-worker-0db43007924f36a26c672f8c71bbb458: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Mar 22 12:01:27.644953 ip-10-0-238-198 root[1993]: machine-config-daemon[1969]: Starting update from mco-empty-mc to rendered-worker-0db43007924f36a26c672f8c71bbb458: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Mar 22 12:01:27.645577 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:01:27.645541    1969 update.go:1166] Updating files                                                                                                                                     
Mar 22 12:01:27.645941 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:01:27.645910    1969 update.go:1239] Deleting stale data                                                                                                                                
Mar 22 12:01:27.648345 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:01:27.648302    1969 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-079026093 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8db6
5a64505bb77e208f882270ff6c9778c73bfc13812602a0f8a01dc27b18e4                                                                                                                                                                                                                                                                  
Mar 22 12:02:29.007513 ip-10-0-238-198 rpm-ostree[1979]: In idle state; will auto-exit in 60 seconds                                                                                                                                                                 
Mar 22 12:02:29.018230 ip-10-0-238-198 systemd[1]: rpm-ostreed.service: Succeeded.                                                                                                                                                                                   
Mar 22 12:02:29.019118 ip-10-0-238-198 systemd[1]: rpm-ostreed.service: Consumed 46ms CPU time                                                                                                                                                                       
Mar 22 12:14:08.241182 ip-10-0-238-198 systemd[1]: Starting Cleanup of Temporary Directories...                                                                                                                                                                      
Mar 22 12:14:08.253721 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/pkg-dbus-daemon.conf:1] Duplicate line for path "/var/lib/dbus", ignoring.                                                                                                       
Mar 22 12:14:08.259151 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/rpm-ostree-1-autovar.conf:1] Duplicate line for path "/var/lib/alternatives", ignoring.                                                                                          
Mar 22 12:14:08.260616 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/tmp.conf:12] Duplicate line for path "/var/tmp", ignoring.                                                                                                                       
Mar 22 12:14:08.260740 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.                                                                                                                       
Mar 22 12:14:08.261187 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:19] Duplicate line for path "/var/cache", ignoring.                                                                                                                     
Mar 22 12:14:08.261202 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:21] Duplicate line for path "/var/lib", ignoring.                                                                                                                       
Mar 22 12:14:08.261210 ip-10-0-238-198 systemd-tmpfiles[2004]: [/usr/lib/tmpfiles.d/var.conf:23] Duplicate line for path "/var/spool", ignoring.                                                                                                                     
Mar 22 12:14:08.268883 ip-10-0-238-198 systemd[1]: systemd-tmpfiles-clean.service: Succeeded.                                                                                                                                                                        
Mar 22 12:14:08.269504 ip-10-0-238-198 systemd[1]: Started Cleanup of Temporary Directories.                                                                                                                                                                         
Mar 22 12:14:08.269788 ip-10-0-238-198 systemd[1]: systemd-tmpfiles-clean.service: Consumed 21ms CPU time                                                                                                                                                            
Mar 22 12:23:30.565724 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:23:30.565471    1969 update.go:1729] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8db65a64505bb77e208f882270ff6c9778c73bfc13812602a0f8a01dc27b18e4              
Mar 22 12:23:30.565724 ip-10-0-238-198 machine-config-daemon[1969]: I0322 12:23:30.565700    1969 rpm-ostree.go:258] Running captured: rpm-ostree status --json                                                                                                      
Mar 22 12:23:30.591697 ip-10-0-238-198 dbus-daemon[1456]: [system] Activating via systemd: service name='org.projectatomic.rpmostree1' unit='rpm-ostreed.service' requested by ':1.36' (uid=0 pid=2010 comm="rpm-ostree status --json " label="system_u:system_r:unconfined_service_t:s0")
Mar 22 12:23:30.594232 ip-10-0-238-198 systemd[1]: Starting rpm-ostree System Management Daemon...                                                                                                                                                                   
Mar 22 12:23:30.619839 ip-10-0-238-198 rpm-ostree[2013]: Reading config file '/etc/rpm-ostreed.conf'                                                                                                                                                                 
Mar 22 12:23:30.625423 ip-10-0-238-198 rpm-ostree[2013]: In idle state; will auto-exit in 61 seconds                                                                                                                                                                 
Mar 22 12:23:30.625525 ip-10-0-238-198 dbus-daemon[1456]: [system] Successfully activated service 'org.projectatomic.rpmostree1'                                                                                                                                     
Mar 22 12:23:30.625723 ip-10-0-238-198 systemd[1]: Started rpm-ostree System Management Daemon.                                                                                                                                                                      
Mar 22 12:23:30.630184 ip-10-0-238-198 rpm-ostree[2013]: client(id:cli dbus:1.36 unit:machine-config-daemon-firstboot.service uid:0) added; new total=1                                                                                                              
Mar 22 12:23:30.635664 ip-10-0-238-198 rpm-ostree[2013]: client(id:cli dbus:1.36 unit:machine-config-daemon-firstboot.service uid:0) vanished; remaining=0                                                                                                           
Mar 22 12:23:30.635700 ip-10-0-238-198 rpm-ostree[2013]: In idle state; will auto-exit in 64 seconds                                                                                                                                                                 
```

Comment 2 Michael Gugino 2021-03-23 19:09:36 UTC
This job on azure has a 7 minute boot pause: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.8/1372002364097040384 and only 1 long pause not 2.  Not sure if 7 minutes is within spec or not, the log statements make it appear the system is doing nothing during that time.

Comment 3 Colin Walters 2021-04-08 19:30:38 UTC
Hmm.  I think we'd need to instrument some things here with more logging to really know anything.  It can also be pulling the image that's being slow.

I hope to land some work from ostree-container so that we can pull the image directly, which will avoid the delay from buffering the whole thing; we can also add metrics around pull speed to this.

Alternatively, changing the cluster to do pull-through to a local cache would be a huge win.

But it could be something entirely different going wrong.   We can get a *lot* more logs with `oc -v7 image extract` but I'm not sure I'd want to run that by default.

Maybe it would make sense to bite the bullet and vendor containers/image into the MCD.  Or, perhaps preferably this would look like https://github.com/ostreedev/ostree-rs-ext/issues/6

Anyways short term...I think adding some sort of Prometheus metric around node join time would be useful.  Does the MAPI have that?  We could also add a metric for update+reboot time to the MCD.

Comment 4 Michael Gugino 2021-04-08 20:07:39 UTC
We discussed how to best gather 'node join time' today, but there isn't a way currently.  We need the diff between machine provisioned and node created.

Comment 7 Joel Speed 2021-08-17 09:04:20 UTC
This appears to still be an issue that is causing failures in CI across multiple platforms, this one is from Azure.

```
ug 16 21:42:00.794206 ci-op-kt0lhi57-3302f-xj9bk-worker-eastus2-kvbgx systemd[1]: rpm-ostreed.service: Succeeded.
Aug 16 21:42:00.795194 ci-op-kt0lhi57-3302f-xj9bk-worker-eastus2-kvbgx systemd[1]: rpm-ostreed.service: Consumed 66ms CPU time
Aug 16 21:54:16.798432 ci-op-kt0lhi57-3302f-xj9bk-worker-eastus2-kvbgx systemd[1]: Starting Cleanup of Temporary Directories...
Aug 16 21:54:16.811015 ci-op-kt0lhi57-3302f-xj9bk-worker-eastus2-kvbgx systemd-tmpfiles[2081]: [/usr/lib/tmpfiles.d/pkg-dbus-daemon.conf:1] Duplicate line for path "/var/lib/dbus", ignoring.
```

A 12 minute pause early in the startup

Journal from: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/637/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1427376531528749057/artifacts/e2e-agnostic/gather-extra/artifacts/nodes/ci-op-kt0lhi57-3302f-xj9bk-worker-eastus2-kvbgx/

Has there been any further investigation into this one?

Comment 8 Micah Abbott 2021-09-02 19:10:00 UTC
Colin, could you take another look at this and see if there is another path forward we can try?

Comment 9 Micah Abbott 2022-01-13 20:56:02 UTC
We haven't received additional reports of the instances slow to initialize and we don't have enough data to take action, so going to close this for now.

If additional data is provided, this can be reopened.

Comment 10 Red Hat Bugzilla 2023-09-15 01:03:54 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.