Bug 1375629 - atomic host deploy phase can be extremely slow
Summary: atomic host deploy phase can be extremely slow
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: rpm-ostree-client
Version: 7.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Colin Walters
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-13 14:31 UTC by Qian Cai
Modified: 2016-09-21 22:31 UTC (History)
5 users (show)

Fixed In Version: ostree-2016.10-1.atomic.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-21 18:46:57 UTC
Target Upstream Version:


Attachments (Terms of Use)
blinking cursors screenshot (40.37 KB, image/png)
2016-09-13 15:04 UTC, Qian Cai
no flags Details
blinking cursors screenshot v2 (30.00 KB, image/png)
2016-09-13 15:07 UTC, Qian Cai
no flags Details

Comment 1 Colin Walters 2016-09-13 14:39:12 UTC
11m is certainly bad...but to be clear, this bug isn't about *rendering* - it's simply that the process is slow (without visible feedback), right?

Comment 2 Qian Cai 2016-09-13 15:04:24 UTC
Created attachment 1200533 [details]
blinking cursors screenshot

During cursor blinking, it is just blank screen with cursor blinking in a restricted area (screenshot attached). BTW, downloading objects only took around 3 minutes or so.

Comment 3 Qian Cai 2016-09-13 15:07:17 UTC
Created attachment 1200534 [details]
blinking cursors screenshot v2

Comment 4 Qian Cai 2016-09-13 15:12:20 UTC
Funny enough, if you interrupt (Ctrl-C) the downloading object process at the beginning before the blink cursors showed up and then resume the upgrade, the problem won't happen anymore.

# atomic host upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard

Receiving metadata objects: 26/(estimating) 2.4 MB/s 33.3 MB                    Receiving metadata objects: 32/(estimating) 2.4 MB/s 36.6 MB                    ^C

# time atomic host upgrade

Receiving objects: 99% (3921/3931) 928.5 kB/s 220.1 MB                          
error: Operation was cancelled

real	3m39.391s
user	0m0.464s
sys	0m0.052s

# time atomic host upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard
1 metadata, 0 content objects fetched; 837 B transferred in 3 seconds
Copying /etc changes: 29 modified, 4 removed, 57 added
Transaction complete; bootconfig swap: yes deployment count change: 1
...
Run "systemctl reboot" to start a reboot

real	0m8.352s
user	0m0.533s
sys	0m0.127s

Comment 5 Colin Walters 2016-09-13 15:17:42 UTC
I can't say I know what's going on here, but as far as I know, if there *is* some issue with 7.2.6-1 then there's not much we can do to fix it now.

Comment 6 Qian Cai 2016-09-13 15:24:12 UTC
FYI, also reproducible on the OS1.

Comment 7 Micah Abbott 2016-09-13 15:31:41 UTC
In my previous testing using local VMs and bare metal, I did not encounter this issue.  (However, mjenner said he did see it in a local VM setup).

But it is reproducing relatively easily in OpenStack.

Comment 8 Micah Abbott 2016-09-13 15:35:17 UTC
Also, non-interactive upgrades (i.e. via Ansible) did not appear to show this problem or any slower upgrade progress.

Comment 9 Qian Cai 2016-09-13 15:36:47 UTC
Here is a manual workaround. Once blinked cursors in place, it can't be interrupted, i.e., Ctrl-C (suppose it is doing IO by writing the objects to the disk). After writing for a while (suppose everything is writing to the disk), you will need to "kill -9" the "rpm-ostree upgrade" process, and run "atomic host upgrade" again.

Comment 11 Micah Abbott 2016-09-13 16:06:18 UTC
Redirecting the output to 'tee' seems to avoid this issue and allows the upgrade to complete in a reasonable amount of time

# time rpm-ostree upgrade | tee output
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard
Receiving metadata objects: 2/(estimating) 355 bytes/s 1.1 kB
Receiving metadata objects: 4/(estimating) 741 bytes/s 3.0 kB
Receiving metadata objects: 8/(estimating) 279.4 kB/s 1.4 MB
Receiving metadata objects: 21/(estimating) 813.4 kB/s 5.7 MB
Receiving metadata objects: 27/(estimating) 1.0 MB/s 8.4 MB
...
  systemd-sysv 219-19.el7_2.12 -> 219-19.el7_2.13
  tuned 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
  tuned-profiles-atomic 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
Run "systemctl reboot" to start a reboot

real    4m36.102s
user    0m0.137s
sys     0m0.107s



Additionally, I have observed this problem doing a rebase from 7.2.7 to 7.3

Comment 12 Qian Cai 2016-09-13 18:07:32 UTC
One interesting thing is that both those openstack instances are using serial console, e.g., console=tty0 console=ttyS0,115200n8 , so that probably explain why people didn't reproduce it on a local system without serial console so far. Giving that rpm-ostree depends on glib2, so I am unsure if there could something messed up with console text handing somewhere.

Comment 13 Colin Walters 2016-09-13 18:10:49 UTC
I'm still getting ~4m and not 11m.  My best guess here is this is related to syncfs() performance - we're getting I/O latency spikes in OpenStack.  But it's hard to say without more measurements.

What we're talking about now is how much time the non-pull part of the update takes, right?  The pull process could obviously be faster but I don't think there's any regressions there.

So can those who are getting slow times - can you also add to the logs here how long the pull took?

Comment 14 Micah Abbott 2016-09-13 18:13:58 UTC
I saw the following:

# rpm-ostree upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard

593 metadata, 3338 content objects fetched; 215800 KiB transferred in 253 seconds

<problem observed for many minutes>

Copying /etc changes: 29 modified, 4 removed, 60 added
Transaction complete; bootconfig swap: yes deployment count change: 0
...

Comment 15 Colin Walters 2016-09-13 18:21:17 UTC
Can someone try attaching `strace -f -s 2048 -o strace.log $(pidof rpm-ostreed)` after the pull is done?  (In a tools container or via layered packages)

Also, try using `pstack $(pidof rpm-ostreed)` (possibly after `debuginfo-install rpm-ostree`)).

Comment 16 Micah Abbott 2016-09-13 18:36:15 UTC
Another data point, regarding length of time:

# time rpm-ostree upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard

593 metadata, 3338 content objects fetched; 215800 KiB transferred in 278 seconds                                                                                                                                                             
<problem occurs>

  tuned 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
  tuned-profiles-atomic 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
Run "systemctl reboot" to start a reboot

real	12m14.369s
user	0m8.474s
sys	0m12.841s

Comment 23 Colin Walters 2016-09-16 16:13:34 UTC
One thing that would help on this is to rebuild rpm-ostree without optimization (and full debuginfo) and get a stack trace from that.   In particular, the values of `ncolumns`, `textlen` etc. from text_percent_internal()

Here we'd just need a rebuilt client binary.

Comment 29 Jonathan Lebon 2016-09-21 14:01:03 UTC
Going through the steps in comment 1 (with the addition of setenforce 0) in a loop), I have been getting consistent results so far and haven't been able to reproduce this issue. CAI/Micah, could you verify if you can still reproduce it? My gut feeling is that this very well could have been an OpenStack performance issue.

(In reply to Micah Abbott from comment #7)
> In my previous testing using local VMs and bare metal, I did not encounter
> this issue.  (However, mjenner said he did see it in a local VM setup).

Is he able to reproduce this reliably locally?

Comment 31 Colin Walters 2016-09-21 14:48:01 UTC
@jlebon - I forgot to update this bug, but when I was looking at this with Micah, the stack traces clearly show the client spinning in a loop writing spaces.  The daemon was completely idle.

My best guess here is it's something like SIGWINCH being delivered at an unlucky time causing corruption of the cached_columns variable?

It does look to me like:

diff --git a/glnx-console.c b/glnx-console.c
index 416d276..24a4850 100644
--- a/glnx-console.c
+++ b/glnx-console.c
@@ -88,6 +88,8 @@ glnx_console_columns (void)
       cached_columns = c;
     }
 
+  /* SIGWINCH HERE */
+
   return cached_columns;
 }
 

is going to result in cached_columns being returned as zero.  Though in that case, AFAICS we'd hit:

  if (ncolumns < bar_min)
    return; /* TODO: spinner */

Which should result in a noop.  But I'm guessing it's something similar to this.

Comment 32 Micah Abbott 2016-09-21 15:12:13 UTC
(In reply to Jonathan Lebon from comment #29)
> Going through the steps in comment 1 (with the addition of setenforce 0) in
> a loop), I have been getting consistent results so far and haven't been able
> to reproduce this issue. CAI/Micah, could you verify if you can still
> reproduce it? My gut feeling is that this very well could have been an
> OpenStack performance issue.

I can still reproduce it.  And now I have seen it reproduced when using a local VM.

My best guess at a reproducer (which seems dependent on some sort of client settings/config):

1.  Boot VM
2.  Start brand new terminal (in my case I am using GNOME Terminal)
3.  SSH to VM (in my case, using unprivileged 'cloud-user')
4.  sudo su -
5.  rpm-ostree upgrade
6.  Wait for pull to complete, observe hang/blank characters getting spit out

Comment 33 Jonathan Lebon 2016-09-21 15:34:13 UTC
I am about 99% certain that this is https://github.com/GNOME/libglnx/pull/19. ostree-2016.7-2 didn't have this fix, but the latest one does.

To reproduce this, one simply has to have a terminal window less than strlen("xxx metadata, xxxx content objects fetched; xxxxxx KiB transferred in xxx seconds") (I'll save you the trouble... it's 82 :)). The standard GNOME Terminal window is 80 columns wide.

I think we can close this as CURRENTRELEASE, but I'd like Micah and/or CAI to try to reproduce this while having a maximized terminal window.

Comment 34 Qian Cai 2016-09-21 18:22:00 UTC
The above finding looks like correct from further testing.

Comment 35 Micah Abbott 2016-09-21 18:30:44 UTC
Agreed with @caiqian.

Performed upgrades using local VMs and OpenStack VMs.
  - Upgraded from 7.2.6-1 -> 7.2.7
  - Upgraded from 7.2.6-1 -> 7.3

When the terminal was full-screen, this issue was not observed.

Comment 36 Steve Milner 2016-09-21 18:46:57 UTC
Per conversion setting to CLOSED NEXTRELEASE.


Note You need to log in before you can comment on or make changes to this bug.