Hide Forgot
11m is certainly bad...but to be clear, this bug isn't about *rendering* - it's simply that the process is slow (without visible feedback), right?
Created attachment 1200533 [details] blinking cursors screenshot During cursor blinking, it is just blank screen with cursor blinking in a restricted area (screenshot attached). BTW, downloading objects only took around 3 minutes or so.
Created attachment 1200534 [details] blinking cursors screenshot v2
Funny enough, if you interrupt (Ctrl-C) the downloading object process at the beginning before the blink cursors showed up and then resume the upgrade, the problem won't happen anymore. # atomic host upgrade Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard Receiving metadata objects: 26/(estimating) 2.4 MB/s 33.3 MB Receiving metadata objects: 32/(estimating) 2.4 MB/s 36.6 MB ^C # time atomic host upgrade Receiving objects: 99% (3921/3931) 928.5 kB/s 220.1 MB error: Operation was cancelled real 3m39.391s user 0m0.464s sys 0m0.052s # time atomic host upgrade Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard 1 metadata, 0 content objects fetched; 837 B transferred in 3 seconds Copying /etc changes: 29 modified, 4 removed, 57 added Transaction complete; bootconfig swap: yes deployment count change: 1 ... Run "systemctl reboot" to start a reboot real 0m8.352s user 0m0.533s sys 0m0.127s
I can't say I know what's going on here, but as far as I know, if there *is* some issue with 7.2.6-1 then there's not much we can do to fix it now.
FYI, also reproducible on the OS1.
In my previous testing using local VMs and bare metal, I did not encounter this issue. (However, mjenner said he did see it in a local VM setup). But it is reproducing relatively easily in OpenStack.
Also, non-interactive upgrades (i.e. via Ansible) did not appear to show this problem or any slower upgrade progress.
Here is a manual workaround. Once blinked cursors in place, it can't be interrupted, i.e., Ctrl-C (suppose it is doing IO by writing the objects to the disk). After writing for a while (suppose everything is writing to the disk), you will need to "kill -9" the "rpm-ostree upgrade" process, and run "atomic host upgrade" again.
Redirecting the output to 'tee' seems to avoid this issue and allows the upgrade to complete in a reasonable amount of time # time rpm-ostree upgrade | tee output Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard Receiving metadata objects: 2/(estimating) 355 bytes/s 1.1 kB Receiving metadata objects: 4/(estimating) 741 bytes/s 3.0 kB Receiving metadata objects: 8/(estimating) 279.4 kB/s 1.4 MB Receiving metadata objects: 21/(estimating) 813.4 kB/s 5.7 MB Receiving metadata objects: 27/(estimating) 1.0 MB/s 8.4 MB ... systemd-sysv 219-19.el7_2.12 -> 219-19.el7_2.13 tuned 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6 tuned-profiles-atomic 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6 Run "systemctl reboot" to start a reboot real 4m36.102s user 0m0.137s sys 0m0.107s Additionally, I have observed this problem doing a rebase from 7.2.7 to 7.3
One interesting thing is that both those openstack instances are using serial console, e.g., console=tty0 console=ttyS0,115200n8 , so that probably explain why people didn't reproduce it on a local system without serial console so far. Giving that rpm-ostree depends on glib2, so I am unsure if there could something messed up with console text handing somewhere.
I'm still getting ~4m and not 11m. My best guess here is this is related to syncfs() performance - we're getting I/O latency spikes in OpenStack. But it's hard to say without more measurements. What we're talking about now is how much time the non-pull part of the update takes, right? The pull process could obviously be faster but I don't think there's any regressions there. So can those who are getting slow times - can you also add to the logs here how long the pull took?
I saw the following: # rpm-ostree upgrade Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard 593 metadata, 3338 content objects fetched; 215800 KiB transferred in 253 seconds <problem observed for many minutes> Copying /etc changes: 29 modified, 4 removed, 60 added Transaction complete; bootconfig swap: yes deployment count change: 0 ...
Can someone try attaching `strace -f -s 2048 -o strace.log $(pidof rpm-ostreed)` after the pull is done? (In a tools container or via layered packages) Also, try using `pstack $(pidof rpm-ostreed)` (possibly after `debuginfo-install rpm-ostree`)).
Another data point, regarding length of time: # time rpm-ostree upgrade Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard 593 metadata, 3338 content objects fetched; 215800 KiB transferred in 278 seconds <problem occurs> tuned 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6 tuned-profiles-atomic 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6 Run "systemctl reboot" to start a reboot real 12m14.369s user 0m8.474s sys 0m12.841s
One thing that would help on this is to rebuild rpm-ostree without optimization (and full debuginfo) and get a stack trace from that. In particular, the values of `ncolumns`, `textlen` etc. from text_percent_internal() Here we'd just need a rebuilt client binary.
Going through the steps in comment 1 (with the addition of setenforce 0) in a loop), I have been getting consistent results so far and haven't been able to reproduce this issue. CAI/Micah, could you verify if you can still reproduce it? My gut feeling is that this very well could have been an OpenStack performance issue. (In reply to Micah Abbott from comment #7) > In my previous testing using local VMs and bare metal, I did not encounter > this issue. (However, mjenner said he did see it in a local VM setup). Is he able to reproduce this reliably locally?
@jlebon - I forgot to update this bug, but when I was looking at this with Micah, the stack traces clearly show the client spinning in a loop writing spaces. The daemon was completely idle. My best guess here is it's something like SIGWINCH being delivered at an unlucky time causing corruption of the cached_columns variable? It does look to me like: diff --git a/glnx-console.c b/glnx-console.c index 416d276..24a4850 100644 --- a/glnx-console.c +++ b/glnx-console.c @@ -88,6 +88,8 @@ glnx_console_columns (void) cached_columns = c; } + /* SIGWINCH HERE */ + return cached_columns; } is going to result in cached_columns being returned as zero. Though in that case, AFAICS we'd hit: if (ncolumns < bar_min) return; /* TODO: spinner */ Which should result in a noop. But I'm guessing it's something similar to this.
(In reply to Jonathan Lebon from comment #29) > Going through the steps in comment 1 (with the addition of setenforce 0) in > a loop), I have been getting consistent results so far and haven't been able > to reproduce this issue. CAI/Micah, could you verify if you can still > reproduce it? My gut feeling is that this very well could have been an > OpenStack performance issue. I can still reproduce it. And now I have seen it reproduced when using a local VM. My best guess at a reproducer (which seems dependent on some sort of client settings/config): 1. Boot VM 2. Start brand new terminal (in my case I am using GNOME Terminal) 3. SSH to VM (in my case, using unprivileged 'cloud-user') 4. sudo su - 5. rpm-ostree upgrade 6. Wait for pull to complete, observe hang/blank characters getting spit out
I am about 99% certain that this is https://github.com/GNOME/libglnx/pull/19. ostree-2016.7-2 didn't have this fix, but the latest one does. To reproduce this, one simply has to have a terminal window less than strlen("xxx metadata, xxxx content objects fetched; xxxxxx KiB transferred in xxx seconds") (I'll save you the trouble... it's 82 :)). The standard GNOME Terminal window is 80 columns wide. I think we can close this as CURRENTRELEASE, but I'd like Micah and/or CAI to try to reproduce this while having a maximized terminal window.
The above finding looks like correct from further testing.
Agreed with @caiqian. Performed upgrades using local VMs and OpenStack VMs. - Upgraded from 7.2.6-1 -> 7.2.7 - Upgraded from 7.2.6-1 -> 7.3 When the terminal was full-screen, this issue was not observed.
Per conversion setting to CLOSED NEXTRELEASE.