Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1375629

Summary:

atomic host deploy phase can be extremely slow

Product:

Red Hat Enterprise Linux 7

Reporter:

Qian Cai <qcai>

Component:

rpm-ostree-client

Assignee:

Colin Walters <walters>

Status:

CLOSED NEXTRELEASE

QA Contact:

atomic-bugs <atomic-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.2

CC:

ajia, jlebon, miabbott, qcai, smilner

Target Milestone:

Keywords:

Extras

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

ostree-2016.10-1.atomic.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-09-21 18:46:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
blinking cursors screenshot	none
blinking cursors screenshot v2	none

Comment 1 Colin Walters 2016-09-13 14:39:12 UTC

11m is certainly bad...but to be clear, this bug isn't about *rendering* - it's simply that the process is slow (without visible feedback), right?

Comment 2 Qian Cai 2016-09-13 15:04:24 UTC

Created attachment 1200533 [details]
blinking cursors screenshot

During cursor blinking, it is just blank screen with cursor blinking in a restricted area (screenshot attached). BTW, downloading objects only took around 3 minutes or so.

Comment 3 Qian Cai 2016-09-13 15:07:17 UTC

Created attachment 1200534 [details]
blinking cursors screenshot v2

Comment 4 Qian Cai 2016-09-13 15:12:20 UTC

Funny enough, if you interrupt (Ctrl-C) the downloading object process at the beginning before the blink cursors showed up and then resume the upgrade, the problem won't happen anymore.

# atomic host upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard

Receiving metadata objects: 26/(estimating) 2.4 MB/s 33.3 MB                    Receiving metadata objects: 32/(estimating) 2.4 MB/s 36.6 MB                    ^C

# time atomic host upgrade

Receiving objects: 99% (3921/3931) 928.5 kB/s 220.1 MB                          
error: Operation was cancelled

real	3m39.391s
user	0m0.464s
sys	0m0.052s

# time atomic host upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard
1 metadata, 0 content objects fetched; 837 B transferred in 3 seconds
Copying /etc changes: 29 modified, 4 removed, 57 added
Transaction complete; bootconfig swap: yes deployment count change: 1
...
Run "systemctl reboot" to start a reboot

real	0m8.352s
user	0m0.533s
sys	0m0.127s

Comment 5 Colin Walters 2016-09-13 15:17:42 UTC

I can't say I know what's going on here, but as far as I know, if there *is* some issue with 7.2.6-1 then there's not much we can do to fix it now.

Comment 6 Qian Cai 2016-09-13 15:24:12 UTC

FYI, also reproducible on the OS1.

Comment 7 Micah Abbott 2016-09-13 15:31:41 UTC

In my previous testing using local VMs and bare metal, I did not encounter this issue.  (However, mjenner said he did see it in a local VM setup).

But it is reproducing relatively easily in OpenStack.

Comment 8 Micah Abbott 2016-09-13 15:35:17 UTC

Also, non-interactive upgrades (i.e. via Ansible) did not appear to show this problem or any slower upgrade progress.

Comment 9 Qian Cai 2016-09-13 15:36:47 UTC

Here is a manual workaround. Once blinked cursors in place, it can't be interrupted, i.e., Ctrl-C (suppose it is doing IO by writing the objects to the disk). After writing for a while (suppose everything is writing to the disk), you will need to "kill -9" the "rpm-ostree upgrade" process, and run "atomic host upgrade" again.

Comment 11 Micah Abbott 2016-09-13 16:06:18 UTC

Redirecting the output to 'tee' seems to avoid this issue and allows the upgrade to complete in a reasonable amount of time

# time rpm-ostree upgrade | tee output
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard
Receiving metadata objects: 2/(estimating) 355 bytes/s 1.1 kB
Receiving metadata objects: 4/(estimating) 741 bytes/s 3.0 kB
Receiving metadata objects: 8/(estimating) 279.4 kB/s 1.4 MB
Receiving metadata objects: 21/(estimating) 813.4 kB/s 5.7 MB
Receiving metadata objects: 27/(estimating) 1.0 MB/s 8.4 MB
...
  systemd-sysv 219-19.el7_2.12 -> 219-19.el7_2.13
  tuned 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
  tuned-profiles-atomic 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
Run "systemctl reboot" to start a reboot

real    4m36.102s
user    0m0.137s
sys     0m0.107s



Additionally, I have observed this problem doing a rebase from 7.2.7 to 7.3

Comment 12 Qian Cai 2016-09-13 18:07:32 UTC

One interesting thing is that both those openstack instances are using serial console, e.g., console=tty0 console=ttyS0,115200n8 , so that probably explain why people didn't reproduce it on a local system without serial console so far. Giving that rpm-ostree depends on glib2, so I am unsure if there could something messed up with console text handing somewhere.

Comment 13 Colin Walters 2016-09-13 18:10:49 UTC

I'm still getting ~4m and not 11m.  My best guess here is this is related to syncfs() performance - we're getting I/O latency spikes in OpenStack.  But it's hard to say without more measurements.

What we're talking about now is how much time the non-pull part of the update takes, right?  The pull process could obviously be faster but I don't think there's any regressions there.

So can those who are getting slow times - can you also add to the logs here how long the pull took?

Comment 14 Micah Abbott 2016-09-13 18:13:58 UTC

I saw the following:

# rpm-ostree upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard

593 metadata, 3338 content objects fetched; 215800 KiB transferred in 253 seconds

<problem observed for many minutes>

Copying /etc changes: 29 modified, 4 removed, 60 added
Transaction complete; bootconfig swap: yes deployment count change: 0
...

Comment 15 Colin Walters 2016-09-13 18:21:17 UTC

Can someone try attaching `strace -f -s 2048 -o strace.log $(pidof rpm-ostreed)` after the pull is done?  (In a tools container or via layered packages)

Also, try using `pstack $(pidof rpm-ostreed)` (possibly after `debuginfo-install rpm-ostree`)).

Comment 16 Micah Abbott 2016-09-13 18:36:15 UTC

Another data point, regarding length of time:

# time rpm-ostree upgrade
Updating from: rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard

593 metadata, 3338 content objects fetched; 215800 KiB transferred in 278 seconds                                                                                                                                                             
<problem occurs>

  tuned 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
  tuned-profiles-atomic 2.5.1-4.el7_2.3 -> 2.5.1-4.el7_2.6
Run "systemctl reboot" to start a reboot

real	12m14.369s
user	0m8.474s
sys	0m12.841s

Comment 23 Colin Walters 2016-09-16 16:13:34 UTC

One thing that would help on this is to rebuild rpm-ostree without optimization (and full debuginfo) and get a stack trace from that.   In particular, the values of `ncolumns`, `textlen` etc. from text_percent_internal()

Here we'd just need a rebuilt client binary.

Comment 29 Jonathan Lebon 2016-09-21 14:01:03 UTC

Going through the steps in comment 1 (with the addition of setenforce 0) in a loop), I have been getting consistent results so far and haven't been able to reproduce this issue. CAI/Micah, could you verify if you can still reproduce it? My gut feeling is that this very well could have been an OpenStack performance issue.

(In reply to Micah Abbott from comment #7)
> In my previous testing using local VMs and bare metal, I did not encounter
> this issue.  (However, mjenner said he did see it in a local VM setup).

Is he able to reproduce this reliably locally?

Comment 31 Colin Walters 2016-09-21 14:48:01 UTC

@jlebon - I forgot to update this bug, but when I was looking at this with Micah, the stack traces clearly show the client spinning in a loop writing spaces.  The daemon was completely idle.

My best guess here is it's something like SIGWINCH being delivered at an unlucky time causing corruption of the cached_columns variable?

It does look to me like:

diff --git a/glnx-console.c b/glnx-console.c
index 416d276..24a4850 100644
--- a/glnx-console.c
+++ b/glnx-console.c
@@ -88,6 +88,8 @@ glnx_console_columns (void)
       cached_columns = c;
     }
 
+  /* SIGWINCH HERE */
+
   return cached_columns;
 }
 

is going to result in cached_columns being returned as zero.  Though in that case, AFAICS we'd hit:

  if (ncolumns < bar_min)
    return; /* TODO: spinner */

Which should result in a noop.  But I'm guessing it's something similar to this.

Comment 32 Micah Abbott 2016-09-21 15:12:13 UTC

(In reply to Jonathan Lebon from comment #29)
> Going through the steps in comment 1 (with the addition of setenforce 0) in
> a loop), I have been getting consistent results so far and haven't been able
> to reproduce this issue. CAI/Micah, could you verify if you can still
> reproduce it? My gut feeling is that this very well could have been an
> OpenStack performance issue.

I can still reproduce it.  And now I have seen it reproduced when using a local VM.

My best guess at a reproducer (which seems dependent on some sort of client settings/config):

1.  Boot VM
2.  Start brand new terminal (in my case I am using GNOME Terminal)
3.  SSH to VM (in my case, using unprivileged 'cloud-user')
4.  sudo su -
5.  rpm-ostree upgrade
6.  Wait for pull to complete, observe hang/blank characters getting spit out

Comment 33 Jonathan Lebon 2016-09-21 15:34:13 UTC

I am about 99% certain that this is https://github.com/GNOME/libglnx/pull/19. ostree-2016.7-2 didn't have this fix, but the latest one does.

To reproduce this, one simply has to have a terminal window less than strlen("xxx metadata, xxxx content objects fetched; xxxxxx KiB transferred in xxx seconds") (I'll save you the trouble... it's 82 :)). The standard GNOME Terminal window is 80 columns wide.

I think we can close this as CURRENTRELEASE, but I'd like Micah and/or CAI to try to reproduce this while having a maximized terminal window.

Comment 34 Qian Cai 2016-09-21 18:22:00 UTC

The above finding looks like correct from further testing.

Comment 35 Micah Abbott 2016-09-21 18:30:44 UTC

Agreed with @caiqian.

Performed upgrades using local VMs and OpenStack VMs.
  - Upgraded from 7.2.6-1 -> 7.2.7
  - Upgraded from 7.2.6-1 -> 7.3

When the terminal was full-screen, this issue was not observed.

Comment 36 Steve Milner 2016-09-21 18:46:57 UTC

Per conversion setting to CLOSED NEXTRELEASE.