1661625 – rsync-3.1.2-4.el7 hangs when run with -vvv to sync a large repository

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1661625 - rsync-3.1.2-4.el7 hangs when run with -vvv to sync a large repository

Summary: rsync-3.1.2-4.el7 hangs when run with -vvv to sync a large repository

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	rsync
Sub Component:
Version:	7.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Ruprich
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1757052
TreeView+	depends on / blocked

Reported:	2018-12-21 19:40 UTC by Thiago Rafael Becker
Modified:	2023-10-06 18:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1775561 (view as bug list)
Environment:
Last Closed:	2019-12-18 11:07:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3771641	0	None	None	None	2018-12-21 20:33:36 UTC

Description Thiago Rafael Becker 2018-12-21 19:40:00 UTC

Description of problem:
The following command hangs after some time.

# rsync -avvv /<large_directory> /dst

Version-Release number of selected component (if applicable):
rsync-3.1.2-4.el7

How reproducible:
100%

Steps to Reproduce:
1. Install a new machine with two extra disk formatted as xfs, mount them as /git1 and /git2

The file system is actually not important. I've seen it happening with nfs as the source file system. Also seen with mvfs.

2. Clone the linux kernel repository into /git1
3. Start rsync from /git1 to /git2

Actual results:
The rsync command stops after some time.

Expected results:
rsync to complete.

Additional info:

Tested with kernel-3.10.0-862.11.6.el7.x86_64 and rsync-3.1.2-4.el7.x86_64, hangs.
Tested with kernel-3.10.0-957.1.3.el7.x86_64 and rsync-3.1.2-4.el7.x86_64, hangs.
Tested with kernel-3.10.0-957.1.3.el7 and rsync-3.0.9-18.el7.x86_64, doesn't hang.
Tested with kernel-3.10.0-862.11.6.el7.x86_64 and rsync-3.0.9-18.el7.x86_64, doesn't hang.

The rsync processes appear to be stuck waiting for a select to complete.

root     13624  9812  6 17:35 pts/0    00:00:04 rsync -avvv /git1 /git2
root     13625 13624  0 17:35 pts/0    00:00:00 rsync -avvv /git1 /git2
root     13626 13625  7 17:35 pts/0    00:00:04 rsync -avvv /git1 /git2

# cat /proc/{13624,13625,13626}/stack
[<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0
[<ffffffff9f635711>] do_select+0x6d1/0x7c0
[<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0
[<ffffffff9f635b9a>] SyS_select+0xba/0x110
[<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0
[<ffffffff9f635711>] do_select+0x6d1/0x7c0
[<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0
[<ffffffff9f635b9a>] SyS_select+0xba/0x110
[<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0
[<ffffffff9f635711>] do_select+0x6d1/0x7c0
[<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0
[<ffffffff9f635b9a>] SyS_select+0xba/0x110
[<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

Comment 3 Renaud Métrich 2019-01-03 09:00:12 UTC

rsync-3.0.9-18 uses protocol 30 whereas rsync-3.1.2-4 uses protocol 31.

After digging into Upstream repo, it appears the issue started to happen with v3.1.0pre1 (didn't find an older tag to test further).
This should probably be reported Upstream.

Issue happens only with "-vvv", not "-vv".

Comment 4 Renaud Métrich 2019-01-03 10:13:11 UTC

Issue was introduced by:

d8587b4 - Change the msg pipe to use a real multiplexed IO mode for the data that goes from the receiver to the generator. (9 years ago) <Wayne Davison>

Comment 5 Renaud Métrich 2019-01-04 09:12:38 UTC

The issue is within read_a_msg() which is used by 2nd child process:
- first process (main process) is the sender
- second (child) process (PID 11453) is a receiver that also sends messages (maybe it's due to verbosity, I don't know yet ...)
- third (child) process is a receiver


# pstack 11453
#0  0x00007f6b2384df53 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000427fdf in writefd_unbuffered (fd=fd@entry=1, buf=buf@entry=0x7ffe731183c0 "\211", len=141) at io.c:1434
#2  0x00000000004284af in mplex_write (fd=1, code=code@entry=MSG_INFO, buf=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=137, convert=<optimized out>) at io.c:378
#3  0x0000000000426fe8 in io_multiplex_write (code=MSG_INFO, buf=<optimized out>, len=len@entry=137, convert=<optimized out>) at io.c:1833
#4  0x000000000042709c in send_msg (code=<optimized out>, buf=buf@entry=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=len@entry=137, convert=<optimized out>) at io.c:392
#5  0x000000000041ea5e in rwrite (code=<optimized out>, code@entry=FINFO, buf=buf@entry=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=len@entry=137, is_utf8=0) at log.c:313
#6  0x0000000000427395 in read_a_msg (fd=3) at io.c:1044
#7  0x0000000000428075 in writefd_unbuffered (fd=fd@entry=1, buf=buf@entry=0x7ffe7311bd90 "p", len=116) at io.c:1448
...

In read_a_msg(), the "msg_fd_in" variable is temporarily disabled:

 874 /* Read a message from a multiplexed source. */
 875 static void read_a_msg(int fd)
 876 {
...
 879         int tag, flist_parent = -1, save_msg_fd_in = msg_fd_in;
...
 881         /* Temporarily disable msg_fd_in.  This is needed to avoid looping back
 882          * to this routine from writefd_unbuffered(). */
 883         msg_fd_in = -1;
...

Due to verbosity (?? unsure ??), we enter the following code:

1031         case MSG_INFO:
1032         case MSG_ERROR:
1033         case MSG_ERROR_XFER:
1034         case MSG_WARNING:
1035                 if (msg_bytes >= sizeof line) {
1036                     overflow:
1037                         rprintf(FERROR,
1038                                 "multiplexing overflow %d:%lu [%s%s]\n",
1039                                 tag, (unsigned long)msg_bytes, who_am_i(),
1040                                 inc_recurse ? "/inc" : "");
1041                         exit_cleanup(RERR_STREAMIO);
1042                 }
1043                 read_loop(fd, line, msg_bytes);
1044                 rwrite((enum logcode)tag, line, msg_bytes, !am_generator);

We have here: tag == MSG_INFO, line == "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n"

On line 1044, we try writing this, but cannot because buffer is probably full.
Since msg_fd_in has been disabled temporarily, there is probably something on the FD but we never handle it, causing a deadlock to occur between the various sides.

Comment 6 Renaud Métrich 2019-01-04 09:30:03 UTC

This rwrite() call is indeed due to verbosity. In some rare cases, if there are tons of errors, we should be able to hit the same issue without "-vvv".

Comment 13 Tomáš Hozza 2019-12-18 11:07:14 UTC

Red Hat Enterprise Linux version 7 entered the Maintenance Support 1 Phase in August 2019. In this phase only qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. Other errata advisories may be delivered as appropriate.

This bug has been reviewed by Support and Engineering representative and does not meet the inclusion criteria for Maintenance Support 1 Phase. The issue has been cloned to RHEL-8 as Bug #1775561.

For more information about Red Hat Enterprise Linux Lifecycle, please see https://access.redhat.com/support/policy/updates/errata/

Comment 14 RHEL Program Management 2019-12-18 11:07:23 UTC

Development Management has reviewed and declined this request. You may appeal this decision by using your Red Hat support channels, who will make certain  the issue receives the proper prioritization with product and development management.

https://www.redhat.com/support/process/production/#howto

Note You need to log in before you can comment on or make changes to this bug.