Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1661625 - rsync-3.1.2-4.el7 hangs when run with -vvv to sync a large repository
Summary: rsync-3.1.2-4.el7 hangs when run with -vvv to sync a large repository
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: rsync
Version: 7.6
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Michal Ruprich
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On:
Blocks: 1757052
TreeView+ depends on / blocked
 
Reported: 2018-12-21 19:40 UTC by Thiago Rafael Becker
Modified: 2021-02-15 17:01 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1775561 (view as bug list)
Environment:
Last Closed: 2019-12-18 11:07:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3771641 0 None None None 2018-12-21 20:33:36 UTC

Description Thiago Rafael Becker 2018-12-21 19:40:00 UTC
Description of problem:
The following command hangs after some time.

# rsync -avvv /<large_directory> /dst

Version-Release number of selected component (if applicable):
rsync-3.1.2-4.el7

How reproducible:
100%

Steps to Reproduce:
1. Install a new machine with two extra disk formatted as xfs, mount them as /git1 and /git2

The file system is actually not important. I've seen it happening with nfs as the source file system. Also seen with mvfs.

2. Clone the linux kernel repository into /git1
3. Start rsync from /git1 to /git2

Actual results:
The rsync command stops after some time.

Expected results:
rsync to complete.

Additional info:

Tested with kernel-3.10.0-862.11.6.el7.x86_64 and rsync-3.1.2-4.el7.x86_64, hangs.
Tested with kernel-3.10.0-957.1.3.el7.x86_64 and rsync-3.1.2-4.el7.x86_64, hangs.
Tested with kernel-3.10.0-957.1.3.el7 and rsync-3.0.9-18.el7.x86_64, doesn't hang.
Tested with kernel-3.10.0-862.11.6.el7.x86_64 and rsync-3.0.9-18.el7.x86_64, doesn't hang.

The rsync processes appear to be stuck waiting for a select to complete.

root     13624  9812  6 17:35 pts/0    00:00:04 rsync -avvv /git1 /git2
root     13625 13624  0 17:35 pts/0    00:00:00 rsync -avvv /git1 /git2
root     13626 13625  7 17:35 pts/0    00:00:04 rsync -avvv /git1 /git2

# cat /proc/{13624,13625,13626}/stack
[<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0
[<ffffffff9f635711>] do_select+0x6d1/0x7c0
[<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0
[<ffffffff9f635b9a>] SyS_select+0xba/0x110
[<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0
[<ffffffff9f635711>] do_select+0x6d1/0x7c0
[<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0
[<ffffffff9f635b9a>] SyS_select+0xba/0x110
[<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0
[<ffffffff9f635711>] do_select+0x6d1/0x7c0
[<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0
[<ffffffff9f635b9a>] SyS_select+0xba/0x110
[<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

Comment 3 Renaud Métrich 2019-01-03 09:00:12 UTC
rsync-3.0.9-18 uses protocol 30 whereas rsync-3.1.2-4 uses protocol 31.

After digging into Upstream repo, it appears the issue started to happen with v3.1.0pre1 (didn't find an older tag to test further).
This should probably be reported Upstream.

Issue happens only with "-vvv", not "-vv".

Comment 4 Renaud Métrich 2019-01-03 10:13:11 UTC
Issue was introduced by:

d8587b4 - Change the msg pipe to use a real multiplexed IO mode for the data that goes from the receiver to the generator. (9 years ago) <Wayne Davison>

Comment 5 Renaud Métrich 2019-01-04 09:12:38 UTC
The issue is within read_a_msg() which is used by 2nd child process:
- first process (main process) is the sender
- second (child) process (PID 11453) is a receiver that also sends messages (maybe it's due to verbosity, I don't know yet ...)
- third (child) process is a receiver


# pstack 11453
#0  0x00007f6b2384df53 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000427fdf in writefd_unbuffered (fd=fd@entry=1, buf=buf@entry=0x7ffe731183c0 "\211", len=141) at io.c:1434
#2  0x00000000004284af in mplex_write (fd=1, code=code@entry=MSG_INFO, buf=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=137, convert=<optimized out>) at io.c:378
#3  0x0000000000426fe8 in io_multiplex_write (code=MSG_INFO, buf=<optimized out>, len=len@entry=137, convert=<optimized out>) at io.c:1833
#4  0x000000000042709c in send_msg (code=<optimized out>, buf=buf@entry=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=len@entry=137, convert=<optimized out>) at io.c:392
#5  0x000000000041ea5e in rwrite (code=<optimized out>, code@entry=FINFO, buf=buf@entry=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=len@entry=137, is_utf8=0) at log.c:313
#6  0x0000000000427395 in read_a_msg (fd=3) at io.c:1044
#7  0x0000000000428075 in writefd_unbuffered (fd=fd@entry=1, buf=buf@entry=0x7ffe7311bd90 "p", len=116) at io.c:1448
...

In read_a_msg(), the "msg_fd_in" variable is temporarily disabled:

 874 /* Read a message from a multiplexed source. */
 875 static void read_a_msg(int fd)
 876 {
...
 879         int tag, flist_parent = -1, save_msg_fd_in = msg_fd_in;
...
 881         /* Temporarily disable msg_fd_in.  This is needed to avoid looping back
 882          * to this routine from writefd_unbuffered(). */
 883         msg_fd_in = -1;
...

Due to verbosity (?? unsure ??), we enter the following code:

1031         case MSG_INFO:
1032         case MSG_ERROR:
1033         case MSG_ERROR_XFER:
1034         case MSG_WARNING:
1035                 if (msg_bytes >= sizeof line) {
1036                     overflow:
1037                         rprintf(FERROR,
1038                                 "multiplexing overflow %d:%lu [%s%s]\n",
1039                                 tag, (unsigned long)msg_bytes, who_am_i(),
1040                                 inc_recurse ? "/inc" : "");
1041                         exit_cleanup(RERR_STREAMIO);
1042                 }
1043                 read_loop(fd, line, msg_bytes);
1044                 rwrite((enum logcode)tag, line, msg_bytes, !am_generator);

We have here: tag == MSG_INFO, line == "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep  6 08:43:06 2018\nl7.x86_64/.gitattributes\n"

On line 1044, we try writing this, but cannot because buffer is probably full.
Since msg_fd_in has been disabled temporarily, there is probably something on the FD but we never handle it, causing a deadlock to occur between the various sides.

Comment 6 Renaud Métrich 2019-01-04 09:30:03 UTC
This rwrite() call is indeed due to verbosity. In some rare cases, if there are tons of errors, we should be able to hit the same issue without "-vvv".

Comment 13 Tomáš Hozza 2019-12-18 11:07:14 UTC
Red Hat Enterprise Linux version 7 entered the Maintenance Support 1 Phase in August 2019. In this phase only qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. Other errata advisories may be delivered as appropriate.

This bug has been reviewed by Support and Engineering representative and does not meet the inclusion criteria for Maintenance Support 1 Phase. The issue has been cloned to RHEL-8 as Bug #1775561.

For more information about Red Hat Enterprise Linux Lifecycle, please see https://access.redhat.com/support/policy/updates/errata/

Comment 14 RHEL Program Management 2019-12-18 11:07:23 UTC
Development Management has reviewed and declined this request. You may appeal this decision by using your Red Hat support channels, who will make certain  the issue receives the proper prioritization with product and development management.

https://www.redhat.com/support/process/production/#howto


Note You need to log in before you can comment on or make changes to this bug.