Bug 1496905 - wc -l gives wrong line count for files with windows line breaks and single trailing CRNL
Summary: wc -l gives wrong line count for files with windows line breaks and single tr...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: coreutils
Version: 27
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kamil Dudka
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-28 18:19 UTC by ell1e
Modified: 2017-09-29 10:52 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-09-29 10:39:47 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
notepad.png (2.63 KB, image/png)
2017-09-28 18:19 UTC, ell1e
no flags Details

Description ell1e 2017-09-28 18:19:45 UTC
Created attachment 1332099 [details]
notepad.png

Description of problem:
wc -l gives the wrong line count for files with windows line breaks and single trailing CRNL.

These are the contents of a file with which this can be tested:

jonas@cyberman:~$ hexdump test.txt
0000000 6261 0d63 640a 6665 0a0d               
000000a
jonas@cyberman:~$

Check notepad.png (attached) to see that this file shows up as 3 lines with a trailing empty line on Microsoft Windows. (which I suggest should be canonical on how to interpret Windows line breaks)

This is what wc -l says on this exact file:

jonas@cyberman:~$ wc -l test.txt
2 test.txt
jonas@cyberman:~$

Version-Release number of selected component (if applicable):
coreutils-8.27-16.fc27.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create above file with Windows line breaks
2. Check it out in Notepad on Windows
3. Try wc -l file.txt

Actual results:
wc -l doesn't agree with Notepad.exe about how many lines this file has

Expected results:
wc -l prints out same line count as visible in Notepad.exe on Windows

Additional info:

Comment 1 ell1e 2017-09-28 19:29:12 UTC
Sorry, here is a less confusing hex dump output which shows the two \r\n sequences more clearly:

jonas@cyberman:~$ hexdump -b test.txt
0000000 141 142 143 015 012 144 145 146 015 012                        
000000a

Comment 2 Kamil Dudka 2017-09-29 07:10:48 UTC
Those two outputs of hexdump do not match with each other.  If you convert the sequence from comment #0 to octals, you get:

142 141 015 143 144 012 146 145 012 015

The problem here is that 015 (CR) and 012 (LF) are not next to each other, so it cannot be recognized as a CR-LF sequence.  If I use the input from comment #1, I get the expected result:

% for x in 141 142 143 015 012 144 145 146 015 012; do printf "\x$(printf "%x" $((0$x)))"; done | hexdump -C
00000000  61 62 63 0d 0a 64 65 66  0d 0a                    |abc..def..|
0000000a

% for x in 141 142 143 015 012 144 145 146 015 012; do printf "\x$(printf "%x" $((0$x)))"; done | wc -l
2

Please attach the exact input file you are giving to 'wc -l' on input and paste the exact result that you get out of 'wc -l'.

Comment 3 Kamil Dudka 2017-09-29 07:14:03 UTC
(In reply to Kamil Dudka from comment #2)
> 142 141 015 143 144 012 146 145 012 015
> 
> The problem here is that 015 (CR) and 012 (LF) are not next to each other,

Moreover, the second sequence is LF-CR instead of CR-LF.

Comment 4 ell1e 2017-09-29 10:12:40 UTC
The hexdump in the initial bug description is misleading. Please only consider the hexdump in Comment 1 with the proper byte-wise format.

The file from Comment 1 does NOt yield the expected result for me:

jonas@cyberman:~$ hexdump -b test.txt
0000000 141 142 143 015 012 144 145 146 015 012                        
000000a
jonas@cyberman:~$ wc -l test.txt
2 test.txt
jonas@cyberman:~$

Please note the expected result is 3(!) lines, because a trailing \r\n on Windows implies an additional trailing empty line (see also Notepad screenshot which clearly shows three lines).

Comment 5 ell1e 2017-09-29 10:15:30 UTC
(just to be super clear: a trailing \n on Linux doesn't imply a following empty line because on Linux \n just terminates the previous line as per POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 However, on Windows \r\n appears to be always a true line break no matter if something follows or not.)

Comment 6 Kamil Dudka 2017-09-29 10:39:47 UTC
The correct result is just 2 lines for the sequence in comment #1 because you have only two CR-LF sequences there.  It does not really matter if something follows after the last CR-LF sequence or not.  'wc -l' just counts the newline characters.

Comment 7 ell1e 2017-09-29 10:52:34 UTC
Alright, that makes sense. I just tried a file with no \n but other contents and indeed it shows 0, which is consistent with that. I did assume it counts the lines in the file, but I probably should have read the man page better..


Note You need to log in before you can comment on or make changes to this bug.