Bug 1219141 - grep wrongly classifies a valid C source file as a binary file
Summary: grep wrongly classifies a valid C source file as a binary file
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: grep
Version: 22
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jaroslav Škarvada
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-06 16:33 UTC by Kamil Dudka
Modified: 2016-01-24 03:18 UTC (History)
5 users (show)

Fixed In Version: grep-2.22-5.fc23 grep-2.21-9.fc22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-24 03:18:29 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Kamil Dudka 2015-05-06 16:33:34 UTC
Version-Release number of selected component (if applicable):
grep-2.21-5.fc22.x86_64


How reproducible:
easily


Steps to Reproduce:
1. curl -O https://raw.githubusercontent.com/bagder/curl/curl-7_29_0/src/tool_getparam.c
2. file tool_getparam.c 
3. grep ftpport tool_getparam.c


Actual results:

$ file tool_getparam.c 
tool_getparam.c: C source, ISO-8859 text

$ grep ftpport tool_getparam.c
Binary file tool_getparam.c matches


Expected results:

$ grep ftpport tool_getparam.c
  {"P",  "ftpport",                  TRUE},
         /* 'ftpport' old version */
        Curl_safefree(config->ftpport);
      GetStr(&config->ftpport, nextarg);


Additional info:
grep-2.18-1.fc20 works fine.

Comment 1 Sebastian Pöhn 2015-05-07 18:22:00 UTC
I run into the same issue. I already opened a report at gnu grep:

http://debbugs.gnu.org/cgi/bugreport.cgi?bug=20526

Affected: grep-2.21-3.fc21.x86_64
Not affected: grep-2.20-6.fc21.x86_64

Comment 2 Kamil Dudka 2015-05-11 09:57:18 UTC
Thanks for the reference!  The upstream commit that changes the behavior should be reverted IMO.  I use grep to search (usually ASCII) content in a huge amount of text files of mixed encoding.  It is technically impossible to set the LC_ALL variable for each of them as in the suggested workaround.  If GCC compiles a file in the same environment with no warning, the file hardly classifies as a binary file.

Comment 3 Jaroslav Škarvada 2015-05-11 10:14:29 UTC
You can set grep alias to 'grep -a' and treat all files as text.

I am not going to revert the fix if upstream not do.

Comment 4 Kamil Dudka 2015-05-11 10:31:41 UTC
(In reply to Jaroslav Škarvada from comment #3)
> You can set grep alias to 'grep -a' and treat all files as text.

That would not restore the behavior of grep-2.18-1.fc20.  I want grep to treat text files as text files, regardless of their encoding.  grep -a will cause grep to treat also binary files as text files, which is not that useful.

Comment 5 Eric Blake 2015-05-11 16:11:09 UTC
A text file, by POSIX definition, is one in which ALL bytes in the file comprise valid characters in the current locale encoding.  A file may be text in one locale, and binary in another.  If you try to grep a file that is binary because you did not use 'grep -a' and your locale differs from the encoding used by the file, then all bets are off, because you are attempting something non-portable.  This is not a regression, but a documented change in undefined behavior, and arguably a bug fix (as grep outputting invalid encoding sequences when grepping a binary file that differs from the current locale may in turn result in the terminal displaying bogus content when it encounters those encoding errors).  I see no reason to change anything in grep - people deserve to be educated about locale-dependence of their files.

Comment 6 Eric Blake 2015-05-11 16:14:08 UTC
(In reply to Kamil Dudka from comment #4)
> (In reply to Jaroslav Škarvada from comment #3)
> > You can set grep alias to 'grep -a' and treat all files as text.
> 
> That would not restore the behavior of grep-2.18-1.fc20.  I want grep to
> treat text files as text files, regardless of their encoding.  grep -a will
> cause grep to treat also binary files as text files, which is not that
> useful.

The problem, though, is that not all "text files" are text files across all locales.  What heuristics would you propose to determine that "this file, which has encoding errors in the current locale and is therefore not a text file, is nevertheless a text file in some other locale, so treat it as text instead of binary" without adding the expense of grepping over two locales instead of one for every file?

Comment 7 Kamil Dudka 2015-05-11 16:58:16 UTC
I usually need to grep all C source files, Makefiles, shell scripts etc. for some pattern in a project directory recursively while skipping object files and the like.  The heuristic used in grep prior to commit cd36abd4 worked just fine for me.

Source files in many SW projects simply are of mixed encoding.  Even if you force all projects to unify the encoding now, it will remain mixed in the SCM history, making grep less useful during git-bisect etc.

Comment 8 Jaroslav Škarvada 2015-05-11 17:02:14 UTC
(In reply to Kamil Dudka from comment #7)
> I usually need to grep all C source files, Makefiles, shell scripts etc. for
> some pattern in a project directory recursively while skipping object files
> and the like.  The heuristic used in grep prior to commit cd36abd4 worked
> just fine for me.
> 
> Source files in many SW projects simply are of mixed encoding.  Even if you
> force all projects to unify the encoding now, it will remain mixed in the
> SCM history, making grep less useful during git-bisect etc.

Maybe some new option or environment variable for the "compatibility mode"? I think this discussion should move upstream.

Comment 9 Eric Blake 2015-05-11 17:10:02 UTC
(In reply to Kamil Dudka from comment #7)
> I usually need to grep all C source files, Makefiles, shell scripts etc. for
> some pattern in a project directory recursively while skipping object files
> and the like.  The heuristic used in grep prior to commit cd36abd4 worked
> just fine for me.

But you were getting lucky, and relying on undefined behavior.  The old heuristic was "encoding error? oh well, you get to keep the pieces if it hangs"; the new heuristic is "encoding error? file is binary, so treat it as such". But the point remains that either way, you are relying on undefined behavior.

> 
> Source files in many SW projects simply are of mixed encoding.  Even if you
> force all projects to unify the encoding now, it will remain mixed in the
> SCM history, making grep less useful during git-bisect etc.

Ideally, it would be nice if files had attributes in the metadata stating what encoding the file is in, and if SCMs had metadata that tracked the encoding (and encoding changes, when a file is recoded).  But that's a much bigger project to take on; one that would need cooperation across the entire software stack to bring to fruition.

Comment 10 Arkadiusz Miskiewicz 2015-06-23 12:00:46 UTC
That new behaviour is not that great from user point of view. Old one not correct enough.

So what now, config file for grep with list of (user choosen) locales that should be considered "text" ? A bit crazy idea.

Comment 11 Kamil Dudka 2016-01-05 11:47:32 UTC
upstream commit:

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=85210016

Comment 12 Jaroslav Škarvada 2016-01-05 13:03:11 UTC
(In reply to Kamil Dudka from comment #11)
> upstream commit:
> 
> http://git.savannah.gnu.org/cgit/grep.git/commit/?id=85210016

Thanks for the info.

Comment 13 Fedora Update System 2016-01-05 13:08:01 UTC
grep-2.22-4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-14956838f8

Comment 14 Fedora Update System 2016-01-05 14:25:48 UTC
grep-2.21-7.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-8a95ebc411

Comment 15 Fedora Update System 2016-01-06 00:24:31 UTC
grep-2.21-7.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-8a95ebc411

Comment 16 Fedora Update System 2016-01-06 00:28:58 UTC
grep-2.22-4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-14956838f8

Comment 17 Fedora Update System 2016-01-06 17:07:46 UTC
grep-2.22-5.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-313013441f

Comment 18 Fedora Update System 2016-01-06 17:42:13 UTC
grep-2.21-8.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-d41031d2d0

Comment 19 Fedora Update System 2016-01-07 04:53:26 UTC
grep-2.22-5.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-313013441f

Comment 20 Fedora Update System 2016-01-07 05:23:54 UTC
grep-2.21-8.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-d41031d2d0

Comment 21 Fedora Update System 2016-01-07 19:52:07 UTC
grep-2.22-5.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 22 Fedora Update System 2016-01-12 10:21:41 UTC
grep-2.21-9.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-8883879d28

Comment 23 Fedora Update System 2016-01-13 07:22:21 UTC
grep-2.21-9.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-8883879d28

Comment 24 Fedora Update System 2016-01-24 03:18:20 UTC
grep-2.21-9.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.