If I print a text file which has a copyright symbol in it (decimal character 169) but is otherwise real ascii, to a printer configured to use the hpijs "Generic PCL5e Printer Foomatic" driver (the recommended driver), what comes out of the printer is a blank page instead of the contents of the text file. If I remove the copyright symbol, the file prints fine. There is nothing in the cups error_log to explain what goes wrong. However, when I invoke texttopaps by hand, I see this message to stderr: "(null): Invalid character in input". Note that despite displaying this message, texttops still produces a PostScript file containing the contents of the text file, although the line containing the copyright character is omitted. I believe this wasn't a problem in earlier cups versions. As a point of information, a2ps has no trouble with this file. I have cups-1.2.2-13 and foomatic-3.0.2-37.
"copyright symbol in it (decimal character 169)" tells me that your file is in some sort of ISO-8859-x encoding or other. Did you print the file using 'lp -odocument-format=text/plain;charset=iso-8859-whatever', or instead did you set your locale correctly in the environment you submitted the job from?
It's not in any encoding, as far as I know. It's a simple plain-text file, not exactly ASCII but rather the bastardized ASCII that Microsoft inflicted upon us all when they started putting weird quotation marks, (R), (C), TM, etc. in files that they were calling text. If there's an iso-8859-x encoding corresponding to this, I don't know what it is. Cups used to print files containing this character without any trouble. I haven't changed anything; it's cups that has changed. Furthermore, a2ps prints this file without any problem, and I'm calling a2ps from the same locale that I'm calling lpr from. In short, I don't really care whether it's "correct" that this file has a high-bit character in it which isn't, strictly speaking, an ASCII character. The reality is that text files have such characters in them all the time nowadays, cups should be able to print them rather than choking on them, and it used to do so and doesn't any longer.
A quick test shows that the (C) symbol is indeed the single byte 0xa9 in iso-8859-1, so that's probably the encoding your document is in. (Yes, it's certainly in an encoding! All plain text is, even if that encoding is ASCII.) The texttops filter shipped with upstream CUPS cannot handle UTF-8 at all, and by chance happens to handle your copyright character. Put this in context: the default encoding for the entire distribution is UTF-8, but the print spooler wouldn't accept the vast majority of UTF-8 characters. We now ship a text->PS filter based on paps and this allows us to generate correct output for UTF-8 text files. http://paps.sourceforge.net/ has an example of this. So the solution is for you to a) convert your document to UTF-8 using iconv -f iso-8859-1 -t utf-8 < in > out or b) print your document with a 'document-format=text/plain;charset=iso-8859-1' option so that the text filter has a sporting chance of knowing what encoding you expect it to be using. I'm going to reassign this bug to the paps component, because I think that, if possible, it would be much better if the individual bytes not understood could be omitted from the line (perhaps with '?' or the box character substituted), rather than the entire line being missed out.
I think you might have misunderstood part of my bug. When I print the file through cups with lpr, it doesn't print *at all*. The printer spits out a blank page. When I call texttopaps directly, the resulting postscript file is missing the line with the copyright symbol, as I previously mentioned. As you point out, throwing away the entire line is worse than just throwing away the misunderstood symbol, but even that would be better than one cups is doing now, i.e., throwing away the entire file. I don't know why cups generates a blank page even though texttopaps is generating a postscript file with content in it. Incidentally, I still think this is at its root a cups bug and that's where it should be address. In my mind, sort of by definition, something that worked before and doesn't anymore is a bug. Perhaps the paps encoder needs to be smarter about guessing encodings, such that it could reasonably look at an input file, guess that it's iso-8859-1, and then print it appropriately. there's *lots* of software that guesses encodings rather successfully when no encoding is specified.
(In reply to comment #3) > I'm going to reassign this bug to the paps component, because I think that, if > possible, it would be much better if the individual bytes not understood could > be omitted from the line (perhaps with '?' or the box character substituted), > rather than the entire line being missed out. I don't think that it's possible since Pango itself needs UTF-8 strings and paps itself also expects UTF-8 strings of course. so I'd suggest b) solution. otherwise this issue should be an feature request like to guess the encodings. it's quite hard to support all the encodings I'm sure though, but anyway. BTW Tim, does CUPS assume that any errors happened when something is output to stderr?
(In reply to comment #4) > I think you might have misunderstood part of my bug. When I print the file > through cups with lpr, it doesn't print *at all*. The printer spits out a blank I'm still not sure why CUPS doesn't print out the incomplete PS file to the printer though, > page. When I call texttopaps directly, the resulting postscript file is missing > the line with the copyright symbol, as I previously mentioned. As you point paps gives up to parse a file when any invalid character appears. it may be useful to just use paps as filter tool, because it may be likely to be given any binary files, and it will possibly causes paps may generates a bigger PS file that entirely prints out the broken things. however when working as CUPS filter, we are able to assume that CUPS will invokes texttopaps for only text/plain files. so I can get rid of this limitation for this purpose only. > Perhaps the paps encoder needs to be smarter about guessing encodings, such that > it could reasonably look at an input file, guess that it's iso-8859-1, and then > print it appropriately. there's *lots* of software that guesses encodings > rather successfully when no encoding is specified. > There are no softwares that perfectly working on the guess encodings for all the encodings. if possible, it would be ideal though, I often saw the misguessing encodings on even emacs.
The problem is that paps gives up on the whole file if there is an encoding error detected by iconv. The iconv code gets executed even when the input file is (expected to be) UTF-8 because of the CHARSET code in the paps-cups patch. I've modified the paps-cups patch to avoid ever calling iconv_open("UTF-8", "UTF-8"), and I get much better results now. paps-0.6.6-15 is the fixed package.
Ah, thanks for tracking this down, Tim. I forgot to set CHARSET during testing at all. that's why I didn't see ;)