Bug 677095

Summary: Double ampersand in HTML portions of emails
Product: [Fedora] Fedora EPEL Reporter: Matěj Cepl <mcepl>
Component: rss2emailAssignee: Orphan Owner <extras-orphan>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: el5CC: bugs.michael, extras-orphan, lindsey.smith, mcepl, pertusus
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-14 08:26:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
example email with double ampresands
none
testcase
none
sample email message
none
not reproduced with F-14
none
~/.rss2email/config.py in question
none
debug patch against rss2email 2.70 none

Description Matěj Cepl 2011-02-13 10:26:41 UTC
Created attachment 478449 [details]
example email with double ampresands

Description of problem:
With this configuration
DEFAULT_FROM="rss2email"
FORCE_FROM = 1
HTML_MAIL=1
SMTP_SEND=0
TRUST_GUID = 1
DATE_HEADER=1
USE_PUBLISHER_EMAIL = 0
UNICODE_SNOB=1

I am getting emails with all ampresands in HTML portion of the message doubled:

escend from the sky, like a Magritte painting. It is all very charming an=
d surreal and doesn&&#8217;t make any sense, except in an advantages and =
disadvantages of vermiculation for life, in a space-time worm sort of sen=

which results in line like this in Thunderbird:

descend from the sky, like a Magritte painting. It is all very charming and surreal and doesn&’t make any sense, except in an advantages and disadvantages of vermiculation for life, in a space-time worm sort of sense,

Version-Release number of selected component (if applicable):
rss2email-2.60-3.el5
python-feedparser-4.1-10.el5
python-html2text-2.26-2.el5


How reproducible:
100%

Steps to Reproduce:
1.let rss2email work with the above configuration
2.read emails in HTML-aware email client
3.
  
Actual results:
all & doubled resulting in corrupted text in email client

Expected results:
the entity should be just &#8217; in the above example.

Additional info:

Comment 1 Matěj Cepl 2011-02-13 10:29:45 UTC
The original RSS 2.0 feed (http://crookedtimber.org/feed/) has this in the particular part of the text:

Then time starts again, and 314 Dapper Men descend from the sky, like a Magritte painting. It is all very charming and surreal and doesn&#8217;t make any sense, except in an advantages and disadvantages of vermiculation for life, in a space-time worm sort of sense, sense.

So, I guess the fault is 100% somewhere between rss2email and my email client.

Comment 2 Matěj Cepl 2011-04-02 18:04:49 UTC
Created attachment 489606 [details]
testcase

Actually this is one more proof that the issue is in rss2email and not html2text (other one is HTML_MAIL=1 in my configuration). When running html2text on the attached HTML snippet I get correct text:

bradford:~ $ python /usr/lib/python2.7/site-packages/html2text.py <testcase.html 
Then time starts again, and 314 Dapper Men descend from the sky, like a
Magritte painting. It is all very charming and surreal and doesn't make any
sense, except in an advantages and disadvantages of vermiculation for life, in
a space-time worm sort of sense, sense.

bradford:~ $

Comment 3 Matěj Cepl 2011-04-04 07:15:14 UTC
Created attachment 489707 [details]
sample email message

Still reproduceable with rss2email-2.70-2.el6.noarch, python-html2text-2.38-1.el6.noarch, and python-feedparser-4.1-10.el6.noarch on the entry for http://crookedtimber.org/2011/04/02/connick-v-thompson/ . feedparser gives (IMHO correct) content as

>>> print f.entries[0]['content']
[{'base': 'http://crookedtimber.org/feed/', 'type': 'text/html', 'value': u'<p>J.K. Galbraith remarked that conservatism was engaged in a long search for a superior moral justification for selfishness. But that quest may sometimes become boring, or perhaps too difficult. Not to worry, because <a href="http://www.slate.com/id/2290036/">occasions to be straightforwardly vicious</a> are more easily found, if you have the taste for it. Its spiteful tone aside, in substance <em>Connick v. Thompson</em> seems to be a <a href="http://www.kieranhealy.org/blog/archives/2003/04/17/state-sponsored-terror/">Lord Denning Moment</a> for the U.S. Supreme Court. The conservative majority preferred to affirm an obvious wrong rather than face the <a href="http://en.wikipedia.org/wiki/Birmingham_Six#Charges_against_police_and_prison_officers">appalling vista</a> of a brutal and corrupt justice system. To be fair to the system, it&#8217;s worse than that. Once the initial wrongdoing came to trial a jury, the district court, and the 5th circuit (twice) all decided the other way. It&#8217;s only when we get to Thomas, Scalia, Roberts, Alito, and Kennedy that the system chose to <a href="http://prospect.org/cs/articles?article=the_impunity_of_the_roberts_court">further institutionalize prosecutorial immunity</a>. Stitch-ups should be seamless: if someone could pull at a stray thread, the whole thing might unravel, after all.</p>', 'language': None}]
>>> 

(see "It&#8217;s"), but the message has (in the HTML part, thus I assume untouched by html2text)

(line 104) it&&#8217;s
(line 106) It&&#8217;s=

Which seems to lead to the bug in rss2email itself

Comment 5 Matěj Cepl 2011-04-04 10:22:16 UTC
The same goes for this example (from http://notreligious.typepad.com/notreligious/2011/04/why-we-needed-rules-based-faith-and-why-we-need-to-move-past-it-charles-park.html). feedparser shows

 Harvey Cox laments how the Roman Empire co-opted Christianity and pushed the \u2018Age of Belief\u2019 onto us.\xa0 He looks 

email has

 Harvey=
 Cox laments how the Roman Empire co-opted Christianity and pushed the =E2=
=80=98Age of Belief=E2=80=99 onto us.&&#0160; He looks

Comment 6 Michael Schwendt 2011-04-04 11:38:07 UTC
Re: comment 3
Cannot reproduce with Fedora 14:

  $ rpm -q rss2email python-feedparser python-html2text
  rss2email-2.70-1.fc14.noarch
  python-feedparser-4.1-12.fc14.noarch
  python-html2text-2.38-2.1.noarch

I also don't get two-part mails but just either text/html (for HTML_MAIL=1) or text/plain (for HTML_MAIL=0). They are not encoded quoted-printable. Only seldomly, but not for this example feed.

Comment 7 Matěj Cepl 2011-04-04 14:01:01 UTC
(In reply to comment #6)
> Re: comment 3
> Cannot reproduce with Fedora 14:

With the configuration in comment 0?

Comment 8 Michael Schwendt 2011-04-04 15:06:58 UTC
Created attachment 489791 [details]
not reproduced with F-14

Sure, same config. Output mail attached. What would I need to change to get a two-part MIME mail like you get? Can you rule out that none of your MTAs does that?

Comment 9 Matěj Cepl 2011-04-04 16:11:05 UTC
Created attachment 489803 [details]
~/.rss2email/config.py in question

(In reply to comment #8)
> Created attachment 489791 [details]
> not reproduced with F-14
> 
> Sure, same config. Output mail attached. What would I need to change to get a
> two-part MIME mail like you get? Can you rule out that none of your MTAs does
> that?

I have no idea, how would it be done otherwise. I always thought it is generated by rss2email.

Well, there is

luther.ceplovi.cz ... that's RHEL 6 with Postfix postfix-2.6.6-2.el6.i686
d1080.master.cz ... that's hosting's CentOS 5 with postfix-2.3.3-2.1.el5_2
smtp-out3.iol.cz ... most likely postfix as well at my ISP's site, but no
    clue about that (I don't have shell access there)
antivir5.iol.cz ... ditto
port3.iol.cz ... ditto
and back to luther.

d1080.master.cz postconf -n is:
-bash-3.2$ /usr/sbin/postconf -n
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
debug_peer_level = 2
html_directory = no
inet_interfaces = all
local_recipient_maps = unix:passwd.byname $alias_maps
mail_owner = postfix
mailbox_command = /usr/bin/procmail -f- -a "$USER"
mailbox_size_limit = 0
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
mydestination = $myhostname, localhost.$mydomain, localhost
mydomain = p-lab.cz
myhostname = d1080.master.cz
myorigin = $myhostname
newaliases_path = /usr/bin/newaliases.postfix
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.3.3/README_FILES
sample_directory = /usr/share/doc/postfix-2.3.3/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
smtp_generic_maps = hash:/etc/mail/generic
smtpd_recipient_restrictions = permit_sasl_authenticated, permit_mynetworks, reject_unauth_destination, reject_unlisted_recipient, reject_unverified_recipient
smtpd_sasl_auth_enable = yes
smtpd_sender_restrictions = permit_sasl_authenticated
unknown_local_recipient_reject_code = 550
virtual_alias_domains = /etc/mail/local-host-names
virtual_alias_maps = hash:/etc/mail/virtusertable
-bash-3.2$ 

The same for luther.ceplovi.cz:
bradford:tmp $ /usr/sbin/postconf -n
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
biff = no
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
data_directory = /var/lib/postfix
debug_peer_level = 4
default_destination_concurrency_limit = 200
default_destination_recipient_limit = 1000
html_directory = no
inet_interfaces = loopback-only
inet_protocols = ipv4
mail_owner = postfix
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
mydestination = $myhostname, localhost.$mydomain, localhost
newaliases_path = /usr/bin/newaliases.postfix
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.8.2/README_FILES
recipient_delimiter = +
relayhost = smtp.o2isp.cz
sample_directory = /usr/share/doc/postfix-2.8.2/samples
sender_canonical_maps = hash:/etc/postfix/sender_canonical
sender_dependent_relayhost_maps = hash:/etc/postfix/sender_relayhost
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_password
smtp_sasl_security_options = 
transport_maps = hash:/etc/postfix/transport
unknown_local_recipient_reject_code = 550
virtual_alias_maps = hash:/etc/postfix/virtual
bradford:tmp $

Comment 10 Lindsey Smith 2011-04-04 16:26:33 UTC
Project maintainer here. I was unable to reproduce this with the latest build (v2.71) on Ubuntu using the provided config. Note that builds from the project site use feedparser v5.x which is vastly superior but new enough that distros may not have caught up with yet.

Comment 11 Michael Schwendt 2011-04-04 16:38:02 UTC
Created attachment 489804 [details]
debug patch against rss2email 2.70

Attached patch for rss2email 2.70's /usr/share/rss2email/rss2email.py will print the message contenttype + content to stdout before passing it on to sendmail.

One of the attached example mails says
  X-Mailer: Zarafa 7.0.0-24874
  X-Original-To: matej
so clearly there is some forwarding and reprocessing involved.

Comment 12 Lindsey Smith 2011-04-04 16:56:27 UTC
I also have access to a RHEL5 box and I was unable to reproduce using v2.71 of the original project release.

Comment 13 Matěj Cepl 2011-04-05 12:46:30 UTC
(In reply to comment #11)
> One of the attached example mails says
>   X-Mailer: Zarafa 7.0.0-24874
>   X-Original-To: matej

Oh, right. I will check with Zarafa folks.

Comment 14 Matěj Cepl 2011-05-14 08:26:14 UTC
I believe this was really a bug in Zarafa 7beta* and it has been fixed already in 7RC1.