[dev] IMP and html2text filter called from lib/MIME/Viewer/plain.php

Sat Nov 17 06:02:53 UTC 2007

On Fri, 16 Nov 2007, Chuck Hagenbuch wrote:

> Quoting Chris Stromsoe <cbs at cts.ucla.edu>:
>
>> I'm noticing horrible load times for messages large plain text 
>> attachments that have long lines.  I'm displaying them in-line.  The 
>> plain viewer calls html2text with parselevel set to TEXT_HTML_MICRO. 
>> The regular expression to convert the text links to is chewing CPU (> 
>> 100s display time for a 30 line 400k attachment).
>
> But not when the lines aren't long? Interesting. Wonder if we can come 
> up with a clever way to normalize how much text we look at, perhaps 
> using wordwrap or strpos... would you create a ticket on 
> http://bugs.horde.org/, including a sample message if you can?

The lines are long because the text file is DNA sequences (a total of 43 
"words" in the 400k attachment).  The way that the regexp for 
framework/Text_Filter/Filter/linkurls.php in getPatterns() is written,

     |([\w+]+)://([^\s"<]*[\w+#?/&=])|e

it ends up chewing memory and cpu looking for something valid.  An 
alternate fix would be changing to something like

     /(^|\s)(ftp|http|mailto|news):\/\/([^\s"<]*[\w+#?\/&=])/e

using a white-listed set of prefixes that either start a line or have 
preceding whitespace.

I can create a ticket, or a ticket + patch for linkurl.php to whitelist 
URL prefixes if that would work better.

>> Is there any downside to changing TEXT_HTML_MICRO to TEXT_HTML_NOHTML 
>> in imp/lib/MIME/Viewer/plain.php ?  Mostly I'm concerned about 
>> side-affects other than not turning URLs into links in plain-text 
>> attachments, which is acceptable.
>
> There should be no other side effects.

ok.  That should work fine as a short term fix then.

-Chris