[dev] [framework-patch] clean HTML
Francois Marier
francois at nit.ca
Fri Aug 6 11:55:00 PDT 2004
On Fri, Aug 06, 2004 at 02:30:46PM +0200, Jan Schneider wrote:
> Zitat von Francois Marier <francois at nit.ca>:
> >This patch fixes the <script> problem by removing what's between the
> >two tags. It also fixes the <style> problem when displaying HTML
> >inline (in non-inline mode, the <style> tags are preserved).
>
> I don't like this approach, the style tag changes are not necessary anyway
> and the script regexps are too weak to catch all cases and too strong to
> catch common cases. This is a cosmetic issue, so the we don't want to catch
> each obfuscated version of script tags here. Just take the style cleanup as
> an example and model the script regexps after that.
True, if there are obfuscated tags in the HTML, then it's most likely
spam and then it doesn't matter if it's not displayed correctly.
I've changed the regexp to be as simple as possible in this updated patch.
> >Furthermore, I also added a line that strips out all HTML comments
> >(including scripts and styles) if we are displaying inline. Since we
> >cannot allow either script or styles, there is no point in sending
> >this data to the browser.
>
> I'm not sure if i want to trade the additional page size with the additional
> cpu cycles, but I may get convinced. At least you shouldn't need to look for
> withspace characters, the full stop already matches them. If you intended to
> catch new lines, use the DOTALL modifier /s instead.
Well, I don't know how much slower it would be, I guess it depends on
the speed of the network and the CPU of the server, but my guess is
that if there is a difference then it is pretty small. I just thought
it would be best to refrain from sending useless stuff over the wire,
but feel free to rip this part out of the patch if you don't think
it's worth the effort.
Francois
-------------- next part --------------
diff -rpuN -X ../ignorelist ../build/framework/MIME/MIME/Viewer/html.php framework/MIME/MIME/Viewer/html.php
--- ../build/framework/MIME/MIME/Viewer/html.php Wed Jul 14 07:30:27 2004
+++ framework/MIME/MIME/Viewer/html.php Fri Aug 6 14:35:52 2004
@@ -68,6 +68,12 @@ class MIME_Viewer_html extends MIME_View
}
}
+ /* Removes HTML comments (including some scripts & styles)
+ * if displaying inline */
+ if (!$attachment) {
+ $data = preg_replace('/<!--.*?-->/s', '', $data);
+ }
+
/* Change space entities to space characters. */
$data = preg_replace('/&#(x0*20|0*32);?/i', ' ', $data);
@@ -123,6 +129,10 @@ class MIME_Viewer_html extends MIME_View
/* Get all on<foo>="bar()". NEVER allow these. */
$data = preg_replace('/(\s+[Oo][Nn]\w+)\s*=/', '\1HordeCleaned=', $data);
+
+ /* Remove all scripts since they might introduce garbage if they
+ * are not quoted properly */
+ $data = preg_replace('|<script[^>]*>.*?</script>|is', '<HordeCleaned_script>', $data);
/* Get all tags that might cause trouble - <object>, <embed>,
* <base>, etc. Meta refreshes and iframes, too. */
More information about the dev
mailing list