I am copying some sample content from Microsoft Word and pasting it into the HTML here: http://cutesoft.net/example/general.aspx
When I paste, I get asked if I want to clean up the Microsoft tags. I say yes and I get the following result:
Which is very clean HTML without the Microsoft Word "mso" tags. So far, so good.
I am trying to achieve the same result in C# code
- [TestMethod]
- public void CleanUpMicrosoftWordHTML()
- {
- var source = "<font face=\"Times New Roman\" size=\"3\"></font>";
- source += "<p class=\"MsoNormal\" style=\"margin: 0in 0in 0pt;\"><span lang=\"NL\"><o:p><font face=\"Times New Roman\" size=\"3\"> </font></o:p></span></p>";
- source += "<font face=\"Times New Roman\" size=\"3\"></font>";
- source += "<pre style=\"text-indent: -0.25in; margin-left: 0.5in; mso-list: l0 level1 lfo1; tab-stops: list .5in left 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt;\">";
- source += "<!--[if !supportLists]--><span class=\"migratedcontentfont1\"><span lang=\"NL\" style=\"font-family: Symbol; font-size: 12pt; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;\">";
- source += "<span style=\"mso-list: Ignore;\">·<span style='font: 7pt/normal \"Times New Roman\"; font-size-adjust: none; font-stretch: normal;'> </span></span></span>";
- source += "</span><!--[endif]-->";
- source += "<span class=\"migratedcontentfont1\"><span lang=\"NL\" style='font-family: \"Times New Roman\",\"serif\"; font-size: 12pt;'>test<o:p></o:p></span></span></pre>";
-
- var expected = "<pre><!--[if !supportLists]--><span style=\"font-family: Symbol; font-size: 12pt;\"><span>·<span> </span></span></span><!--[endif]--><span style='font-family: \"Times New Roman\",\"serif\"; font-size: 12pt;'>test</span></pre>";
-
- var result = EditorUtility.CleanUpMicrosoftWordHTML(source);
-
- Assert.AreEqual(expected, result);
- }
Unfortunately the result is not at all what was expected. The result contains: " · test".
So my question is: how can I properly clean up HTML in my C# code?
By the way, I am running CuteEditor version 6.6 Buid 2013-04-22.