Cleaning Microsoft Word tags from HTML in C#

  •  05-31-2013, 4:17 PM

    Cleaning Microsoft Word tags from HTML in C#

    I am copying some sample content from Microsoft Word and pasting it into the HTML here: http://cutesoft.net/example/general.aspx

     

    When I paste, I get asked if I want to clean up the Microsoft tags.  I say yes and I get the following result:

     

    Which is very clean HTML without the Microsoft Word "mso" tags. So far, so good.

     

    I am trying to achieve the same result in C# code

    1. [TestMethod]  
    2. public void CleanUpMicrosoftWordHTML()  
    3. {  
    4.     var source = "<font face=\"Times New Roman\" size=\"3\"></font>";  
    5.     source += "<p class=\"MsoNormal\" style=\"margin: 0in 0in 0pt;\"><span lang=\"NL\"><o:p><font face=\"Times New Roman\" size=\"3\">&nbsp;</font></o:p></span></p>";  
    6.     source += "<font face=\"Times New Roman\" size=\"3\"></font>";  
    7.     source += "<pre style=\"text-indent: -0.25in; margin-left: 0.5in; mso-list: l0 level1 lfo1; tab-stops: list .5in left 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt;\">";  
    8.     source += "<!--[if !supportLists]--><span class=\"migratedcontentfont1\"><span lang=\"NL\" style=\"font-family: Symbol; font-size: 12pt; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;\">";  
    9.     source += "<span style=\"mso-list: Ignore;\">&#183;<span style='font: 7pt/normal \"Times New Roman\"; font-size-adjust: none; font-stretch: normal;'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>";  
    10.     source += "</span><!--[endif]-->";  
    11.     source += "<span class=\"migratedcontentfont1\"><span lang=\"NL\" style='font-family: \"Times New Roman\",\"serif\"; font-size: 12pt;'>test<o:p></o:p></span></span></pre>";  
    12.   
    13.     var expected = "<pre><!--[if !supportLists]--><span style=\"font-family: Symbol; font-size: 12pt;\"><span>&#183;<span> </span></span></span><!--[endif]--><span style='font-family: \"Times New Roman\",\"serif\"; font-size: 12pt;'>test</span></pre>";  
    14.   
    15.     var result = EditorUtility.CleanUpMicrosoftWordHTML(source);  
    16.   
    17.     Assert.AreEqual(expected, result);  
    18. }  
    Unfortunately the result is not at all what was expected. The result contains: "&nbsp;&#183;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; test".

     

    So my question is: how can I properly clean up HTML in my C# code?

     

    By the way, I am running CuteEditor version 6.6  Buid 2013-04-22.

View Complete Thread