Monday, September 12, 2011

Strip Out HTML from Exported CRM data to Excel

    In CRM,  the exported data can have html tags in them for eg: Description field in Email Activities when exported to Excel will have whole lot of HTML tags that it would be almost impossible to clean them manually and extract the actual data. To solve this issue i have written an Add-In for Excel that would do this job.
Let's understand the problem first : In the figure below, the data reflects the exported CRM data to excel. I have highlighted the HTML tags in description column.


Once you install this new Add-In, all you would do is select Description column header (or any column header that contains HTML in its cells), goto Add-Ins tab-> Click "Clean HTML" Button. Once the process is done the program would display a message box to the user.
Image here shows Description column selction and "Clean HTML" button clicking:



Image below shows Description Column without HTML tags:



This is the code that cleans out all the HTML in a string, i.e., it takes the string with HTML tags in it and returns clean text:

public class Html2Text
    {
       private string _htmlStr = "";
       private string _textContents = "";
  
       public string GetContents(string htmlStr)
       {
           this._htmlStr = htmlStr;
           Thread t = new Thread(runInThread);
           t.TrySetApartmentState(ApartmentState.STA);
           t.Start();
           t.Join();
           return _textContents;
       }
   
       private void runInThread()
       {
           WebBrowser wb = new WebBrowser();
           wb.DocumentText = "Cleaning HTML";
           HtmlDocument htmlDoc = wb.Document.OpenNew(true);
           htmlDoc.Write(_htmlStr);
    
           HtmlElementCollection htmlColl =  htmlDoc.All;
           List<string> indvidualTextColl = new List<string>();
   
           for (int i = 0; i < htmlColl.Count; i++)
           {
               HtmlElement currElement =  htmlColl[i];
               if(currElement.TagName.ToUpper() == "P" &&
                               !string.IsNullOrEmpty(currElement.InnerText))
               {
                   indvidualTextColl.Add(currElement.InnerText);
               }
           }
    
           this._textContents = string.Join("\n", indvidualTextColl.ToArray());
       }
    }


Now, let's create the Excel AddIn project and use the above class to strip off HTML:
1. Create a project using VS 2010's Excel 2010 Add-In template and name it ExcelAddIn
Note: we are not going to touch ThisAddIn.cs
2. Add a new CS file and add the above Html2Text class code
3. Add New Item -> Ribbon (Visual Designer)
4. Add a button to the Visual designer with the following properties:
Name:  button1 , Label: Clean HTML
5. Double click the button to goto Code behind file, add the following code to the click event:
Note: In CRM i have noticed that mostly the data is within <P> tag, so i am extracting the data from within <P> tag.

try
  {
     Excel.Worksheet activeWS = (Excel.Worksheet)Globals.ThisAddIn.Application.ActiveSheet;

     int numOfRows = activeWS.UsedRange.EntireRow.Count;
     
     Excel.Range activeCellRange = Globals.ThisAddIn.Application.ActiveCell;
     int currRow = activeCellRange.Row;  // getting the number of rows for Description column
     int currColl = activeCellRange.Column;
     Html2Text obj = new Html2Text();  // instantiate Html2Text object 
     string htmlStr = "";
     string textContents = ""; 
     
     // looping thru all the rows of Description column and fixing the data
     for (int i = 1; i < numOfRows; i++)
     {
         Excel.Range currCell = (Excel.Range)activeWS.Cells[i + 1, currColl];
         //MessageBox.Show(currCell.Value.ToString());                
           if (currCell.Value != null && currCell.Value.ToString() != "")
            {
                  htmlStr = currCell.Value.ToString();
                  if (htmlStr.Contains("<htm") || htmlStr.Contains("<HTML") || htmlStr.Contains("<p>") ||          
                          htmlStr.Contains("<P>") || htmlStr.Contains("<!DOCTYPE") ||
                            htmlStr.Contains("<!doctype") || htmlStr.Contains("<FONT") || 
                             htmlStr.Contains("<font"))
                        { }
                   else
                        continue;
                }
            else
                 continue;
   
            textContents = obj.GetContents(htmlStr);
            currCell.Value2 = textContents;
       }
     
       MessageBox.Show("HTML cleaning is done.", "Office Extension");
   }

catch (Exception ex)
   {
        MessageBox.Show("Error: " + ex.Message);
    }

6. Now just build this project , VS will generate dll, pdb and a VSTO file. Just double click the VSTO file to install this new Office AddIn
7. Open your Excel 2010 or 2007 , verify the appearance of this new button within Ribbon's AddIn Tab and test its functionality

Note: User has to manually save the changes after running this AddIn, this is done intentionally so that User can verify the data and save.
Hope you found this article useful
-San.

No comments:

Post a Comment