Let's understand the problem first : In the figure below, the data reflects the exported CRM data to excel. I have highlighted the HTML tags in description column.
Once you install this new Add-In, all you would do is select Description column header (or any column header that contains HTML in its cells), goto Add-Ins tab-> Click "Clean HTML" Button. Once the process is done the program would display a message box to the user.
Image here shows Description column selction and "Clean HTML" button clicking:
Image below shows Description Column without HTML tags:
This is the code that cleans out all the HTML in a string, i.e., it takes the string with HTML tags in it and returns clean text:
public class Html2Text
{
private string _htmlStr = "";
private string _textContents = "";
public string GetContents(string htmlStr)
{
this._htmlStr = htmlStr;
Thread t = new Thread(runInThread);
t.TrySetApartmentState(ApartmentState.STA);
t.Start();
t.Join();
return _textContents;
}
private void runInThread()
{
WebBrowser wb = new WebBrowser();
wb.DocumentText = "Cleaning HTML";
HtmlDocument htmlDoc = wb.Document.OpenNew(true);
htmlDoc.Write(_htmlStr);
HtmlElementCollection htmlColl = htmlDoc.All;
List<string> indvidualTextColl = new List<string>();
for (int i = 0; i < htmlColl.Count; i++)
{
HtmlElement currElement = htmlColl[i];
if(currElement.TagName.ToUpper() == "P" &&
!string.IsNullOrEmpty(currElement.InnerText))
{
indvidualTextColl.Add(currElement.InnerText);
}
}
this._textContents = string.Join("\n", indvidualTextColl.ToArray());
}
}
Now, let's create the Excel AddIn project and use the above class to strip off HTML:
1. Create a project using VS 2010's Excel 2010 Add-In template and name it ExcelAddIn
Note: we are not going to touch ThisAddIn.cs
2. Add a new CS file and add the above Html2Text class code
3. Add New Item -> Ribbon (Visual Designer)
4. Add a button to the Visual designer with the following properties:
Name: button1 , Label: Clean HTML
5. Double click the button to goto Code behind file, add the following code to the click event:
Note: In CRM i have noticed that mostly the data is within <P> tag, so i am extracting the data from within <P> tag.
try
{
Excel.Worksheet activeWS = (Excel.Worksheet)Globals.ThisAddIn.Application.ActiveSheet;
int numOfRows = activeWS.UsedRange.EntireRow.Count;
Excel.Range activeCellRange = Globals.ThisAddIn.Application.ActiveCell;
int currRow = activeCellRange.Row; // getting the number of rows for Description column
int currColl = activeCellRange.Column;
Html2Text obj = new Html2Text(); // instantiate Html2Text object
string htmlStr = "";
string textContents = "";
// looping thru all the rows of Description column and fixing the data
for (int i = 1; i < numOfRows; i++)
{
Excel.Range currCell = (Excel.Range)activeWS.Cells[i + 1, currColl];
//MessageBox.Show(currCell.Value.ToString());
if (currCell.Value != null && currCell.Value.ToString() != "")
{
htmlStr = currCell.Value.ToString();
if (htmlStr.Contains("<htm") || htmlStr.Contains("<HTML") || htmlStr.Contains("<p>") ||
htmlStr.Contains("<P>") || htmlStr.Contains("<!DOCTYPE") ||
htmlStr.Contains("<!doctype") || htmlStr.Contains("<FONT") ||
htmlStr.Contains("<font"))
{ }
else
continue;
}
else
continue;
textContents = obj.GetContents(htmlStr);
currCell.Value2 = textContents;
}
MessageBox.Show("HTML cleaning is done.", "Office Extension");
}
catch (Exception ex)
{
MessageBox.Show("Error: " + ex.Message);
}
6. Now just build this project , VS will generate dll, pdb and a VSTO file. Just double click the VSTO file to install this new Office AddIn
7. Open your Excel 2010 or 2007 , verify the appearance of this new button within Ribbon's AddIn Tab and test its functionality
Note: User has to manually save the changes after running this AddIn, this is done intentionally so that User can verify the data and save.
Hope you found this article useful
-San.
No comments:
Post a Comment