How complex is to remove javascript from HTML? (to prevent XSS)

Hi,
There are few situations where you want your users to enter plain HTML but you do not want them to execute javascript. Problem seems very simple but its very complex.

In this article I will let you know the complexity of doing this rather proposing any solution.

Lets start with the ways javascript can appear in an HTML document. Here are few things  I am listing.

1. In script tag
javascript typically appears in an script tag e.g

<script src="http://www.someurl.com">
</script>

Or


<script>
alert('hi javascript')
</script>

2. In event attributes


Next simple way in which javascript can appear in your HTML is as event handler.


<div onclick="alert('hi javascript')">
</div>

Though w3c document describes following attributes as intrinsic events but depending on browser implementation it might change

    onunload
    onclick
    ondblclick
    onmousedown
    onmouseup
    onmouseover
    onmousemove
    onmouseout
    onfocus
    onblur
    onkeypress
    onkeydown
    onkeyup
    onsubmit
    onreset
    onselect
    onchange
    onerror <-- img onerror event



3. In anchor href

<a href="javascript:alert('hi javascript')">Click Me</a>

4.  In href and src tag of some media tags

In some of the older browsers (e.g. IE6) img src tag can contain javascript.
e.g.


<img src="javascript:alert('hi javascript')"/>

this situation might also appear in various ways as

<TABLE background="javascript:alert('hi')" or 

<LINK REL="stylesheet" href="javascript:alert('hi');">
<BGSOUND SRC="javascript:alert('hi javascript');">

5. In CSS attributes 



html{
 background: #28d expression("alert('hi javascript')");
}


this can appear in any place where CSS can appear i.e. external CSS, style block or inline style attribute.

6. base64 encoded svg can contain script tag

<EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> 


7. Encoded Scenario of any of the above case. 

Any of the above case may or may appear as encoded format e.g.
<a href="javascript:alert('hi javascript');">
is same as
<a href="&#x6A;&#x61;&#x76;&#x61;&#x73;&#x63;&#x72;&#x69;&#x70;&#x74;&#x3A;&#x61;&#x6C;&#x65;&#x72;&#x74;&#x28;&#x27;&#x68;&#x69;&#x20;&#x6A;&#x61;&#x76;&#x61;&#x73;&#x63;&#x72;&#x69;&#x70;&#x74;&#x27;&#x29;&#x3B;">

Important Note: 
1) code might appear in upper case or lower are e.g "javascript:" can be writtent as JaVaScript:

2) <script> tag can appear as multiline
<s
c
r
i
p
t
>


3) since html attribute values may or may not be wrapped under quotes
url like http://www.yahoo.com+ onmouseover ="alert('hi')" might generate code like this
<a href=http://www.yahoo.com onmouseover="alert('hi')"/>  which is wrong!!!!!!!!!

4) Usefull Links
http://coding.smashingmagazine.com/2011/01/11/keeping-web-users-safe-by-sanitizing-input-data/
http://ha.ckers.org/xss.html