When I see the answer to solve Level 15 of http://escape.alf.nu, I notice that
<!--<script> will cause the DOM parser to break. In the following HTML you won’t see the string “Test” (tested on IE 11 & Firefox & Chrome):
<!DOCTYPE HTML> <html> <body> <script> var a="<!--<script>"; </script> <p>Test</p> </body> </html>
But these two scripts will show “Test”:
<!DOCTYPE HTML> <html> <body> <script> var a="<!--"; </script> <p>Test</p> </body> </html>
<!DOCTYPE HTML> <html> <body> <script> var a="<script>"; </script> <p>Test</p> </body> </html>
I don’t understand, why does this happen?
This raises the important point that the text inside of
This code is not valid HTML5 syntax, so there is nothing in the HTML5 specification that would give us a clue about what is going one here. To be specific, there are two issues:
- There is a
<script>tag without a closing
- There is an opening
<!--without a closing
-->. (see restrictions for contents of script elements)
Both of these problem will put a browser’s HTML parser into an error parsing mode, which means they are trying to make sense of invalid syntax. What browsers will do when trying to make sense of invalid syntax is undefined behavior, which technically means that anything can happen (such as nasal demons). The de facto behavior here seems to be that browsers are agreeing on how they handle this undefined behavior, but it is undefined behavior nonetheless.
For whatever reason, this combination of syntax issues next to each other causes browsers to ignore the text later in the document.
EDIT: I have identified how the parsing error is produced by stepping through this part of the HTML5 spec.
The text content of the script (excluding whitespace) is
This must match the following grammar rule:
data1 *( escape [ script-start data3 ] "-->" data1 ) [ escape ]
We can begin parsing the text content by matching
data1, which has the following rule:
data1 = < any string that doesn't contain a substring that matches not-data1 > not-data1 = "<!--"
That is, the string
var a=" matches the
data1 production. It ends there because the next part is
For there to be any text afterwards in the script, it must match the
escape production, which is as follows:
escape = "<!--" data2 *( script-start data3 script-end data2 )
Let”s match the next part of the text. So far we have
data1 var a=" escape <!-- data2 ???
Now nothing can be contained in
data2 because the
data2 production prohibits the substring
<script> (i.e. a
data2 = < any string that doesn"t contain a substring that matches not-data2 > not-data2 = script-start / "-->"
The lexer cannot proceed with with valid steps according to the grammar, so the browser must now go into error processing.
It ‘ll be some assumption being violated in the internal mechanism.
There’s not much point trying to rationalise about this stuff.
You wrote invalid HTML, so anything can happen.