About Me

I'm just someone struggling against my own inertia to be creative. My current favorite book is "Oh the places you'll go" by Dr. Seuss

Thursday, June 10, 2010

Simplifying HTML Part 2 of 4

In part one I called out a few of the more awful features of HTML for removal. Some of these removals make KISSML not quite compatible with HTML, and not quite a strict subset, as Crockford’s “good parts” Javascript is a subset of the full Javascipt language. This was criticised by a coworker or mine, and quite rightly too! Here is my response to that criticism:

Two of my removals, the script tag, and the style tag, while on their own would result in a strictly compatible subset version of html, in order to mean anything useful they must be enforced in some way. The whole point of removing these tags is to prevent XSS style attacks. Currently, if you wanted to eliminate XSS attacks on some specific site, you would engage in a rushed kind of language sub-setting exercise. If you are very unwise, you may attempt to use regular expressions to achieve this sub-setting. So, if you’re doing this impromptu language design exercise, there’s one of two goals you can aim for, but you cannot achieve both goals simultaneously:

        1.        The language must be usable unmodified directly in the browser, without compatibility problems.
        2.        The language must be acceptable, unmodified, from user facing inputs.

After trying out both of these goals in various systems, I believe goal 2. is the more pragmatic, wiser choice. (Though please point out if I’ve presented a false dillemma!)

With HTML, to prevent XSS, it is already the case that we must use sanitisers such as HTML Purifier, or pseudo markup languages like Markdown. In both these cases, there is a transformation from what gets input by the user, and what ultimately gets served to the browser. The programmer must make the choice between storing the original user input, or storing the transformed version of the input, possibly both. In addition, we also already have template languages like PHP in common use and these are interpreted and transformed before being sent to the browser. I would like to suggest, that since we are already transforming our inputs well before it gets to the browser, then making slightly incompatible changes with old versions of HTML which can be transformed into legacy HTML is not all that bad a deal. The only other situation where it may become difficult is the situation of authoring static pages with no server side components, and no requirements for user input.

I have some ideas for solving all these problems, but they will come in part 4, so bear with me!


What to Generalise and Consolidate: The Incompatible Parts

Preformatted and Literal Text: I’ve already wrote that I’m removing HTML entities. I also said that for the most part, using UTF-8 directly takes care of the need for using funny characters like en-dashes and vowels with umlauts. However, there’s still one thing UTF-8 can’t quite do. Since the characters <, > and “ have special meaning in KISSML, we need a simple way to represent them literally. If we want to display blocks of code in HTML today, we could use the <pre> or <code> elements, and <, > ” characters within those elements are not interpreted as HTML until the closing </pre> or </code> is encountered. But what if we want to talk about pre tags inside a pre tag? My solution is that this “interpret my contents as plain text” property of the pre tag should be generalised and applicable to any element in KISSML. I will call this attribute “literal”. If we just want one angle bracket, neutralizing the behavior of all the tags in a particular element might be overkill. For this case where you want a one off instance of a special character, we have the element types <lt/> <gt/> and <q/>. These are plain inline elements with the cdata attribute, and predefined to contain the text <, >, and “ respectively.

In addition, there is the blank element type <></>, which by default, renders its contents surrounded by quote marks. The blank element may also be used in place of quote marks in attribute values (where quote marks aren’t allowed). Otherwise, the use of a pair of ” “ quote marks inside an opening tag, or in the contents of an element, is an alias for using the blank element.

I eliminated html entities by replacing them with equivalent functionality defined in terms of html elements, and attributes. I consolidated html entities with html elements. I combined them into one concept. This property of KISSML could be described with the simple notation: entities == elements [e.g. &lt; == <lt/> ]. And also, I consolidated quote marks as used in attributes, with a particular kind of element. Thus, ”foo“ == <>foo</> Here I will summarise the rest of KISSML’s generalisations in this equation style.


tag name == name == class == id == attribute name = css property name
and also:
element attribute == css property

This one is quite iconoclastic indeed. I’ve never understood, why do we need 6 drastically different ways to attach name, value sets to elements? KISSML has only one way.
And so the following labyrinth of HTML:

<button name=”mylink“><a href=”http://example.com“ id=”mylink“ class=”buttonimage contentimage“><img src=”button.jpg“ style=”display: block; width:100%;“ /></a></button>

May become in KISSML:

<button a img href=”http://example.com“ src=”button.jpg“ mylink buttonimage contentimage display=”block“ width=”100%“ />

In KISSML, we eliminate the specialness of ”tag types“ like ”a“ and ”p“. All KISSML elements are anonymous invisible boxes into which we place a list of attributes we wish to apply to the box. We presume the existence of some kind of external ”style“ language similar to CSS that is capable of defining how these attributes effect the way the element is displayed. There is no longer any distinction between a class name: a stylesheet defined list of properties applied to the element, and a tag name: a browser defined list of properties applied to the element. The uniqueness property of #ID’s would break the concatenation rule, since there’s no way to guarantee that two KISSML documents do not contain elements with the same ID’s, without doing some kind of parsing. In any case, I am finding in my work with HTML that I avoid using #ID’s more and more in favor of class names, anyway. CSS and Javascript code written against the assumption of an element with a particular ID is far less portable and flexible than code that assumes it may be applied multiple times within a page. This also fits with the no special case pattern since the logical consequence of this consolidation is the replacement of the dom methods getElementsByTagName, getElementsByClassName, and getElementById, with a single method, getElements, which returns an array of elements, and the only result case you need to handle is iterate through an array of elements.


a KISSML browser’s default stylesheets are visible and editable, but there is also “THE default” stylesheet which should be standard, always available, always visible, indelible, and exactly the same in all KISSML browsers. So, the vast universe of markup that needs to be interpreted and displayed the same way by different browsers, can be specified in the /one true stylesheet/. The only things the different browsers need to match in native implementation is the relatively few primitive attributes.

DTD == Stylesheet
Doctype Declaration == Stylesheet link
Validator == KISSML-LINT

that default stylesheet in our theoretical style language should also be usable for validation purposes. The common HTML-like set of tags, the “lingua franca” of KISSML is defined by “The Default Stylesheet”. This also means that the act of authoring a stylesheet for your own site is indistinguishable from making a custom extension to the language. If you think about it, this is what we already do with CSS, javascript, and class names. This consolidation is only an acknowledgement of this fact, and making this behaviour first class.

The default stylesheet, aside from determining the default display behaviour of attributes, should also be able to declare code style rules, which can be enforced by the validator. Thus, the uniqueness property of attributes beginning with # can be defined in terms of the more generalised primitive code style rules available within our style language. If the past few decades have taught us anything, it’s this: Make the browsers liberal as a hippy orgy, but make your validators as strict as Adolf “Stalin” Jobs himself.

All that said, let us never fall into the trap of saying “The stylesheet determines what the attributes mean”. Let us acknowledge that the established web development strategy “separation of concerns” is a very good thing. Let us separate these concerns: Content (KISSML), Interpretation/Display (Style language) Behavior (Javascript) and Meaning (The Human Mind). let us endeavour to avoid mixing these concerns, and let us not be foolish as to think that a document full of computer code indicates community-wide agreement on the meaning of words, which rightfully should be determined by prose, debate and negotiation.


attribute value == element content == node list

and so:

<img src=“example.png” title=“here is some <strong>markup</strong> <q>language</q>” > But let’s also get rid of the alt tag, because the img tag can <em>already</em> contain marked up content ! </img>


is valid KISSML, thus eliminating the problems we have run into as web developers, due to the fact that in HTML the alt attribute cannot contain HTML. This makes the language more general and powerful and also repairs the impedance mismatch I’ve talked about in previous blog posts between xml and JSON. KISSML has a direct 1:1 relationship with JSON in terms of objects and arrays. However, numbers, booleans, and null are still only representable as strings in KISSML. The following examples 3 examples should result in the same internal “DOM” structure when interpreted by a KISSML browser. The first 2 examples are KISSML, and the third is JSON.

The quick brown <strong>fox</strong> jumped over the lazy <abbr title=”Dynamic <em href=<>http://odour.net</> >Odour</em> Generator“ >dog</abbr>.

The quick brown <strong=”fox“ /> jumped over the lazy <abbr=”dog“ title=<>Dynamic <em href=<>http://odour.net</> >Odour</em> Generator</> />.

[”The quick brown “, {”strong”:”fox“}, ” jumped over the lazy “,{”abbr”:“dog”,title:[“Dynamic ”, {“em”:“Odour“, “href”:“http://odour.net”}, “ Generator”]},“.”]

From this comparison, you can kind of see KISSML as in the same spirit of JSON, while addressing JSON’s weaknesses for representing documents. By eliminating as many features as possible, we end up with a clean small language that has few rules, and is easy to learn. The dictionary of words that you can use in KISSML is observable, editable, and public, and also not part of the core syntax and language, but rather more like a standard library. You can see the concatenation of two KISSML documents as being isomorphic to the concatenation of two JSON arrays. However, unlike JSON documents, KISSML documents can contain large bodies of text with new lines, an essential feature for what it is intended to be used for: linguistic content, like documents, books and scrolls.

Simplifying HTML Part 1 of 4

HTML is difficult. It is difficult because there is a mountain of stuff to learn. Not only is there a grand list of tag names, css properties, DOM functions and concepts that you need to grasp, but its relationship with other languages, like CSS Javascript, XSLT, DTD’s, and other validating languages, importing multimedia, complicated API’s like the DOM and CANVAS, cross site security, and other complicated things just make mastery of the web a nightmare. And that is not even touching on cross browser incompatibilities. HTML is goddamned difficult.

So how do we go about making this easier?

Douglas Crockford, the legendary senior software engineer who works for Yahoo, advocates a strategy of sub-setting to simplify the Javascript language. Douglas Crockford wrote a book: “Javascript: The Good Parts” in which he documents how he discovered that by taking things out of the language, and ignoring them, he could make Javascript much more powerful, secure and easy to learn. Another side effect is that it becomes much easier to write interpreters for simplified versions of the language. Douglas Crockford’s extreme subset of Javascript, JSON is so easy to learn, and such a powerful concept, that it has spread to have parser implementations in nearly every vaguely useful programming language. I think the same could be done with HTML.

In this series of blog posts I will define a simplified version of HTML that I will call KISSML. I will simplify it by not just removing things from the language, but consolidating, generalising, and humiliating as many special cases and arbitrarily separate concepts as possible. Unfortunately, this effort of mine falls short of the Crockford ideal; The domain all possible valid KISSML documents are not all valid HTML. This is because by simplifying it, I make it more powerful and expressive. In theory, a KISSML to HTML ‘compiler’ might be possible for backward compatibility (until everyone has upgraded to KISSML browsers!). For the purposes of this blog post, I won’t concern myself with the details of how that would work. I realise that redefining and rebuilding HTML from scratch has been attempted (and failed) many times before. Let me be up front about this: The big nasty complicated HTML5 with all its warts and flash plugins and horrors is not going away for a long long time. Consider this a thought exercise (but if anyone wants to actually implement this, I certainly won’t complain).

i will define KISSML in relation to HTML in terms of:

  • What to remove: (The Bad Parts)
  • What to generalise and consolidate: (The Powerful Parts)
  • What’s left, What it is: (The Good Parts)
  • And its relationship to other technologies: (The New Style)
What to remove: The Bad Parts

So what do I remove? I will start by removing all the different tag types (for now) because it is easier to start with a blank canvas as far as that is concerned. KISSML is an extensible markup syntax, like xml, However, unlike XML, there is no requirement for an outer enclosing “root” element. Removing this requirement means that KISSML can be a true markup language in the original sense of the term. KISSML is a markup language in a way that XML and HTML cannot be. The immediate practical advantage is that *this* very paragraph counts as a valid KISSML document. Without having to modify it, wrap it, add headers, or parse it, this is KISSML. Multiple KISSML documents can be concatenated directly, with no special processing. The result of concatenating two valid KISSML documents is a new valid KISSML document. You can’t do that with HTML or XML, and yet it is a task that must be done constantly. Vast numbers of web developers are living in sin! Much like banning sex or alcohol, those who would forbid naked HTML from being considered valid seem to misunderstand something fundamental about how people actually behave. It is a goal to consider *most* html fragments, as produced in the previous two links, as valid KISSML.

On that note, <head> and <meta> need to go too. I already said I was getting rid of all the different tag types (for now), but these ones aren’t coming back, unlike some of the others. Their existence is a contradiction. they are meant to define “metadata” and yet here they are inside the data. it doesn’t make sense. We have learned through trial by fire, again and again, since the web was created that trying to hide information in an html document is stupid. If you can’t see it, it may as well not exist. Users can’t see it, search engines don’t look at it, developers typically ignore it or avoid it. Browsers ignore (most) of it. Hence, it seems to me <head> and <meta> are almost completely pointless. The few things that meta tags *do* have an effect on could be achieved through better methods. <title> is visible, sort of, but there can only be one. In a multiply concatenated document, which <title> do you choose? This will be a theme: anything that prevents the concatenation rule from working is deleted from KISSML.

HTML entities. In HTML, and XML, in order to insert a special character, you must use the & followed by some special name, followed by a ;, as in : “Bill &amp; Ted&apos;s Excellent Adventure”. Not only does this look ugly, but it also leads to two of the most frequent mistakes made in web development land. The first is using the ampersand & character without encoding it into an entity, like this: “Bill & Ted”, a mistake that leads to an invalid document, and breaking parsing software not prepared for the situation. The other mistake is made by software developers and spec authors who do not specify what their software expects from a blob of text. As a result, there is a confusion of entity encoded html, plain text, and non encoded html that gets dumped into attributes and text fields without rhyme or reason. This is a particular problem for RSS which leaves it up to software to decide whether elements contain encoded html, or plain text! Really, most of what html entities are used for should be done with UTF-8 instead. Which leads us to...

Encodings other than UTF-8 need to die. I mean that as politely as possible. Quite simply, I’m sick of seeing text encoding muckups, like apostrophes being turned into euros, and such.

Not only does the existence of numerous myriad different text encodings make files difficult to parse and display, it makes client/server interaction difficult too. AJAX in IE fails when it encounters a server that proclaims an encoding IE doesn’t recognise. Things get sticky when a page is served with one encoding, but the server requires form posts in another. UTF-8 only should be used from now on, and browsers should assume they are receiving UTF-8. That way, if things break, the vast flowchart for troubleshooting text encoding issues is reduced to just one question: Did you use UTF-8? If no, use UTF-8. If yes, someone else failed to use UTF-8. Why UTF-8? Because we’re moving in that direction anyway, and UTF-8 theoretically has no upper limit to the number of glyphs it can represent. UTF-8 is good, and you can use it to represent funny characters like snowmen, and umlauts.

Frames: Better arguments than I can come up with have been made elsewhere. Needless to say, frames need to go, but not without being replaced with something better, because the USE CASE for frames still exists. It’s just that frames are a bad solution to that use case.

The Script Tag: Surely there must be a better, more secure way of making a web-page scriptable! Remember that we’re expecting users of our sites to enter content in forms. We then take that user entered content, and display it on our sites with full privileges and abilities. The existence of the script tag, or any other way to modify the browser behaviour in the markup language itself makes securing these forms incredibly difficult. Markup should be, quite simply, markup and nothing else. Otherwise, XSS exploits ahoy!

The Style Tag: for symmetry with the elimination of the script tag, let us affirm that we shouldn’t be mixing these powerful languages in with the markup, because once you’ve spilled oil in the ocean, it’s really really hard to get it out again.

TO BE CONTINUED IN PART 2: THE POWERFUL PARTS