About Me

I'm just someone struggling against my own inertia to be creative. My current favorite book is "Oh the places you'll go" by Dr. Seuss

Thursday, June 10, 2010

Simplifying HTML Part 1 of 4

HTML is difficult. It is difficult because there is a mountain of stuff to learn. Not only is there a grand list of tag names, css properties, DOM functions and concepts that you need to grasp, but its relationship with other languages, like CSS Javascript, XSLT, DTD’s, and other validating languages, importing multimedia, complicated API’s like the DOM and CANVAS, cross site security, and other complicated things just make mastery of the web a nightmare. And that is not even touching on cross browser incompatibilities. HTML is goddamned difficult.

So how do we go about making this easier?

Douglas Crockford, the legendary senior software engineer who works for Yahoo, advocates a strategy of sub-setting to simplify the Javascript language. Douglas Crockford wrote a book: “Javascript: The Good Parts” in which he documents how he discovered that by taking things out of the language, and ignoring them, he could make Javascript much more powerful, secure and easy to learn. Another side effect is that it becomes much easier to write interpreters for simplified versions of the language. Douglas Crockford’s extreme subset of Javascript, JSON is so easy to learn, and such a powerful concept, that it has spread to have parser implementations in nearly every vaguely useful programming language. I think the same could be done with HTML.

In this series of blog posts I will define a simplified version of HTML that I will call KISSML. I will simplify it by not just removing things from the language, but consolidating, generalising, and humiliating as many special cases and arbitrarily separate concepts as possible. Unfortunately, this effort of mine falls short of the Crockford ideal; The domain all possible valid KISSML documents are not all valid HTML. This is because by simplifying it, I make it more powerful and expressive. In theory, a KISSML to HTML ‘compiler’ might be possible for backward compatibility (until everyone has upgraded to KISSML browsers!). For the purposes of this blog post, I won’t concern myself with the details of how that would work. I realise that redefining and rebuilding HTML from scratch has been attempted (and failed) many times before. Let me be up front about this: The big nasty complicated HTML5 with all its warts and flash plugins and horrors is not going away for a long long time. Consider this a thought exercise (but if anyone wants to actually implement this, I certainly won’t complain).

i will define KISSML in relation to HTML in terms of:

  • What to remove: (The Bad Parts)
  • What to generalise and consolidate: (The Powerful Parts)
  • What’s left, What it is: (The Good Parts)
  • And its relationship to other technologies: (The New Style)
What to remove: The Bad Parts

So what do I remove? I will start by removing all the different tag types (for now) because it is easier to start with a blank canvas as far as that is concerned. KISSML is an extensible markup syntax, like xml, However, unlike XML, there is no requirement for an outer enclosing “root” element. Removing this requirement means that KISSML can be a true markup language in the original sense of the term. KISSML is a markup language in a way that XML and HTML cannot be. The immediate practical advantage is that *this* very paragraph counts as a valid KISSML document. Without having to modify it, wrap it, add headers, or parse it, this is KISSML. Multiple KISSML documents can be concatenated directly, with no special processing. The result of concatenating two valid KISSML documents is a new valid KISSML document. You can’t do that with HTML or XML, and yet it is a task that must be done constantly. Vast numbers of web developers are living in sin! Much like banning sex or alcohol, those who would forbid naked HTML from being considered valid seem to misunderstand something fundamental about how people actually behave. It is a goal to consider *most* html fragments, as produced in the previous two links, as valid KISSML.

On that note, <head> and <meta> need to go too. I already said I was getting rid of all the different tag types (for now), but these ones aren’t coming back, unlike some of the others. Their existence is a contradiction. they are meant to define “metadata” and yet here they are inside the data. it doesn’t make sense. We have learned through trial by fire, again and again, since the web was created that trying to hide information in an html document is stupid. If you can’t see it, it may as well not exist. Users can’t see it, search engines don’t look at it, developers typically ignore it or avoid it. Browsers ignore (most) of it. Hence, it seems to me <head> and <meta> are almost completely pointless. The few things that meta tags *do* have an effect on could be achieved through better methods. <title> is visible, sort of, but there can only be one. In a multiply concatenated document, which <title> do you choose? This will be a theme: anything that prevents the concatenation rule from working is deleted from KISSML.

HTML entities. In HTML, and XML, in order to insert a special character, you must use the & followed by some special name, followed by a ;, as in : “Bill &amp; Ted&apos;s Excellent Adventure”. Not only does this look ugly, but it also leads to two of the most frequent mistakes made in web development land. The first is using the ampersand & character without encoding it into an entity, like this: “Bill & Ted”, a mistake that leads to an invalid document, and breaking parsing software not prepared for the situation. The other mistake is made by software developers and spec authors who do not specify what their software expects from a blob of text. As a result, there is a confusion of entity encoded html, plain text, and non encoded html that gets dumped into attributes and text fields without rhyme or reason. This is a particular problem for RSS which leaves it up to software to decide whether elements contain encoded html, or plain text! Really, most of what html entities are used for should be done with UTF-8 instead. Which leads us to...

Encodings other than UTF-8 need to die. I mean that as politely as possible. Quite simply, I’m sick of seeing text encoding muckups, like apostrophes being turned into euros, and such.

Not only does the existence of numerous myriad different text encodings make files difficult to parse and display, it makes client/server interaction difficult too. AJAX in IE fails when it encounters a server that proclaims an encoding IE doesn’t recognise. Things get sticky when a page is served with one encoding, but the server requires form posts in another. UTF-8 only should be used from now on, and browsers should assume they are receiving UTF-8. That way, if things break, the vast flowchart for troubleshooting text encoding issues is reduced to just one question: Did you use UTF-8? If no, use UTF-8. If yes, someone else failed to use UTF-8. Why UTF-8? Because we’re moving in that direction anyway, and UTF-8 theoretically has no upper limit to the number of glyphs it can represent. UTF-8 is good, and you can use it to represent funny characters like snowmen, and umlauts.

Frames: Better arguments than I can come up with have been made elsewhere. Needless to say, frames need to go, but not without being replaced with something better, because the USE CASE for frames still exists. It’s just that frames are a bad solution to that use case.

The Script Tag: Surely there must be a better, more secure way of making a web-page scriptable! Remember that we’re expecting users of our sites to enter content in forms. We then take that user entered content, and display it on our sites with full privileges and abilities. The existence of the script tag, or any other way to modify the browser behaviour in the markup language itself makes securing these forms incredibly difficult. Markup should be, quite simply, markup and nothing else. Otherwise, XSS exploits ahoy!

The Style Tag: for symmetry with the elimination of the script tag, let us affirm that we shouldn’t be mixing these powerful languages in with the markup, because once you’ve spilled oil in the ocean, it’s really really hard to get it out again.

TO BE CONTINUED IN PART 2: THE POWERFUL PARTS

No comments: