About Me

I'm just someone struggling against my own inertia to be creative. My current favorite book is "Oh the places you'll go" by Dr. Seuss

Monday, November 21, 2011

Securing the Web, appendix

*(requires id attribute)
<!-- -->
<script>* (only src attribute, no inline script)
<style>* (but only css @ directives allowed inside)

(no html comments, id attributes or inline event handlers)
<area >
<img> (restricted to #fragment refs)


(with src= attributes that can point to #fragments of ManifestML)
*(requires id attribute)
<form>* (restricted to #fragment refs)

<input type="button">
<input type="checkbox">
<input type="color">
<input type="date">
<input type="datetime">
<input type="datetime-local">
<input type="email">
<input type="file">
<input type="hidden">
<input type="image">
<input type="month">
<input type="number">
<input type="password">
<input type="radio">
<input type="range">
<input type="reset">
<input type="search">
<input type="submit">
<input type="tel">
<input type="text">
<input type="time">
<input type="url">
<input type="week">

Thursday, November 17, 2011

Securing the Web

(A bit of a departure from my ironically complicated KissML idea today)

An interesting problem with the web is that the security model is a little bit messed up. Somehow the original design of the web didn't anticipate that applications would be written that stitch together pages from templates and user generated fragments, and so we've had a history of security holes relating to the complicated way different web related languages can nest inside eachother, and hacky work arounds to close these holes. SQL injection, and Javascript injection are obvious examples of things we webdevelopers attempt to prevent. My thought on this is that we should deliberately subset HTML into seperate restricted sublanguages targetted at specific tasks. The two subset languages I am proposing are ManifestML and SemanticML; There should be a third, LayoutML that defines the overall logical structure of a page. I don't have a clear idea though, of what that should be. I'll leave that to the comments.


ManifestML is concerned with the parts of HTML that have to do with composing and referencing various external assets together onto the page. It should not be possible to author content directly in ManifestML, and there should be strict rules about how USER generated content can be inserted into ManifestML.
ManifestML has the following parts:

<doctype> and xml declarations (if necessary)
the <title> and <meta> <html> and <body> tags
xml namespaces (if needed)
the HTML5 AppCache manifest reference
link (stylesheets, rss feeds, alternate versions)
script (but only the src attributes, script shouldn't be allowed inline)
the A tag
IMG tag
body (for containing img and A elements)
textnodes with whitespace only, outside of A elements or Object elements.
canvas tag, with ID, and alternate content within. (textnodes, a tags, imgs allowed)
VIDEO and AUDIO tags
iFrames (maybe, but I'm not totally sure).

id attributes required for all elements.

Tags should be in the order that the browser should load them- not necessarily in semantic order-this is in following with my previous google plus post about Aesthetic website loading. With a manifest file, it is easier to manage the way a page loads.

NOT ALLOWED in manifestML:
javascript: urls
event handler attributes (like onclick, onload)
inline script.
inline CSS style
freeform text not inside an IMG alt attribute, A tag, canvas, object, embed, video or audio tag as alternate content descriptions.
anything else not explicitly mentioned.

All manifestML documents should be valid HTML5, HTML4, or XHTML1.0 (not 1.1) documents. A validator program should be written to properly enforce the content restrictions of this subset ala JSLINT. Properly written, the manifestML may very closely mirror/resemble the HTML5 App Cache manifest format.


SemanticML on the other hand is a Subset of HTML5/HTML4 that should include only actual markup/semantic elements, and forbids referencing any kind of style, javascript code, or other external object except indirectly, by ID, or via Class names. Essentially the type of markup you'd expect to be generated out of a program like "Markdown" or "Textile" or a wysiwyg editor"

Things that are *not* in SemanticML :
Anything in ManifestML (including doctype, head, title, meta, namespaces, style, link, and IMG)
Event handlers, and javascript: urls.
ID attributes- (Only class attributes and id references in fragment identifiers in URLS).
Inline Style attributes.

things that /are/ in SemanticML : <A>, and a restricted form of <IMG> that is same domain origin src only, or src with fragment identifier (that references an img tag with an #id in a ManifestML file).

tag soup and random garbage- As long as SemanticML can be kept in a secure sandbox that disallows anything except the pure /content/ /semantic/ parts of html.

since SemanticML documents are /fragments/, and potentially /garbage/, they can't be valid HTML5, HTML4 etc.. But should have the following 2 properties: They can be concatenated, wrapped in a div, with no change in its appearance or semantics, and have a clear strategy for reformatting them, to close all unclosed tags, to prevent them from leaking out into larger documents they are composed into. Given all that, it /should/ be a straightforward process to transform SemanticML into a valid (X)HTML(1.0|4|5) document.

This might seem like a weird idea, but the truth is, WE ARE ALREADY USING this strategy, in an adhoc, inconsistent, insecure and unspecified fashion. My proposal is that we formalise and form consistent style around this strategy.