About Me

I'm just someone struggling against my own inertia to be creative. My current favorite book is "Oh the places you'll go" by Dr. Seuss

Thursday, November 17, 2011

Securing the Web

(A bit of a departure from my ironically complicated KissML idea today)

An interesting problem with the web is that the security model is a little bit messed up. Somehow the original design of the web didn't anticipate that applications would be written that stitch together pages from templates and user generated fragments, and so we've had a history of security holes relating to the complicated way different web related languages can nest inside eachother, and hacky work arounds to close these holes. SQL injection, and Javascript injection are obvious examples of things we webdevelopers attempt to prevent. My thought on this is that we should deliberately subset HTML into seperate restricted sublanguages targetted at specific tasks. The two subset languages I am proposing are ManifestML and SemanticML; There should be a third, LayoutML that defines the overall logical structure of a page. I don't have a clear idea though, of what that should be. I'll leave that to the comments.


ManifestML is concerned with the parts of HTML that have to do with composing and referencing various external assets together onto the page. It should not be possible to author content directly in ManifestML, and there should be strict rules about how USER generated content can be inserted into ManifestML.
ManifestML has the following parts:

<doctype> and xml declarations (if necessary)
the <title> and <meta> <html> and <body> tags
xml namespaces (if needed)
the HTML5 AppCache manifest reference
link (stylesheets, rss feeds, alternate versions)
script (but only the src attributes, script shouldn't be allowed inline)
the A tag
IMG tag
body (for containing img and A elements)
textnodes with whitespace only, outside of A elements or Object elements.
canvas tag, with ID, and alternate content within. (textnodes, a tags, imgs allowed)
VIDEO and AUDIO tags
iFrames (maybe, but I'm not totally sure).

id attributes required for all elements.

Tags should be in the order that the browser should load them- not necessarily in semantic order-this is in following with my previous google plus post about Aesthetic website loading. With a manifest file, it is easier to manage the way a page loads.

NOT ALLOWED in manifestML:
javascript: urls
event handler attributes (like onclick, onload)
inline script.
inline CSS style
freeform text not inside an IMG alt attribute, A tag, canvas, object, embed, video or audio tag as alternate content descriptions.
anything else not explicitly mentioned.

All manifestML documents should be valid HTML5, HTML4, or XHTML1.0 (not 1.1) documents. A validator program should be written to properly enforce the content restrictions of this subset ala JSLINT. Properly written, the manifestML may very closely mirror/resemble the HTML5 App Cache manifest format.


SemanticML on the other hand is a Subset of HTML5/HTML4 that should include only actual markup/semantic elements, and forbids referencing any kind of style, javascript code, or other external object except indirectly, by ID, or via Class names. Essentially the type of markup you'd expect to be generated out of a program like "Markdown" or "Textile" or a wysiwyg editor"

Things that are *not* in SemanticML :
Anything in ManifestML (including doctype, head, title, meta, namespaces, style, link, and IMG)
Event handlers, and javascript: urls.
ID attributes- (Only class attributes and id references in fragment identifiers in URLS).
Inline Style attributes.

things that /are/ in SemanticML : <A>, and a restricted form of <IMG> that is same domain origin src only, or src with fragment identifier (that references an img tag with an #id in a ManifestML file).

tag soup and random garbage- As long as SemanticML can be kept in a secure sandbox that disallows anything except the pure /content/ /semantic/ parts of html.

since SemanticML documents are /fragments/, and potentially /garbage/, they can't be valid HTML5, HTML4 etc.. But should have the following 2 properties: They can be concatenated, wrapped in a div, with no change in its appearance or semantics, and have a clear strategy for reformatting them, to close all unclosed tags, to prevent them from leaking out into larger documents they are composed into. Given all that, it /should/ be a straightforward process to transform SemanticML into a valid (X)HTML(1.0|4|5) document.

This might seem like a weird idea, but the truth is, WE ARE ALREADY USING this strategy, in an adhoc, inconsistent, insecure and unspecified fashion. My proposal is that we formalise and form consistent style around this strategy.


Maksim Lin said...

Breton its a very interesting idea!

But does it make sense to formalise this instead of documenting best practise? as there is such a huge diversity of use cases for web development.

Breton Slivka said...

I think it makes sense to formalise it in the sense that it makes sense to have a set of verifiers- like JSLint, to ensure that a bit of text either does or does not conform to a particular subset. --- Verifiers to be integrated into server side frameworks, browser extensions, and possibly even the browsers themselves- via an opt in tag or header, like the ES5 "strict mode". Opting in to the stricter modes in the browser would provide benefits like allowing you to shut off support for the document.write methods to increase the load performance of the page, and let the brower know that you as the author agree and expect the browser to download the resources in the manifest in parallel, simultaneously, and in any particular order.

That contradicts an offhand remark I made in the post, so I may have to revise that part. My point is, in ES5, using strict mode you are able to opt in to some future js engine optimisations by agreeing not to use certain features of the language. I envision the same thing with this. Though I realise that's a similar promise that xhtml made, I'm hopefully avoiding XHTML's mistake, by being 100% backwards compatible with older browsers, and making strict error handling an unobtrusive opt-in developer feature that you turn on while you're developing, and turn off when you go live (if you want).