Gwtar: A static efficient single-file HTML format

138 points by theblazehen 6 hours ago

simonw 4 hours ago

TIL about window.stop() - the key to this entire thing working, it's causes the browser to stop loading any more assets: https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Apparently every important browser has supported it for well over a decade: https://caniuse.com/mdn-api_window_stop

Here's a screenshot illustrating how window.stop() is used - https://gist.github.com/simonw/7bf5912f3520a1a9ad294cd747b85... - everything after <!-- GWTAR END is tar compressed data.

Posted some more notes on my blog: https://simonwillison.net/2026/Feb/15/gwtar/

moritzwarhier 4 hours ago

Not the inverse, but for any SPA (not framework or library) developers seeing this, it's probably worth noting that this is not better than using document.write, window.open and simular APIs.
But could be very interesting for use cases where the main logic lives on the server and people try to manually implement some download- and/or lazy-loading logic.
Still probably bad unless you're explicitly working on init and redirect scripts.
8n4vidtmkvmk 4 hours ago

Neat! I didn't know about this either.
Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.

tym0 3 hours ago

I was on board until I saw that those can't easily be opened from a local file. Seems like local access is one of the main use case for archival formats.

avaer 3 hours ago

Agreed, I was thinking it's like asm.js where it can "backdoor pilot" [1] an interesting use case into the browser by making it already supported by default.
But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.
[1] https://en.wikipedia.org/wiki/Television_pilot#Backdoor_pilo...

gildas 30 minutes ago

I would like to know why ZIP/HTML polyglot format produced by SingleFile [1] and mentioned in the article "achieve static, single, but not efficiency". What's not efficient compared to the gwtar format?

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

gwern 26 minutes ago

'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?
- gildas 22 minutes ago
  
  I haven't looked closely, but I get the impression that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call and rely on range requests (zip.js supports them) to unzip and display the page. This could also be transparent for the user, depending on whether the file is served via HTTP or not. However, I admit that I haven't implemented this mechanism yet.
  
  gwern 15 minutes ago
  
  > that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call...However, I admit that I haven't implemented this mechanism yet.
  Well, yes. That's why we created Gwtar and I didn't just use SingleFileZ. We would have preferred to not go to all this trouble and use someone else's maintained tool, but if it's not implemented, then I can't use it.
  (Also, if it had been obvious to you how to do this window.stop+range-request trick beforehand, and you just hadn't gotten around to implementing it, it would have been nice if you had written it up somewhere more prominent; I was unable to find any prior art or discussion.)
  
  gildas 8 minutes ago
  
  The reason I did not implement the innovative mechanism you describe is that I did not think of it because, in my case, all the technical effort was/is focused on reading the archive from the filesystem. No one has suggested it either.
  Edit: Actually, SingleFile already calls window.stop() when displaying a zip/html file in HTTP, see https://github.com/gildas-lormeau/single-file-core/blob/22fc...

gregabbott an hour ago

I made something similar to this a while ago: two single file HTML vanilla JS tools, for locally viewing[1] or wrapping[2] zipped webpages (a zip which has an index.html file and any related structure of files: folders, images, css, js). The wrapper bundles a given zipped webpage and the viewer code into a self-extracting HTML file that also runs offline (as in `file:///`). I needed it for simple zips but a little work should let it handle more interesting ones.

[1] https://gregabbott.pages.dev/zipped-webpage-view/ 12KB

[2] https://gregabbott.pages.dev/zipped-webpage-wrap/ 16KB

zetanor 4 hours ago

The author dismisses WARC, but I don't see why. To me, Gwtar seems more complicated than a WARC, while being less flexible and while also being yet another new format thrown onto the pile.

obscurette 4 hours ago

WARC is mentioned with very specific reason not being good enough: "WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display)."
simonw 4 hours ago

I don't think you can provide a URL to a WARC that can be clicked to view its content directly in your browser.
- zetanor 4 hours ago
  
  At the very least, WARC could have been used as the container ("tar") format after the preamble of Gwtar. But even there, given that this format doesn't work without a web server (unlike SingleFile, mentioned in the article), I feel like there's a lot to gain by separating the "viewer" (Gwtar's javascript) from the content, such that the viewer can be updated over time without changing the archives.
  I certainly could be missing something (I've thought about this problem for all of a few minutes here), but surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx" with little to no loss of convenience, and call it a day?
  
  gwern 30 minutes ago
  
  You could potentially use WARC instead of Tar as the appended container, sure, but that's a lot of complexity, WARC doesn't serialize the rendered page (so what is the greater 'fidelity' actually getting you?) and SingleFile doesn't support WARC, and I don't see a specific advantage that a Gwtar using WARC would have. The page rendered what it rendered.
  And if you choose to require separate files and break single-file, then you have many options.
  > surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx"
  I'm not familiar with warcviewer.js and Googling isn't showing it. Are you thinking of https://github.com/webrecorder/wabac.js ?

calebm an hour ago

Very cool idea. I think single-file HTML web apps are the most durable form of computer software. A few examples of Single-File Web Apps that I wrote are: https://fuzzygraph.com and https://hypervault.github.io/.

mr_mitm 3 hours ago

Pretty cool. I made something similar (much more hacky) a while ago: https://github.com/AdrianVollmer/Zundler

Works locally, but it does need to decompress everything first thing.

gwern an hour ago

So this is like SingleFileZ in that it's a single static inefficient HTML archive, but it can easily be viewed locally as well?
How does it bypass the security restrictions which break SingleFileZ/Gwtar in local viewing mode? It's complex enough I'm not following where the trick is and you only mention single-origin with regard to a minor detail (forms).
- mr_mitm 33 minutes ago
  
  The content is in an iframe, my code is outside of it, and the two frames are passing messages back and forth. Also I'm monkey patching `fetch` and a few other things.

Retr0id 2 hours ago

It's fairly common for archivers (including archive.org) to inject some extra scripts/headers into archived pages or otherwise modify the content slightly (e.g. fixing up relative links). If this happens, will it mess up the offsets used for range requests?

gwern an hour ago

The range requests are to offsets in the original file, so I would think that most cases of 'live' injection do not necessarily break it. If you download the page and the server injects a bunch of JS into the 'header' on the fly and the header is now 10,000 bytes longer, then it doesn't matter, since all of the ranges and offsets in the original file remain valid: the first JPG is still located starting at offset byte #123,456 in $URL, the second one is located starting at byte #456,789 etc, no matter how much spam got injected into it.
Beyond that, depending on how badly the server is tampering with stuff, of course it could break the Gwtar, but then, that is true of any web page whatsoever (never mind archiving), and why they should be very careful when doing so, and generally shouldn't.
Now you might wonder about 're-archiving': if the IA serves a Gwtar (perhaps archived from Gwern.net), and it injects its header with the metadata and timeline snapshot etc, is this IA Gwtar now broken? If you use a SingleFile-like approach to load it, properly force all references to be static and loaded, and serialize out the final quiescent DOM, then it should not be broken and it should look like you simply archived a normal IA-archived web page. (And then you might turn it back into a Gwtar, just now with a bunch of little additional IA-related snippets.) Also, note that the IA, specifically, does provide endpoints which do not include the wrapper, like APIs or, IIRC, the 'if_/' fragment. (Besides getting a clean copy to mirror, it's useful if you'd like to pop up an IA snapshot in an iframe without the header taking up a lot of space.)

karel-3d an hour ago

The example link doesn't work for me at all in iOS safari?

https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

I will try on Chrome tomorrow.

woodruffw 43 minutes ago

It also doesn't work on desktop Safari 26.2 (or perhaps it does, but not to the extent intended -- it appears to be trying to download the entire response before any kind of content painting.)

westurner 2 hours ago

Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

WICG/webpackage: https://github.com/WICG/webpackage#packaging-tools

"Use Cases and Requirements for Web Packages" https://datatracker.ietf.org/doc/html/draft-yasskin-wpack-us...

gwern an hour ago

> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?
As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.
There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.
This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)
> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?
No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?

spankalee 3 hours ago

I really don't understand why a zip file isn't a good solution here. Just because is requires "special" zip software on the server?

gwern 42 minutes ago

> Just because is requires "special" zip software on the server?
Yes. A web browser can't just read a .zip file as a web page. (Even if a web browser decided to try to download, and decompress, and open a GUI file browser, you still just get a list of files to click.) Therefore, far from satisfying the trilemma, it just doesn't work.
And if you fix that, you still generally have a choice between either no longer being single-file or efficiency. (You can just serve a split-up HTML from a single ZIP file with some server-side software, which gets you efficiency, but now it's no longer single-file; and vice-versa. Because if it's a ZIP, how does it stop downloading and only download the parts you need?)
newzino 2 hours ago

Zip stores its central directory at the end of the file. To find what's inside and where each entry starts, you need to read the tail first. That rules out issuing a single Range request to grab one specific asset.
Tar is sequential. Each entry header sits right before its data. If the JSON manifest in the Gwtar preamble says an asset lives at byte offset N with size M, the browser fires one Range request and gets exactly those bytes.
The other problem is decompression. Zip entries are individually deflate-compressed, so you'd need a JS inflate library in the self-extracting header. Tar entries are raw bytes, so the header script just slices at known offsets. No decompression code keeps the preamble small.
- fluidcruft 2 hours ago
  
  You can also read a zip sequentially like a tar file. Some info is in the directory only but just for getting file data you can read the file records sequentially. There are caveats about when files appear multiple times but those caveats also apply to processing tar streams.

O1111OOO 4 hours ago

I gave up a long time ago and started using the "Save as..." on browsers again. At the end of the day, I am interested in the actual content and not the look/feel of the page.

I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).

mikae1 3 hours ago

Have you https://addons.mozilla.org/firefox/addon/single-file/?
If you really just want the text content you could just save markdown using something like https://addons.mozilla.org/firefox/addon/llmfeeder/.
- ninalanyon 3 minutes ago
  
  On the subject of SingleFile there is also WebScrapBook: https://github.com/danny0838/webscrapbook
  I prefer it because it can save without packing the assets into one HTML file. Then it's easy to delete or hardlink common assets.
gwern 40 minutes ago

I find that 'save as' horribly breaks a lot of web pages. There's no choice these days but to load pages with JS and serialize out the final quiescent DOM. I also spend a lot of time with uBlock Origin and AlwaysKillSticky and NoScript wrangling my archive snapshots into readability.
TiredOfLife 2 hours ago

Save as doesn't work on sites that lazy load.

renewiltord 4 hours ago

Hmm, I’m interested in this, especially since it applies no compression delta encoding might be feasible for daily scans of the data but for whatever reason my Brave mobile on iOS displays a blank page for the example page. Hmm, perhaps it’s a mobile rendering issue because Chrome and Safari on iOS can’t do it either https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

wetpaws 4 hours ago

[dead]

nullsanity 3 hours ago

Gwtar seems like a good solution to a problem nobody seemed to want to fix. However, this website is... something else. It's full of inflated self impprtantance, overly bountiful prose, and feels like someone never learned to put in the time to write a shorter essay. Even the about page contains a description of the about page.

I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.

3rodents 3 hours ago

gwern is a legendary blogger (although blogger feels underselling it… “publisher”?) and has earned the right to self-aggrandize about solving a problem he has a vested interest in. Maybe he’s a megalomaniac and/or unemployed and/or writing too many words but after contributing so much, he has earned it.
- TimorousBestie 2 hours ago
  
  I was more willing to accept gwern’s eccentricities in the past but as we learn more about MIRI and its questionable funding resources, one wonders how much he’s tied up in it.
  The Lighthaven retreat in particular was exceptionally shady, possibly even scam-adjacent; I was shocked that he participated in it.
  
  k33n 2 hours ago
  
  What does any of that have to do with the value of what’s presented in the article?
fluidcruft 3 hours ago

What's up with the non-stop knee-jerk bullshit ad hom on HN lately?
- Krutonium 3 hours ago
  
  We're tired, chief.
- esseph 3 hours ago
  
  The earth is falling out from under a lot of people, and they're trying to justify their position on the trash heap as the water level continues to rise around it. It's a scary time.
- TimorousBestie 2 hours ago
  
  Technically it’s only an ad hominem when you’re using the insult as a component in a fallacious argument; the parent comment is merely stating an aesthetic opinion with more force than is typically acceptable here.