thm 1 year ago

Might also be worthwhile to download a pre ~2023 dump, because Low-background steel.

  • foreigner 1 year ago

    LOL that is an amazing analogy, thank you.

    • icepat 1 year ago

      My go-to as well. Pre-war steel.

      • froh 1 year ago

        it's not about pre-war. it's about pre-trinity-nuclear tests. which means uncontaminated by atmospheric radioactive isotopes. it happened at the end of ww-ii but that is not the point.

        • sebzim4500 1 year ago

          It's an important distinction because a lot of ships were sunk during WWII.

        • icepat 1 year ago

          Yes, however it's also an accepted name for it

          > Low-background steel, also known as pre-war steel and pre-atomic steel, is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s.

          https://en.wikipedia.org/wiki/Low-background_steel

  • busyant 1 year ago

    just in case anyone is as obtuse as I am, I believe the joke here is that the contents of Wikipedia might be contaminated with AI generated content starting around 2023. you can probably look up low background steel to complete the analogy.

  • webdoodle 1 year ago

    > download a pre ~2023 dump, because Low-background steel.

    Military A.I. was likely in use earlier, and since PSYOPS are the most used and most effective weapon in the U.S. Military's arsenal, you absolutely know it was used. It ain't a war crime the first time...

    • rokobobo 1 year ago

      How much do we know about military AI’s capabilities? As in, is there any evidence that the government/military was ahead of big tech on the AI research front?

      • glenstein 1 year ago

        Seconded. Sometimes when someone says XYZ was likely used it's because they've read something from a credible source, or maybe are a subject matter expert, or have grasped some other similarly solid chain of evidence.

        But sometimes, they mean "likely" in the more colloquial sense of a guesstimation, which can range anywhere from informed guess to low effort fan-fiction. I default toward the latter unless otherwise specified.

      • tmpz22 1 year ago

        "Please summarize the maintenance procedure for a tomahawk missile"

        boom

    • debesyla 1 year ago

      It is also unlikely that, if such AI was used, it would have been used to edit a billion articles about obscure species of plants and insects.

miki123211 1 year ago

I wonder how easy it would be to make a practically indestructible, everlasting Wikipedia reader.

Something using solar for power, with a rugged and water-resistant enclosure, made of extremely high-quality components that won't break for hundreds of years at least. Maybe add an IRDA port for good measure, to make it possible to transfer all the data out somewhat quickly.

You could make hundreds of these and put them in hard-to-reach locations around the world, to make sure at least one survives whatever calamity might befall us in the future.

  • willis936 1 year ago

    You could even make it radiation tolerant by printing it.

    • nine_k 1 year ago

      Be certain to use acid-free paper [1]. The typical cheap bright-white paper of today will have hard time staying in a good condition in 100-200 years. Ideally go for the ultra-durable cotton-rag paper used e.g. for paper money.

      [1]: https://en.wikipedia.org/wiki/Acid-free_paper

    • jjeaff 1 year ago

      Then figure out how and where to store the over one thousand volumes that have 1200 pages each.

      • interroboink 1 year ago

        From "Plenty of Room at the Bottom"[1]:

          What would happen if I print all this down at the scale we have been
          discussing? How much space would it take?  It would take, of course, the
          area of about a million pinheads ...  All of the information which all of
          mankind has every recorded in books can be carried around in a pamphlet
          in your hand — and not written in code, but a simple reproduction of
          the original pictures, engravings, and everything else on a small scale
          without loss of resolution.
        

        Need a good magnifying glass, though (:

        [1] https://web.pa.msu.edu/people/yang/RFeynman_plentySpace.pdf

        • yencabulator 1 year ago

          Yeah electron microscopes are very much not "practically indestructible" on the scale of civilization-wide disturbances.

          Going from "25,000x -> encyclopedia on pinhead" to "100x (microfilm scale) -> encyclopedia on a 5x10cm metal sheet" is probably a better bet.

  • y-curious 1 year ago

    You got me imagining a project where the entire wiki db gets laser-etched on thin stone tablets/metal plates

    • jjeaff 1 year ago

      microfische is probably the best option here.

  • chuckadams 1 year ago

    Make sure it has the words "DON'T PANIC" inscribed in large friendly letters on the cover.

  • shrinks99 1 year ago

    Kiwix has created pretty polished software for this: https://kiwix.org/

    My last download of English Wikipedia was ~110 GB and includes images! It's impressively small for the volume of information available.

  • ce4 1 year ago

    Aard2 for Android exists since at least 2015:

    https://f-droid.org/packages/itkach.aard2

    I have many current and old dumps and can switch between a few years. Very nice in case of deleted articles or to check old time stamped versions. It also supports more than just Wikipedia like wikiquote or wikivoyage or cooking wiki. You can compile own mediawikis too

  • rotexo 1 year ago

    Some thoughts about making it possible for individual humans to access Wikipedia, robustly to calamities that are within the sphere of human agency.

    Seems like you would want it to be stored digitally. Ideally, people would have the ability to access it remotely, in case their local copy is somehow corrupted. For that, you would need a physical network by which the data can be transmitted. Economies of scale would seem to suggest that there would be one or a few entities that would “serve” the content to individuals who request it. Of course, you would want those individuals to be able to access this information without having detailed technical knowledge and ability. I guess they would have pre-packaged software “browsers” they could use to access the network.

    In order to maintain this arrangement, you would want enough political stability to allow for the physical upkeep of this infrastructure, including human infrastructure (feeding the engineers who make it all possible). In order to make it worthwhile, you would need people who want to access the information too. I suspect political stability, a sufficient abundance of the necessities for human life, and the political will to make sure that everyone’s needs are met so that they can safely be curious about the world would help here too.

    All of this requires sources of power. I suspect that a combination of nuclear power, solar/batteries, and geothermal energy would be sufficient and would avoid the problem of running out of fossil fuels at some point in the future. The nice side-effect here of reducing the impact of calamities exacerbated by the greenhouse effect.

    For the information to continue being relevant, you would have to update it with new knowledge, and correct inaccuracies. How best to accomplish this? Well, I guess you would need a systematic way to interrogate the causes behind the various effects we observe in the world. I would propose a system where people create hypotheses, and perform experiments that exclude the influence of as many factors as possible external to the phenomenon being studied. People would then share their findings, and I guess would critique each other’s arguments in a sort of “peer review” to try to come to a consensus. You would have to feed and provide for these people at a certain basic level to make sure they are comfortable and safe enough to continue doing this work. I guess you would want to encourage the value systems compatible with this method of interrogating the world.

    Just my 2 cents.

myself248 1 year ago

You can also get it as a .zim file for easy offline browsing with Kiwix.

The whole enchilada: https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_ma...

Other versions: https://library.kiwix.org/#lang=eng&category=wikipedia

  • ramses0 1 year ago

    I tried this kiwix the other day, it has like a 300mb "essentials" text version that was interesting.

    This comment was downvoted and instead, it'd better merit a comment as to "why" it wasn't contributing to the discussion?

    • JohnKemeny 1 year ago

      > I tried this kiwix the other day, it has like a 300mb "essentials" text version that was interesting.

      I didn't downvote the comment, but it's not an incredibly deep contribution, is it?

      If you really wish to contribute, perhaps you can say what the "'essentials' text version" contained and why you found it interesting?

  • we0x 1 year ago

    It is useful software for offline use and emergencies. For those who may not know, apart from wikis, they also offer offline documentation(Linux distros like ArchWiki, libraries, etc.), medical libraries(Medicine Plus, etc.), and Stack Exchange.

  • zaggynl 1 year ago

    Will the 2025 zim be available as well?

    • unethical_ban 1 year ago

      I am wondering the same thing. I have Jan 2021, Jan 2024... I want to keep a snapshot each year and I wonder why a new one hasn't been generated.

      I haven't looked for documentation on creating my own zim file.

      • geoffeg 1 year ago

        I looked into it once, I think the script or system that built the larger dumps broke and no one fixed it. I started working on it but other stuff got in the way.

    • benoitberaud 1 year ago

      Main Kiwix dev in charge of scrapers (tools to create ZIM files, even if we do not really scrape technically speaking) here.

      We are working hard toward upgrading the Wikipedia ZIMs, but it is far from being an easy feat. I'm mostly solo on this, and far from dedicating 100% of my time to this, so it does not move very fast. We are quite close to being able to reach the goal however, probably only a matter of weeks now.

      Bonus: the tool will now get pretty good at making a ZIM of any Mediawiki, not only Wikimedia ones, we expect for instance to work on all Fandom wikis somewhere this year since there is significant knowledge over there.

      • myself248 1 year ago

        Hey, thank you for what you do! Kiwix was the first project that made me feel like using Github Sponsors to support it; your work is wrapped into countless other educational projects like IIAB.

        Is there any specific help you need, or where could folks get involved if they wanted to?

      • zaggynl 1 year ago

        Thank you for your answer and the hard work!

jasoncartwright 1 year ago

If you've got some spare bandwidth & storage then seeding some of the torrents here is a cheap and fun way of helping Wikipedia out. I've served around 20TB of these dumps in the past year.

https://meta.wikimedia.org/wiki/Data_dump_torrents

  • mhitza 1 year ago

    Do you happen to know why wikipedia didn't embrace torrents as the default download method?

    • davidkwast 1 year ago

      I think it is a nice use case for IPFS

    • dijit 1 year ago

      Speculating: because torrents are not especially good at dealing with small modifications?

      Most people probably won't seed many versions, so it's a losing effort, and you need to allocate a huge chunk of space for each version.

      Deduplicating filesystems are sadly not in vogue.

    • mmooss 1 year ago

      Most people don't use torrents.

JKCalhoun 1 year ago

I more or less do this every year — grab the latest Kiwix, English version (about 100 GB or so). I keep the older ones as well.

JKCalhoun 1 year ago

Is there a RAG for Wikipedia?

I may not be using the term correctly here. In short, I would love a local LLM + Wikipedia snapshot so that I can have an offline, self-hosted ... Hitchhiker's Guide to Earth.

albert_e 1 year ago

There are non English versions of Wikipedia also.

Can anyone please point to information on how we can download a copy of one specific language version?

cynicalsecurity 1 year ago

Okay, that looked a bit ridiculous in the pre-AI era (who needs to download the whole Wikipedia?), but now I can see the sense in it.

  • _fat_santa 1 year ago

    Oh no even now there is plenty of use for it outside of AI training. Just think of all the schools in villages all around the world that don't have access to the internet or have a very limited connection. I've worked with folks that would setup local "wikipedia servers" for schools so that kids could access Wikipedia via a local network connection. In other setups they just download all of wikipedia to a set of laptops and you use one of the offline readers to browse it.

    This is essentially the modern version of having a library of encyclopedias.

    • ozmodiar 1 year ago

      I'm thinking less about AI training and more about having a source of (reasonably) reliable information from the net, in case AI generated fake images and generated cross referenced texts start making it too difficult to discern real history from malicious rewrites. It's bad enough now, but can get much worse with the proliferation of AI agents.

      • myself248 1 year ago

        Pre-2022 Wikipedia dumps will be analyzed by future historians.

        • normie3000 1 year ago

          And banned by future governments.

    • ThinkingGuy 1 year ago

      There's already a project to serve the use case you're describing (school in a disconnected village): Internet in a Box

      https://internet-in-a-box.org/

      They provide offline access to Wikipedia, OpenStreetMap, Project Gutenberg, and many other resources.

  • jsheard 1 year ago

    Too bad the AI scrapers don't care, and are melting Wikipedia's production servers anyway.

    https://arstechnica.com/information-technology/2025/04/ai-bo...

    • sdoering 1 year ago

      Tragedy of the commons. And that’s why we can’t have nice things.

      Because people are people. And will always prioritize egotism over respect for the common good.

      • mistrial9 1 year ago

        no - when fragile resources are abused by one endpoint out of one hundred thousand others, and the abuse is one hundred thousand times greater.. how is that a condemnation of the "ways" of "all people" .. what is justice?

      • yreg 1 year ago

        But we have nice things. Wikipedia can deal with it just fine.

    • petercooper 1 year ago

      I bet someone like Cloudflare could pull the dataset each day and serve up a plain text/Markdown version of Wikipedia for rounding error levels of spend. I just loaded a random Wikipedia page and it had a weight of 1.5MB in all for what I worked out would be about 30KB of Markdown (i.e. 50x less bandwidth).

      Of course, the problem then is getting all these scrapers and bots to actually use the alternative, but Wikimedia could potentially redirect suspected clients in that direction..

      • tough 1 year ago

        Someone suggested to me to apply a filter that serves .md or txt to bots/ai scrapers instead of the regular website, seems smart if it works but i hate it when i get captchas and this could end up similarly detecting non-bots as bots

        maybe a view full website link loaded on js so bots dont see it idk

        • 3036e4 1 year ago

          I would love to see most sites serve me markdown. I'd happily install a browser extension to mask me as a a AI bot scraper if it means I can just get the text without all the noise.

          • tough 1 year ago

            lol

            me too tbh

            someone pointed out you can enable by default reader mode on safar under settings but even then not all website’s pages are seeved as reader mode enabled pages

          • tough 1 year ago

            someone built a service for ai bots called pure.md its been a godsend to curl websites as markdown on the occasional where it doesnt work first time and works great for occasional use with the free tier

          • sunshine-o 1 year ago

            I have good news. It (almost) exists, it is called Gemini [0]

            - [0] https://geminiprotocol.net/

            • 3036e4 1 year ago

              Not news to me. I host my own gemlog there, but post rather infrequently.

              Websites as gemtext would be even better than as markdown, but less likely to be fed to bots.

    • ujkhsjkdhf234 1 year ago

      I would like companies to start aggressively pushing back against AI scrapers using things like Anubis[0]. If you can't be a good steward of the internet or respectful to other peoples' resources, then people have the right to deny them to you.

      [0] https://github.com/TecharoHQ/anubis

    • kristianp 1 year ago

      I wonder if Wikipedias recent switch to client side rendering has hurt their performance too. Serving a prerendered page might have helped this situation. I don't know the details of their new system though.

  • BoxFour 1 year ago

    Also helps save Wikipedia if it gets shut down - which might happen!

    • hsuduebc2 1 year ago

      True. Musk for example is publicly attacking it for spreading "left-wing lies" because in his wiki page there are statements like "He has been criticized for making unscientific and misleading statements, including COVID-19 misinformation and promoting conspiracy theories, and affirming antisemitic, racist, and transphobic comments." which are just pure facts.

      It would be nice to have something like this more decentralized.

      • AStonesThrow 1 year ago

        I was perusing some recent discussions on sources with interest. It seems that Wikipedia's intelligentsia have managed to "blacklist" (deprecate or declare "generally unreliable") practically every prominent source of news in the US that is not centrist or leftist.

        I kid you not; through a process of attrition they've attacked the very reliability and reputation of every source, including Fox News and the like, and they've told editors sitewide that they simply can't be cited as a "Reliable Secondary Source", like at all.

        I am not sure if that is an accurate assessment of the situation on the ground for mainstream media, but it certainly exposes some real systemic bias.

        And this is the highest-order and most enduring method of ingraining systemic bias in the project: by weeding out sources with unfavorable viewpoints and perspectives, saying they publish lies and untruth, and being able to prohibit them globally from any use.

        And I was pondering this state of affairs and just thinking about Karoline Leavitt's press room, and wondering what will the landscape be, if there is precious little intersection between press outlets who may be favorable or deferent to the present administration, and those which are allowed to be cited on Wikipedia? Ouch!

        • zzzeek 1 year ago

          It's not Wikipedia's fault that the vast majority of right wing media consists of pure propaganda, disinformation, and lies.

          • AStonesThrow 1 year ago

            [citation needed]

            And you know, I wouldn't be surprised if people hurling those accusations somehow believe that the lies and misinformation are one-sided and partisan. As if leftism has some sort of monopoly on Truth and Goodness bestowed from above.

            It's really been sickening to see the media outlets just lay down thick trails of bullshit that is designed to distract us, to instill fear, uncertainty, and doubt, to make us hate one another, to keep us hanging on that channel or that subscription for the next tidbit. It's disgusting and manipulative, and the Right has absolutely no monopoly on those tactics.

            Wikipedia is simply a microcosm of the prevailing zeitgeist, so they are as likely to cure systemic bias as a leopard can change its spots.

            • zzzeek 1 year ago

              Wait why just leftism, what happened to centrism ? Where'd the goalposts go ?

            • hn_acker 1 year ago

              This thread is about media organizations, but I think useful context is that in non-polarized situations conservatives and liberals are similarly likely to spread political misinformation while in polarized situations conservatives are relatively more likely to spread political misinformation [1].

              [1] https://journals.sagepub.com/doi/full/10.1177/00222429241264...

        • wyre 1 year ago

          [flagged]

        • hsuduebc2 1 year ago

          Your point is understandable regarding source bias, but in Musk's case, the statements "they" mentioned are simply true. While you definitely have a valid point about the risks of systemic bias in excluding certain outlets, relativizing factual accuracy could inadvertently lead to a situation where every lie becomes just another "valid opinion." A viewpoint can indeed be an opinion, but misinformation remains misinformation. Wikipedia should not become a space for free interpretation of reality.

          Just because one side happens to produce more misinformation doesn't mean these facts should be omitted. Consider this analogy: Stalin killed millions and was undeniably a tyrant, and even though the current Russian establishment might push a different narrative, it doesn't erase historical reality. Similarly, accurately documenting Musk's misleading statements isn't bias—it's factual reporting.

  • hagbard_c 1 year ago

    > who needs to download the whole Wikipedia

    Anyone who wants to have access while off-line, for whatever reason. This can be as simple as saving costs via more complicated as accessing content from regions with spotty and/or expensive connectivity (you're on a ship out of reach of shore-based mobile networks, you do not have access to Starlink or something similar, you're deep in the jungle, deep underground, etc) to some prepper scenario where connectivity ends at the cave entry because the 'net has ceased to exist.

    I would like to have a less politically biased online encyclopedia for the latter scenario, it would be a shame to start a new society based on the same bad ideas which brought down the previous one. If ever a politically neutral LLM becomes available that'd be one of the first tasks I'd put it to: point out bias - any bias - in articles, encyclopedias and other 'sources' (yes, I know, WP is not an original source but for this purpose it is) of knowledge.

    • moritzwarhier 1 year ago

      Is there a "politically neutral" human? And if there was, what could that person reasonably say about politics?

      • amanaplanacanal 1 year ago

        I suspect "politically neutral" is a meaningless phrase. It's just a way for people to tar their political opponents by inference.

        The problem is: even if you report only facts, there is an editorial function in choosing which facts to report, because it is physically impossible to report all facts. So someone can always point to some sort of bias on choosing which facts to report.

        • tough 1 year ago

          And when editors have big ad spenders, you bet they won't criticize the hand that feeds them, most of the time

          • moritzwarhier 1 year ago

            But what to do about it, in your opinion? How to prevent people with malicious interests from editing?

      • hagbard_c 1 year ago

        There are no politically neutral humans but there can be politically neutral publications. All you have to do to be politically neutral is treat all legal political ideologies the same without favouring one over the others. Wikipedia does not achieve this goal, not by far.

    • hello_computer 1 year ago

      > based on the same bad ideas which brought down the previous one

      I don’t think that’s fair. Not that Wikipedia is without bias, but that their ivory tower biases are worlds apart from the lying brutal animalistic Hollywood signals herding the masses in “our democracy”.

    • DistractionRect 1 year ago

      Genuine question, can you provide multiple explicit examples of such bias? I heard a lot of people railing against bias in Wikipedia, but no one provides any blatant examples of it.

      • hagbard_c 1 year ago

        A genuine answer, how about looking up some studies on this subject? Not those done by Wikipedia of course, they claim to be politically neutral after all.

        Here's a few, from https://www.allsides.com/blog/wikipedia-biased

        Six studies, including two from Harvard researchers, have found a left-wing bias at Wikipedia:

        A 2024 analysis [1] by researcher David Rozado that used AllSides Media Bias Ratings [2] found Wikipedia associates right-of-center public figures with more negative sentiment than left-wing figures, and tends to associate left-leaning news organizations with more positive sentiment than right-leaning ones.

        A Harvard study [3] found Wikipedia articles are more left-wing than Encyclopedia Britannica.

        Another paper [4] from the same Harvard researchers found left-wing editors are more active and partisan on the site.

        A 2018 analysis [5] found top-cited news outlets on Wikipedia are mainly left-wing.

        Another analysis [6] using AllSides Media Bias Ratings found that pages on American politicians cite mostly left-wing news outlets.

        American academics found [7] conservative editors are 6 times more likely to be sanctioned in Wikipedia policy enforcement.

        There are far more sources out there.

        If I show examples of biased pages - the one on Antifa is a good example - this will just devolve into a quibble about this or that sentence.

        [1] https://davidrozado.substack.com/p/is-wikipedia-politically-...

        [2] https://www.allsides.com/media-bias/ratings

        [3] https://www.semanticscholar.org/paper/Do-Experts-or-Collecti...

        [4] https://www.hbs.edu/faculty/Publication%20Files/17-028_e7788...

        [5] https://archive.md/v4TFn

        [6] https://archive.is/dDr7X

        [7] https://thecritic.co.uk/the-left-wing-bias-of-wikipedia/

        • DistractionRect 1 year ago

          > A genuine answer, how about looking up some studies on this subject?

          I figured that since you had a strong opinion on the subject you probably had strong evidence and could steer us to more directed reading to understand your viewpoint. Certainly we all should investigate things for ourselves, but sometimes it helps to have a place to start. You've certainly given us plenty to read through and consider. I'll read it with an open mind - some prereading thoughts that come to mind, is the citation bias proportional to factual accuracy (some outlets are more factually accurate than others, so one would expect to see them cited more often)? What's the distribution of the population of potentially citable sources (I.e. Is the bias a reflection of the population, or selection bias)? Is editor sanctions selective in enforcement or are conservative editors more likely to engage in behavior that warrants sanctions?

          In other words, are we confusing correlations with causation? I don't know, I'll have to dig into the sources you provided and do my own research. I posit the questions now because it's the only thoughts I can contribute to the discussion at present.

    • ndriscoll 1 year ago

      You don't need to be deep in the jungle. You might just not want to pay for mobile data. If your phone has an SD card slot, you can put in 1 TB of storage and have wikipedia, a lifetime of music, tons of books, an atlas of your country for GPS navigation, and plenty of room for taking photos/videos. Storage is cheap enough that mobile data should be basically pointless.

    • parodysbird 1 year ago

      You have bad politics. This is bad politics.

      • hagbard_c 1 year ago

        No, you have bad politics.

        This is not kindergarten so let's no go down this path. Asking for a politically neutral (see my explanation elsewhere in this thread if you don't understand what that means) source of information is not 'bad politics' but intended to avoid bad politics. I suspect that you 'identify' as either 'liberal' or 'progressive' so I assume you'd be less than thrilled if Wikipedia had a conservative bias. The same goes for conservatives and (traditional) capital-L Liberals who are less than thrilled to see Wikipedia having a 'left-wing' or 'progressive' bias. It just makes WP end up being lumped together with the legacy media, known to be untrustworthy where it counts and that is a shame for a site which in many ways still is a valuable resource as long as you avoid any and all subjects which have been pulled into the polarised political discourse.

        • MollyRealized 1 year ago

          I may be mistaken, but i think the person you are replying to was pretending to be the kind of AI you were speaking of.

  • verandaguy 1 year ago
        > who needs to download the whole Wikipedia
    

    Anyone archiving the site. Wikipedia is, for its faults, one of the best-curated collections of summarized human knowledge, probably in history.

    Replicating that knowledge helps build data resilience and protect it against all sorts of disasters. I used to seed their monthly data dump torrent for a while.

  • krick 1 year ago

    No idea when your pre-AI era begun, but I was much more excited to host Wikipedia locally 15 years ago than I am now.

btbuildem 1 year ago

First time I visited this page was in January 2025