Microgrants/32GB usb stick for Commons dump

From Wikimedia UK
Jump to navigation Jump to search


Overview

My bot, Faebot has been a busy bot lately. As a result my networked hard disk that holds my local dump of commonswiki.xml gets accessed constantly from several threads running on my macmini. I would like to move the 22GB file to flash memory on USB stick to avoid problems with wear and tear on my hardware. At the moment, the largest USB stick I have to hand is 8GB but the prices of 32GB have fallen dramatically, so this seems realistic.

Bot owners are expected to use a local dump of commonswiki.xml for lengthy or complex bot work, to avoid putting a lot of transactions on the Wikimedia servers. Naturally this means that the stress of high volume transactions moves to your own home kit!

Budget

£16 for a small form-factor 32GB USB stick. See example supplier.

Timeline
  • Indefinitely. I would be unable to loan it, or use in on other machines as it would be in constant bot use from my home macmini. If I stop running Wikimedia scripts that use such a dump, or the Commons dump gets too large for a 32GB stick, I would be happy to lend it on to another volunteer that has a use for it.
  • Progress will be seen on Commons:User:Faebot.
  • I would expect this 32GB of storage to be good at least throughout 2013. At the moment a reasonable xml dump of Commons image pages is running at 22GB, it may well exceed 32GB by 2014.
Expected outcomes
  • Reduce wear and tear on my home desktop drive (putting my personal backup and archives at risk).
  • Enable Faebot to continue with the 2 million+ UK Geograph image categorization plus the other odd tasks it gets up to within scope. See Commons:User:Faebot/Geograph for current projects including sorting UK images by County/Borough using Open Street Map data (probably a year of slow bot work).
Who I am

I am Fæ, I do a lot on Commons.
-- (talk) 19:38, 22 October 2012 (UTC)

Discussion

Hi Fæ. Thanks for submitting this microgrant application. :-) Since you're a current fellow trustee, I'm going to take a cautious approach to deciding on this application, and will ask a couple of other people to chime in prior to a decision being made. I hope that's OK with you.

With regards the application, on a technical basis, I'm not sure that a flash drive is the best solution here, both in terms of reliability (cheap flash drives won't last long under rigorous usage) or access speed (10MB/s read isn't particularly quick). Have you considered a small USB2 or firewire external hard drive? With regards ownership, I note that the drive would belong to WMUK and that it should be returned to the office once you've finished using it, either for recycling or reuse. Would that be OK with you? Thanks. Mike Peel (talk) 20:34, 22 October 2012 (UTC)

Hi both, you'll find that the newer cheap usb drives have a very limited life in terms of the number of write cycles they will sustain. But for around £16 for 32GB, you can almost start to think about them as solid state and much faster versions of DVR-R - i.e. consumables. I can't see any reason not to accept this, with the usual caveat that we are not to be seen to favour a trustee. Naturally, I'd be happy to lend my endorsement to other volunteers who were doing similar work and had similar needs in future. --RexxS (talk) 20:43, 22 October 2012 (UTC)
I have the feeling that this is a well documented and supported request. I see good arguments why this might be a good idea (it would help Faebot in its work), I see no immediate downsides and the requested amount falls within the limitations of the program. If the technical alternative that Mike suggests is significantly better and not extremely more expensive, I would prefer that. Volunteers should get the means to do their work as well as possible - which is the exact reason for this program. I realize Fae is a trustee, but while it is not allowed to give trustees advantage in these programs, I don't see benefit to giving them disadvantage. Lodewijk (talk) 20:51, 22 October 2012 (UTC)
WRT Firewire, my macmini has a firewire interface (that I have never used), but I'm not sure the advantage would be enormous as the limitation is probably processing power and internet connections rather than transfer speed. For example a simple bit of regex going through the full 14 million+ pages in the xml data seems to take 2 days, sometimes a lot less depending on the query, as another example my current beta-test London borough categorization script takes around 5 to 15 seconds per Commons image page, the limiting factor being the xml queries to Commons and OSM databases rather than the preparation done by looking at the local cache - so the bottleneck is probably not going to be the USB stick going via a USB 2.0 connection. Generally speed is not really the limiting factor here, I kick off processes and let them run in the background and normally don't care too much if they take 30 minutes or are left to run overnight. This request is more about moving to flash and saving on hard disk transactions, I think a usb stick would take a year of such use to develop read/write problems, but even then as Rexx points out, if I plan to wear out usb sticks, this is far more economic than the alternatives of wearing out hard disk drives. BTW, it is a little ironic that we have spent rather more than £16 of volunteer value talking about the options, but I'm aware of the principles here; it's a pity I can't think of valid reasons for Wikimedia UK to buy the trustees decent laptops. :-)
BTW, this 32GB stick would be mostly read from, with few writes. I only update the 22GB commonswiki file on a monthly or less basis (as I have to download it on my ghastly internet connection) and I'm not planning of having more than a handful of other writes to the stick per week. I would hazard a guess that this makes the life expectancy much better than a year. -- (talk) 22:06, 22 October 2012 (UTC)
From experience of hammering USB sticks - it will wear out very quickly under constant use. But £16 is not much for a replacement. Bear in mind, though, other factors might affect your disk - depending on how your program works it may still end up using swap space on the HDD to process such a large cache. Which defeats the object :) All that aside, this is a very low value request so I don't think it needs so much scrutiny! I know everyone is being really cautious at the moment (not a bad thing) but it's not exactly expensive. Hell, Fæ, I have a number of 32GB or greater USB sticks sat on my desk that you would be welcome to. --ErrantX (talk) 23:01, 22 October 2012 (UTC)

Thanks all. I'm happy to approve this microgrant. Mike Peel (talk) 20:14, 23 October 2012 (UTC)