-
Aphyr and Peter Bailis collect an authoritative list of known network partition and outage cases from published post-mortem data:
This post is meant as a reference point — to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments. Processes, servers, NICs, switches, local and wide area networks can all fail, and the resulting economic consequences are real. Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance. The consequences of these outages range from increased latency and temporary unavailability to inconsistency, corruption, and data loss. Split-brain is not an academic concern: it happens to all kinds of systems — sometimes for days on end. Partitions deserve serious consideration.
I honestly cannot understand people who didn’t think this was the case. 3 years reading (and occasionally auto-cutting) Amazon’s network-outage tickets as part of AWS network monitoring will do that to you I guess ;)(tags: networking outages partition cap failure fault-tolerance)
-
from Atelier Olschinsky. ‘Fine Art Print on Hahnemuehle Photo Rag Bright White 310g; Limited Edition / Numbered and signed by the artist’
Justin's Linklog Posts
incompetent error-handling code in the mongo-java-driver project
an unexplained invocation of Math.random() in the exception handling block of this MongoDB java driver class causes roflscale lols in the github commit notes. http://stackoverflow.com/a/16833798 has more explanation.
(tags: github commits mongodb webscale roflscale random daily-wtf wtf)
-
‘What is a Hermetic Server? The short definition would be a “server in a box”. If you can start up the entire server on a single machine that has no network connection AND the server works as expected, you have a hermetic server! This is a special case of the more general “hermetic” concept which applies to an isolated system not necessarily on a single machine. Why is it useful to have a hermetic server? Because if your entire [system under test] is composed of hermetic servers, it could all be started on a single machine for testing; no network connection necessary! The single machine could be a physical or virtual machine.’ These also qualify as “fakes”, using the terminology Martin Fowler suggests at http://martinfowler.com/bliki/TestDouble.html , I think
(tags: google testing hermetic-servers test test-doubles unit-testing)
-
hooray, sanity from the Google Testing blog. this has been a major cause of pain in the past, dealing with tricky rewrites of mock-heavy unit test code
Casalattico – Wikipedia, the free encyclopedia
How wierd. Many of the well-known chippers in Ireland are run by families from the same comune in Italy.
In the late 19th and early 20th century a significant number of young people left Casalattico to work in Ireland, with many founding chip shops there. Most second, third and fourth generation Irish-Italians can trace their lineage back to the municipality, with names such as Magliocco, Fusco, Marconi, Borza, Macari, Rosato and Forte being the most common. Although the Forte family actually originates from the village of Mortale, renamed Mon Forte due to the achievements of the Forte family. It is believed that up to 8,000 Irish-Italians have ancestors from Casalattico. The village is home to an Irish festival every summer to celebrate the many families that moved from there to Ireland.
(via JK)(tags: rome lazio italy ireland chip-shops chippers history emigration casalattico work irish-italians via:jk)
Videos from the Continuous Delivery track at QCon SF 2012
Think we’ll be watching some of these in work soon — Jez Humble’s talk (the last one) in particular looks good:
Amazon, Etsy, Google and Facebook are all primarily software development shops which command enormous amounts of resources. They are, to use Christopher Little’s metaphor, unicorns. How can the rest of us adopt continuous delivery? That’s the subject of my talk, which describes four case studies of organizations that adopted continuous delivery, with varying degrees of success. One of my favourites – partly because it’s embedded software, not a website – is the story of HP’s LaserJet Firmware team, who re-architected their software around the principles of continuous delivery. People always want to know the business case for continuous delivery: the FutureSmart team provide one in the book they wrote that discusses how they did it.
(tags: continuous-integration continuous-delivery build release process dev deployment videos qcon towatch hp)
_Dynamic Histograms: Capturing Evolving Data Sets_ [pdf]
Currently, histograms are static structures: they are created from scratch periodically and their creation is based on looking at the entire data distribution as it exists each time. This creates problems, however, as data stored in DBMSs usually varies with time. If new data arrives at a high rate and old data is likewise deleted, a histogram’s accuracy may deteriorate fast as the histogram becomes older, and the optimizer’s effectiveness may be lost. Hence, how often a histogram is reconstructed becomes very critical, but choosing the right period is a hard problem, as the following trade-off exists: If the period is too long, histograms may become outdated. If the period is too short, updates of the histogram may incur a high overhead. In this paper, we propose what we believe is the most elegant solution to the problem, i.e., maintaining dynamic histograms within given limits of memory space. Dynamic histograms are continuously updateable, closely tracking changes to the actual data. We consider two of the best static histograms proposed in the literature [9], namely V-Optimal and Compressed, and modify them. The new histograms are naturally called Dynamic V-Optimal (DVO) and Dynamic Compressed (DC). In addition, we modified V-Optimal’s partition constraint to create the Static Average-Deviation Optimal (SADO) and Dynamic Average-Deviation Optimal (DADO) histograms.
(via d2fn)(tags: via:d2fn histograms streaming big-data data dvo dc sado dado dynamic-histograms papers toread)
How I decoded the human genome – Salon.com
classic long-read article from John Sundman: ‘We are becoming the masters of our own DNA. But does that give us the right to decide that my children should never have been born?’ part two at http://www.salon.com/2003/10/22/genome_two/
(tags: human genome genomics eugenics politics life john-sundman disability health dna medicine salon long-reads children)
The “Meme Hustler” hustler: Evgeny Morozov’s Stupid Talk about Tim O’Reilly
great long-read blog post from John Sundman debunking Evgeny Morozov’s takedown of Tim O’Reilly
(tags: debunking john-sundman evgeny-morozov tim-oreilly tech technological-solutionism futurism writing silicon-valley utopianism open-source oss)
Strange Passion Presents Chant Chant Chant, Choice & SM Corporation live
‘We are delighted to announce, for one night only, 3 legendary Irish Post Punk bands performing live in Dublin after a 30 year hiatus. This follows on from the critically acclaimed release of the Strange Passion Irish Post Punk compilation in 2012. Post punk legends Chant Chant Chant will perform along with electronic music pioneers Choice and SM Corporation. ‘
(tags: choice music ireland post-punk electronic dublin strange-passion gigs)
‘Mythbusting Modern Hardware to gain “Mechanical Sympathy”‘ [slides]
Martin Thompson’s latest talk — taking a few common concepts about modern hardware performance and debunking/confirming them, mythbusters-style
(tags: mythbusters hardware mechanical-sympathy martin-thompson java performance cpu disks ssd)
High home ownership can seriously damage labor market, new study suggests
Interesting — a healthy rental market is needed to allow sufficient labour mobility. This matches what I heard and saw from friends and coworkers in the US, anecdotally
Concert Industry Struggles With ‘Bots’ That Siphon Off Tickets – NYTimes.com
Bots now buying more than 60% of tickets, one group requesting up to 200,000 per day; bot writers now charging $14 per 10k captchas (via Shane Naughton)
(tags: ticketmaster scalping tickets via:shane-naughton bots captchas abuse)
Instant artist statement: Arty Bollocks Generator
‘My work explores the relationship between the body and vegetarian ethics. With influences as diverse as Munch and Francis Bacon, new synergies are created from both orderly and random narratives. Ever since I was a postgraduate I have been fascinated by the essential unreality of the moment. What starts out as undefined soon becomes corroded into a hegemony of greed, leaving only a sense of failing and the chance of a new order. As temporal replicas become transformed through diligent and undefined practice, the viewer is left with an impression of the darkness of our culture.’
(tags: funny humor art arty bollocks generator hacks via:leroideplywood)
Communication costs in real-world networks
Peter Bailis has generated some good real-world data about network performance and latency, measured using EC2 instances, between ec2 regions, between zones, and between hosts in a single AZ. good data (particularly as I was looking for this data in a public source not too long ago).
I wasn’t aware of any datasets describing network behavior both within and across datacenters, so we launched m1.small Amazon EC2 instances in each of the eight geo-distributed “Regions,” across the three us-east “Availability Zones” (three co-located datacenters in Virginia), and within one datacenter (us-east-b). We measured RTTs between hosts for a week at a granularity of one ping per second.
Some of the high-percentile measurements are undoubtedly impact of host and VM behaviour, but that is still good data for a typical service built in EC2.(tags: networks performance measurements benchmarks ops ec2 networking internet az latency)
Reducing MongoDB traffic by 78% with Redis | Crashlytics Blog
One for @roflscaletips. Crashlytics reduce MongoDB load by hacking in some hand-coded caching into their Rails app, instead of just using a front-line HTTP cache to reduce Rails *and* db load. duh. (via Oisin)
(tags: crashlytics fail roflscale rails caching redis ruby via:oisin)
Display Hidden Files in OS X Open and Save Dialog Boxes
yet another laughable UI kludge in OS X. ridiculous
(tags: usability osx apple ui kludges hidden-files dot-files command-shift-option-elbow magic)
-
“Dear Mr Tilman, this is the only way I can help you. saluti, Giorgio Moroder”. I love it — someone call Tufte
(tags: graphics giorgio-moroder history music ilx basslines donna-summer synths)
Hollywood Studios [attempt to censor] Pirate Bay Documentary
Probably not deliberate, but pretty damn inept.
Over the past weeks several movie studios have been trying to suppress the availability of TPB-AFK [the Pirate Bay documentary] by asking Google to remove links to the documentary from its search engine. The links are carefully hidden in standard DMCA takedown notices for popular movies and TV-shows. The silent attacks come from multiple Hollywood sources including Viacom, Paramount, Fox and Lionsgate and are being sent out by multiple anti-piracy outfits. Fox, with help from six-strikes monitoring company Dtecnet, asked Google to remove a link to TPB-AFK on Mechodownload. Paramount did the same with a link on the Warez.ag forums. Viacom sent at least two takedown requests targeting links to the Pirate Bay documentary on Mrworldpremiere and Rapidmoviez. Finally, Lionsgate jumped in by asking Google to remove a copy of TPB-AFK from a popular Pirate Bay proxy.
(tags: funny inept hollywood lionsgate fox viacom paramount dtecnet tpb-afk piratebay piracy copyright movies google)
Flashback: How Yahoo Killed Flickr and Lost the Internet
This is about the best tech journalism I’ve ever read on Flickr. nice one Mat Honan
(tags: gizmodo flickr acquisition mergers yahoo corporate-culture mat-honan tech journalism)
Resisting the lure of the Freeman movement | Workers Solidarity Movement
An anarchist critique of the Freeman movement from the WSM:
This has been a very brief overview of the Freeman movement that has tried to capture with broad strokes its nature and possible responses. There is room for much more work, including a more in-depth analysis of the various flaws in the approach to the law. The greatest danger however is allowing a movement to develop within anarchist circles that ignores the principle of mutual aid and implicitly promotes private ownership of resources, that by granting absolute right to individuals gives them the ability to ignore their responsibilities to the wider community and ecology that sustains them. In more traditional terms, the movement is one all about negative freedoms, ignoring positive freedom as a concept.
(tags: anarchism freeman-on-the-land politics ireland law wsm)
The Reactionary ‘Freeman-?on-?the-?land’ and a Political Fracture
Another leftie view on the Freeman movement
(tags: freeman-on-the-land politics ireland left-wing anarchism law)
-
Well, apparently tomorrow, but close enough. Happy birthday to bradfitz’ greatest creation and its wonderful slab allocator!
(tags: birthdays code via:alex-popescu open-source history malloc memory caching memcached)
Newegg nukes “corporate troll” Alcatel in third patent appeal win this year
I am loving this. Particularly this:
At trial in East Texas Cheng took the stand to tell Newegg’s story. Alcatel-Lucent’s corporate representative, at the heart of its massive licensing campaign, couldn’t even name the technology or the patents it was suing Newegg over. “Successful defendants have their litigation managed by people who care,” said Cheng. “For me, it’s easy. I believe in Newegg, I care about Newegg. Alcatel Lucent, meanwhile, they drag out some random VP—who happens to be a decorated Navy veteran, who happens to be handsome and has a beautiful wife and kids—but the guy didn’t know what patents were being asserted. What a joke.” “Shareholders of public companies that engage in patent trolling should ask themselves if they’re really well-served by their management teams,” Cheng added. “Are they properly monetizing their R&D? Surely there are better ways to make money than to just rely on litigating patents. If I was a shareholder, I would take a hard look as to whether their management was competent.”
(tags: patents ip swpats alcatel bell-labs newegg east-texas litigation lucent)
Call me maybe: Carly Rae Jepsen and the perils of network partitions
Kyle “aphyr” Kingsbury expands on his slides demonstrating the real-world failure scenarios that arise during some kinds of partitions (specifically, the TCP-hang, no clear routing failure, network partition scenario). Great set of blog posts clarifying CAP
(tags: distributed network databases cap nosql redis mongodb postgresql riak crdt aphyr)
-
Welcome to the Galapagos of Chinese “open” source. I call it “gongkai” (??). Gongkai is the transliteration of “open” as applied to “open source”. I feel it deserves a term of its own, as the phenomenon has grown beyond the so-called “shanzhai” (??) and is becoming a self-sustaining innovation ecosystem of its own. Just as the Galapagos Islands is a unique biological ecosystem evolved in the absence of continental species, gongkai is a unique innovation ecosystem evolved with little western influence, thanks to political, language, and cultural isolation. Of course, just as the Galapagos was seeded by hardy species that found their way to the islands, gongkai was also seeded by hardy ideas that came from the west. These ideas fell on the fertile minds of the Pearl River delta, took root, and are evolving. Significantly, gongkai isn’t a totally lawless free-for-all. It’s a network of ideas, spread peer-to-peer, with certain rules to enforce sharing and to prevent leeching. It’s very different from Western IP concepts, but I’m trying to have an open mind about it.
(tags: gongkai bunnie-huang china phone mobile hardware devices open-source)
Stability Patterns and Antipatterns [slides]
Michael “Release It!” Nygard’s slides from a recent O’Reilly event, discussing large-scale service reliability design patterns
(tags: michael-nygard design-patterns architecture systems networking reliability soa slides pdf)
Deep In The Game: Not The RTE Guide
Good interview with Alan Maguire, the satirist behind the very funny @NotTheRTEGuide on Twitter:
I’ve always been a huge fan of TV Go Home and Charlie Brooker in general and it seemed like Irish TV and culture was a good target for the kind of barbed surrealism that he does. (I’m not claiming I’m in his league or anything but he’s the main influence). I was really surprised that there hadn’t been a parody RTÉ Guide already. TV listings are 140-ish characters already and the RTÉ Guide has a kind of weird place in Irish culture where everybody knows it but nobody our age really has any idea of what’s in it anymore. We associate it with a small-c conservatism, or I did at least and I play that up occasionally with the account.
(tags: nottherteguide rte rte-guide ireland funny satire interviews)
-
‘based on my observations while I was a Site Reliability Engineer at Google.’ – by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
(tags: monitoring ops devops alerting alerts pager-duty via:jk)
Monitoring the Status of Your EBS Volumes
Page in the AWS docs which describes their derived metrics and how they are computed — these are visible in the AWS Management Console, and alarmable, but not viewable in the Cloudwatch UI. grr. (page-joshea!)
(tags: ebs aws monitoring metrics ops documentation cloudwatch)
Interpol filter scope creep: ASIC ordering unilateral website blocks
Bloody hell. This is stupidity of the highest order, and a canonical example of “filter creep” by a government — secret state censorship of 1200 websites due to a single investment scam site.
The Federal Government has confirmed its financial regulator has started requiring Australian Internet service providers to block websites suspected of providing fraudulent financial opportunities, in a move which appears to also open the door for other government agencies to unilaterally block sites they deem questionable in their own portfolios. The instrument through which the ISPs are blocking the Interpol list of sites is Section 313 of the Telecommunications Act. Under the Act, the Australian Federal Police is allowed to issue notices to telcos asking for reasonable assistance in upholding the law. […] Tonight Senator Conroy’s office revealed that the incident that resulted in Melbourne Free University and more than a thousand other sites being blocked originated from a different source — financial regulator the Australian Securities and Investment Commission. On 22 March this year, ASIC issued a media release warning consumers about the activities of a cold-calling investment scam using the name ‘Global Capital Wealth’, which ASIC said was operating several fraudulent websites — www.globalcapitalwealth.com and www.globalcapitalaustralia.com. In its release on that date, ASIC stated: “ASIC has already blocked access to these websites.”
(tags: scams australia filtering filter-creep false-positives isps asic fraud secrecy)
Obfuscatory pie-chart from Garda penalty-points corruption report
“Twitter / gavinsblog: For sake of clarity here is helpful pie chart of the 95.4% of fixed charge notices not terminated #missingthepoint” Paging Edward Tufte: classic example of an obfuscatory pie-chart, diagramming the wrong thing misleadingly. By presenting it like this, it appears that the 95.4% of cases where fixed charge notices were issued by the guards are relevant to the discussion of the other classes; in reality, that means that 4.6% of cases, 37,000 cases, were terminated, some for good reasons, others for not, and it’s the difference between those two classes that are relevant. In my opinion, 2 separate pie charts would be better; one to show the dismissed-versus-undismissed count (which IMO could have been omitted entirely), and one to show the good-vs-not-so-good termination reason counts (which is the meat of the issue).
(tags: dataviz visualisation data obfuscation gardai police corruption penalty-points)
Berkeley DB Java Edition Architecture [PDF]
background white paper on the BDB-JE innards and design, from 2006. Still pretty accurate and good info
(tags: bdb-je java berkeley-db bdb design databases pdf white-papers trees)
-
This Court has developed a new awareness and understanding of a category of vexatious litigant. As we shall see, while there is often a lack of homogeneity, and some individuals or groups have no name or special identity, they (by their own admission or by descriptions given by others) often fall into the following descriptions: Detaxers; Freemen or Freemen-on-the-Land; Sovereign Men or Sovereign Citizens; Church of the Ecumenical Redemption International (CERI); Moorish Law; and other labels – there is no closed list. In the absence of a better moniker, I have collectively labelled them as Organized Pseudolegal Commercial Argument litigants [“OPCA litigants”], to functionally define them collectively for what they literally are. These persons employ a collection of techniques and arguments promoted and sold by ‘gurus’ (as hereafter defined) to disrupt court operations and to attempt to frustrate the legal rights of governments, corporations, and individuals. Over a decade of reported cases have proven that the individual concepts advanced by OPCA litigants are invalid. What remains is to categorize these schemes and concepts, identify global defects to simplify future response to variations of identified and invalid OPCA themes, and develop court procedures and sanctions for persons who adopt and advance these vexatious litigation strategies. One participant in this matter […] appears to be a sophisticated and educated person, but is also an OPCA litigant. One of the purposes of these Reasons is, through this litigant, to uncover, expose, collate, and publish the tactics employed by the OPCA community, as a part of a process to eradicate the growing abuse that these litigants direct towards the justice and legal system we otherwise enjoy in Alberta and across Canada. I will respond on a point-by-point basis to the broad spectrum of OPCA schemes, concepts, and arguments advanced in this action by [him].
Via Ronan Lupton(tags: via:ronanlupton law canada legal freeman opca court tax judgements)
-
This classic came up in discussions yesterday…
In the Linux Kernel community Rusty Russell came up with a API rating scheme to help us determine if our API is sensible, or not. It’s a rating from -10 to 10, where 10 is perfect is -10 is hell. Unfortunately there are too many examples at the wrong end of the scale.
(tags: rusty-russell quality coding kernel linux apis design code-reviews code)
-
hooray! Command-line gmailish goodness returns. And with a signed gem, to boot
Martin Thompson, Luke “Snabb Switch” Gorrie etc. review the C10M presentation from Schmoocon
on the mechanical-sympathy mailing list. Some really interesting discussion on handling insane quantities of TCP connections using low volumes of hardware:
This talk has some good points and I think the subject is really interesting. I would take the suggested approach with serious caution. For starters the Linux kernel is nowhere near as bad as it made out. Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning. I’ve heard they have since taken this to over 5 million concurrent connections. BTW Open Onload is an open source implementation. Writing a network stack is a serious undertaking. In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases. It is a great exercise in data structures and lock-free programming. If you need very high-end performance I’d talk to the Solarflare or Mellanox guys before writing my own. There are some errors and omissions in this talk. For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache. A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections. Creating and destroying 1 million connections per second is a major issue. A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection.
(tags: mechanical-sympathy hardware scaling c10m tcp http scalability snabb-switch martin-thompson)
-
This program creates an EBS snapshot for an Amazon EC2 EBS volume. To help ensure consistent data in the snapshot, it tries to flush and freeze the filesystem(s) first as well as flushing and locking the database, if applicable. Filesystems can be frozen during the snapshot. Prior to Linux kernel 2.6.29, XFS must be used for freezing support. While frozen, a filesystem will be consistent on disk and all writes will block. There are a number of timeouts to reduce the risk of interfering with the normal database operation while improving the chances of getting a consistent snapshot. If you have multiple EBS volumes in a RAID configuration, you can specify all of the volume ids on the command line and it will create snapshots for each while the filesystem and database are locked. Note that it is your responsibility to keep track of the resulting snapshot ids and to figure out how to put these back together when you need to restore the RAID setup.
Handy!(tags: ubuntu ec2 aws linux ebs snapshots ops tools alestic)
Measuring & Optimizing I/O Performance
Another good writeup on iostat and EBS, from Ilya Grigorik
(tags: io optimization sysadmin performance iostat ebs aws ops)
AWS forum post on interpreting iostat output for EBS
Great post from AndrewC@EBS on interpreting iostat output on EBS volumes — from 2009, but still looks reasonable enough
Operations is Dead, but Please Don’t Replace it with DevOps
This is so damn spot on.
Functional silos (and a standalone DevOps team is a great example of one) decouple actions from responsibility. Functional silos allow people to ignore, or at least feel disconnected from, the consequences of their actions. DevOps is a cultural change that encourages, rewards and exposes people taking responsibility for what they do, and what is expected from them. As Werner Vogels from Amazon Web Services says, “you build it, you run it”. So a “DevOps team” is a risky and ultimately doomed strategy. Sure there are some technical roles, specifically related to the enablement of DevOps as an approach and these roles and tools need to be filled and built. Self service platforms, collaboration and communication systems, tool chains for testing, deployment and operations are all necessary. Sure someone needs to deliver on that stuff. But those are specific technical deliverables and not DevOps. DevOps is about people, communication and collaboration. Organizations ignore that at their peril.
(tags: devops teams work ops silos collaboration organisations)
Universal Music Group adding audible “watermarks”
including on paid-for, losslessly-compressed digital audio music files:
Why isn’t UMG’s watermark talked about more? Maybe people think the audio quality problems are due to some kind of lossy compression, as I did, and ignore it completely, or blame the streaming service/distributor. The problem here is that the UMG watermark degrades the audio to about the equivalent of a 96 kbit MP3. My guess is that if consumers were informed about what is going on, they would care. Especially those who pay full retail price for digital downloads advertised as lossless audio.
(tags: lame audio drm media music umg universal watermarks noise consumer mp3)
“Call Me Maybe: Carly Rae Jepsen and the Perils of Network Partitions”
Aphyr’s epic RICON talk, exploring distributed-database failure modes through music. and what a lot of fail there is! Bottom line: CRDTs win
(tags: crdts data-structures storage ricon apyhr failures network partitions puns slides)
Cloudera Impala 1.0: It’s Here, It’s Real, It’s Already the Standard for SQL on Hadoop
we are proud to announce the first production drop of Impala, which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.
Along with some great benchmark numbers against Hive. nifty stuff(tags: cloudera impala sql querying etl olap hadoop analytics business-intelligence reports)
Alex Feinberg’s response to Damien Katz’ anti-Dynamoish/pro-Couchbase blog post
Insightful response, worth bookmarking. (the original post is at http://damienkatz.net/2013/05/dynamo_sure_works_hard.html ).
while you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity. You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster — I am speaking from personal experience), you will have to accept that you’re bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency. The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc…), or even more complex dependent failures. The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It’s perfect fine to trade off availability for other goals — you just need to be aware of that trade off.
(tags: cap distributed-databases databases quorum availability scalability damien-katz alex-feinberg partitions network dynamo riak voldemort couchbase)
CAP Confusion: Problems with ‘partition tolerance’
Another good clarification about CAP which resurfaced during last week’s discussion:
So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition […]: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.
(sorry, catching up on old interesting things posted last week…)(tags: failure scalability network partitions cap quorum distributed-databases fault-tolerance)
Big-O Algorithm Complexity Cheat Sheet
nicely done, very readable
(tags: algorithms reference cheat-sheet big-o complexity estimation coding)
Did Conroy’s AFP filter wrongly block 1,200 sites?
Looks like many Aussie network operators were legally required to block 1,200 websites (presumably, one target and 1199 false positives), in secret. Quoting http://lists.ausnog.net/pipermail/ausnog/2013-April/017993.html : “You get a notice to block. You block or either get fined, go to jail or lose your carrier licence. It is a blunt instrument and it is a condition of being at ‘the big boys table’ i.e. you’re a carrier or a carriage service provider.”
(tags: australia law afp filtering internet blocking censorship secret eff)
Making sense out of BDB-JE fast stats
good info on the system metrics recorded by BDB-JE’s EnvironmentStats code, particularly where cache and cleaner activity are concerned. Particularly useful for Voldemort
(tags: voldemort caching bdb bdb-je storage tuning ops metrics reference)
Approximate Heavy Hitters -The SpaceSaving Algorithm
nice, readable intro to SpaceSaving (which I’ve linked to before) — a simple stream-processing cardinality top-K estimation algorithm with bounded error.
(tags: algorithms coding space-saving cardinality streams stream-processing estimation)
Darach Ennis on CEP, Stream Processing, Messaging, OOP vs Functional Architecture
good interview — lots of food for thought!
(tags: darach-ennis stream-processing messaging architecture qcon interviews erlang cep realtime rx comet events)
One Year Later, the Results of Tor Books UK Going DRM-Free
As it is, we’ve seen no discernible increase in piracy on any of our titles, despite them being DRM-free for nearly a year.
Understanding Elastic Block Store Availability and Performance [slides]
fantastic in-depth presentation on EBS usage; lots of good advice here if you’re using EBS volumes with/without PIOPS
(tags: piops ebs performance aws ec2 ops storage amazon presentations)
-
Github get good results using Judy arrays to replace a Ruby hash. However: the whole blog post is a bit dodgy to me. It feels like there are much better ways to fix the problem: 1. the big one: don’t do GC-heavy activity in the front-end web servers. Split that language-classification code into a separate service. Write its results to a cache and don’t re-query needlessly. 2. why isn’t this benchmarked against a C/C++ hash? it’s only 36000 entries, loaded once at startup. lookups against that should be blisteringly fast even with the basic data structures, and that would also be outside the Ruby heap so avoid the GC overhead. Feels like the use of a Judy array was a “because I want to” decision. 3. personally, I’d have preferred they spend time fixing their uptime problems…. See also https://news.ycombinator.com/item?id=5639013 for more kvetching.
(tags: ruby github gc judy-arrays linguist hashes data-structures)
-
Mozilla’s experience with Kanban. We’ve had good results in Amazon, too. good intro links in this post — might start talking about it in Swrve…
(tags: kanban scheduling team agile mozilla)
Secret Bitcoin mining code added to game sparks outrage
Thunberg’s admission that [the E-Sports Entertainment Association client software] ran Bitcoin-mining software without explicit user consent is startling. Aside from potentially opening the company up to huge legal liability, the move is likely to engender distrust among some of the company’s most loyal fans. The nonchalance of some of Thunberg’s comments may only add insult to the betrayal many users are likely to feel. “But for the record, I told jag he shouldn’t be lazy and run the miner in a separate process,” he wrote in a post, referring to one of his software engineers with the screen name Jaguar, who didn’t take steps to conceal the Bitcoin miner. “Rookie move.” In the later post he wrote: “100% of the funds are going into the s14 prize pot, so at the very least your melted gpus contributed to a good cause.”
Gap’s application of Knockout.js and the MVVM model
Interesting, first time I’d heard of it; the Model-View-View Model pattern.
(tags: mvvm architecture javascript web ui knockout-js martin-fowler json)
-
very nice single-purpose site — figure out who represents any given Irish postal address
Lectures in Advanced Data Structures (6.851)
Good lecture notes on the current state of the art in data structure research.
Data structures play a central role in modern computer science. You interact with data structures even more often than with algorithms (think Google, your mail server, and even your network routers). In addition, data structures are essential building blocks in obtaining efficient algorithms. This course covers major results and current directions of research in data structures: TIME TRAVEL We can remember the past efficiently (a technique called persistence), but in general it’s difficult to change the past and see the outcomes on the present (retroactivity). So alas, Back To The Future isn’t really possible. GEOMETRY When data has more than one dimension (e.g. maps, database tables). DYNAMIC OPTIMALITY Is there one binary search tree that’s as good as all others? We still don’t know, but we’re close. MEMORY HIERARCHY Real computers have multiple levels of caches. We can optimize the number of cache misses, often without even knowing the size of the cache. HASHING Hashing is the most used data structure in computer science. And it’s still an active area of research. INTEGERS Logarithmic time is too easy. By careful analysis of the information you’re dealing with, you can often reduce the operation times substantially, sometimes even to constant. We will also cover lower bounds that illustrate when this is not possible. DYNAMIC GRAPHS A network link went down, or you just added or deleted a friend in a social network. We can still maintain essential information about the connectivity as it changes. STRINGS Searching for phrases in giant text (think Google or DNA). SUCCINCT Most “linear size” data structures you know are much larger than they need to be, often by an order of magnitude. Some data structures require almost no space beyond the raw data but are still fast (think heaps, but much cooler).
(via Tim Freeman)(tags: data-structures lectures mit video data algorithms coding csail strings integers hashing sorting bst memory)
Older Is Wiser: Study Shows Software Developers’ Skills Improve Over Time
At least in terms of StackOverflow rep:
For the first part of the study, the researchers compared the age of users with their reputation scores. They found that an individual’s reputation increases with age, at least into a user’s 40s. There wasn’t enough data to draw meaningful conclusions for older programmers. The researchers then looked at the number of different subjects that users asked and answered questions about, which reflects the breadth of their programming interests. The researchers found that there is a sharp decline in the number of subjects users weighed in on between the ages of 15 and 30 – but that the range of subjects increased steadily through the programmers’ 30s and into their early 50s. Finally, the researchers evaluated the knowledge of older programmers (ages 37 and older) compared to younger programmers (younger than 37) in regard to relatively recent technologies – meaning technologies that have been around for less than 10 years. For two smartphone operating systems, iOS and Windows Phone 7, the veteran programmers had a significant edge in knowledge over their younger counterparts. For every other technology, from Django to Silverlight, there was no statistically significant difference between older and younger programmers. “The data doesn’t support the bias against older programmers – if anything, just the opposite,” Murphy-Hill says.
Damn right ;)(tags: coding age studies software work stack-overflow ncsu knowledge skills life)
-
Test Double is a generic term for any case where you replace a production object for testing purposes. There are various kinds of double that Gerard lists: Dummy objects are passed around but never actually used. Usually they are just used to fill parameter lists. Fake objects actually have working implementations, but usually take some shortcut which makes them not suitable for production (an InMemoryTestDatabase is a good example). Stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what’s programmed in for the test. Spies are stubs that also record some information based on how they were called. One form of this might be an email service that records how many messages it was sent. Mocks are pre-programmed with expectations which form a specification of the calls they are expected to receive. They can throw an exception if they receive a call they don’t expect and are checked during verification to ensure they got all the calls they were expecting.
(tags: test-doubles naming patterns tdd testing mocking tests martin-fowler)
Limerick-Tralee walking/cycling route blocked by farmers
Oh for god’s sake. I know a few people who’ve made a trip to Mayo explicitly because the Greenway was there to visit. This is shocking, backwards stuff:
The success of [Mayo’s] Great Western Greenway [trail] has overtaken that of others, such as the Great Southern Trail group, which has been working hard to install a walking and cycling route on sections of the former Limerick-Tralee railway line. On February 2nd, to mark the 50th anniversary of its closure, about 150 members and supporters of the Great Southern Trail set out from the old railway station at Abbeyfeale, Co Limerick, along the most recently developed section to cross the Kerry county boundary. The trailers were greeted by a barricade on the border, manned by more than 30 farmers, including the Listowel Fine Gael town councillor Denis Stack. A stand-off continued for three hours, with the Garda mediating in vain. The farmers were trying to lay claim to the land occupied by the disused railway line, even though Minister for Transport Leo Varadkar had made it clear that CIÉ “is the owner of the property [and] will object to any application by others to register these lands”.
(via Rossa McMahon)(tags: via:rossamcmahon cycling walking hiking trails ireland kerry limerick listowel denis-stack cie)
-
like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text. [it] is written in portable C, and it has zero runtime dependencies. You can download a single binary, scp it to a far away machine, and expect it to work.
Nice tool. Needs to get into the Debian/Ubuntu apt repos pronto ;)(tags: jq tools cli via:peakscale json coding data sed unix)
“Clickwrap” licensing established as legal in Irish court
“The evidence does establish that there is a practice in the airline and online travel agency sectors of contractually binding web users by click wrapping or browse wrapping, which practice is generally and regularly followed by the operators in those sectors. In reality, it is difficult to see how online trade could be carried on in the absence of those devices. As regards the third question which arises from the MSG decision, in this case it is whether the defendant was aware or is presumed to have been aware of the practice. The evidence before the Court, in my view, clearly demonstrates that the defendant was aware of the practice, it being a practice which is generally and regularly followed when making bookings with online travel agents and with airlines and which, in the words of the Court in the MSG case, may be regarded as being a consolidated practice. Accordingly, in my view, by application of Article 23(1)(c), the defendant is bound by the jurisdiction clause in the Terms of Use on the plaintiff’s website by its use, either through the medium of an automaton or a manual operator or a third party data provider, of the website.”
(via Rossa McMahon)
Functional Reactive Programming in the Netflix API with RxJava
Hmm, this seems nifty as a compositional building block for Java code to enable concurrency without thread-safety and sync problems.
Functional reactive programming offers efficient execution and composition by providing a collection of operators capable of filtering, selecting, transforming, combining and composing Observable’s. The Observable data type can be thought of as a “push” equivalent to Iterable which is “pull”. With an Iterable, the consumer pulls values from the producer and the thread blocks until those values arrive. By contrast with the Observable type, the producer pushes values to the consumer whenever values are available. This approach is more flexible, because values can arrive synchronously or asynchronously.
(tags: concurrency java jvm threads thread-safety coding rx frp fp functional-programming reactive functional async observable)
You probably shouldn’t use a spreadsheet for important work
Daniel Lemire comments on the recent cases of bugs in spreadsheets causing major impact:
There are several critical problems with a tool like Excel that need to be widely known: * Spreadsheets do not support testing. For anything that matters, you should validate and test your code automatically and systematically; * Spreadsheets make code reviews impractical. To visually inspect the code, you need to click and each and every cell. In practice, this means that you cannot reasonably ask someone to read over your formulas to make sure that there is no mistake; * Spreadsheets encourage redundancies. Spreadsheets encourage copy-and-paste. Though copying and pasting is sometimes the right tool, it also creates redundancies. These redundancies make it very difficult to update a spreadsheet: are you absolutely sure that you have changed the formula throughout?
Agreed on all three, particularly on the impossibility of testing. IMO, everyone who may be in a job where automation via spreadsheet is likely, needs training in SDE fundamentals: unit testing, the important of open source and open data for reproducibility, version control, and code review. We are all computer scientists now.(tags: spreadsheets excel coding errors bugs testability unit-testing testing quality sde sde-fundamentals dry)
Log4j2 Asynchronous Loggers for Low-Latency Logging – Apache Log4j 2
implemented using the LMAX Disruptor library — very impressive performance figures. I presume in real-world usage, these latencies are dwarfed by hardware costs, though
(tags: disruptor coding java log4j logging async performance)
-
Google Drive and GMail have a built-in scripting engine. I had no idea
(tags: gmail evernote archival scripting coding hacks google-drive)
-
How the Irish media are partly to blame for the catastrophic property bubble, from a paper entitled _The Role Of The Media In Propping Up Ireland’s Housing Bubble_, by Dr Julien Mercille, in the _Social Europe Journal_:
“The overall argument is that the Irish media are part and parcel of the political and corporate establishment, and as such the news they convey tend to reflect those sectors’ interests and views. In particular, the Celtic Tiger years involved the financialisation of the economy and a large property bubble, all of it wrapped in an implicit neoliberal ideology. The media, embedded within this particular political economy and itself a constitutive element of it, thus mostly presented stories sustaining it. In particular, news organisations acquired direct stakes in an inflated real estate market by purchasing property websites and receiving vital advertising revenue from the real estate sector. Moreover, a number of their board members were current or former high officials in the finance industry and government, including banks deeply involved in the bubble’s expansion.”
(tags: economics irish-times ireland newspapers media elite insiders bubble property-bubble property celtic-tiger papers news bias)
-
Ugh. low-end ISPs MITM’ing DNS queries:
Some ISP’s are now using a technology called ‘Transparent DNS proxy’. Using this technology, they will intercept all DNS lookup requests (TCP/UDP port 53) and transparently proxy the results. This effectively forces you to use their DNS service for all DNS lookups. If you have changed your DNS settings to an open DNS service such as Google, Comodo or OpenDNS expecting that your DNS traffic is no longer being sent to your ISP’s DNS server, you may be surprised to find out that they are using transparent DNS proxying.
(via Nelson) BitTorrent’s Secure Dropbox Alternative Goes Public
As kragen says, ‘a decentralized way to sync a folder of large files, using BitTorrent instead of an untrustworthy central server’. Windows, OSX, and Linux supported
(tags: bittorrent dropbox cloud storage filesharing sharing sync synchronization)
DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second
250 million tweets per day, 30-node HBase cluster, 400TB of storage, Kafka and 0mq. This is from 2011, hence this dated line: ‘for a distributed application they thought AWS was too limited, especially in the network. AWS doesn’t do well when nodes are connected together and they need to talk to each other. Not low enough latency network. Their customers care about latency.’ (Nowadays, it would be damn hard to build a lower-latency network than that attached to a cc2.8xlarge instance.)
(tags: datasift architecture scalability data twitter firehose hbase kafka zeromq)
Breaking the 1000 ms Time to Glass Mobile Barrier [slides]
Great presentation from Google on HTML5 CSS+JS render speed, 3G/4G network latency, etc. (via John G)
(tags: google slides 3g 4g lte networking telcos telecom css js html5 web via:jg)
Lucene 4 – Revisiting Problems For Speed [slides]
a Presentation from Simon Willnauer on optimization work performed on Lucene in 2011. The most interesting stuff here is the work done to replace an O(n^2) FuzzyQuery fuzzy-match algorithm with a FSM trie is extremely cool — benchmarked at 214 times faster!
(tags: benchmarks slides lucene search fuzzy-matching text-matching strings algorithms coding fsm tries)
Microsoft Code Digger extension
Miguel de Icaza says it’s witchcraft — I’m inclined to agree:
Code Digger analyzes possible execution paths through your .NET code. The result is a table where each row shows a unique behavior of your code. The table helps you understand the behavior of the code, and it may also uncover hidden bugs. Through the new context menu item “Generate Inputs / Outputs Table” in the Visual Studio editor, you can invoke Code Digger to analyze your code. Code Digger computes and displays input-output pairs. Code Digger systematically hunts for bugs, exceptions, and assertion failures.
(tags: testing constraint-solving solver witchcraft magic dot-net coding tests code-digger microsoft)
Swansea measles outbreak: was an MMR scare in the local press to blame?
Sixteen years ago, journalists had a much easier job assembling “balanced” stories about MMR in south Wales. When I wrote about the measles outbreak last week, I suggested that it was related to Andrew Wakefield’s discredited 1998 Lancet research, but the Swansea contagion seems more likely to be the result of a separate scare a year earlier in the South Wales Evening Post. Before 1997, uptake of MMR in the distribution area of the Post was 91%, and 87.2% in the rest of Wales. After the Post’s campaign, uptake in the distribution area fell to 77.4% (it was 86.8% in the rest of Wales). That’s almost a 14% drop where the Post had influence, compared with less than 3% elsewhere. In the dry wording of the BMJ, “the [South West Evening Post] campaign is the most likely explanation”. In other words, what we can see in Swansea is the local effect of local reporting‚ in all probability, just a taster of what happens when the news irresponsibly creates unfounded terror. […] The 1997 coverage focused on a group of families who blamed MMR for various ailments in their children, including learning difficulties, digestive problems and autism‚ none of which have been found to have any connection with the vaccine. The Post’s coverage was at the time deemed a success, and in 1998 it won a prize for investigative reporting in the BT Wales Press Awards. That year, the SWEP ran at least 39 stories related to the alleged dangers of MMR. And yes, it’s true that the paper never directly endorsed non-vaccination. What it did do was publicise the idea of “vaccine damage” as a risk, one that parents would then likely weigh up against the risk of contracting measles, mumps or rubella. And this went beyond the reporting of parental anxieties‚ it was part of the Post’s editorial line. One article is entitled “Young bodies cannot take it”. The all-important “journalistic balance” was constantly available, thanks to campaigning parents and their solicitor Richard Barr. (It was Barr who engaged Wakefield for a lawsuit, leading to the “fishing expedition” research that became the Lancet paper.) They were happy to provide a quote on the dangers of the “triple jab”, which health authorities were then obliged to rebut politely. The Post also seemed to downplay the risk of measles, reporting on 6 July 1998 that “not a single child has been hit by the illness‚ despite a 13% drop in take-up levels”. It’s not parents who should feel embarrassed by the Swansea measles outbreak: some may have acted from overt dread at the prospect of harming their child, and some simply from omission, but all were encouraged by a press that focused on non-existent risks and downplayed the genuine horror of the diseases MMR prevents. The shame belongs to journalists: those of the South West Evening Post who allowed themselves to be recruited in the service of a speculative lawsuit, and any who let a specious devotion to “balance” overrule a duty to tell the truth.
(tags: south-wales wales mmr health vaccination scares journalism ethics disease measles south-wales-evening-post)
-
mostly a DynamoDB puff-piece from last week’s Amazon Cloud Connect, but contains some good real-world figures for a 20-billion-GUID deduping table use-case at end. ($4,150 per month, to cut to the chase)
(tags: dynamodb aws figures costs architecture ec2 dedupe cloud-connect slides)
Excel, untestability, and the reliability of quants
Wow, this is a great software-quality story — I knew Excel was the most widely used programming environment out there, but this is a factor I’d overlooked:
In his remarks on the final panel, Frank Partnoy mentioned something I missed when it came out a few weeks ago: the role of Microsoft Excel in the “London Whale” trading debacle. [..] To summarize: JPMorgan’s Chief Investment Office needed a new value-at-risk (VaR) model for the synthetic credit portfolio (the one that blew up) and assigned a quantitative whiz […] to create it. The new model “operated through a series of Excel spreadsheets, which had to be completed manually, by a process of copying and pasting data from one spreadsheet to another.” The internal Model Review Group identified this problem as well as a few others, but approved the model, while saying that it should be automated and another significant flaw should be fixed. After the London Whale trade blew up, the Model Review Group discovered that the model had not been automated and found several other errors. Most spectacularly, “After subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR …” I write periodically about the perils of bad software in the business world in general and the financial industry in particular, by which I usually mean back-end enterprise software that is poorly designed, insufficiently tested, and dangerously error-prone. But this is something different. […] While Excel the program is reasonably robust, the spreadsheets that people create with Excel are incredibly fragile. There is no way to trace where your data come from, there’s no audit trail (so you can overtype numbers and not know it), and there’s no easy way to test spreadsheets, for starters. The biggest problem is that anyone can create Excel spreadsheets — badly. Because it’s so easy to use, the creation of even important spreadsheets is not restricted to people who understand programming and do it in a methodical, well-documented way. This is why the JPMorgan VaR model is the rule, not the exception: manual data entry, manual copy-and-paste, and formula errors. This is another important reason why you should pause whenever you hear that banks’ quantitative experts are smarter than Einstein, or that sophisticated risk management technology can protect banks from blowing up. At the end of the day, it’s all software. While all software breaks occasionally, Excel spreadsheets break all the time. But they don’t tell you when they break: they just give you the wrong number.
(tags: excel reliability software coding ides jpmorgan value-at-risk finance london-whale quants spreadsheets unit-tests testability testing)
Riak, CAP, and eventual consistency
Good (albeit draft) write-up of the implications of CAP, allow_mult, and last_write_wins conflict-resolution policies in Riak:
As Brewer’s CAP theorem established, distributed systems have to make hard choices. Network partition is inevitable. Hardware failure is inevitable. When a partition occurs, a well-behaved system must choose its behavior from a spectrum of options ranging from “stop accepting any writes until the outage is resolved” (thus maintaining absolute consistency) to “allow any writes and worry about consistency later” (to maximize availability). Riak leans toward the availability end of the spectrum, but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data.
(tags: riak cap eventual-consistency distcomp distributed-systems partition last-write-wins voldemort allow_mult)
How You Can Help Save Upcoming.org, Posterous, and More
Yahoo! sucks. shutting down in days? ArchiveTeam Warrior to the rescue; install the VM!
(tags: archival yahoo shutdowns upcoming waxy archives virtualbox)
The Excel Depression – NYTimes.com
Krugman on the Reinhart-Rogoff Excel-bug fiasco.
What the Reinhart-Rogoff affair shows is the extent to which austerity has been sold on false pretenses. For three years, the turn to austerity has been presented not as a choice but as a necessity. Economic research, austerity advocates insisted, showed that terrible things happen once debt exceeds 90 percent of G.D.P. But “economic research” showed no such thing; a couple of economists made that assertion, while many others disagreed. Policy makers abandoned the unemployed and turned to austerity because they wanted to, not because they had to. So will toppling Reinhart-Rogoff from its pedestal change anything? I’d like to think so. But I predict that the usual suspects will just find another dubious piece of economic analysis to canonize, and the depression will go on and on.
(tags: paul-krugman economics excel coding bugs software austerity debt)
Vaccination ‘herd immunity’ demonstration
‘Stochastic monte-carlo epidemic SIR model to reveal herd immunity’. Fantastic demo of this important medical concept (via Colin Whittaker)
(tags: via:colinwh stochastic herd-immunity random sir epidemics health immunity vaccination measles medicine monte-carlo-simulations simulations)
Fred’s ImageMagick Scripts: SIMILAR
compute an image-similarity metric, to discover mostly-identical-but-slightly-tweaked images:
SIMILAR computes the normalized cross correlation similarity metric between two equal dimensioned images. The normalized cross correlation metric measures how similar two images are, not how different they are. The range of ncc metric values is between 0 (dissimilar) and 1 (similar). If mode=g, then the two images will be converted to grayscale. If mode=rgb, then the two images first will be converted to colorspace=rgb. Next, the ncc similarity metric will be computed for each channel. Finally, they will be combined into an rms value.
(via Dan O’Neill)(tags: image photos pictures similar imagemagick via:dano metrics similarity)
-
a first-person game prototype in which players navigate a 3D space while picking up orbs that reduce the speed of light in increments. Custom-built, open-source relativistic graphics code allows the speed of light in the game to approach the player’s own maximum walking speed. Visual effects of special relativity gradually become apparent to the player, increasing the challenge of gameplay. These effects, rendered in realtime to vertex accuracy, include the Doppler effect (red- and blue-shifting of visible light, and the shifting of infrared and ultraviolet light into the visible spectrum); the searchlight effect (increased brightness in the direction of travel); time dilation (differences in the perceived passage of time from the player and the outside world); Lorentz transformation (warping of space at near-light speeds); and the runtime effect (the ability to see objects as they were in the past, due to the travel time of light). Players can choose to share their mastery and experience of the game through Twitter. A Slower Speed of Light combines accessible gameplay and a fantasy setting with theoretical and computational physics research to deliver an engaging and pedagogically rich experience.
Eventual Consistency Today: Limitations, Extensions, and Beyond – ACM Queue
Good overview of the current state of eventually-consistent data store research, covering CALM and CRDTs, from Peter Bailis and Ali Ghodsi
(tags: eventual-consistency data storage horizontal-scaling research distcomp distributed-systems via:martin-thompson crdts calm acid cap)
Latency’s Worst Nightmare: Performance Tuning Tips and Tricks [slides]
the basics of running a service stack (web, app servers, data stores) on AWS. some good benchmark figures in the final slides
(tags: benchmarks aws ec2 ebs piops services scaling scalability presentations)
Rob “b3ta” Manuel in Dublin next week
The Bottom Half Of The Internet — “Racism; typos; filth; spam; ignorance; rage – that’s all the bottom half of the internet is good for, right? Rob Manuel wants you to question the internet dictum, most beloved of high-profile columnists, that you should ignore all of the comments all of the time. The ‘war on comments’, he reckons, might just be an echo of a fourth estate that’s having trouble adjusting to the idea of an unwashed public disagreeing with their sacred opinions. Sous les pavés, la plage.” On Tuesday, le cool Dublin & Pilcrow present SPIEL. Rob Manuel is the flashy animator behind B3ta and he’s joined by Ed Melvin, who wants to educate you on ‘The Unreal Engines’ of virtual currencies and economies.
(tags: rob-manuel b3ta dublin comments internet meetings talks lecool)
Reality, Reactivity, Relevance and Repeatability in Java Application Profiling
this product from JInspired appears to support runtime profiling of java apps with < 5% performance impact
(tags: profiling performance java coding measurement)
You Lookin’ At Me? Reflections on Google Glass
ex-Nokia product design guru Jan Chipchase on Google Glass
(tags: google privacy technology google-glass pervasive-computing life future)
Not the ‘best in the world’ – The Medical Independent
Debunking this prolife talking point:
‘Our maternity services are amongst the best in the world’. This phrase has been much hackneyed since the heartbreaking death of Savita Halappanavar was revealed in mid October. James Reilly and other senior politicians are particularly guilty of citing this inaccurate position. So what is the state of Irish maternity services and how do our figures compare with other comparable countries? Let’s start with the statistics.
The bottom line:Eight deaths per 100,000 is not bad, but it ranks our maternity services far from the best in world and below countries such as Slovakia and Poland.
(tags: pro-choice ireland savita medicine health maternity morbidity statistics)
How Kaggle Is Changing How We Work – Thomas Goetz – The Atlantic
Founded in 2010, Kaggle is an online platform for data-mining and predictive-modeling competitions. A company arranges with Kaggle to post a dump of data with a proposed problem, and the site’s community of computer scientists and mathematicians — known these days as data scientists — take on the task, posting proposed solutions. […] On one level, of course, Kaggle is just another spin on crowdsourcing, tapping the global brain to solve a big problem. That stuff has been around for a decade or more, at least back to Wikipedia (or farther back, Linux, etc). And companies like TaskRabbit and oDesk have thrown jobs to the crowd for several years. But I think Kaggle, and other online labor markets, represent more than that, and I’ll offer two arguments. First, Kaggle doesn’t incorporate work from all levels of proficiency, professionals to amateurs. Participants are experts, and they aren’t working for benevolent reasons alone: they want to win, and they want to get better to improve their chances of winning next time. Second, Kaggle doesn’t just create the incidental work product, it creates a new marketplace for work, a deeper disruption in a professional field. Unlike traditional temp labor, these aren’t bottom of the totem pole jobs. Kagglers are on top. And that disruption is what will kill Joy’s Law. Because here’s the thing: the Kaggle ranking has become an essential metric in the world of data science. Employers like American Express and the New York Times have begun listing a Kaggle rank as an essential qualification in their help wanted ads for data scientists. It’s not just a merit badge for the coders; it’s a more significant, more valuable, indicator of capability than our traditional benchmarks for proficiency or expertise. In other words, your Ivy League diploma and IBM resume don’t matter so much as my Kaggle score. It’s flipping the resume, where your work is measurable and metricized and your value in the marketplace is more valuable than the place you work.
(tags: academia datamining economics data kaggle data-science ranking work competition crowdsourcing contracting)
-
a good reference, with lots of sample output. Not clear if it takes 1.6/1.7 differences into account, though
Austerity policies founded on Excel typo
You’ve probably heard that countries with a high debt:GDP ratio suffer from slow economic growth. The specific number 90 percent has been invoked frequently. That’s all thanks to a study conducted by Carmen Reinhardt and Kenneth Rogoff for their book This Time It’s Different. But the results have been difficult for other researchers to replicate. Now three scholars at the University of Massachusetts have done so in “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff” and they find that the Reinhart/Rogoff result is based on opportunistic exclusion of Commonwealth data in the late-1940s, a debatable premise about how to weight the data, and most of all a sloppy Excel coding error. Read Mike Konczal for the whole rundown, but I’ll just focus on the spreadsheet part. At one point they set cell L51 equal to AVERAGE(L30:L44) when the correct procuedure was AVERAGE(L30:L49). By typing wrong, they accidentally left Denmark, Canada, Belgium, Austria, and Australia out of the average. When you run the math correctly “the average real GDP growth rate for countries carrying a public debt-to-GDP ratio of over 90 percent is actually 2.2 percent, not -0.1 percent.”
(tags: austerity politics excel coding errors bugs spreadsheets economics economy)
Is Your MySQL Buffer Pool Warm? Make It Sweat!
How GroupOn are warming up a failover warm MySQL spare, using Percona stuff and a “tee” of the live in-flight queries. (via Dave Doran)
(tags: via:dave-doran mysql databases warm-spares spares failover groupon percona replication)
So now you know who gets some of those excessive Ticketmaster fees….
Interesting evidence; it appears Irish music promoters are getting “rebates” from the massive TicketMaster “booking fee”, on each ticket sold. This sounds like a cartel to me, and we need to regulate this. Where is the National Consumer Agency and Competition Authority?
The matter is something which should be of concern to every gig-going music fan, regardless of whether they go to Stradbally or not. For years, many have asked about TicketMaster’s quasi-monopoly position in the marketplace and why this is so. We’ve always been told that promoters preferred to deal with one company rather than several and that TM’s systems and nationwide reach yadda yadda yadda was the bees’ knees etc. Other companies have tried to compete but no-one has been able to beat TM at this game. But why would promoters go elsewhere when they’re getting a slice of the TM fees back as rebates? Those past off-the-record attempts by and briefings from promoters blaming TM for those fees can now be seen as hypocritical. They’re sticking with TM because they’re receiving a take of the fees paid by punters who have no other choice in service provider if they want to get their hands on tickets. You wonder what the acts make of this cash-grab – perhaps some whip-smart agent is already making a claim for a percentage of the rebates because there would be no rebates in the first place without the act. Surely this is an issue for the Competition Authority and National Consumers Association too, given the manner in which the rebates are made and TM’s deals with the promoters? While promoters under TM deals are free to sell a certain proportion of their tickets with another provider, it’s usually only a very small percentage of the total and unlikely to trouble TM’s bottom line. Also, given that the rebates are volume-driven, it’s better for the promoters to keep the largest possible chunk of their business with TM. It seems that we have a new suspect in the blame game about why ticket prices are so high.
(tags: regulation ireland cartels competition ticketing tickets ticketmaster music gigs consumer)
Blog shines spotlight on Dublin city’s illegal dumping problem
Hooray, Eoin’s activism gets some coverage!
THE SCALE OF Dublin’s dumping problem is laid bare in a blog that has seen contributors send in photos of chairs, fridges and heaps of rubbish strewn on city streets. Eoin Parker, one of organisers behind DublinLitterBlog.com, spoke to TheJournal.ie about the problem, saying that the blog was set up following the privatisation of waste management by Dublin City Council in 2012.
(tags: dumping dublin litter rubbish blogs dcc d1 activism community)
-
To our knowledge, Ked is the first scripting language to emerge from The People’s Republic of Cork. Below is an account of what we know so far about the mysterious Corkonian language. Any suggested updates or contributions are encouraged.
Genius. Just how bad are RTE’s finances?
A sobering examination by NAMAwinelake into the quagmire of Ireland’s publicly-funded national broadcaster:
It seems that RTE has become a disaster zone, with libels and incompetence overseen by incapable management, and this is reflected in that organisation’s financial results. RTE still employs nearly 2,000 people and supports jobs and industry across independent producers and suppliers; it is a major business. But the time has come to call a halt to delusional management that is sinking the organization deeper into a quagmire which will ultimately need to be bailed out by the State. And Noel Curran is fobbing us off with flying a kite about a reduction in 65-year old Pat Kenny’s salary from €630,000 to €570,000?!
(tags: rte namawinelake public funding finances money mismanagement ireland incompetence tv news)
High Scalability – Scaling Pinterest – From 0 to 10s of Billions of Page Views a Month in Two Years
wow, Pinterest have a pretty hardcore architecture. Sharding to the max. This is scary stuff for me:
a [Cassandra-style] Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.
yeah, so, eek ;)(tags: clustering sharding architecture aws scalability scaling pinterest via:matt-sergeant redis mysql memcached)
Expert in Savita inquiry confirms Irish women get lower standard of care with chorioamnionitis
Dr. Jen Gunter again:
Dr. Knowles’ testimony confirms for me that the law played a role, because her statements indicate the standard of care for treatment of chorioamnionitis is less aggressive in Ireland. This can only be because of the law as there is no medical evidence to support delaying delivery when chorioamnionitis is diagnosed. Standard of care is not to wait until a woman is sick enough to need a termination, the idea is to treat her, you know, before she gets sick enough. An elevated white count and ruptured membranes at 17 weeks is typically enough to make the diagnosis, so Dr. Knowles needs to testify as to what in Savita’s medical record made it safe to not recommend a delivery. By the way, I also disagree with Dr. Knowles about her interpretation of Savita’s medical record, the chart doesn’t have “subtle indicators” of infection, it screams chorioamnionitis long before Wednesday morning. In North America the standard of care with chorioamnionitis is to recommend delivery as soon as the diagnosis is made, not wait until women enter the antechamber of death in the hopes that we can somehow snatch them back from the brink. If Irish law, or the interpretation thereof, had nothing to do with Savita’s death no expert would be mentioning sick enough at all.
(tags: jen-gunter ob-gyn medicine savita law ireland abortion tragedy galway hospital)
Boundary Product Update: Trends Dashboard Now Available
Boundary implement week-on-week trend display. Pity they use silly “giant number” dashboard boxes showing comparisons of the current datapoint with the previous week’s datapoint; there’s no indication of smoothing being applied, and “giant number” dashboards are basically useless anyway compared to a time-series graph, for unsmoothed time-series data. Also, no prediction bands. :(
(tags: boundary time-series tsd prediction metrics smoothing dataviz dashboards)
ESB Networks | Power Check | Service Interruptions Map
real-time service outage information on a map, from Ireland’s power network
Project Voldemort at Gilt Groupe: When Failure Isn’t an Option [slides]
Geir Magnusson explains how Gilt Groupe is using Project Voldemort to scale out their e-commerce transactional system. The initial SQL solution had to be replaced because it could not handle the transactional spikes the site is experiencing daily due to its particular way of selling their inventory: each day at noon. Magnusson explains why they chose Voldemort and talks about the architecture.
via Filippo(tags: via:filippo database architecture nosql data voldemort gilt-groupe ops storage presentations)
The full timeline of Savita Halappanavar’s mistreatment
a comment on Dr. Jen Gunter’s blog puts it all together
(tags: timeline savita abortion malpractice ireland medicine fail)
-
No holds barred:
Speaking today, spokesman Charles Stanley-Smith said; “This idea is insane. This area has suffered from dumping due to a lack of enforcement – yet the council now propose to effectively withdraw services altogether. As numerous studies such as ‘the broken window hypothesis’ indicate, where a small problem is left un-tackled it is likely to become far worse rather than better. In other words, rather than increase enforcement to solve the problem, Dublin City Council is going to remove enforcement. How will this deal with the problem? Imagine if that logic were applied to crime; would the removal of police services in an area help resolve criminal behaviour – or increase it? The answer is obvious.”
(tags: an-taisce environment cleaning dublin ireland dcc rubbish trash society d1)
-
Written by Google, this library is a flexible, efficient, and powerful Java client library for accessing any resource on the web via HTTP. It features a pluggable HTTP transport abstraction that allows any low-level library to be used, such as java.net.HttpURLConnection, Apache HTTP Client, or URL Fetch on Google App Engine. It also features efficient JSON and XML data models for parsing and serialization of HTTP response and request content. The JSON and XML libraries are also fully pluggable, including support for Jackson and Android’s GSON libraries for JSON.
Not quite as simple an API as Python’s requests, sadly, but still an improvement on the verbose Apache HttpComponent API. Good support for unit testing via a built-in mock-response class. Still in beta(tags: google beta software http libraries json xml transports protocols)
Former IMF chief of mission to Ireland says not burning the bondholders was “a mistake”
Former IMF chief of mission to Ireland, Ashoka Mody, above left with Ajai Chopra in 2010. Melancholy of eye and large of loafer, Ashoka was involved in negotiating Ireland’s EU/IMF bailout. […] This morning Ashok gave an interview to Gavin Jennings on Morning Ireland, in which he admitted Ireland’s bailout was riddled with mistakes, namely the non-burning of the senior bondholders and the program of austerity. Jennings: “So, if imposing austerity on Ireland was wrong, or a mistake; if not allowing any burning of bondholders, whether official, sovereign or private was a mistake; you were centrally involved in that program. I know Ajai Chopra was very much the public face of the IMF mission to Ireland. But you were centrally involved in constructing this bailout. How much responsibility do you take for those errors.” Mody: “Yes, so, obviously, I have to take the responsibility in…but I’m in very good company in taking responsibility in this. There were many parties involved. And my role really was to bring such matters to the attention of people who finally made these decisions.”
Great.(tags: bondholders imf ireland economy default ajai-chopra ashoka-mody)
Savita Halappanavar’s inquest: the three questions that must be answered | Dr. Jen Gunter
A professional OB/GYN analyses the horrors coming to light in the Savita inquest. Here’s one particular gem:
Fetal survival with ruptured membranes at 17 weeks is 0%, this is from prospective study. […but] “real and substantial risk” to the woman’s life is what is required by the Irish constitution to terminate a pregnancy, *whether or not the foetus is viable*.
So the foetus had 0% chance of survival — but still termination was not considered an option. Bloody hell.(tags: religion ireland savita horrors malpractice galway guh hospitals hse health inquest abortion pro-choice pregnancy)
Minister Rabbitte welcomes EU agreement on re-use of Public Sector Information
Lots of talk about “charging regimes”, “income-generating public sector bodies” etc., but not a single mention of open data or free access. Terrible stuff. :( (via conoro)
(tags: via:conoro open-access government public-sector ireland eu open-data public free)
Compression in Kafka: GZIP or Snappy ?
With Ack: in this mode, as far as compression is concerned, the data gets compressed at the producer, decompressed and compressed on the broker before it sends the ack to the producer. The producer throughput with Snappy compression was roughly 22.3MB/s as compared to 8.9MB/s of the GZIP producer. Producer throughput is 150% higher with Snappy as compared to GZIP. No ack, similar to Kafka 0.7 behavior: In this mode, the data gets compressed at the producer and it doesn’t wait for the ack from the broker. The producer throughput with Snappy compression was roughly 60.8MB/s as compared to 18.5MB/s of the GZIP producer. Producer throughput is 228% higher with Snappy as compared to GZIP. The higher compression savings in this test are due to the fact that the producer does not wait for the leader to re-compress and append the data; it simply compresses messages and fires away. Since Snappy has very high compression speed and low CPU usage, a single producer is able to compress the same amount of messages much faster as compared to GZIP.
The Bw-Tree: A B-tree for New Hardware – Microsoft Research
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.
(tags: bw-trees database paper toread research algorithms microsoft sql sql-server b-trees data-structures storage cache-friendly mechanical-sympathy)
Boundary Techtalk – Large-scale OLAP with Kobayashi
Boundary on their TSD-on-Riak store.
Dietrich Featherston, Engineer at Boundary, walks through the process of designing Kobayashi, the time-series analytics database behind our network metrics. He goes through the false-starts and lessons learned in effectively using Riak as the storage layer for a large-scale OLAP database. The system is ultimately capable of answering complex, ad-hoc queries at interactive latencies.
(tags: video boundary tsd riak eventual-consistency storage kobayashi olap time-series)
-
A few days old, but already an instant Streisand-Effect classic:
Sometimes people borrow [Colin Purrington’s free guide about making scientific posters] without giving him credit. This happens fairly regularly, and when he finds out about it, he sends an e-mail asking them to take it down. Usually they do. But when he sent an e-mail to the Consortium for Plant Biotechnology Research, asking that a roughly 1,200-word, near-verbatim, uncredited chunk from his guide be removed from the consortium’s materials, the response was unexpected. Rather than apologise, a lawyer sent him a cease-and-desist letter accusing him of plagiarizing the consortium’s materials and demanding that he take down his guide or face a lawsuit seeking damages up to $150,000.
(tags: streisand-effect lawsuits law infringement copyright cpbr bullying science posters)
Kafka 0.8 Producer Performance
Great benchmarking from Piotr Kozikowski at the LiveRamp team, into performance of the upcoming Kafka 0.8 release
(tags: performance kafka apache benchmarks ops queueing)
Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node
an excellent writeup on Kafka 0.8’s use and operation, including details of the new replication features
(tags: kafka replication queueing distributed ops)
-
‘A €10 silver coin being offered for sale to the public in honour of James Joyce by the Central Bank tomorrow contains a misquote from the author. The line used on the coin from Chapter 3 of Ulysses includes a superfluous conjunction – a rogue ‘that’.’ [..] The coin reads:
“Ineluctable modality of the visible: at least that if no more, thought through my eyes. Signatures of all things *that* I am here to read.”
(Incorrect ‘that’ emphasised)(tags: for:robotwisdom james-joyce typos funny fail central-bank ireland coins minting errors ulysses)
Netflix ISP Speed Index for Ireland
Via Mulley. Magnet doing well, with UPC coming second; UPC have dropped a fair bit in the past month. Would love to see it broken down by region…
(tags: upc ireland isps speed bandwidth netflix broadband magnet eircom)
Why I’m Walking Away From CouchDB
In practice there are two gotchas that are so painful I am looking for a replacement with a different featureset than couchdb provides. The location tracking project icecondor.com uses couchdb to store 20,000 new records per day. It has more write traffic than read traffic and runs on modest hardware. Those two gotchas are: 1. View Index updates. While I have a vague understanding of why view index updates are slow and bulky and important, in practice it is unworkable. Every write sets up a trap for the first reader to come along after the write. The more writes there are, the bigger the trap for the first reader which has to wait on the couchdb process that refreshes the view index on an as-needed basis. I believe this trade-off was made to keep writes fast. No need to update the view index until all writes are actually complete, right? Write traffic is heavier than read traffic and the time needed for that index refresh causes the webapp to crash because its not setup to handle timeouts from a database query. The workaround is as hackish as one can imagine – cron jobs to hit every map/reduce query to keep indexes fresh. 2. Append only database file Append only is in theory a great way to ensure on-disk reliability. A system crash during an append should only affect that append. Its a crash during an update to existing parts of the file that risks the integrity of more than whats being updated. With so many layers of caching and optimizations in the kernel and the filesystem and now in the workings of SSD drives, I’m not sure append-only gives extra protection anymore. What it does do is a create a huge operational headache. The on-disk file can never grow beyond half the available storage space. Record deletion uses new disk space and if the half-full mark approaches, vacuuming must be done. The entire database is rewritten to the filesystem, leaving out no longer needed records. If the data file should happen to grow beyond half the partition, the system has esentially crashed because there is no way to compact the file and soon the partition will be full. This is a likely scenario when there is a lot of record deletion activity. The system in question does a lot of writes of temporary data that is followed up by deletes a few days later. There is also a lot of permanent storage that hardly gets used. Rewriting every byte of the records that are long-lived due to compaction is an enormous amount of wasted I/O – doubly so given SSD drives have a short write-cycle lifespan.
(tags: nosql couchdb consistency checkpointing databases data-stores indexing)
CouchDB: not drinking the kool-aid
Jonathan Ellis on some CouchDB negatives:
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project: Writes are serialized. Not serialized as in the isolation level, serialized as in there can only be one write active at a time. Want to spread writes across multiple disks? Sorry. CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes. Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less. CouchDB is simple. Gloriously simple. Why is that a negative? It’s competing with systems (in the popular imagination, if not in its author’s mind) that have been maturing for years. The reason PostgreSQL et al have those features is because people want them. And if you don’t, you should at least ask a DBA with a few years of non-MySQL experience what you’ll be missing. The majority of CouchDB fans don’t appear to really understand what a good relational database gives them, just as a lot of PHP programmers don’t get what the big deal is with namespaces. A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce. MapReduce is a great approach to trivially parallelizing certain classes of problem. The problem is, it’s tedious and error-prone to write raw MapReduce code. This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively). Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages. It’s a little verbose, and you might be bored with it, but it’s much better than writing low-level mapreduce code.
(tags: cassandra couch nosql storage distributed databases consistency)
What is the CouchDB replication protocol? Is it like Git? – Stack Overflow
Good write up of CouchDB replication
(tags: protocols couchdb sync replication git mvcc databases merging timelines)
TouchDB’s reverse-engineered write-up of the Couch replication protocol
There really isn’t a separate “protocol” per se for replication. Instead, replication uses CouchDB’s REST API and data model. It’s therefore a bit difficult to talk about replication independently of the rest of CouchDB. In this document I’ll focus on the algorithm used, and link to documentation of the APIs it invokes. The “protocol” is simply the set of those APIs operating over HTTP.
(tags: couchdb protocols touchdb nosql replication sync mvcc revisions rest)
-
A good writeup of how to detect cases of copyright infringement for photography, art and other visual media.
Von Glitschka, Modern Dog and myriad others make clear that the support of the creative community is absolutely vital in raising awareness of copyright infringements. Sites like www.youthoughtwewouldntnotice.com name and shame clear breaches of copyright, while the Modern Dog case shows that there is no better IP tracing system than the eyes and ears of the design community itself. “It’s the industry at large that has kept me aware of infringements,” states Von. “Without that I would miss most of them because I don’t go looking – they find me via the eyes of others.”
(tags: photography art visual-media copyright infringement piracy ripping)
FastBit: An Efficient Compressed Bitmap Index Technology
an [LGPL] open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user’s data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools. The key technology underlying the FastBit software is a set of compressed bitmap indexes. In database systems, an index is a data structure to accelerate data accesses and reduce the query response time. Most of the commonly used indexes are variants of the B-tree, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations, but are somewhat slower to update after a modification of an individual record. A key innovation in FastBit is the Word-Aligned Hybrid compression (WAH) for the bitmaps.[…] Another innovation in FastBit is the multi-level bitmap encoding methods.
(tags: fastbit nosql algorithms indexing search compressed-bitmaps indexes wah bitmaps compression)
-
The bit array data structure is implemented in Java as the BitSet class. Unfortunately, this fails to scale without compression. JavaEWAH is a word-aligned compressed variant of the Java bitset class. It uses a 64-bit run-length encoding (RLE) compression scheme. We trade-off some compression for better processing speed. We also have a 32-bit version which compresses better, but is not as fast. In general, the goal of word-aligned compression is not to achieve the best compression, but rather to improve query processing time. Hence, we try to save CPU cycles, maybe at the expense of storage. However, the EWAH scheme we implemented is always more efficient storage-wise than an uncompressed bitmap (as implemented in the BitSet class). Unlike some alternatives, javaewah does not rely on a patented scheme.
(tags: javaewah wah rle compression bitmaps bitmap-indexes bitset algorithms data-structures)
Measure Anything, Measure Everything « Code as Craft
the classic Etsy pro-metrics “measure everything” post. Some good basic rules and mindset
Testing Your Automation [slides]
Test-driven infrastructure, using Chef — slides from Big Ruby 2013. Tools used: foodcritic (lol), Chefspec, minitest-chef-handler, fauxhai, cucumber chef. This is really good to see — TDD applied to ops. Video at: http://confreaks.com/videos/2309-bigruby2013-testing-your-automation-ttd-for-chef-cookbooks
(tags: devops ops chef automation testing tdd infrastructure provisioning deployment)
Meet the nice-guy lawyers who want $1,000 per worker for using scanners | Ars Technica
Great investigative journalism, interviewing the legal team behind the current big patent-troll shakedown; that on scanning documents with a button press, using a scanner attached to a network. They express whole-hearted belief in the legality of their actions, unsurprisingly — they’re exactly what you think they’d be like (via Nelson)
(tags: via:nelson ethics business legal patents swpats patent-trolls texas shakedown)
[#HADOOP-9448] Reimplement things – ASF JIRA
Pretty good April Fools from this year — a patch to delete the entirety of Hadoop’s codebase:
To avoid any bias to the existing code and make the same mistakes we should just delete trunk completely. Attached it is a script that deletes everything.
(tags: hadoop april-fools asf patches open-source oss)
Lucas Nussbaum’s Blog » Blog Archive » RVM: seriously?
+1. RVM is atrocious code — some of the worst bash script I’ve seen. And it’s not just installing as a command, it requires that it be sourced and hooks into your login shell. If you then use “set -e”, it crashes; “set -u”, it crashes; reset $HOME, crash. It’s dire.
-
Next April 11th, at the IIEA in North Gt Georges St:
Rick Falkvinge, founder of the Swedish Pirate Party, will examine the case for reform of copyright and patent law in the EU. Legalised file sharing, free sampling and shortened copyright protection times are the main elements of a proposal co-authored by Mr. Falkvinge which was submitted to the European Parliament in 2012. He will question whether, in the context of ever-increasing online activity, existing legal frameworks pose a threat to users’ civil liberties.
(tags: rick-falkvinge pirate-party ireland iiea dublin copyright patents filesharing)
High Performance MongoDB Clusters with Amazon EBS Provisioned IOPS
yeah yeah, Mongo. bookmarking for the good data on EBS+PIOPS
(tags: ebs piops aws performance tips ops ec2 mongodb presentations)
-
These notes are intended to help users and system administrators maximize TCP/IP performance on their computer systems. They summarize all of the end-system (computer system) network tuning issues including a tutorial on TCP tuning, easy configuration checks for non-experts, and a repository of operating system specific instructions for getting the best possible network performance on these platforms.
Some tips for maximizing HPC network performance for the intra-DC case; recommended by the LinkedIn Kafka operations page.(tags: tuning network tcp sysadmin performance ops kafka ec2)
Increasing EBS Performance – Amazon Elastic Compute Cloud
good docs from EC2
(tags: ec2 ebs performance piops docs)
-
an open source virtualized Ethernet networking stack. I am developing Snabb Switch in response to several exciting trends: x86 has risen to be a powerful networking platform. Virtualization and SDN are pulling more networking into servers. Optimized user-space software is out-performing kernel-space software. Snabb Switch’s simple and fast software-only data plane makes developing networking software easier than ever before.
Written in LuaJIT but aiming to be very fast. cool stuff, worth watching(tags: sdn software networking emulation snabb-switch luajit lua virtualization)
Abusing hash kernels for wildly unprincipled machine learning
what, is this the first time our spam filtering approach of hashing a giant feature space is hitting mainstream machine learning? that can’t be right!
(tags: ai machine-learning python data hashing features feature-selection anti-spam spamassassin)