-
Some good factoids about Loggly’s Kafka usage and scales
(tags: scalability logging loggly kafka queueing ops reliabilty)
Patterns for building a resilient and scalable microservices platform on AWS
Some good details from Boyan Dimitrov at Hailo, on their orchestration, deployment, provisioning infra they’ve built
(tags: deployment ops devops hailo microservices platform patterns slides)
-
A probabilistic data structure for frequency/k-occurrence cardinality estimation of multisets. Sample implementation
(via Patrick McFadin)(tags: via:patrickmcfadin hyperloglog cardinality data-structures algorithms hyperlogsandwich counting estimation lossy multisets)
“Trash Day: Coordinating Garbage Collection in Distributed Systems”
Another GC-coordination strategy, similar to Blade (qv), with some real-world examples using Cassandra
(tags: blade via:adriancolyer papers gc distsys algorithms distributed java jvm latency spark cassandra)
Five Takeaways on the State of Natural Language Processing
Good overview of the state of the art in NLP nowadays. I particularly like word2vec interesting:
Embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim).
Quite similar to some tokenization approaches we experimented with in SpamAssassin, so I don’t find this too surprising….(tags: word2vec nlp tokenization machine-learning language parsing doc2vec skip-grams data-structures feature-extraction via:lemonodor)