Skip to content

Category: Uncategorized

Block AI scrapers with Anubis

  • Block AI scrapers with Anubis

    Bookmarking this in case I have to use it; I have a blog-related use case that I don't want LLM scrapers to kill my blog with.

    Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.

    The most hilarious part about how Anubis is implemented is that it triggers challenges for every request with a User-Agent containing "Mozilla". Nearly all AI scrapers (and browsers) use a User-Agent string that includes "Mozilla" in it. This means that Anubis is able to block nearly all AI scrapers without any configuration.

    Tags: throttling robots scraping ops llms bots hashcash tarpits

Cost-optimized archival in S3 using s3tar

  • Cost-optimized archival in S3 using s3tar

    "s3tar" is new to me, and looks like a perfect tool for this common use-case -- aggregation and archival of existing data on S3, which often requires aggregation into large file sizes to take advantage of S3 Glacier storage classes (which have a minimum file size of 128Kb).

    s3tar optimizes for cost and performance on the steps involved in downloading the objects, aggregating them into a tar, and putting the final tar in a specified Amazon S3 storage class using a configurable “–concat-in-memory” flag. ... The tool also offers the flexibility to upload directly to a user’s preferred storage class or store the tar object in S3 Standard storage and seamlessly transition it to specific archival classes using S3 Lifecycle policies.

    The only downside of s3tar is that it doesn't support recompression, which is also a common enough requirement -- especially after aggregation of multiple small input files into a larger, more compressible archive. But hey, can't have everything.

    s3tar: https://github.com/awslabs/amazon-s3-tar-tool

    Tags: s3tar amazon s3 compression storage archival architecture aggregation logs glacier via:lwia

Cryptocurrency “market caps” and notional value

  • Cryptocurrency "market caps" and notional value

    Excellent explainer from Molly White, which explains the risk around quoting "market caps" for memecoins:

    The “market cap” measurement has become ubiquitous within and outside of crypto, and it is almost always taken at face value. Thoughtful readers might see such headlines and ask questions like “how did a ‘$2 trillion market’ tumble without impacting traditional finance?”, but I suspect most accept the number.

    When crypto projects are hacked, there are headlines about hackers stealing “$166 million worth” of tokens, when in reality the hackers only could cash out 2% of that amount (around $3 million) because their attempts to sell illiquid tokens caused the price to crash.

    Tags: molly-white memecoins bitcoin rug-pulls scams liquidity market-caps cryptocurrency

Hollo

  • Hollo

    "A federated microblogging software for single users. ActivityPub-enabled, Mastodon-compatible API, supports CommonMark and Misskey-style quotes. Hollo is designed for single-users, so you can own your instance and have full control over your data. It’s perfect for personal microblogs, notes, and journals."

    Seems fairly heavyweight, however, so I probably won't be running it, but it's a nice take on the single-user-server Fediverse use case.

    Tags: fediverse mastodon hollo apps social-media blogging

GTFS-Realtime API

  • GTFS-Realtime API

    The Irish National Transport Authority have an open data API for realtime public transport information; very cool. "The GTFS-R API contains real-time updates for services provided by Dublin Bus, Bus Éireann, and Go-Ahead Ireland."

    The specification currently supports the following types of information:

    Trip updates - delays, cancellations, changed routes; Service alerts - stop moved, unforeseen events affecting a station, route or the entire network; Vehicle positions - information about the vehicles including location and congestion level

    Registration is required.

    Tags: public-transport buses trains transit nta gtfs apis open-data dublin ireland

Why the British government is so into AI

  • Why the British government is so into AI

    Interesting BlueSky thread on the topic --

    The UK Government believes several things:

    1) The AI genie is out of the bottle and cannot be put back in

    2) Embracing AI would definitely be good for the British economy

    3) Enforcing copyright on AI training would put Britain out of step with rest of the world and subsequently...

    4) Enforcing copyright would be ineffective as AI would just be trained elsewhere, cutting out Brit creatives entirely

    5) Govt's preferred option is permissive enough to be attractive to AI firms but demands transparency so at least rights holders have some recourse; the alternative is bleaker.

    Obviously, I contest all of these beliefs to one degree or another, but this is where the govt is, and it's useful to understand that. The real crux of the debate, as they see it, is how Britain's laws can practically deal with the global inevitability of AI. They believe it's untenable to make Britain a legislative pariah state for AI, and that this would not lead to good outcomes for British creatives anyway. This is a point worth considering when replying to the consultation.

    However, the govt says it's not going to implement policy before it has a technical solution for rights holders to opt-out and chase down infringements. My view is that this is difficult to the point of being pure fantasy, and either means that the govt is not serious about finding a real, effective technical solution, or this policy will be kicked indefinitely down the road. My dinner partner was optimistic a solution could be achieved within the timespan of a year or two. I just don't buy it.

    Government says it has not sided with AI firms over creative industries. However, its understanding of "not taking a side" creates a false equality between massive companies whose business relies on crime and individuals whose livelihoods will be destroyed.

    I got the sense that there is no political will whatsoever to seriously challenge firms who offer to spend big in Britain, and that any thought of holding them to account for actual crime is simply considered naive. But we do have a bit of time while govt attempts to confect their magical, easy to use, opt-out solution—time during which one or several of these AI firms might implode, making the true cost more apparent.

    Tags: uk government ai policy copyright ip britain economy future

The people should own the town square

  • The people should own the town square

    Ah, this is welcome news from Mastodon:

    We are going to transfer ownership of key Mastodon ecosystem and platform components to a new non-profit organization, affirming the intent that Mastodon should not be owned or controlled by a single individual. [...] Taking the first tentative steps almost a year ago, there are already multiple organizations involved with shepherding the Mastodon code and platform. The next 6 months will see the transformation of the Mastodon structures, shifting away from the early days’ single-person ownership and enshrining the envisioned independence in a dedicated European not-for-profit entity.

    Tags: mastodon social-media open-source fediverse

Grafana and ClickHouse

Watch Duty

  • Watch Duty

    Nice to see an important public need being met here:

    The [Watch Duty] app gives users the latest alerts about fires in their area [in California] and has become a vital service for millions of users in the western U.S. struggling with the seemingly constant threat of deadly wildfires—one major reason it had over 360,000 unique visits from 8:00-8:30 a.m. local time Wednesday. And the man behind Watch Duty promises that as a nonprofit, his organization has no plans to pull an OpenAI and become a profit-seeking enterprise.

    Tags: non-profits tech watch-duty apps mobile public-good

Steve Jobs vs Ireland

  • Steve Jobs vs Ireland

    this is a great Steve Jobs story, from the engineer who wrote v1 of the MacOS X Dock:

    At one point during a trip over, Steve was talking to Bas and asked how things were coming along with the Dock. He replied something along the lines of “going well, the engineer is over from Ireland right now, etc”. Steve left, and then visited my manager’s manager’s manager and said the fateful words (as reported to me by people who were in the room where it happened).

    “It has come to my attention that the engineer working on the Dock is in FUCKING IRELAND”.

    I was told that I had to move to Cupertino. Immediately. Or else.

    I did not wish to move to the States. I liked being in Europe. Ultimately, after much consideration, many late night conversations with my wife, and even buying a guide to moving, I said no.

    They said ok then. We’ll just tell Steve you did move.

    (via Niall Murphy)

    Tags: macos america osx apple history steve-jobs

Court docs allege Meta trained LLM models using pirated book trove

  • Court docs allege Meta trained LLM models using pirated book trove

    This is pretty massive:

    The [court] document claims that Meta decided to download documents from Library Genesis -- aka. “LibGen” -- to train its models. LibGen is the subject of a lawsuit brought by textbook publishers who believe it happily hosts and distributes [pirated] works [....]

    The filing from plaintiffs in the Kadrey case claims that documents produced by Meta [...] describe internal debate about accessing LibGen, a little squeamishness about using BitTorrent in the office to do so, and eventual escalation to “MZ” [Mark Zuckerberg himself], who approved use of the contentious resource. [...]

    Another filing claims that a Meta document describes how it removed copyright notifications from material downloaded from LibGen, and suggests the company did so because it realized including such text could mean a model’s output would reveal it was trained on copyrighted material.

    US District Court Judge Vince Chhabria also noted that in one of the documents Meta wants to seal, an employee wrote the following:

    “If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.”

    No shit.

    Tags: piracy meta copyright mark-zuckerberg law llama training libgen books

Bufferbloat Test

  • Bufferbloat Test

    A handy tool to test your internet connection for "bufferbloat", the error condition involving "undesirable high latency caused by other traffic on your network. It happens when a flow uses more than its fair share of the bottleneck. Bufferbloat is the primary cause of bad performance for real-time Internet applications like VoIP calls, video games, and videoconferencing."

    (My home internet connection is currently rating a C: "your latency increased considerably under load", jumping from a min/mean/p95/max of 10.7, 16.9, 23.7, 30.1ms to 35.3, 98.4, 121.0, 286.0ms under load, yikes, so looks like I need to do some optimising.)

    Tags: bufferbloat internet networking optimisation performance testing tools

Waymos don’t stop for pedestrians

Garbage Day on Meta’s moderation plans

  • Garbage Day on Meta's moderation plans

    This is 100% spot on, I suspect, regarding Meta's recently-announced plans to give up on content moderation:

    After 2021, the major tech platforms we’ve relied on since the 2010s could no longer pretend that they would ever be able to properly manage the amount of users, the amount of content, the amount of influence they “need” to exist at the size they “need” to exist at to make the amount of money they “need” to exist.

    And after sleepwalking through the Biden administration and doing the bare minimum to avoid any fingers pointed their direction about election interference last year, the companies are now fully giving up. Knowing the incoming Trump administration will not only not care, but will even reward them for it.

    The question now is, what will the EU do about it? This is a flagrant raised finger in the face of the Digital Services Act.

    Tags: moderation content ugc meta future dsa eu garbage-day

“uhtcearu”

ads.txt for a site with no ads

  • ads.txt for a site with no ads

    Don Marti: "since there’s a lot of malarkey in the online advertising business, I’m putting up this file [on my website] to let the advertisers know that if someone sold you an ad and claimed it ran on here, you got burned."

    The format is defined in a specification from the IAB Tech Lab. The important part is the last line. The placeholder is how you tell the tools that are supposed to be checking this stuff that you don’t have ads.

    Tags: ads don-marti hacks ads-txt web

Hoarder

  • Hoarder

    "Quickly save links, notes, and images and hoarder will automatically tag them for you using AI for faster retrieval. Built for the data hoarders out there!"

    Self-hosted (with a docker-compose file), open-source link hoarding tool; intriguingly, this scrapes links, extracts text and images, generates automated tag suggestions using OpenAI or a local ollama LLM, and indexes the page's full text using Meilisearch, which seems to be a speedy incremental search. Could be a great place to gateway links from this blog into a super-searchable form. hmm

    Tags: links archiving bookmarks web search hoarder docker ai

The AI We Deserve

  • The AI We Deserve

    A very thought-provoking essay from Evgeny Morozov on AI, LLMs and their embodied political viewpoint:

    Sure, I can build a personalized language learning app using a mix of private services, and it might be highly effective. But is this model scalable? Is it socially desired? Is this the equivalent of me driving a car where a train might do just as well? Could we, for instance, trade a bit of efficiency and personalization to reuse some of the sentences or short stories I’ve already generated in my app, reducing the energy cost of re-running these services for each user?

    This takes us to the core problem with today’s generative AI. It doesn’t just mirror the market’s operating principles; it embodies its ethos. This isn’t surprising, given that these services are dominated by tech giants that treat users as consumers above all. Why would OpenAI, or any other AI service, encourage me to send fewer queries to their servers or reuse the responses others have already received when building my app? Doing so would undermine their business model, even if it might be better from a social or political (never mind ecological) perspective. Instead, OpenAI’s API charges me— and emits a nontrivial amount of carbon emissions— even to tell me that London is the capital of the UK or that there are one thousand grams in a kilogram.

    For all the ways tools like ChatGPT contribute to ecological reason, then, they also undermine it at a deeper level—primarily by framing our activities around the identity of isolated, possibly alienated, postmodern consumers. When we use these tools to solve problems, we’re not like Storm’s carefree flâneur, open to anything; we’re more like entrepreneurs seeking arbitrage opportunities within a predefined, profit-oriented grid. [....]

    The Latin American examples give the lie to the “there’s no alternative” ideology of technological development in the Global North. In the early 1970s, this ideology was grounded in modernization theory; today, it’s rooted in neoliberalism. The result, however, is the same: a prohibition on imagining alternative institutional homes for these technologies. There’s immense value in demonstrating—through real-world prototypes and institutional reforms—that untethering these tools from their market-driven development model is not only possible but beneficial for democracy, humanity, and the planet.

    Tags: technology ai history eolithism neoliberalism llms openai cybernetics hans-otto-storm cybersyn

Principal Engineer Roles

  • Principal Engineer Roles

    From AWS VP of Technology, Mae-Lan Tomsen Bukovec -- a set of roles which a Principal Engineer can play to get projects done:

    Sponsor: A Sponsor is a project/program lead, spanning multiple teams. Yes, this role can be played by a manager but it does not have to be (at least not at Amazon). If you are a Sponsor, you have to make sure decisions are made and that people aren’t stuck in analysis paralysis. This doesn’t mean that you yourself make those decisions (that’s often a Tie-breaker’s role which you may or may not be here). But you have to drive making sure decisions get made, which can mean owning those decisions, escalating to the right people, or whatever it takes to get it done.

    A Sponsor is constantly clearing obstacles and getting things moving. It is a time-consuming role. You shouldn’t have time to act as Guide or a Sponsor on more than two projects combined, and you don’t have to be a Sponsor every year. But if a few years go by, and you haven’t been a Sponsor, it might be time to think about where you can step in and play that role. It tends to build new skills because you have to operate in different dimensions to land the right outcomes for the project.

    Guide: Guides tend to be domain experts that are deeply involved in the architecture of a project. Guide will often drive the design but they’re not “The Architect.” A Guide often works through others to produce the designs, and themselves produce exemplary artifacts, like design docs or bodies of code. The code produced by a Guide is usually illustrative of a broader pattern or solving a difficult problem that the rest of the team will often run with afterwards. The difference between a Guide and a Sponsor is that the Guide focuses on the technical path for the project, and the Sponsor owns all aspects of project delivery, including product definition and organizational alignment.

    Guides influence teams. If you are influencing individuals, you’re likely being a mentor and not a Guide. A Guide is a time-consuming role. You shouldn’t have time to Guide more than two projects, and that drops to one project if you are a Sponsor at the same time.

    Catalyst: A Catalyst gets an idea off the ground, and it’s not always their idea. In my experience, the idea might not even come from the Catalyst—it can be something we’ve been talking about doing for years but never really got off the ground. Catalysts will create docs or prototypes and drive discussions with senior decision makers to think through the concept. Catalysts are not just “idea factories.” They take the time to develop the concept, drive buy-in for the idea, and work with the larger leadership team to assign engineers to deliver the project.

    A Catalyst is a time-consuming role because of all the work that needs to be done. At Amazon, that involves prototypes, docs and discussions. It is hard to effectively Catalyze more than one or two things at once. It is important to note that Catalysts, like Tie-breakers, are not permanent roles. Once a project is catalyzed (e.g., in engineering with a dedicated team working on the project), a Catalyst moves out of the role. The Catalyst might take on a Guide or Sponsor role on the project, or not. Not every project needs a Catalyst. A Catalyst is a very helpful (arguably critical) role for your most ambitious, complex, and/or ambiguous problems to solve in the organization.

    Tie Breaker: A Tie-Breaker makes a decision after a debate. At Amazon, that means deeply understanding the different positions, weighing in with a choice, and then formally closing it out with an email or a doc to the larger group. Not every project needs a Tie-Breaker. But if your project gets stuck in a consensus-seeking mode without making progress on hard decisions, a senior engineer might have to step in as a Tie-Breaker. Tie-breakers own breaking a log-jam on direction in the team by making a decision. Obviously, a Tie Breaker has to have great judgment. But, it is incredibly important that the Tie-Breaker listens well and understands all the nuances to the different positions as part of breaking the tie. When a Tie -Breaker drives a choice, they must bring other engineers into their thought process so that all the engineers in the debate understand the “why” behind the choice even if some are disappointed by the direction. A Tie-Breaker must have strong engineering and organizational acumen in this role.

    Sometimes an organization will depend on a small set of senior engineers to play the role of Tie-Breaker because they are so good at it. As a successful Tie-Breaker, you want to be careful not to set a tone that every decision, no matter how small, must go through you. You’ll quickly transition from Tie-Breaker to a “decision bottleneck” at that point—and that is not a role any team needs. If a team finds itself frequently seeking out a Tie-Breaker, it could be a sign that the team needs help understanding how to make decisions. That's a topic for a different time. The Tie-Breaker role is considered a “moment in time” role, versus Sponsor/Guide which are ongoing until you reach a milestone. Once the decision is made and closed out, you’re no longer the Tie-Breaker.

    Catcher: A Catcher gets a project back on track, often from a technical perspective. It requires high judgement because a Catcher drives prioritization and formulating a pragmatic plan under tight deadlines. Catchers must quickly do their own detailed analysis to understand the nuances of the problem and come up with the path forward in the right timeframe. As a comparison, a Tie-breaker tends to step in when the pros/cons of the different approaches are well known and the team needs to make a hard decision. Once “caught” (i.e., the project is back on track and moving forward), a project doesn’t need the Catcher anymore.

    Sometimes Principal Engineers can do too much catching. Don’t get me wrong, we are all Catchers sometimes—including me. Any fast-paced business needs Catchers in engineering and management. It teaches important skills about leadership in difficult moments and helps the business by landing deliverables. It also teaches you what not to do next time. However, it is better to generalize a Catcher skill set across more engineers and not depend on a small set to Principal Engineers as Catchers. If a Principal Engineer plays Catcher all the time through a succession of projects, it leaves no time to develop skills in other roles.

    Participant: A participant works on something without one of these explicitly assigned leadership roles. A Participant can be active or passive. Active participants are hands-on, and do things like spend a few days working through a design discussion or picking up a coding task occasionally on a project, etc. Passive participants offer up a few points in a meeting and move on. In general, if you're going to participate it's better to do so actively. Time-boxing some passive participation (e.g., office hours for engineers) can be a useful mechanism to stay connected to the team. However, keep in mind that it is easy for your time to get consumed by being a Participant in too many things.

    (via Marc Brooker)

    Tags: roles principal-engineer work projects project-management amazon aws via:marc-brooker

Brian Eno on AI

  • Brian Eno on AI

    In my own experience as an artist, experimenting with AI has mixed results. I’ve used several “songwriting” AIs and similar “picture-making” AIs. I’m intrigued and bored at the same time: I find it quickly becomes quite tedious. I have a sort of inner dissatisfaction when I play with it, a little like the feeling I get from eating a lot of confectionery when I’m hungry. I suspect this is because the joy of art isn’t only the pleasure of an end result but also the experience of going through the process of having made it. When you go out for a walk it isn’t just (or even primarily) for the pleasure of reaching a destination, but for the process of doing the walking. For me, using AI all too often feels like I’m engaging in a socially useless process, in which I learn almost nothing and then pass on my non-learning to others. It’s like getting the postcard instead of the holiday. [...]

    All that said, I do believe that AI tools can be very useful to an artist in making it possible to devise systems that see patterns in what you are making and drawing them to your attention, being able to nudge you into territory that is unfamiliar and yet interestingly connected. I say this having had some good experiences in my own (pre-AI) experiments with Markov chain generators and various crude randomizing procedures. [...]

    To make anything surprising and beautiful using AI you need to prepare your prompts extremely carefully, studiously closing off all the yawning, magnetic chasms of Hallmark mediocrity. If you don’t want to get moon rhyming with June, you have to give explicit instructions like, “Don’t rhyme moon with June!” And then, at the other end of the process, you need to rigorously filter the results. Now and again, something unexpected emerges. But even with that effort, why would a system whose primary programming is telling it to take the next most probable step produce surprising results? The surprise is primarily the speed and the volume, not the content. 

    Tags: play process technology culture future art music ai brian-eno creation

Inky Frame 7.3″

Sweden’s Suspicion Machine

  • Sweden’s Suspicion Machine

    Here we go, with another predictive algorithm-driven bias machine used to drive refusal of benefits:

    Lighthouse Reports and Svenska Dagbladet obtained an unpublished dataset containing thousands of applicants to Sweden’s temporary child support scheme, which supports parents taking care of sick children. Each of them had been flagged as suspicious by a predictive algorithm deployed by the Social Insurance Agency. Analysis of the dataset revealed that the agency’s fraud prediction algorithm discriminated against women, migrants, low-income earners and people without a university education. Months of reporting — including conversations with confidential sources — demonstrate how the agency has deployed these systems without scrutiny despite objections from regulatory authorities and even its own data protection officer.

    Tags: sweden predictive algorithms surveillance welfare benefits bias data-protection fraud

Thalidomide chirality paradox explained

  • Thalidomide chirality paradox explained

    Molecule chirality ("left-handedness" and "right-handedness") has been in the news again recently.

    What is little known is the relevance of chirality to the thalidomide disaster. Thalidomide, the drug which was prescribed widely to pregnant women in the 1950s for the treatment of morning sickness, was later discovered to be a chiral molecule, and while the left-handed molecule was effective, the right-handed one was extremely toxic, causing thousands of children around the world to be born with severe birth defects. The mystery is, why didn't this toxicity emerge during animal experiments? Here's a paper with a potential explanation:

    Twenty years after the thalidomide disaster in the late 1950s, Blaschke et al. reported that only the (S)-enantiomer of thalidomide is teratogenic [jm: causing birth defects]. However, other work has shown that the enantiomers ["mirror" molecules] of thalidomide interconvert in vivo, which begs the question: why is teratogen activity not observed in animal experiments that use (R)-thalidomide given the ready in vivo racemization (“thalidomide paradox”)? Herein, we disclose a hypothesis to explain this “thalidomide paradox” through the in-vivo self-disproportionation of enantiomers. Upon stirring a 20% ee solution of thalidomide in a given solvent, significant enantiomeric enrichment of up to 98% ee was observed reproducibly in solution. We hypothesize that a fraction of thalidomide enantiomers epimerizes in vivo, followed by precipitation of racemic [equally mixed between R/S forms] thalidomide in (R/S)-heterodimeric form. Thus, racemic thalidomide is most likely removed from biological processes upon racemic precipitation in (R/S)-heterodimeric form. On the other hand, enantiomerically pure thalidomide remains in solution, affording the observed biological experimental results: the (S)-enantiomer is teratogenic, while the (R)-enantiomer is not.

    Tags: chirality thalidomide molecules drugs medicine papers chemistry

UK passes the Online Safety Act

  • UK passes the Online Safety Act

    Apparently "The Online Safety Act applies to every service which handles user-generated content and has “links to the UK”, with a few limited exceptions listed below. The scope is extraterritorial (like the GDPR) so even sites entirely operated outside the UK are in scope if they are considered to have “links to the UK”."

    A service has links to the UK if any of the following apply: - the service has a “significant number” of UK users - UK users form one of the target markets for the service - the service is accessible to UK users and “there are reasonable grounds to believe that there is a material risk of significant harm to individuals in the UK” (this seems less likely to apply for smaller services but who knows)

    Tags: osa uk safety regulations ofcom

Why did Silicon Valley turn right?

  • Why did Silicon Valley turn right?

    A great essay on the demise of the 1990s/2000s liberal consensus in Silicon Valley:

    No-one now believes - or pretends to believe - that Silicon Valley is going to connect the world, ushering in an age of peace, harmony and likes across nations. [...] A decade ago, liberals, liberaltarians and straight libertarians could readily enthuse about “liberation technologies” and Twitter revolutions in which nimble pro-democracy dissidents would use the Internet to out-maneuver sluggish governments. Technological innovation and liberal freedoms seemed to go hand in hand. Now they don’t. Authoritarian governments have turned out to be quite adept for the time being, not just at suppressing dissidence but at using these technologies for their own purposes. Platforms like Facebook have been used to mobilize ethnic violence around the world, with minimal pushback from the platform’s moderation systems [...] My surmise is that this shift in beliefs has undermined the core ideas that held the Silicon Valley coalition together. Specifically, it has broken the previously ‘obvious’ intimate relationship between innovation and liberalism. I don’t see anyone arguing that Silicon Valley innovation is the best way of spreading liberal democratic awesome around the world any more, or for keeping it up and running at home. Instead, I see a variety of arguments for the unbridled benefits of innovation, regardless of its benefits for democratic liberalism. I see a lot of arguments that AI innovation in particular is about to propel us into an incredible new world of human possibilities, provided that it isn’t restrained by DEI, ESG and other such nonsense. Others (or the same people) argue that we need to innovate, innovate, innovate because we are caught in a technological arms race with China, and if we lose, we’re toast. Others (sotto or brutto voce; again, sometimes the same people) - contend innovation isn’t really possible in a world of democratic restraint, and we need new forms of corporate authoritarianism with a side helping of exit, to allow the kinds of advances we really need to transform the world.

    Tags: essays henry-farrell tech politics silicon-valley fascism democracy liberalism

Black plastic won’t kill you

  • Black plastic won't kill you

    How a simple math error sparked a panic about toxic chemicals in black plastic kitchen utensils:

    Plastics rarely make news like this. From Newsmax to Food and Wine, and from the Daily Mail to CNN, the media uptake was enthusiastic on a paper published in October in the peer-reviewed journal Chemosphere. “Your cool black kitchenware could be slowly poisoning you, study says. Here’s what to do,” said the LA Times. “Yes, throw out your black spatula,” said the San Francisco Chronicle. Salon was most blunt: “Your favorite spatula could kill you,” it said. [....] The paper correctly gives the reference dose for BDE-209 as 7,000 nanograms per kilogram of body weight per day, but calculates this into a limit for a 60-kilogram adult of 42,000 nanograms per day. So, as the paper claims, the estimated actual exposure from kitchen utensils of 34,700 nanograms per day is more than 80 per cent of the EPA limit of 42,000. That sounds bad. But 60 times 7,000 is not 42,000. It is 420,000. This is what Joe Schwarcz [director of McGill University’s Office for Science and Society] noticed. The estimated exposure is not even a tenth of the reference dose.

    (tags: cooking research science plastics errors maths math fail papers)

ntfy.sh

  • ntfy.sh

    Send push notifications to your phone via PUT/POST. "a simple HTTP-based pub-sub notification service. It allows you to send notifications to your phone or desktop via scripts from any computer, and/or using a REST API. It's infinitely flexible, and 100% free software."

    I've been using a personal Slack for this purpose, but this is a decent-sounding alternative.

    (tags: notification push alerting open-source android ios push-messaging)

Pleias language models

  • Pleias language models

    OK, this is quite cool: "the first ever [language] models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness."

    Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models. Pleias 1.0 models include three base models: 350M, 1.2B, and 3B parameters. They feature two specialized models for knowledge retrieval with unprecedented performance for their size on multilingual Retrieval-Augmented Generation, Pleias-Pico (350M parameters) and Pleias-Nano (1.2B parameters). [...] Our models are: * multilingual, offering strong support for multiple European languages; * safe, showing the lowest results on the toxicity benchmark; * performant for key tasks, such as knowledge retrieval; * able to run efficiently on consumer-grade hardware locally (CPU-only, without quantisation) Pleias 1.0 family embodies a new approach to specialized small language models, for end applications: wound-up models. We have implemented a set of ideas and solutions during pretraining that produce a frugal yet powerful language model specifically optimized for further RAG implementations. We release two wound-up models further trained for Retrieval Augmented Generation (RAG): Pleias-pico-350m-RAG and Pleias-nano-1B-RAG. These models are designed to be implemented locally, so we prioritized frugal implementation. As our models are small, they can run smoothly, even on devices with limited RAM.

    And here's their fully open training set: https://huggingface.co/datasets/PleIAs/common_corpus

    (tags: llms models huggingface ai pleias rag ai-act open-data)

UK benefits AI system found to show bias

  • UK benefits AI system found to show bias

    File this under "the least surprising news ever":

    An artificial intelligence system used by the UK government to detect welfare fraud is showing bias according to people’s age, disability, marital status and nationality, the Guardian can reveal. An internal assessment of a machine-learning programme used to vet thousands of claims for universal credit payments across England found it incorrectly selected people from some groups more than others when recommending whom to investigate for possible fraud.

    The most interesting aspect of the report published is that currently "there is no established numerical or statistical benchmark at which referral or outcome disparity can be defined as within tolerance".

    I would have assumed a lack of bias, measured against a "false positive" rate -- ie. benefits recipients who were selected for additional checks, who were then found to be legitimate and not committing fraud, should have been a design goal, and a critical KPI for such a system.

    There are going to be a lot of similar examples in the years to come -- here's hoping this "bias measurement" KPI becomes established as a concept.

    (tags: bias ai kpis dwp uk benefits welfare fraud ml)

Ridding My Home Network of IP Addresses

Ridding My Home Network of IP Addresses

(Republishing this one on the blog, instead of just as a gist)

Recent changes in the tech scene have made it clear that relying on commercial companies to provide services I rely on isn't a good strategy in the long term, and given that Tailscale is so effective these days as a remote-access system, I've gradually been expanding a small collection of self-hosted web apps and services running on my home network.

Until now they've mainly been addressed using their IP addresses and random high ports on the internal LAN, for example:

  1. Pihole: http://10.19.72.7/admin
  2. Home Assistant: http://10.19.72.11:8123/
  3. Linkding: http://10.19.72.6:9092/
  4. Grafana: http://10.19.72.6:3000/
  5. (plus a good few others)

Needless to say this is a bit messy and inelegant, so I've been planning to sort it out for a while. My requirements:

  1. no more ugly bare IP addresses!
  2. a DNS domain;
  3. with HTTPS URLs;
  4. one per service;
  5. no visible port numbers;
  6. fully valid TLS certs, no having to click through warnings or install funny CA certs;
  7. accessible regardless of which DNS server is in use -- ie. using public DNS records. This may seem slightly unusual, but it's useful so that the internal services can still be accessed when I'm using my work VPN (which forces its own DNS servers);
  8. accessible internally;
  9. accessible externally, over Tailscale;
  10. not accessible externally without Tailscale.

After a few false starts, I'm pretty happy with the current setup, which uses Caddy.

Hosting The Domain At Cloudflare

First off, since the service URLs are not to be accessible externally without Tailscale active, the HTTP challenge approach to provision Let's Encrypt certs cannot be used. That would require an open-to-the-internet publicly-accessible HTTP server on my home network, which I absolutely want to avoid.

In order to use the ACME DNS challenge instead, I set up my public domain "taint.org" to use Cloudflare as the authoritative DNS server (in Cloudflare terms, "full setup"). This lets Caddy edit the DNS records via the Cloudflare API to handle the ACME challenge process.

One of the internal hosts is needed to run the Caddy server's reverse proxies; I picked "hass", 10.19.72.11, the Home Assistant host, which didn't have anything already running on port 80 or port 443. (All of my internal hosts are running on a private /24 IP range, at 10.19.72.0/24.)

The dedicated DNS domain I'm using for my home services is "home.taint.org". In order to use this, I clicked through to the Cloudflare admin panel and created a DNS record as follows:

Type   Name      Content             Proxy Status               TTL
A      *.home    10.19.72.11         DNS only - reserved IP     Auto

Now, any hostnames under "home.taint.org" will return the IP 10.19.72.11 (where Caddy will run).

I don't particularly care about exposing my internal home network IPs to the world, as a trade-off to allow the URLs to work even if an internal host is using the work VPN, or resolving with 8.8.8.8, or whatever. That's worth missing out on a little bit of paranoia, since the IPs won't be accessible from outside without Tailscale anyway.

It is worth noting that the Cloudflare-hosted domain doesn't have to be the same one used for URLs in the home network; using dns_challenge_override_domain you can delegate the ACME challenge from any "home" domain to one which is hosted in Cloudflare.

The Caddy Setup

One wrinkle is that I had to generate a custom Caddy build in order to get the "dns.providers.cloudflare" non-standard module, from https://caddyserver.com/download . This is a click-and-download page which generates a custom Caddy binary on the fly. It would have been nicer if the Cloudflare module was standard, but hey.

Once that's installed, I can get this output:

$ /usr/local/bin/caddy list-modules
[long list of standard modules omitted]

dns.providers.cloudflare
dns.providers.route53

  Non-standard modules: 2

  Unknown modules: 0

(Yes, I have Caddy running as a normal service, not as a Docker container. No particular reason; I think Docker should work fine.)

Go to the Cloudflare account dashboard, and create a user API token as described at https://developers.cloudflare.com/fundamentals/api/get-started/create-token/ . In my case, it has Zone / DNS / Edit permission, on the specific zone taint.org.

Copy that token as it's needed in the "Caddyfile", which now looks like the following:

hass.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.11:8123
}

links.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.6:9092
}

pi.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        redir / /admin/
        reverse_proxy /admin/* 10.19.72.7:80
}

grafana.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.6:3000
}

[many other services omitted]

Running sudo caddy run in the same dir will start up and verbosely log what it's doing. (Once you're happy enough, you can get Caddy running in the normal systemd service way.)

After setting those up, I now have my services accessible locally as:

  1. Home Assistant: https://hass.home.taint.org/
  2. Pihole: https://pi.home.taint.org/
  3. Grafana: https://grafana.home.taint.org/
  4. Linkding: https://links.home.taint.org/

Caddy seamlessly goes off and configures fully valid TLS certs with no fuss. I found it much tidier than Certbot, or Nginx Proxy Manager.

The Tailscale Setup

So this has now sorted out all of the requirements bar one:

  1. accessible externally, over Tailscale.

To do this I had to log into Tailscale's admin console and go to https://login.tailscale.com/admin/machines , pick a host on the 10.19.72/24 internal LAN, click it's dropdown menu and "Edit Route Settings...", and enable a Subnet Route for 10.19.72/24. By doing this, all of the service.home.taint.org DNS records are now accessible, remotely, once Tailscale is enabled; I don't even need to use ts.net names to access them! Perfect.

Anyway, that's the setup -- hopefully this writeup will help others. And kudos to Caddy, Let's Encrypt and Tailscale for making this relatively easy.

GenCast

  • GenCast

    Google DeepMind announce their new AI model for weather forecasting, in collaboration with the ECMWF:

    Today, in a paper published in Nature, we present GenCast, our new high resolution (0.25°) AI ensemble model. GenCast provides better forecasts of both day-to-day weather and extreme events than the top operational system, the European Centre for Medium-Range Weather Forecasts’ (ECMWF) ENS, up to 15 days in advance. We’ll be releasing our model’s code, weights, and forecasts, to support the wider weather forecasting community. [...] GenCast is a diffusion model, the type of generative AI model that underpins the recent, rapid advances in image, video and music generation. However, GenCast differs from these, in that it’s adapted to the spherical geometry of the Earth, and learns to accurately generate the complex probability distribution of future weather scenarios when given the most recent state of the weather as input. To train GenCast, we provided it with four decades of historical weather data from ECMWF’s ERA5 archive. This data includes variables such as temperature, wind speed, and pressure at various altitudes. The model learned global weather patterns, at 0.25° resolution, directly from this processed weather data.
    It's open source: https://github.com/google-deepmind/graphcast And here are the open-released model weights: https://console.cloud.google.com/storage/browser/dm_graphcast Graphcast (the previous iteration) has public forecasts published at https://charts.ecmwf.int/?query=GraphCast , under a CC-BY-NC-SA-4 licence -- it would be great if the GenCast forecasts join this data set. Paper: https://arxiv.org/abs/2312.15796 This all looks really great, a fantastic commitment to (genuine) openness and open data, and the paper seems rigorous (to this amateur). Great stuff.

    (tags: forecasting weather ai gencast graphcast deepmind google ecmwf genai)

TikTok in hot water over Romanian elections

  • TikTok in hot water over Romanian elections

    ‘We are getting fed up’: EU lawmakers snap at TikTok over Romanian election:

    For years, the Chinese-owned social media app has brushed off security concerns in the United States and Europe that it could be used for mass manipulation and espionage. It now faces an intense regulatory storm in Bucharest over whether it played a role in skewing the democratic process in an EU country and NATO member of 19 million people. [....] "Honestly speaking, we are getting fed up by the documents and the empty promises," Swedish center-right European lawmaker Arba Kokalari said near the end of the hearing.

    (tags: tiktok elections romania eu bias news propaganda democracy social-media)

noyb is now qualified to bring collective redress actions

  • noyb is now qualified to bring collective redress actions

    "noyb is now approved as a so-called "Qualified Entity" to bring collective redress actions in courts throughout the European Union. Such action under Directive (EU) 2020/1828 can either be an "injunction" or a "redress" measure. "Injunctions" generally prohibit a company from engaging in illegal practices, including any GDPR violations. "Redress" measures allow a European version of a "Class Action", where thousands or millions of users could be represented by noyb and for example ask for non-material damages when their personal data was unlawfully processed." This is very interesting -- and timely, given the mass scraping of user data to feed AI training sets...

    (tags: noyb data-privacy data-protection class-actions law eu collective-redress)

Privacy Disasters: FaceHuggers Are Eating Your Skeets

The Buddhabrot

  • The Buddhabrot

    This was news to me! There's another fractal pattern derived from the Mandelbrot set which I'd never seen before:

    As it turns out, it’s not just the boundary of the Mandelbrot set that’s mind-bogglingly complex: the same goes for the (xn, yn) escape trajectories associated with the (u, v) pixels near the set’s edge. The iterated coordinates follow elaborate, long-winded paths through space; their ethereal trails form a density plot reminiscent of the Mandelbrot fractal itself.

    (tags: fractals mandelbrot buddhabrot graphics maths via:lcamtuf)

Rewilding fields massively improved bumblebee numbers in Scotland

  • Rewilding fields massively improved bumblebee numbers in Scotland

    "Bumblebee population increases 116 times over in 'remarkable' Scotland project":

    Rewilding Denmarkfield, a 90-acre project based just north of Perth, has been working to restore nature to green spaces in an increasingly built up area for the past two years. Statistics from the charity show in 2021, when some of the fields managed by the project were still barley monoculture, only 35 bumblebees were counted. But by 2023, after just two years of nature restoration work in the same fields, the population increased to 4,056. The diversity of bumblebee also doubled, according to the charity, from five to ten different species.

    (tags: bees bumblebees scotland fields farming rewilding fallow nature)

WeSQL

  • WeSQL

    "an innovative MySQL distribution that adopts a compute-storage separation architecture, with storage backed by S3 (and S3-compatible systems). WeSQL has completely replaced MySQL’s traditional disk storage with S3. All MySQL data—binlogs, schemas, storage engine metadata, WAL, and data files—are entirely (not partially!) stored as objects in S3. The 11 nines of durability provided by S3 significantly enhances data reliability. Additionally, WeSQL can start from a clean, empty instance, connect to S3, load the data, and begin serving immediately with no additional setup required. It is ideal for users who need an easy-to-manage, cost-effective, and developer-friendly MySQL database solution, especially for those needing support for both Serverless and BYOC (Bring Your Own Cloud)." (via Ian on ITC)

    (tags: mysql s3 object-storage storage databases sql)

Reversing.Works Investigation Exposes Glovo’s Data Privacy Violations

  • Reversing.Works Investigation Exposes Glovo’s Data Privacy Violations

    Ha, this is great:

    Reversing.Works, an innovative project dedicated to exposing abuses within gig economy platforms, uncovered significant labour law violations within Glovo’s algorithmic management system and provided critical evidence for an investigation by the Italian Data Protection Authority. After a year-long investigation, the DPA fined Glovo 5 Million €, and demanded corrective action from the platform. Glovo’s algorithmic management system was found to have misused workers’ personal data in ways that violated labour law, including monitoring workers’ movements outside of their work shifts, keeping hidden scores on workers, and sending detailed monitoring of their work to third parties outside the scope of their contracts. This was a mixed violation of both Italian labour law and the General Data Protection Regulation (GDPR). Reversing.Works’ investigation, using sophisticated reversing engineering techniques, sheds light on the hidden mechanics that drive the platform’s model of operation, and perhaps additional business dynamics. [...] “It’s surprising that unions never used a tool like this,” says Gaetano Priori, the lead investigator at Reversing.Works. “Privacy is an individual right, so it hasn’t been seen as a tool for labour struggles. But it has potential in digitally-intermediated labour because one violation could affect all the workers in all the regions in which a company operates.” Reversing.Works has shown how GDPR and tech-enabled investigation can help expose bad practices and create fairer working conditions. This case is a call to action for all gig workers, showing that existing legal tools can be used for the collective good. Priori adds, “This should be a wake-up call for all workers managed by technology. With GDPR and tech, we have the means to challenge unfair practices.”

    (tags: reverse-engineering gdpr data-protection data-privacy gig-economy glovo italy unions)

Generative AI Pushes Outcome Over Process (And This Is Why I Hate It)

  • Generative AI Pushes Outcome Over Process (And This Is Why I Hate It)

    This is a really interesting point about education and learning, in general:

    AI technology is based on the idea that the important part of creating things is the outcome, not the process. Can't draw? That shouldn't stop you from making a picture. Worried about your writing? Why should that stop you from handing in a coherent essay? The ads for AI all promise that you'll be able to produce things without all the tedious work of actually producing it - isn't that great?  Well no, it's not - it's terrible. It betrays a fundamental misunderstanding of why creating things has value. It's terrible in general, but I am especially offended by this idea in the context of education, and in this post I want to lay this idea out in a little detail. 

    (tags: education learning ai process-vs-outcome working how-we-work)

S3 now supports appending

  • S3 now supports appending

    Ooh, interesting -- this can unlock a few new system designs:

    You can append data to the end of existing objects stored in the S3 Express One Zone storage class in directory buckets. We recommend that you use the ability to append data to an object if the data is written continuously over a period of time or if you need to read the object while you are writing to the object. Appending data to objects is common for use-cases such as adding new log entries to log files or adding new video segments to video files as they are trans-coded then streamed. By appending data to objects, you can simplify applications that previously combined data in local storage before copying the final object to Amazon S3.

    (tags: aws s3 storage cloud features)

Binary Quantization

  • Binary Quantization

    A readable explanation of the (relatively new) technique of Binary Quantization applied to LLM embeddings. It's pretty amazing that this compression technique can work without destroying search recall and accuracy, but it seems it does!

    Using BQ will reduce your memory consumption and improve retrieval speeds by up to 40x [...] Binary quantization (BQ) converts any vector embedding of floating point numbers into a vector of binary or boolean values. [...] All [vector floating point] numbers greater than zero are marked as 1. If it’s zero or less, they become 0. The benefit of reducing the vector embeddings to binary values is that boolean operations are very fast and need significantly less CPU instructions. [...] One of the reasons vector search still works with such a high compression rate is that these large vectors are over-parameterized for retrieval. This is because they are designed for ranking, clustering, and similar use cases, which typically need more information encoded in the vector.
    https://www.elastic.co/search-labs/blog/rabitq-explainer-101 is a good maths-heavy explanation of the Elastic implementation using RaBitQ. See also some results from HuggingFace, https://huggingface.co/blog/embedding-quantization .

    (tags: embedding llm ai algorithms data-structures compression quantization binary-quantization quantisation rabitq search recall vectors vector-search)

[pdf] Sky UK on their IPv6/IPv4 gateways

  • [pdf] Sky UK on their IPv6/IPv4 gateways

    A presentation from RIPE89 detailing Sky's MAP-T setup, "IPv6-only with IPv4aaS (MAP-T)". Basically they now use MAP-T translation devices to provide "IPv4 as a service", transparent NAT mapping between IPv6 and IPv4. I suspect this is similar to how Virgin Media operates their network, too, in Ireland. Interestingly, there are now network features (like local CDN POPs) which are more performant when using IPv6 natively, as they avoid a "trombone" route via a network-border translation device to get an IPv4 address. As a result, it's actually starting to be worthwhile running an IPv6 home network....

    (tags: ipv4 ipv6 networking home sky isps ripe map-t nat ip)

headrotor/masto-pinb

  • headrotor/masto-pinb

    from Marsh Gardiner (https://hachyderm.io/@earth2marsh ), a "Mastodon To Pinboard bookmark integration script" -- "a Python script to mimic the functionality of Pinboard's Twitter integration. It reads the latest toots from a Mastodon account and bookmarks them in a Pinboard.in account. It is meant to be run repeatedly as a crontab job to continuously update your bookmarks in the background".

    (tags: mastodon pinboard bookmarks bookmarking scripts)

skyfirehose.com

  • skyfirehose.com

    "Query the Bluesky Jetstream with DuckDB" -- this is a lovely little hack from Tobias Müller (https://bsky.app/profile/tobilg.com). Basically, it's a pre-built DuckDB database file which contains tables which refer to Parquet files in an R2 bucket, which are (presumably) updated regularly with new Bluesky posts from their Jetstream. Tobias says: "there‘s a data gathering process that listens to the Jetstream and dumps the NDJSONs to the filesystem as hourly files. Then, DuckDB transform the data to Parquet files, they get uploaded with rclone." It's a lovely demo of how modern data lake tech can be exposed for public usage in a nice way.

    (tags: s3 parquet duckdb sql jetstream bluesky firehose data-lakes r2)

The Current State of This Blog’s Syndication

For the past several years, since the demise of Google Reader, I’ve been augmenting the RSS/Atom syndication of this linkblog with posts to various social media platforms using bot accounts. This is kind of a form of POSSE -- “Publish (on your) Own Site, Syndicate Elsewhere” (ideally I’d be self-hosting Pinboard to qualify for that I guess).

The destination for cross-posts were first to Twitter (RIP), and more recently to Mastodon via botsin.space. With the shutdown of that instance, I’ve had to make a few changes to my syndication script which gateways the contents to Mastodon, and I also took the opportunity to set up a BlueSky gateway at the same time. On the prompting of @kellan, here’s a quick write-up of where it all currently stands…

Primary Source: Pinboard

The primary source for the blog’s contents is my long-suffering account at https://pinboard.in/u:jm/, where I have been collecting links since 2009 (and before that, del.icio.us since I think 2004?, so that’s 20 years of links by now).

Pinboard has a pretty simple UI for link collection using a bookmarklet, which I’ve improved a tiny bit to open a large editor textbox instead of the default tiny one.

The resulting posts generally tend to include a blockquote, a short lede, and a few tags in the normal Pinboard/Del.icio.us style.

I find editing text posts in the Pinboard bare-bones UI to be easier and more pleasant than WordPress, so I generally use that as the primary source. Based on the POSSE principle, I should really figure out a way to get this onto something self-hosted, but Pinboard works for me (at the moment at least).

Publish from Pinboard to Blog

I use a Python script run from cron, to gateway new bookmarks from https://pinboard.in/u:jm/ as individual posts, formatted with Markdown, to this blog using the WordPress posting API: Github repo

Publish from Pinboard to Mastodon

This reads the Pinboard RSS feed for https://pinboard.in/u:jm/ and posts any new URLs (and the first 500 chars of its description) to the “jmason_links” account at mstdn.social: Github repo

Migration from the old Mastodon account at botsin.space to mstdn.social was really quite easy; after manually setting up the new account at mstdn.social and copying over the bio text, I hit the "Move from a different account" page, and entered @jm_links@botsin.space for the handle of the old account to migrate from.

I then logged in to the old account on botsin.space and hit the "Move to a different account" page, entering @jmason_links@mstdn.social for the handle to migrate to. This triggered copying of the followers from one account to the other, and left the old account dormant with a link to the new location instead.

(One thing to watch out for is that once the move is triggered, the profile for the old account becomes read-only; I've since had to temporarily undo the "moved" status in order to update the profile text, which was a bit messy.)

Publish from Pinboard to BlueSky

This reads the same Pinboard RSS feed as the Mastodon gateway, and gateways new posts from there to the “jmason.ie” account at BlueSky. This is slightly more involved than the Mastodon script, as it attempts to generate an embed card and mark up any links in the post appropriately: Github repo

I have a cron on my home server which runs those Mastodon and BlueSky gateway scripts every 15 minutes, and that seems to be a reasonable cadence without hammering the various APIs too much.

Used EV Buying Guide

  • Used EV Buying Guide

    This, via Reddit, is an amazing guide to buying a used electric vehicle, from Croatia's EVClinic, who are a "car reverse engineering and specialty repair outfit. Taking cars apart, figuring out how and when they break, and figuring out how to repair them is their bread and butter. They've gained a reputation across Europe for being able to fix problems that even the manufacturers themselves don't know how to deal with. They've now distilled that working experience into a report, detailing which vehicles are reliable in the long term - and which ones should be avoided. Each model also has a list of which parts are most likely to break, after how much mileage they are likely to break, and how much it costs to repair.":

    Based on our experience and that of our colleagues’ labs at 15-20 different locations worldwide, we have concluded that the battery is the last concern on the list during the first 10 years of an EV’s life, with some vehicles covering a large number of miles with the original battery system. The most common failures within 10 years of using an EV are: 1. Electric motors, 2. OBC chargers, 3. DC-DC/inverters, and only in fourth place, batteries. Some vehicles can go 10 years without any breakdowns or servicing, resulting in significant savings compared to fossil fuel vehicles. Even EVs that experience faults are cheaper to maintain than their fossil-fueled counterparts, even when factoring in battery and motor failures. Fossil fuel vehicles consume at least €0.13 per kilometer just in fuel, excluding services and breakdowns. With services, breakdowns, and maintenance, they consume an additional minimum of €0.08, totaling over €40,000 for 200,000 km. Thus, a faulty EV is still cheaper than a “functional” fossil fuel vehicle.
    The article lists the Hybrid and Battery EVs available in Europe, and gives a rating to each one regarding their reliability and repairability, in extreme detail. Unfortunately, the BEV I drive -- the Nissan Leaf -- gets a terrible review due to what they consider really crappy battery technology choices. The perils of being an early adopter.... :(

    (tags: nissan leaf bevs evs driving cars hybrid-vehicles electric-vehicles used-cars repair)

How to Learn: Userland Disk I/O

  • How to Learn: Userland Disk I/O

    This is an interesting hodge-podge of key bits of information about disk I/O, file integrity and durability, buffering or unbuffered writes, async I/O, and which filesystems to use for high-I/O database operation on Linux, MacOS and Windows. One thing that was new to me: "You can periodically scrape /proc/diskstats to self-report on disk metrics".

    (tags: databases filesystems linux macos fsync durability coding)

SlateDB

  • SlateDB

    an embedded storage engine built as a log-structured merge-tree. Unlike traditional LSM-tree storage engines, SlateDB writes all data to object storage [ie. S3, Azure Blob Storage, GCS]. Object storage is an amazing technology. It provides highly-durable, highly-scalable, highly-available storage at a great cost. And recent advancements have made it even more attractive: Google Cloud Storage supports multi-region and dual-region buckets for high availability. All object stores support compare-and-swap (CAS) operations. Amazon Web Service's S3 Express One Zone has single-digit millisecond latency. We believe that the future of object storage are multi-region, low latency buckets that support atomic CAS operations. Inspired by The Cloud Storage Triad: Latency, Cost, Durability, we set out to build a storage engine built for the cloud. SlateDB is that storage engine.
    This looks superb. Chris Riccomini is involved.

    (tags: data storage slatedb lsm wal oltp)

Prototype Fund

  • Prototype Fund

    This looks great!

    The first low-threshold funding program for independent developers and small teams creating innovative open-source software. We provide the tech-savvy civil society with access to the resources and processes needed for developing user-centered, innovative software projects. Since 2016, we have funded almost 400 projects. As a learning funding program, we have repeatedly made adjustments to become more efficient and effective. Now we are taking the next step and implement some significant changes. From now on, we are focusing on funding data security and software infrastructure. Apply with your ideas for innovative open source software in the public interest! You will receive up to €95,000 over six months or €158,000 over ten months of funding from the German Ministry of Education and Research. We will also provide you with coaching, consulting and networking opportunities.

    (tags: funding open-source oss via:janl)

GOV.UK chatbot halted by hallucinations

  • GOV.UK chatbot halted by hallucinations

    "AI firms must address hallucinations before GOV.UK chatbot can roll out, digital chief claims":

    Trials of a generative AI-powered chatbot for GOV.UK users have found ongoing issues with so-called hallucinations that must be addressed before the technology can be widely deployed, according to one of the government’s digital leaders. [....] Speaking at an event this morning, Paul Willmott said: “We have experimented with a generative advice [tool] on GOV.UK. You will just say ‘I’m trying to do this’, or ‘I’m annoyed about this’… The challenge we are having – which is exactly the same as in the commercial sector – is what to do with the 1% of hallucinations where the agent starts to get challenging, or abusive – or even seductive.” Even if only present in a tiny minority of instances, these issues mean that GOV.UK Chat is not yet ready for widespread deployment, according to Willmott. Addressing hallucinations will require the support of the likes of OpenAI and other creates of large language models. “Until we have managed to iron that out – which will require the support of the foundational model creators – we won’t be able to put this live,” he said.
    This is hardly surprising, but it's good to see it being acknowledged and the brakes being applied.

    (tags: ai llms hallucations confabulation gov.uk chatbots chatgpt uk)

How the New sqlite3_rsync Utility Works

  • How the New sqlite3_rsync Utility Works

    "I've enjoyed following the development of the new sqlite3_rsync utility in the SQLite project. The utility employs a bandwidth-efficient algorithm to synchronize new and modified pages from an origin SQLite database to a replica. You can learn more about the new utility here and try it out by following the instructions here. Curious about its workings, I reviewed the code" Interesting use of a truncated SHA-3 as the hash() implementation, for speed.

    (tags: sqlite hashing rsync synchronization replication databases storage algorithms)

Using BlueSky as a Mastodon Bot

  • Using BlueSky as a Mastodon Bot

    "A Cheap and Lazy way to create Mastodon Bots using… BlueSky?!" By using the brid.gy gateway service, it's pretty trivial to use BlueSky as an easy means to make a mastodon bot without having to find a bot-friendly Masto host now that botsin.space is no more. For now, I'm doing this at @jmason.ie@bsky.brid.gy , which is gatewaying the posts from my BlueSky bot at https://bsky.app/profile/jmason.ie -- although a more long term approach will be to host the links-to-Mastodon gateway "natively" instead of using brid.gy, IMO.

    (tags: mastodon rss gateways social-media bluesky brid.gy bots linkblog)

Zuckerberg: The AI Slop Will Continue Until Morale Improves

  • Zuckerberg: The AI Slop Will Continue Until Morale Improves

    Well this is just garbage, and one reason why I no longer use Facebook:

    Both Facebook and Instagram are already going this way, with the rise of AI spam, AI influencers, and armies of people copy-pasting and clipping content from other social media networks to build their accounts. This content and this system, Meta said, has led to an 8 percent increase in time spent on Facebook and a 6 percent increase in time spent on Instagram, all at the expense of a shared reality and human connections to other humans.  In the earnings call, Zuckerberg and Susan Li, Meta’s CFO, said that Meta has already slop-ified its ad system and said that more than 1 million businesses are now creating more than 15 million ads per month on Meta platforms using generative AI. 

    (tags: slop facebook ai meta social media grim instagram)

Misusing the BIG-Bench canary string

  • Misusing the BIG-Bench canary string

    Interesting; this blog post discusses using the BIG-Bench canary string, intended to keep data like accuracy test cases out of LLM training corpora, as a general-purpose "don't scrape me" flag on personal blogs. This seems like a more practical, and more likely to be observed, way to opt out of AI training -- seeing as the scrapers don't seem to reliably honour any of the others

    (tags: blogging canaries opt-out scraping web ai llm openai chatgpt claude bing)

Canary Contamination in GPT-4

  • Canary Contamination in GPT-4

    The BIG-Bench canary string is an EICAR- or GTUBE-style canary string which should never appear in LLM training datasets, or by extension, in trained models or their output. Its intention is that any test documents containing that string can be excluded from training, so that benchmark tests will be accurate. Unfortunately, it looks like they weren't excluded -- Claude 3.5 Sonnet and GPT-4-base will reproduce the string; and:

    Of 19 tested [benchmarking] tasks, GPT-4-base perfectly recalled large (non-trivial) portions of code for: The Abstraction and Reasoning Corpus; Simple arithmetic; Diverse Metrics for Social Biases in Language Models; Convince Me
    Great work. In case you were wondering why the LLMs all seem to do so well on their benchmarks, now you know -- they were training on the test data.

    (tags: ai llm testing benchmarking big-bench gpt-4 claude)

Reverse engineering ML models from TikTok and Instagram

  • Reverse engineering ML models from TikTok and Instagram

    This is very clever; _A Picture is Worth 500 Labels: A Case Study of Demographic Disparities in Local Machine Learning Models for Instagram and TikTok_, from University of Wisconsin-Madison and the Technical Unversity of Munich. TikTok and Insta both use local ML models running on users' phones; by reverse engineering these APIs it's possible to test them and experiment on their accuracy.

    Capitalizing on this new processing model of locally analyzing user images, we analyze two popular social media apps, TikTok and Instagram, to reveal (1) what insights vision models in both apps infer about users from their image and video data and (2) whether these models exhibit performance disparities with respect to demographics. As vision models provide signals for sensitive technologies like age verification and facial recognition, understanding potential biases in these models is crucial for ensuring that users receive equitable and accurate services. We develop a novel method for capturing and evaluating ML tasks in mobile apps, overcoming challenges like code obfuscation, native code execution, and scalability. Our method comprises ML task detection, ML pipeline reconstruction, and ML performance assessment, specifically focusing on demographic disparities. We apply our methodology to TikTok and Instagram, revealing significant insights. For TikTok, we find issues in age and gender prediction accuracy, particularly for minors and Black individuals. In Instagram, our analysis uncovers demographic disparities in the extraction of over 500 visual concepts from images, with evidence of spurious correlations between demographic features and certain concepts.

    (tags: tiktok instagram ml machine-learning accuracy testing reverse-engineering reversing mobile android)

Hedge Funds Bet Against Clean Energy

  • Hedge Funds Bet Against Clean Energy

    Hooray! Capitalism has decided to kill off the humans:

    Despite vast green stimulus packages in the US, Europe and China, more hedge funds are on average net short batteries, solar, electric vehicles and hydrogen than are long those sectors; and more funds are net long fossil fuels than are shorting oil, gas and coal, according to a Bloomberg News analysis of positions voluntarily disclosed by roughly 500 hedge funds to Hazeltree, a data compiler in the alternative investment industry.

    (tags: hedge-funds capitalism short-selling clean-energy green future climate-change)

Bert Hubert on Nuclear power in the EU

  • Bert Hubert on Nuclear power in the EU

    "Nuclear power: no, yes, maybe, but not like this":

    Currently many (European) countries are individually trying to order up new nuclear power, from many different places. But it appears we can’t treat nuclear reactors like (say) cars you can just procure. If we’d want to do this right, it is probably indeed better to not simply try to order stuff, but to engender a nuclear revival. To not simply point our fingers at Framatome and EDF and say “do better!”. What if we actually made this a European or transatlantic project, and add the vast expertise that is still hidden within our institutes, and indeed setup a project for building 50 nuclear reactors, or more? This would allow a broad base of research that would derisk the process, so we don’t necessarily find out after 15 years of construction that the design is too complicated. And perhaps also not try to pretend that we are leaving this to the free market, but recognize this as a public activity. Doing it like this would require governments, institutes and companies to think different, and I’m reasonably sure we can’t even get this done between a few like-minded countries. Most definitely the EU would not reach consensus on this, since Germany is fundamentally opposed to anything nuclear ever.

    (tags: bert-hubert nuclear nukes nuclear-power eu future sustainability)

The “ASCII Smuggling” Attack

  • The "ASCII Smuggling" Attack

    Invisible text that AI chatbots understand and humans can't?

    What if there was a way to sneak malicious instructions into Claude, Copilot, or other top-name AI chatbots and get confidential data out of them by using characters large language models can recognize and their human users can’t? As it turns out, there was—and in some cases still is.
    Attackers used prompt injection, hidden in (untrusted) emails sent to a Microsoft 365 Copilot user; when the email is summarized using Copilot, "inside the emails are instructions to sift through previously received emails in search of the sales figures or a one-time password and include them in a URL pointing to his web server." The sensitive data is then steganographically encoded using Unicode "tags block" invisible codepoints, and included in the seemingly-innocent URL. Yet another case where AI developers have failed to study security history -- using untrusted input for in-band signalling has been a security risk since the days of phracking; and allowing the entire list of permitted output characters across the entire Unicode range, instead of locking down to a safe subset, allows this silent exfiltration attack. Extra sting in the tail for Amazon: the researchers didn't even bother testing on their LLM :)

    (tags: ai security steganography exfiltration copilot microsoft openai llms claude infosec attacks exploits)

Does Open Source AI really exist?

  • Does Open Source AI really exist?

    This is absolutely spot on:

    “Open Source AI” is an attempt to “openwash” proprietary systems. In their paper “Rethinking open source generative AI: open-washing and the EU AI Act” Andreas Liesenfeld and Mark Dingemanse showed that many “Open Source” AI models offer hardly more than open model weights. Meaning: You can run the thing but you don’t actually know what it is. Sounds like something we’ve already had: It’s Freeware. The Open Source models we see today are proprietary freeware blobs. Which is potentially marginally better than OpenAI’s fully closed approach but really only marginally. [...] “Open Source” is becom[ing] a sticker like “Fair Trade”, something to make your product look good and trustworthy. To position it outside of the evil commercial space, giving it some grassroots feeling. “We’re in this together” and shit. But we’re not. We’re not in this with Mark fucking Zuckerberg even if he gives away some LLM weights for free cause it hurts his competition. We, as normal people living on this constantly warmer planet, are not with any of those people.
    As tante notes here, for the systems we are talking about today, Open Source AI isn't practically possible, because we’ll never be able to download all the actual training data -- and shame on the OSI for legitimising this attempt at "openwashing".

    (tags: llms open-source osi open-source-ai ai freeware meta training)

Obituary for Ward Christensen

  • Obituary for Ward Christensen

    "Ward Christensen, BBS inventor and architect of our online age, dies at age 78":

    On Friday, Ward Christensen, co-inventor of the computer bulletin board system (BBS), died at age 78 in Rolling Meadows, Illinois. Christensen, along with Randy Suess, created the first BBS in Chicago in 1978, leading to an important cultural era of digital community-building that presaged much of our online world today. Prior to creating the first BBS, Christensen invented XMODEM, a 1977 file transfer protocol that made much of the later BBS world possible by breaking binary files into packets and ensuring that each packet was safely delivered over sometimes unstable and noisy analog telephone lines. It inspired other file transfer protocols that allowed ad-hoc online file sharing to flourish. While Christensen himself was always humble about his role in creating the first BBS, his contributions to the field did not go unrecognized. In 1992, Christensen received two Dvorak Awards, including a lifetime achievement award for "outstanding contributions to PC telecommunications." The following year, the Electronic Frontier Foundation honored him with the Pioneer Award.

    (tags: bbses history computing ward-christensen xmodem networking filesharing)

Brian Merchant on “AI will solve climate change”

  • Brian Merchant on "AI will solve climate change"

    The neo-luddite author of "Blood in the Machine" nails the response to Eric Schmidt's pie-in-the-sky techno-optimism around AI "solving" climate change:

    Even without AGI, we already know what we have to do. [...] The tricky part—the only part that matters in this rather crucial decade for climate action—is implementation. As impressive as GPT technology or the most state of the art diffusion models may be, they will never, god willing, “solve” the problem of generating what is actually necessary to address climate change: Political will. Political will to break the corporate power that has a stranglehold on energy production, to reorganize our infrastructure and economies accordingly, to push out oil and gas. Even if an AGI came up with a flawless blueprint for building cheap nuclear fusion plants—pure science fiction—who among us thinks that oil and gas companies would readily relinquish their wealth and power and control over the current energy infrastructure? Even that would be a struggle, and AGI’s not going to doing anything like that anytime soon, if at all. Which is why the “AI will solve climate change” thinking is not merely foolish but dangerous—it’s another means of persuading otherwise smart people that immediate action isn’t necessary, that technological advancements are a trump card, that an all hands on deck effort to slash emissions and transition to proven renewable technologies isn’t necessary right now. It’s techno-utopianism of the worst kind; the kind that saps the will to act.

    (tags: ai climate eric-schmidt technology techno-optimism techno-utopianism agi neoluddism brian-merchant)

Capture less than you create

  • Capture less than you create

    I've disagreed with David Heinemeier Hansson on plenty of occasions in the past, but this is one where I'm really happy to find myself in agreement. Matt Mullenwegg of WordPress went low, laying in digs about how DHH didn't profit from the success of Rails; DHH's response is perfect:

    The moment you go down the path of gratitude grievances, you'll see ungrateful ghosts everywhere. People who owe you something, if they succeed. A ratio that's never quite right between what you've helped create and what you've managed to capture. If you let it, it'll haunt you forever. So don't! Don't let the success of others diminish your satisfaction with your own efforts. Unless you're literally Mark Zuckerberg, Elon Musk, or Jeff Bezos, there'll always be someone richer than you! The rewards I withdraw from open source flow from all the happy programmers who've been able to write Ruby to build these amazingly successful web businesses with Rails. That enjoyment only grows the more successful these business are! The more economic activity stems from Rails, the more programmers will be able to find work where they might write Ruby. Maybe I'd feel different if I was a starving open source artist holed up somewhere begrudging the wheels of capitalism. But fate has been more than kind enough to me in that regard. I want for very little, because I've been blessed sufficiently. That's a special kind of wealth: Enough. And that's also the open source spirit: To let a billion lemons go unsqueezed. To capture vanishingly less than you create. To marvel at a vast commons of software, offered with no strings attached, to any who might wish to build. Thou shall not lust after thy open source's users and their success.
    Spot on.

    (tags: open-source success rewards coding software business life gratitude gift-economy dhh rails philosophy)

GSM-Symbolic

  • GSM-Symbolic

    "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models", from Apple Machine Learning Research:

    We investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer.
    Even better -- "the performance of all models declines when only the numerical values in the question are altered" seems to suggest that great performance on benchmarks like GSM8K just mean that the LLMs have been trained on the answers...

    (tags: training benchmarks ai llms gsm-symbolic reasoning ml apple papers gsm8k)

Shitposting, Shit-mining and Shit-farming

  • Shitposting, Shit-mining and Shit-farming

    This is where we are with surveillance capitalism and Facebook/X:

    Social media platforms are improved by a moderate tincture of shitposting. More than a few drops though, and the place begins to stink up, driving away advertisers and users. This then leads platform executives to explore the exciting opportunities of shit-mining. Social media generates a lot of content - it’s gotta be valuable somehow! Who needs content moderation if you can become a guano baron? But that only makes things worse, driving out more users and more advertisers, until eventually, you may find yourself left with a population dominated by two kinds of users (a) chumps, and (b) chump-vampirizing obligate predators. This can be a stable equilibrium - even quite a profitable one! But otherwise, it isn’t good news.
    See also a recent story in the Garbage Day newsletter (https://www.garbageday.email/p/what-feels-real-enough-to-share) about Facebook, and how its disaster-relief FB groups are becoming overrun with AI slop images:
    The Verge’s Nilay Patel recently summed up the core tension here, writing on Threads about YouTube’s own generative-AI efforts, “Every platform company is about to be at war with itself as the algorithmic recommendation AI team tries to fight off the content made by the generative AI team.” And it’s clear, at least with Meta, which side is winning the war. This week, Meta proudly announced a new video-generating tool that will make AI misinfo even more convincing — or, at least, better at generating things that feel true. And there’s really only one way to look at all of this. Meta simply does not give a shit anymore. Facebook spent most of the 2010s absorbing, and destroying, not just local journalism in the US, but the very infrastructure of how information is transmitted across the country. And they have clearly lost interest in maintaining that. Users, of course, have no where else to go, so they’re still relying on it to coordinate things like hurricane disaster relief. But the feeds are now — and seemingly forever will be — clogged with AI junk. Because you cannot be a useful civic resource and also give your users a near-unlimited ability to generate things that are not real. And I don’t think Meta are stupid enough to not know this. But like their own users, they have decided that it doesn’t matter what’s real, only what feels real enough to share.
    Given that Meta are _paying_ users to pollute their platform with low-grade AI slop engagement fuel, shit-farming seems the perfect term for that.

    (tags: garbage-day facebook meta ai ai-slop spam shitposting shitfarming shitmining dont-be-evil)

Fixing aggressive Xiaomi battery management

  • Fixing aggressive Xiaomi battery management

    I've been using a Xiaomi phone recently, running Xiaomi HyperOS 1.011.0, and one feature that bugs me constantly is that apps lose state as soon as you flip away to another app, even if only for a second; once you flip back, the app restarts. This appears to be an aspect of Xiaomi's built in power management. I've been searching for a way to disable it, and allow multiple apps in memory simultaneously, and I've finally tracked it down. As described here, https://piunikaweb.com/2021/04/19/miui-optimization-missing-in-developer-options-try-this-workaround/ , you need to enable Developer Mode on the phone, enter "Additional Settings" / "Developer options", then scroll all the way down, nearly to the bottom, to "Reset to default values". Hit this _repeatedly_ (once is not enough!) until another option appears just below, called either "Turn on MIUI optimisation" or, in my case, "Turn on system optimisation"; this is enabled by default. Turn it off. In my case, this has fixed the flipping-between-apps problem, the phone in general is significantly snappier to respond, and WhatsApp and Telegram new-message notifications don't get auto-dismissed (which was another annoying feature previously). I suspect a load of battery optimisations and CPU throttling has been disabled. It remains to be seen what this does to my battery life, but hopefully it'll be worth it, and it'll be nice not to lose state in Chrome forms when I have to flip over to my banking app, etc. I won't be getting another Xiaomi phone after this; there are numerous rough edges and outright bugs in the MIUI/HyperOS platform, at least in the international ROM images, and there's no support or documentation to work around this stuff. It's a crappy user experience.

    (tags: phones mobile xiaomi miui workarounds battery options settings)

What If Data Is a Bad Idea?

  • What If Data Is a Bad Idea?

    A thought-provoking article:

    Philip Agre enumerated five characteristics of data that will help us achieve this repositioning. Agre argued that “living data” must be able to express 1. a sense of ownership, 2. error bars, 3. sensitivity, 4. dependency, and 5. semantics. Although he originally wrote this in the early 1990s, it took some time for technology and policy to catch up. I’m going to break down each point using more contemporary context and terminology: Provenance and Agency: what is the origin of the data and what can I do with it (ownership)? Accuracy: has the data been validated? If not, what is the confidence of its correctness (error bars)? Data Flow: how is data discovered, updated, and shared (sensitivity to changes)? Auditability: what data and processes were used to generate this data (dependencies)? Semantics: what does this data represent?

    (tags: culture data identity data-protection data-privacy living-data open-data)

Ethical Applications of AI to Public Sector Problems

  • Ethical Applications of AI to Public Sector Problems

    Jacob Kaplan-Moss:

    There have been massive developments in AI in the last decade, and they’re changing what’s possible with software. There’s also been a huge amount of misunderstanding, hype, and outright bullshit. I believe that the advances in AI are real, will continue, and have promising applications in the public sector. But I also believe that there are clear “right” and “wrong” ways to apply AI to public sector problems.
    He breaks down AI usage into "Assistive AI", where AI is used to process and consume information (in ways or amounts that humans cannot) to present to a human operator, versus "Automated AI", where the AI both processes and acts upon information, without input or oversight from a human operator. The latter is unethical to apply in the public sector.

    (tags: ai ethics llm genai public-sector government automation)

ClassicPress

  • ClassicPress

    "A lightweight, stable, instantly familiar free open-source content management system. Based on WordPress without the block editor (Gutenberg)." Nobody seems to like the block editor, lol

    (tags: cms wordpress blogs blogging forks)

Patent troll Sable pays up, dedicates all its patents to the public

  • Patent troll Sable pays up, dedicates all its patents to the public

    This is a massive victory for Cloudflare -- way to go!

    Sable initially asserted around 100 claims from four different patents against Cloudflare, accusing multiple Cloudflare products and features of infringement. Sable’s patents — the old Caspian Networks patents — related to hardware-based router technologies common over 20 years ago. Sable’s infringement arguments stretched these patent claims to their limits (and beyond) as Sable tried to apply Caspian’s hardware-based technologies to Cloudflare’s modern software-defined services delivered on the cloud. [...] Cloudflare fought back against Sable by launching a new round of Project Jengo, Cloudflare’s prior art contest, seeking prior art to invalidate all of Sable’s patents. In the end, Sable agreed to pay Cloudflare $225,000, grant Cloudflare a royalty-free license to its entire patent portfolio, and to dedicate its patents to the public, ensuring that Sable can never again assert them against another company.
    (via AJ Stuyvenberg)

    (tags: sable cloudflare patent-trolls patents uspto trolls routing)

ArchiveWeb.page

  • ArchiveWeb.page

    "Interactive browser-based web archiving from Webrecorder. The ArchiveWeb.page browser extension and standalone application allows you to capture web archives interactively as you browse. After archiving your webpages, your archives can be viewed using ReplayWeb.page — no extension required! For those who need to crawl whole websites with automated tools, check out Browsertrix." This is a nice way to archive a personal dynamic site online in a read-only fashion -- there is a self-hosting form of the replayer at https://replayweb.page/docs/embedding/#self-hosting . As @david302 on the Irish Tech Slack notes: "you can turn on recording, browse the (public) site you want to archive, get the .wacz file and stick that+js on s3/cloudfront."

    (tags: archiving archival archives tools web recording replay via:david302)

Turning Everyday Gadgets into Bombs is a Bad Idea

  • Turning Everyday Gadgets into Bombs is a Bad Idea

    Bunnie Huang investigates the Mossad pager bomb's feasibility, and finds it deeply worrying:

    I am left with the terrifying realization that not only is it feasible, it’s relatively easy for any modestly-funded entity to implement. Not just our allies can do this – a wide cast of adversaries have this capability in their reach, from nation-states to cartels and gangs, to shady copycat battery factories just looking for a big payday (if chemical suppliers can moonlight in illicit drugs, what stops battery factories from dealing in bespoke munitions?). Bottom line is: we should approach the public policy debate around this assuming that someday, we could be victims of exploding batteries, too. Turning everyday objects into fragmentation grenades should be a crime, as it blurs the line between civilian and military technologies.

    (tags: batteries israel security terrorism mossad pagers hardware devices bombs)

Modal interfaces considered harmful

  • Modal interfaces considered harmful

    A great line from the 99 Percent Invisible episode titled "Children of the Magenta (Automation Paradox, pt. 1)", regarding the Air France flight 447 disaster:

    When one of the co-pilots hauled back on his stick, he pitched the plane into an angle that eventually caused the stall. [...] it’s possible that he didn’t understand that he was now flying in a different mode, one which would not regulate and smooth out his movements. This confusion about what how the fly-by-wire system responds in different modes is referred to, aptly, as “mode confusion,”  and it has come up in other accidents.

    (tags: automation aviation flying modal-interfaces ui ux interfaces modes mode-confusion air-france-447 disasters)

wordfreq/SUNSET.md

  • wordfreq/SUNSET.md

    wordfreq is "a Python library for looking up the frequencies of words in many languages, based on many sources of data." Sadly, it's now longer going to be updated, as the author writes:

    I don't want to be part of this scene anymore: wordfreq used to be at the intersection of my interests. I was doing corpus linguistics in a way that could also benefit natural language processing tools. The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money. It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise. wordfreq was built by collecting a whole lot of text in a lot of languages. That used to be a pretty reasonable thing to do, and not the kind of thing someone would be likely to object to. Now, the text-slurping tools are mostly used for training generative AI, and people are quite rightly on the defensive. If someone is collecting all the text from your books, articles, Web site, or public posts, it's very likely because they are creating a plagiarism machine that will claim your words as its own. So I don't want to work on anything that could be confused with generative AI, or that could benefit generative AI. OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.

    (tags: ai language llm nlp openai scraping words genai google)

Nevada’s genAI-driven unemployment benefits system

  • Nevada's genAI-driven unemployment benefits system

    As has been shown many times before, current generative AI systems encode bias and racism in their training data. This is not going to go well:

    "There’s no AI [written decisions] that are going out without having human interaction and that human review," DETR's director told the website. "We can get decisions out quicker so that it actually helps the claimant." [...] "The time savings they’re looking for only happens if the review is very cursory," explained Morgan Shah, the director of community engagement for Nevada Legal Services. "If someone is reviewing something thoroughly and properly, they’re really not saving that much time." Ultimately, Shah said, workers using the system to breeze through claims may end up "being encouraged to take a shortcut." [...] As with most attempts at using this still-nascent technology in the public sector, we probably won't know how well the Nevada unemployment AI works unless it's shown to be doing a bad job — which feels like an experiment being conducted on some of the most vulnerable members of society without their consent.
    Of course, the definition of a "bad job" depends who's defining it. If the system is processing a high volume of applications, it may not matter to its operators if it's processing them _correctly_ or not.

    (tags: generative-ai ai racism bias nevada detr benefits automation)