Skip to content

Justin's Linklog Posts

Pleias language models

  • Pleias language models

    OK, this is quite cool: “the first ever [language] models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness.”

    Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models. Pleias 1.0 models include three base models: 350M, 1.2B, and 3B parameters. They feature two specialized models for knowledge retrieval with unprecedented performance for their size on multilingual Retrieval-Augmented Generation, Pleias-Pico (350M parameters) and Pleias-Nano (1.2B parameters). […] Our models are: * multilingual, offering strong support for multiple European languages; * safe, showing the lowest results on the toxicity benchmark; * performant for key tasks, such as knowledge retrieval; * able to run efficiently on consumer-grade hardware locally (CPU-only, without quantisation) Pleias 1.0 family embodies a new approach to specialized small language models, for end applications: wound-up models. We have implemented a set of ideas and solutions during pretraining that produce a frugal yet powerful language model specifically optimized for further RAG implementations. We release two wound-up models further trained for Retrieval Augmented Generation (RAG): Pleias-pico-350m-RAG and Pleias-nano-1B-RAG. These models are designed to be implemented locally, so we prioritized frugal implementation. As our models are small, they can run smoothly, even on devices with limited RAM.

    And here’s their fully open training set: https://huggingface.co/datasets/PleIAs/common_corpus

    (tags: llms models huggingface ai pleias rag ai-act open-data)

UK benefits AI system found to show bias

  • UK benefits AI system found to show bias

    File this under “the least surprising news ever”:

    An artificial intelligence system used by the UK government to detect welfare fraud is showing bias according to people’s age, disability, marital status and nationality, the Guardian can reveal. An internal assessment of a machine-learning programme used to vet thousands of claims for universal credit payments across England found it incorrectly selected people from some groups more than others when recommending whom to investigate for possible fraud.

    The most interesting aspect of the report published is that currently “there is no established numerical or statistical benchmark at which referral or outcome disparity can be defined as within tolerance”.

    I would have assumed a lack of bias, measured against a “false positive” rate — ie. benefits recipients who were selected for additional checks, who were then found to be legitimate and not committing fraud, should have been a design goal, and a critical KPI for such a system.

    There are going to be a lot of similar examples in the years to come — here’s hoping this “bias measurement” KPI becomes established as a concept.

    (tags: bias ai kpis dwp uk benefits welfare fraud ml)

Ridding My Home Network of IP Addresses

(Republishing this one on the blog, instead of just as a gist)

Recent changes in the tech scene have made it clear that relying on commercial companies to provide services I rely on isn’t a good strategy in the long term, and given that Tailscale is so effective these days as a remote-access system, I’ve gradually been expanding a small collection of self-hosted web apps and services running on my home network.

Until now they’ve mainly been addressed using their IP addresses and random high ports on the internal LAN, for example:

  1. Pihole: http://10.19.72.7/admin
  2. Home Assistant: http://10.19.72.11:8123/
  3. Linkding: http://10.19.72.6:9092/
  4. Grafana: http://10.19.72.6:3000/
  5. (plus a good few others)

Needless to say this is a bit messy and inelegant, so I’ve been planning to sort it out for a while. My requirements:

  1. no more ugly bare IP addresses!
  2. a DNS domain;
  3. with HTTPS URLs;
  4. one per service;
  5. no visible port numbers;
  6. fully valid TLS certs, no having to click through warnings or install funny CA certs;
  7. accessible regardless of which DNS server is in use — ie. using public DNS records. This may seem slightly unusual, but it’s useful so that the internal services can still be accessed when I’m using my work VPN (which forces its own DNS servers);
  8. accessible internally;
  9. accessible externally, over Tailscale;
  10. not accessible externally without Tailscale.

After a few false starts, I’m pretty happy with the current setup, which uses Caddy.

Hosting The Domain At Cloudflare

First off, since the service URLs are not to be accessible externally without Tailscale active, the HTTP challenge approach to provision Let’s Encrypt certs cannot be used. That would require an open-to-the-internet publicly-accessible HTTP server on my home network, which I absolutely want to avoid.

In order to use the ACME DNS challenge instead, I set up my public domain "taint.org" to use Cloudflare as the authoritative DNS server (in Cloudflare terms, "full setup"). This lets Caddy edit the DNS records via the Cloudflare API to handle the ACME challenge process.

One of the internal hosts is needed to run the Caddy server’s reverse proxies; I picked "hass", 10.19.72.11, the Home Assistant host, which didn’t have anything already running on port 80 or port 443. (All of my internal hosts are running on a private /24 IP range, at 10.19.72.0/24.)

The dedicated DNS domain I’m using for my home services is "home.taint.org". In order to use this, I clicked through to the Cloudflare admin panel and created a DNS record as follows:

Type   Name      Content             Proxy Status               TTL
A      *.home    10.19.72.11         DNS only - reserved IP     Auto

Now, any hostnames under "home.taint.org" will return the IP 10.19.72.11 (where Caddy will run).

I don’t particularly care about exposing my internal home network IPs to the world, as a trade-off to allow the URLs to work even if an internal host is using the work VPN, or resolving with 8.8.8.8, or whatever. That’s worth missing out on a little bit of paranoia, since the IPs won’t be accessible from outside without Tailscale anyway.

It is worth noting that the Cloudflare-hosted domain doesn’t have to be the same one used for URLs in the home network; using dns_challenge_override_domain you can delegate the ACME challenge from any "home" domain to one which is hosted in Cloudflare.

The Caddy Setup

One wrinkle is that I had to generate a custom Caddy build in order to get the "dns.providers.cloudflare" non-standard module, from https://caddyserver.com/download . This is a click-and-download page which generates a custom Caddy binary on the fly. It would have been nicer if the Cloudflare module was standard, but hey.

Once that’s installed, I can get this output:

$ /usr/local/bin/caddy list-modules
[long list of standard modules omitted]

dns.providers.cloudflare
dns.providers.route53

  Non-standard modules: 2

  Unknown modules: 0

(Yes, I have Caddy running as a normal service, not as a Docker container. No particular reason; I think Docker should work fine.)

Go to the Cloudflare account dashboard, and create a user API token as described at https://developers.cloudflare.com/fundamentals/api/get-started/create-token/ . In my case, it has Zone / DNS / Edit permission, on the specific zone taint.org.

Copy that token as it’s needed in the "Caddyfile", which now looks like the following:

hass.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.11:8123
}

links.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.6:9092
}

pi.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        redir / /admin/
        reverse_proxy /admin/* 10.19.72.7:80
}

grafana.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.6:3000
}

[many other services omitted]

Running sudo caddy run in the same dir will start up and verbosely log what it’s doing. (Once you’re happy enough, you can get Caddy running in the normal systemd service way.)

After setting those up, I now have my services accessible locally as:

  1. Home Assistant: https://hass.home.taint.org/
  2. Pihole: https://pi.home.taint.org/
  3. Grafana: https://grafana.home.taint.org/
  4. Linkding: https://links.home.taint.org/

Caddy seamlessly goes off and configures fully valid TLS certs with no fuss. I found it much tidier than Certbot, or Nginx Proxy Manager.

The Tailscale Setup

So this has now sorted out all of the requirements bar one:

  1. accessible externally, over Tailscale.

To do this I had to log into Tailscale’s admin console and go to https://login.tailscale.com/admin/machines , pick a host on the 10.19.72/24 internal LAN, click it’s dropdown menu and "Edit Route Settings…", and enable a Subnet Route for 10.19.72/24. By doing this, all of the service.home.taint.org DNS records are now accessible, remotely, once Tailscale is enabled; I don’t even need to use ts.net names to access them! Perfect.

Anyway, that’s the setup — hopefully this writeup will help others. And kudos to Caddy, Let’s Encrypt and Tailscale for making this relatively easy.

GenCast

  • GenCast

    Google DeepMind announce their new AI model for weather forecasting, in collaboration with the ECMWF:

    Today, in a paper published in Nature, we present GenCast, our new high resolution (0.25°) AI ensemble model. GenCast provides better forecasts of both day-to-day weather and extreme events than the top operational system, the European Centre for Medium-Range Weather Forecasts’ (ECMWF) ENS, up to 15 days in advance. We’ll be releasing our model’s code, weights, and forecasts, to support the wider weather forecasting community. […] GenCast is a diffusion model, the type of generative AI model that underpins the recent, rapid advances in image, video and music generation. However, GenCast differs from these, in that it’s adapted to the spherical geometry of the Earth, and learns to accurately generate the complex probability distribution of future weather scenarios when given the most recent state of the weather as input. To train GenCast, we provided it with four decades of historical weather data from ECMWF’s ERA5 archive. This data includes variables such as temperature, wind speed, and pressure at various altitudes. The model learned global weather patterns, at 0.25° resolution, directly from this processed weather data.
    It’s open source: https://github.com/google-deepmind/graphcast And here are the open-released model weights: https://console.cloud.google.com/storage/browser/dm_graphcast Graphcast (the previous iteration) has public forecasts published at https://charts.ecmwf.int/?query=GraphCast , under a CC-BY-NC-SA-4 licence — it would be great if the GenCast forecasts join this data set. Paper: https://arxiv.org/abs/2312.15796 This all looks really great, a fantastic commitment to (genuine) openness and open data, and the paper seems rigorous (to this amateur). Great stuff.

    (tags: forecasting weather ai gencast graphcast deepmind google ecmwf genai)

TikTok in hot water over Romanian elections

  • TikTok in hot water over Romanian elections

    ‘We are getting fed up’: EU lawmakers snap at TikTok over Romanian election:

    For years, the Chinese-owned social media app has brushed off security concerns in the United States and Europe that it could be used for mass manipulation and espionage. It now faces an intense regulatory storm in Bucharest over whether it played a role in skewing the democratic process in an EU country and NATO member of 19 million people. [….] “Honestly speaking, we are getting fed up by the documents and the empty promises,” Swedish center-right European lawmaker Arba Kokalari said near the end of the hearing.

    (tags: tiktok elections romania eu bias news propaganda democracy social-media)

noyb is now qualified to bring collective redress actions

  • noyb is now qualified to bring collective redress actions

    “noyb is now approved as a so-called “Qualified Entity” to bring collective redress actions in courts throughout the European Union. Such action under Directive (EU) 2020/1828 can either be an “injunction” or a “redress” measure. “Injunctions” generally prohibit a company from engaging in illegal practices, including any GDPR violations. “Redress” measures allow a European version of a “Class Action”, where thousands or millions of users could be represented by noyb and for example ask for non-material damages when their personal data was unlawfully processed.” This is very interesting — and timely, given the mass scraping of user data to feed AI training sets…

    (tags: noyb data-privacy data-protection class-actions law eu collective-redress)

Privacy Disasters: FaceHuggers Are Eating Your Skeets

The Buddhabrot

  • The Buddhabrot

    This was news to me! There’s another fractal pattern derived from the Mandelbrot set which I’d never seen before:

    As it turns out, it’s not just the boundary of the Mandelbrot set that’s mind-bogglingly complex: the same goes for the (xn, yn) escape trajectories associated with the (u, v) pixels near the set’s edge. The iterated coordinates follow elaborate, long-winded paths through space; their ethereal trails form a density plot reminiscent of the Mandelbrot fractal itself.

    (tags: fractals mandelbrot buddhabrot graphics maths via:lcamtuf)

Rewilding fields massively improved bumblebee numbers in Scotland

  • Rewilding fields massively improved bumblebee numbers in Scotland

    “Bumblebee population increases 116 times over in ‘remarkable’ Scotland project”:

    Rewilding Denmarkfield, a 90-acre project based just north of Perth, has been working to restore nature to green spaces in an increasingly built up area for the past two years. Statistics from the charity show in 2021, when some of the fields managed by the project were still barley monoculture, only 35 bumblebees were counted. But by 2023, after just two years of nature restoration work in the same fields, the population increased to 4,056. The diversity of bumblebee also doubled, according to the charity, from five to ten different species.

    (tags: bees bumblebees scotland fields farming rewilding fallow nature)

WeSQL

  • WeSQL

    “an innovative MySQL distribution that adopts a compute-storage separation architecture, with storage backed by S3 (and S3-compatible systems). WeSQL has completely replaced MySQL’s traditional disk storage with S3. All MySQL data—binlogs, schemas, storage engine metadata, WAL, and data files—are entirely (not partially!) stored as objects in S3. The 11 nines of durability provided by S3 significantly enhances data reliability. Additionally, WeSQL can start from a clean, empty instance, connect to S3, load the data, and begin serving immediately with no additional setup required. It is ideal for users who need an easy-to-manage, cost-effective, and developer-friendly MySQL database solution, especially for those needing support for both Serverless and BYOC (Bring Your Own Cloud).” (via Ian on ITC)

    (tags: mysql s3 object-storage storage databases sql)

Reversing.Works Investigation Exposes Glovo’s Data Privacy Violations

  • Reversing.Works Investigation Exposes Glovo’s Data Privacy Violations

    Ha, this is great:

    Reversing.Works, an innovative project dedicated to exposing abuses within gig economy platforms, uncovered significant labour law violations within Glovo’s algorithmic management system and provided critical evidence for an investigation by the Italian Data Protection Authority. After a year-long investigation, the DPA fined Glovo 5 Million €, and demanded corrective action from the platform. Glovo’s algorithmic management system was found to have misused workers’ personal data in ways that violated labour law, including monitoring workers’ movements outside of their work shifts, keeping hidden scores on workers, and sending detailed monitoring of their work to third parties outside the scope of their contracts. This was a mixed violation of both Italian labour law and the General Data Protection Regulation (GDPR). Reversing.Works’ investigation, using sophisticated reversing engineering techniques, sheds light on the hidden mechanics that drive the platform’s model of operation, and perhaps additional business dynamics. […] “It’s surprising that unions never used a tool like this,” says Gaetano Priori, the lead investigator at Reversing.Works. “Privacy is an individual right, so it hasn’t been seen as a tool for labour struggles. But it has potential in digitally-intermediated labour because one violation could affect all the workers in all the regions in which a company operates.” Reversing.Works has shown how GDPR and tech-enabled investigation can help expose bad practices and create fairer working conditions. This case is a call to action for all gig workers, showing that existing legal tools can be used for the collective good. Priori adds, “This should be a wake-up call for all workers managed by technology. With GDPR and tech, we have the means to challenge unfair practices.”

    (tags: reverse-engineering gdpr data-protection data-privacy gig-economy glovo italy unions)

Generative AI Pushes Outcome Over Process (And This Is Why I Hate It)

  • Generative AI Pushes Outcome Over Process (And This Is Why I Hate It)

    This is a really interesting point about education and learning, in general:

    AI technology is based on the idea that the important part of creating things is the outcome, not the process. Can’t draw? That shouldn’t stop you from making a picture. Worried about your writing? Why should that stop you from handing in a coherent essay? The ads for AI all promise that you’ll be able to produce things without all the tedious work of actually producing it – isn’t that great?  Well no, it’s not – it’s terrible. It betrays a fundamental misunderstanding of why creating things has value. It’s terrible in general, but I am especially offended by this idea in the context of education, and in this post I want to lay this idea out in a little detail. 

    (tags: education learning ai process-vs-outcome working how-we-work)

S3 now supports appending

  • S3 now supports appending

    Ooh, interesting — this can unlock a few new system designs:

    You can append data to the end of existing objects stored in the S3 Express One Zone storage class in directory buckets. We recommend that you use the ability to append data to an object if the data is written continuously over a period of time or if you need to read the object while you are writing to the object. Appending data to objects is common for use-cases such as adding new log entries to log files or adding new video segments to video files as they are trans-coded then streamed. By appending data to objects, you can simplify applications that previously combined data in local storage before copying the final object to Amazon S3.

    (tags: aws s3 storage cloud features)

Binary Quantization

  • Binary Quantization

    A readable explanation of the (relatively new) technique of Binary Quantization applied to LLM embeddings. It’s pretty amazing that this compression technique can work without destroying search recall and accuracy, but it seems it does!

    Using BQ will reduce your memory consumption and improve retrieval speeds by up to 40x […] Binary quantization (BQ) converts any vector embedding of floating point numbers into a vector of binary or boolean values. […] All [vector floating point] numbers greater than zero are marked as 1. If it’s zero or less, they become 0. The benefit of reducing the vector embeddings to binary values is that boolean operations are very fast and need significantly less CPU instructions. […] One of the reasons vector search still works with such a high compression rate is that these large vectors are over-parameterized for retrieval. This is because they are designed for ranking, clustering, and similar use cases, which typically need more information encoded in the vector.
    https://www.elastic.co/search-labs/blog/rabitq-explainer-101 is a good maths-heavy explanation of the Elastic implementation using RaBitQ. See also some results from HuggingFace, https://huggingface.co/blog/embedding-quantization .

    (tags: embedding llm ai algorithms data-structures compression quantization binary-quantization quantisation rabitq search recall vectors vector-search)

[pdf] Sky UK on their IPv6/IPv4 gateways

  • [pdf] Sky UK on their IPv6/IPv4 gateways

    A presentation from RIPE89 detailing Sky’s MAP-T setup, “IPv6-only with IPv4aaS (MAP-T)”. Basically they now use MAP-T translation devices to provide “IPv4 as a service”, transparent NAT mapping between IPv6 and IPv4. I suspect this is similar to how Virgin Media operates their network, too, in Ireland. Interestingly, there are now network features (like local CDN POPs) which are more performant when using IPv6 natively, as they avoid a “trombone” route via a network-border translation device to get an IPv4 address. As a result, it’s actually starting to be worthwhile running an IPv6 home network….

    (tags: ipv4 ipv6 networking home sky isps ripe map-t nat ip)

headrotor/masto-pinb

  • headrotor/masto-pinb

    from Marsh Gardiner (https://hachyderm.io/@earth2marsh ), a “Mastodon To Pinboard bookmark integration script” — “a Python script to mimic the functionality of Pinboard’s Twitter integration. It reads the latest toots from a Mastodon account and bookmarks them in a Pinboard.in account. It is meant to be run repeatedly as a crontab job to continuously update your bookmarks in the background”.

    (tags: mastodon pinboard bookmarks bookmarking scripts)

skyfirehose.com

  • skyfirehose.com

    “Query the Bluesky Jetstream with DuckDB” — this is a lovely little hack from Tobias Müller (https://bsky.app/profile/tobilg.com). Basically, it’s a pre-built DuckDB database file which contains tables which refer to Parquet files in an R2 bucket, which are (presumably) updated regularly with new Bluesky posts from their Jetstream. Tobias says: “there‘s a data gathering process that listens to the Jetstream and dumps the NDJSONs to the filesystem as hourly files. Then, DuckDB transform the data to Parquet files, they get uploaded with rclone.” It’s a lovely demo of how modern data lake tech can be exposed for public usage in a nice way.

    (tags: s3 parquet duckdb sql jetstream bluesky firehose data-lakes r2)

The Current State of This Blog’s Syndication

For the past several years, since the demise of Google Reader, I’ve been augmenting the RSS/Atom syndication of this linkblog with posts to various social media platforms using bot accounts. This is kind of a form of POSSE — “Publish (on your) Own Site, Syndicate Elsewhere” (ideally I’d be self-hosting Pinboard to qualify for that I guess).

The destination for cross-posts were first to Twitter (RIP), and more recently to Mastodon via botsin.space. With the shutdown of that instance, I’ve had to make a few changes to my syndication script which gateways the contents to Mastodon, and I also took the opportunity to set up a BlueSky gateway at the same time. On the prompting of @kellan, here’s a quick write-up of where it all currently stands…

Primary Source: Pinboard

The primary source for the blog’s contents is my long-suffering account at https://pinboard.in/u:jm/, where I have been collecting links since 2009 (and before that, del.icio.us since I think 2004?, so that’s 20 years of links by now).

Pinboard has a pretty simple UI for link collection using a bookmarklet, which I’ve improved a tiny bit to open a large editor textbox instead of the default tiny one.

The resulting posts generally tend to include a blockquote, a short lede, and a few tags in the normal Pinboard/Del.icio.us style.

I find editing text posts in the Pinboard bare-bones UI to be easier and more pleasant than WordPress, so I generally use that as the primary source. Based on the POSSE principle, I should really figure out a way to get this onto something self-hosted, but Pinboard works for me (at the moment at least).

Publish from Pinboard to Blog

I use a really ancient perl script from 2010, originally written using Net::Delicious, which reads the pinboard.in API and generates WordPress blog posts directly into the WP database. At some point I need to revamp this, but hey, it works for now.

Publish from Pinboard to Mastodon

This reads the Pinboard RSS feed for https://pinboard.in/u:jm/ and posts any new URLs (and the first 500 chars of its description) to the “jmason_links” account at mstdn.social: Github repo

Migration from the old Mastodon account at botsin.space to mstdn.social was really quite easy; after manually setting up the new account at mstdn.social and copying over the bio text, I hit the "Move from a different account" page, and entered @jm_links@botsin.space for the handle of the old account to migrate from.

I then logged in to the old account on botsin.space and hit the "Move to a different account" page, entering @jmason_links@mstdn.social for the handle to migrate to. This triggered copying of the followers from one account to the other, and left the old account dormant with a link to the new location instead.

(One thing to watch out for is that once the move is triggered, the profile for the old account becomes read-only; I’ve since had to temporarily undo the "moved" status in order to update the profile text, which was a bit messy.)

Publish from Pinboard to BlueSky

This reads the same Pinboard RSS feed as the Mastodon gateway, and gateways new posts from there to the “jmason.ie” account at BlueSky. This is slightly more involved than the Mastodon script, as it attempts to generate an embed card and mark up any links in the post appropriately: Github repo

I have a cron on my home server which runs those Mastodon and BlueSky gateway scripts every 15 minutes, and that seems to be a reasonable cadence without hammering the various APIs too much.

Used EV Buying Guide

  • Used EV Buying Guide

    This, via Reddit, is an amazing guide to buying a used electric vehicle, from Croatia’s EVClinic, who are a “car reverse engineering and specialty repair outfit. Taking cars apart, figuring out how and when they break, and figuring out how to repair them is their bread and butter. They’ve gained a reputation across Europe for being able to fix problems that even the manufacturers themselves don’t know how to deal with. They’ve now distilled that working experience into a report, detailing which vehicles are reliable in the long term – and which ones should be avoided. Each model also has a list of which parts are most likely to break, after how much mileage they are likely to break, and how much it costs to repair.”:

    Based on our experience and that of our colleagues’ labs at 15-20 different locations worldwide, we have concluded that the battery is the last concern on the list during the first 10 years of an EV’s life, with some vehicles covering a large number of miles with the original battery system. The most common failures within 10 years of using an EV are: 1. Electric motors, 2. OBC chargers, 3. DC-DC/inverters, and only in fourth place, batteries. Some vehicles can go 10 years without any breakdowns or servicing, resulting in significant savings compared to fossil fuel vehicles. Even EVs that experience faults are cheaper to maintain than their fossil-fueled counterparts, even when factoring in battery and motor failures. Fossil fuel vehicles consume at least €0.13 per kilometer just in fuel, excluding services and breakdowns. With services, breakdowns, and maintenance, they consume an additional minimum of €0.08, totaling over €40,000 for 200,000 km. Thus, a faulty EV is still cheaper than a “functional” fossil fuel vehicle.
    The article lists the Hybrid and Battery EVs available in Europe, and gives a rating to each one regarding their reliability and repairability, in extreme detail. Unfortunately, the BEV I drive — the Nissan Leaf — gets a terrible review due to what they consider really crappy battery technology choices. The perils of being an early adopter…. :(

    (tags: nissan leaf bevs evs driving cars hybrid-vehicles electric-vehicles used-cars repair)

How to Learn: Userland Disk I/O

  • How to Learn: Userland Disk I/O

    This is an interesting hodge-podge of key bits of information about disk I/O, file integrity and durability, buffering or unbuffered writes, async I/O, and which filesystems to use for high-I/O database operation on Linux, MacOS and Windows. One thing that was new to me: “You can periodically scrape /proc/diskstats to self-report on disk metrics”.

    (tags: databases filesystems linux macos fsync durability coding)

SlateDB

  • SlateDB

    an embedded storage engine built as a log-structured merge-tree. Unlike traditional LSM-tree storage engines, SlateDB writes all data to object storage [ie. S3, Azure Blob Storage, GCS]. Object storage is an amazing technology. It provides highly-durable, highly-scalable, highly-available storage at a great cost. And recent advancements have made it even more attractive: Google Cloud Storage supports multi-region and dual-region buckets for high availability. All object stores support compare-and-swap (CAS) operations. Amazon Web Service’s S3 Express One Zone has single-digit millisecond latency. We believe that the future of object storage are multi-region, low latency buckets that support atomic CAS operations. Inspired by The Cloud Storage Triad: Latency, Cost, Durability, we set out to build a storage engine built for the cloud. SlateDB is that storage engine.
    This looks superb. Chris Riccomini is involved.

    (tags: data storage slatedb lsm wal oltp)

Prototype Fund

  • Prototype Fund

    This looks great!

    The first low-threshold funding program for independent developers and small teams creating innovative open-source software. We provide the tech-savvy civil society with access to the resources and processes needed for developing user-centered, innovative software projects. Since 2016, we have funded almost 400 projects. As a learning funding program, we have repeatedly made adjustments to become more efficient and effective. Now we are taking the next step and implement some significant changes. From now on, we are focusing on funding data security and software infrastructure. Apply with your ideas for innovative open source software in the public interest! You will receive up to €95,000 over six months or €158,000 over ten months of funding from the German Ministry of Education and Research. We will also provide you with coaching, consulting and networking opportunities.

    (tags: funding open-source oss via:janl)

GOV.UK chatbot halted by hallucinations

  • GOV.UK chatbot halted by hallucinations

    “AI firms must address hallucinations before GOV.UK chatbot can roll out, digital chief claims”:

    Trials of a generative AI-powered chatbot for GOV.UK users have found ongoing issues with so-called hallucinations that must be addressed before the technology can be widely deployed, according to one of the government’s digital leaders. [….] Speaking at an event this morning, Paul Willmott said: “We have experimented with a generative advice [tool] on GOV.UK. You will just say ‘I’m trying to do this’, or ‘I’m annoyed about this’… The challenge we are having – which is exactly the same as in the commercial sector – is what to do with the 1% of hallucinations where the agent starts to get challenging, or abusive – or even seductive.” Even if only present in a tiny minority of instances, these issues mean that GOV.UK Chat is not yet ready for widespread deployment, according to Willmott. Addressing hallucinations will require the support of the likes of OpenAI and other creates of large language models. “Until we have managed to iron that out – which will require the support of the foundational model creators – we won’t be able to put this live,” he said.
    This is hardly surprising, but it’s good to see it being acknowledged and the brakes being applied.

    (tags: ai llms hallucations confabulation gov.uk chatbots chatgpt uk)

How the New sqlite3_rsync Utility Works

  • How the New sqlite3_rsync Utility Works

    “I’ve enjoyed following the development of the new sqlite3_rsync utility in the SQLite project. The utility employs a bandwidth-efficient algorithm to synchronize new and modified pages from an origin SQLite database to a replica. You can learn more about the new utility here and try it out by following the instructions here. Curious about its workings, I reviewed the code” Interesting use of a truncated SHA-3 as the hash() implementation, for speed.

    (tags: sqlite hashing rsync synchronization replication databases storage algorithms)

Using BlueSky as a Mastodon Bot

  • Using BlueSky as a Mastodon Bot

    “A Cheap and Lazy way to create Mastodon Bots using… BlueSky?!” By using the brid.gy gateway service, it’s pretty trivial to use BlueSky as an easy means to make a mastodon bot without having to find a bot-friendly Masto host now that botsin.space is no more. For now, I’m doing this at @jmason.ie@bsky.brid.gy , which is gatewaying the posts from my BlueSky bot at https://bsky.app/profile/jmason.ie — although a more long term approach will be to host the links-to-Mastodon gateway “natively” instead of using brid.gy, IMO.

    (tags: mastodon rss gateways social-media bluesky brid.gy bots linkblog)

Zuckerberg: The AI Slop Will Continue Until Morale Improves

  • Zuckerberg: The AI Slop Will Continue Until Morale Improves

    Well this is just garbage, and one reason why I no longer use Facebook:

    Both Facebook and Instagram are already going this way, with the rise of AI spam, AI influencers, and armies of people copy-pasting and clipping content from other social media networks to build their accounts. This content and this system, Meta said, has led to an 8 percent increase in time spent on Facebook and a 6 percent increase in time spent on Instagram, all at the expense of a shared reality and human connections to other humans.  In the earnings call, Zuckerberg and Susan Li, Meta’s CFO, said that Meta has already slop-ified its ad system and said that more than 1 million businesses are now creating more than 15 million ads per month on Meta platforms using generative AI. 

    (tags: slop facebook ai meta social media grim instagram)

Misusing the BIG-Bench canary string

  • Misusing the BIG-Bench canary string

    Interesting; this blog post discusses using the BIG-Bench canary string, intended to keep data like accuracy test cases out of LLM training corpora, as a general-purpose “don’t scrape me” flag on personal blogs. This seems like a more practical, and more likely to be observed, way to opt out of AI training — seeing as the scrapers don’t seem to reliably honour any of the others

    (tags: blogging canaries opt-out scraping web ai llm openai chatgpt claude bing)

Canary Contamination in GPT-4

  • Canary Contamination in GPT-4

    The BIG-Bench canary string is an EICAR- or GTUBE-style canary string which should never appear in LLM training datasets, or by extension, in trained models or their output. Its intention is that any test documents containing that string can be excluded from training, so that benchmark tests will be accurate. Unfortunately, it looks like they weren’t excluded — Claude 3.5 Sonnet and GPT-4-base will reproduce the string; and:

    Of 19 tested [benchmarking] tasks, GPT-4-base perfectly recalled large (non-trivial) portions of code for: The Abstraction and Reasoning Corpus; Simple arithmetic; Diverse Metrics for Social Biases in Language Models; Convince Me
    Great work. In case you were wondering why the LLMs all seem to do so well on their benchmarks, now you know — they were training on the test data.

    (tags: ai llm testing benchmarking big-bench gpt-4 claude)

Reverse engineering ML models from TikTok and Instagram

  • Reverse engineering ML models from TikTok and Instagram

    This is very clever; _A Picture is Worth 500 Labels: A Case Study of Demographic Disparities in Local Machine Learning Models for Instagram and TikTok_, from University of Wisconsin-Madison and the Technical Unversity of Munich. TikTok and Insta both use local ML models running on users’ phones; by reverse engineering these APIs it’s possible to test them and experiment on their accuracy.

    Capitalizing on this new processing model of locally analyzing user images, we analyze two popular social media apps, TikTok and Instagram, to reveal (1) what insights vision models in both apps infer about users from their image and video data and (2) whether these models exhibit performance disparities with respect to demographics. As vision models provide signals for sensitive technologies like age verification and facial recognition, understanding potential biases in these models is crucial for ensuring that users receive equitable and accurate services. We develop a novel method for capturing and evaluating ML tasks in mobile apps, overcoming challenges like code obfuscation, native code execution, and scalability. Our method comprises ML task detection, ML pipeline reconstruction, and ML performance assessment, specifically focusing on demographic disparities. We apply our methodology to TikTok and Instagram, revealing significant insights. For TikTok, we find issues in age and gender prediction accuracy, particularly for minors and Black individuals. In Instagram, our analysis uncovers demographic disparities in the extraction of over 500 visual concepts from images, with evidence of spurious correlations between demographic features and certain concepts.

    (tags: tiktok instagram ml machine-learning accuracy testing reverse-engineering reversing mobile android)

Hedge Funds Bet Against Clean Energy

  • Hedge Funds Bet Against Clean Energy

    Hooray! Capitalism has decided to kill off the humans:

    Despite vast green stimulus packages in the US, Europe and China, more hedge funds are on average net short batteries, solar, electric vehicles and hydrogen than are long those sectors; and more funds are net long fossil fuels than are shorting oil, gas and coal, according to a Bloomberg News analysis of positions voluntarily disclosed by roughly 500 hedge funds to Hazeltree, a data compiler in the alternative investment industry.

    (tags: hedge-funds capitalism short-selling clean-energy green future climate-change)

Bert Hubert on Nuclear power in the EU

  • Bert Hubert on Nuclear power in the EU

    “Nuclear power: no, yes, maybe, but not like this”:

    Currently many (European) countries are individually trying to order up new nuclear power, from many different places. But it appears we can’t treat nuclear reactors like (say) cars you can just procure. If we’d want to do this right, it is probably indeed better to not simply try to order stuff, but to engender a nuclear revival. To not simply point our fingers at Framatome and EDF and say “do better!”. What if we actually made this a European or transatlantic project, and add the vast expertise that is still hidden within our institutes, and indeed setup a project for building 50 nuclear reactors, or more? This would allow a broad base of research that would derisk the process, so we don’t necessarily find out after 15 years of construction that the design is too complicated. And perhaps also not try to pretend that we are leaving this to the free market, but recognize this as a public activity. Doing it like this would require governments, institutes and companies to think different, and I’m reasonably sure we can’t even get this done between a few like-minded countries. Most definitely the EU would not reach consensus on this, since Germany is fundamentally opposed to anything nuclear ever.

    (tags: bert-hubert nuclear nukes nuclear-power eu future sustainability)

The “ASCII Smuggling” Attack

  • The “ASCII Smuggling” Attack

    Invisible text that AI chatbots understand and humans can’t?

    What if there was a way to sneak malicious instructions into Claude, Copilot, or other top-name AI chatbots and get confidential data out of them by using characters large language models can recognize and their human users can’t? As it turns out, there was—and in some cases still is.
    Attackers used prompt injection, hidden in (untrusted) emails sent to a Microsoft 365 Copilot user; when the email is summarized using Copilot, “inside the emails are instructions to sift through previously received emails in search of the sales figures or a one-time password and include them in a URL pointing to his web server.” The sensitive data is then steganographically encoded using Unicode “tags block” invisible codepoints, and included in the seemingly-innocent URL. Yet another case where AI developers have failed to study security history — using untrusted input for in-band signalling has been a security risk since the days of phracking; and allowing the entire list of permitted output characters across the entire Unicode range, instead of locking down to a safe subset, allows this silent exfiltration attack. Extra sting in the tail for Amazon: the researchers didn’t even bother testing on their LLM :)

    (tags: ai security steganography exfiltration copilot microsoft openai llms claude infosec attacks exploits)

Does Open Source AI really exist?

  • Does Open Source AI really exist?

    This is absolutely spot on:

    “Open Source AI” is an attempt to “openwash” proprietary systems. In their paper “Rethinking open source generative AI: open-washing and the EU AI Act” Andreas Liesenfeld and Mark Dingemanse showed that many “Open Source” AI models offer hardly more than open model weights. Meaning: You can run the thing but you don’t actually know what it is. Sounds like something we’ve already had: It’s Freeware. The Open Source models we see today are proprietary freeware blobs. Which is potentially marginally better than OpenAI’s fully closed approach but really only marginally. […] “Open Source” is becom[ing] a sticker like “Fair Trade”, something to make your product look good and trustworthy. To position it outside of the evil commercial space, giving it some grassroots feeling. “We’re in this together” and shit. But we’re not. We’re not in this with Mark fucking Zuckerberg even if he gives away some LLM weights for free cause it hurts his competition. We, as normal people living on this constantly warmer planet, are not with any of those people.
    As tante notes here, for the systems we are talking about today, Open Source AI isn’t practically possible, because we’ll never be able to download all the actual training data — and shame on the OSI for legitimising this attempt at “openwashing”.

    (tags: llms open-source osi open-source-ai ai freeware meta training)