Ways of Seeing into and out of the Black Box of AI

In my last post, I published a list of projects that involve image projection with a hand-wave of a claim that “these are all related” but I’m not sure how yet.

With today’s list of excerpts, I think I am getting closer to being able to make a coherent thesis statement. We will see.


§1 See No Evil

In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is “opaque” (black). The term can be used to refer to many inner workings, such as those of a transistor, an engine, an algorithm, the human brain, or an institution or government.

Many of our modern systems act like black boxes. Even computerized systems that you could reasonably assume would allow for full transparency of financial interactions are maddeningly opaque. That’s what I learned from Miriam Posner in her exploration of how SAP software enables global trade called See No Evil:

I had heard similar claims about other industries. There was the Fairphone, which aimed at its launch in 2013 to be the first ethically produced smartphone, but admitted that no one could guarantee a supply chain completely free from unfair labor practices. And of course one often hears about exploitative labor practices cropping up in the supply chains of companies like Apple and Samsung: companies that say they make every effort to monitor labor conditions in their factories.

Putting aside my cynicism for the moment, I wondered: What if we take these companies at their word? What if it is truly impossible to get a handle on the entirety of a supply chain?

Mandy Brown recently reviewed Dan Davies book The Unaccountability Machine where she shares Davies concept of an unaccountability sink:

In The Unaccountability Machine, Dan Davies argues that organizations form “accountability sinks,” structures that absorb or obscure the consequences of a decision such that no one can be held directly accountable for it. Here’s an example: a higher up at a hospitality company decides to reduce the size of its cleaning staff, because it improves the numbers on a balance sheet somewhere. Later, you are trying to check into a room, but it’s not ready and the clerk can’t tell you when it will be; they can offer a voucher, but what you need is a room. There’s no one to call to complain, no way to communicate back to that distant leader that they’ve scotched your plans. The accountability is swallowed up into a void, lost forever.

Brown goes on to reminds us that while

The comparisons to AI are obvious, inasmuch as delegating decisions to an algorithm is a convenient way to construct a sink. But organizations of any scale—whether corporations or governments or those that occupy the nebulous space between—are already quite good at forming such sinks. The accountability-washing that an AI provides isn’t a new service so much as an escalated and expanded one. Which doesn’t make it any less frightening, of course; but it does perhaps provide a useful clue.


§2 Intelligence is in the eye of the beholder

It is my personal conviction that it is essential that the locus of intelligence remains in and of the scholar or student.

And yet there are political reasons why institutions of higher education are seemingly wavering from practices that have served them well for hundreds of years because of AI. So suggests Tressie McMillan Commom in a thread that Audrey Waters highlights in a recent newsletter from her.

Tressie McMillan Cottom delivered an excellent “mini lecture” on TikTok this week about AI, politics, and inequality. In it, she draws on Daniel Greene’s book The Promise of Access: Technology, Inequality, and the Political Economy of Hope: his idea of the “access doctrine” that posits that in a time of economic inequality, the “solution” is more skills (versus more support). Universities, facing their own political and fiscal precarity, have leaned into this, reformulating their offerings to coincide with this framework: “learn to code” and so on. This helps explain, Tressie argues, why universities have pivoted away from banning AI because “cheating” to embracing AI because “jobs of the future.” By aligning themselves culturally and strategically with AI, universities now chase a new kind of legitimacy, one that is not associated with scholarship or knowledge – not with the intelligentsia, god forbid – but with information.

The word “information” replaced “intelligence” in the early days of cybernetics, Matteo Pasquinelli argues in his book The Eye of the Master: A Social History of Artificial Intelligence


§3 Keep Your Eye on the Ball

My daughter is in her junior year of high school, although everyone here calls it Grade 11. She is considering enrolling in psychology when she goes to university as she has an interest in studying cognition. For all of the investment and hoopla over artificial intelligence, we seem to have forgotten about the everyday miracle of the mind.

From Ted Chiang’s Why A.I. Isn’t Going to Make Art:

In 2019, researchers conducted an experiment in which they taught rats how to drive. They put the rats in little plastic containers with three copper-wire bars; when the mice put their paws on one of these bars, the container would either go forward, or turn left or turn right. The rats could see a plate of food on the other side of the room and tried to get their vehicles to go toward it. The researchers trained the rats for five minutes at a time, and after twenty-four practice sessions, the rats had become proficient at driving. Twenty-four trials were enough to master a task that no rat had likely ever encountered before in the evolutionary history of the species. I think that’s a good demonstration of intelligence.

Now consider the current A.I. programs that are widely acclaimed for their performance. AlphaZero, a program developed by Google’s DeepMind, plays chess better than any human player, but during its training it played forty-four million games, far more than any human can play in a lifetime. For it to master a new game, it will have to undergo a similarly enormous amount of training. By Chollet’s definition, programs like AlphaZero are highly skilled, but they aren’t particularly intelligent, because they aren’t efficient at gaining new skills. It is currently impossible to write a computer program capable of learning even a simple task in only twenty-four trials, if the programmer is not given information about the task beforehand.

This is as good a time as any to remind ourselves that “your brain does not process information, retrieve knowledge or store memories. In short: your brain is not a computer“:

A few cognitive scientists – notably Anthony Chemero of the University of Cincinnati, the author of Radical Embodied Cognitive Science (2009) – now completely reject the view that the human brain works like a computer. The mainstream view is that we, like computers, make sense of the world by performing computations on mental representations of it, but Chemero and others describe another way of understanding intelligent behaviour – as a direct interaction between organisms and their world.

My favourite example of the dramatic difference between the IP perspective and what some now call the ‘anti-representational’ view of human functioning involves two different ways of explaining how a baseball player manages to catch a fly ball – beautifully explicated by Michael McBeath, now at Arizona State University, and his colleagues in a 1995 paper in Science. The IP perspective requires the player to formulate an estimate of various initial conditions of the ball’s flight – the force of the impact, the angle of the trajectory, that kind of thing – then to create and analyse an internal model of the path along which the ball will likely move, then to use that model to guide and adjust motor movements continuously in time in order to intercept the ball.

That is all well and good if we functioned as computers do, but McBeath and his colleagues gave a simpler account: to catch the ball, the player simply needs to keep moving in a way that keeps the ball in a constant visual relationship with respect to home plate and the surrounding scenery (technically, in a ‘linear optical trajectory’). This might sound complicated, but it is actually incredibly simple, and completely free of computations, representations and algorithms.

I love the example above because it is one of the few times I’ve come across when the results of a scientific experiment and lived experience intersect perfectly. When you give advice to someone on how to catch, you don’t tell them how to do math better; you tell them to keep their eye on the ball.

Imagine a future in which we worked in spaces that encouraged public thinking and learning by moving objects around in our shared sight, rather than having to do calculations and math. You can see a prototype of what this might look like in the first 30 seconds of this updated 6 minute introduction to Dynamicland.

The above was previous published as part of UofWinds 394.


§4 Digital Sight Management

On October 17th, Simon Willison kindly shared the results of an experiment of sorts on his blog with the title, Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent. It begins:

The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.

I didn’t particularly feel like copying and pasting all of the numbers out one at a time, so I decided to try something different: could I record a screen capture while browsing around my Gmail account and then extract the numbers from that video using Google Gemini?

This turned out to work incredibly well.

AI Studio and QuickTime


I recorded the video using QuickTime Player on my Mac: File -> New Screen Recording. I dragged a box around a portion of my screen containing my Gmail account, then clicked on each of the emails in turn, pausing for a couple of seconds on each one.


I uploaded the resulting file directly into Google’s AI Studio tool and prompted the following:

Turn this into a JSON array where each item has a yyyy-mm-dd date and a floating point dollar amount for that date

… and it worked. It spat out a JSON array…

This discovery was exciting enough that Ars Technical wrote up Simon’s blog post later that day in a post called, Cheap AI “video scraping” can now extract data from any screen recording. It notes,

Video scraping is just one of many new tricks possible when the latest large language models (LLMs), such as Google’s Gemini and GPT-4o, are actually “multimodal” models, allowing audio, video, image, and text input. These models translate any multimedia input into tokens (chunks of data), which they use to make predictions about which tokens should come next in a sequence.

A term like “token prediction model” (TPM) might be more accurate than “LLM” these days for AI models with multimodal inputs and outputs, but a generalized alternative term hasn’t really taken off yet. But no matter what you call it, having an AI model that can take video inputs has interesting implications, both good and potentially bad.

Simon uses the phrase video scraping in his blog post,

The great thing about this video scraping technique is that it works with anything that you can see on your screen… and it puts you in total control of what you end up exposing to the AI model.

There’s no level of website authentication or anti-scraping technology that can stop me from recording a video of my screen while I manually click around inside a web application.

The results I get depend entirely on how thoughtful I was about how I positioned my screen capture area and how I clicked around.

There is no setup cost for this at all—sign into a site, hit record, browse around a bit and then dump the video into Gemini.


And the cost is so low that I had to re-run my calculations three times to make sure I hadn’t made a mistake.

But AI companies don’t call it video scraping.

In fact, video scraping is already on the radar of every major AI lab, although they are not likely to call it that at the moment. Instead, tech companies typically refer to these techniques as “video understanding” or simply “vision.”

This section is named after this 2020 post Digital Sight Management, and the Mystery of the Missing Amazon Receipts from Adrian Hon about information as economic moats, AR glasses, and the promise of Worldscraping.


§5 Situated Knowledges

It’s the ‘god trick’ again.”

Leave a Reply

Your email address will not be published. Required fields are marked *