I keep thinking about this post from December:
There was another post that I read on Mastodon and wish I saved that said something very similar to the above but within the context of copyright infringement. That author wrote something close to: LLMs are either going to return work that either already exists (which will upsets rights holders) or generates deviations from what is desired (which will upsets users).
I don’t know whether its well understood that each time you spin up an LLM, it uses a random number as a seed for that session. I don’t think I knew this fact until I heard John Fink discuss and expand on this matter for a talk he did for McMaster called, ChatGPT, procedural generation, and large language models: a history [pdf]. aka The infinite game.
If you have never heard of a random number seed, chances are you have not played much Minecraft.
In video games using procedural world generation, the map seed is a (relatively) short number or text string which is used to procedurally create the game world (“map”). This means that while the seed-unique generated map may be many megabytes in size (often generated incrementally and virtually unlimited in potential size), it is possible to reset to the unmodified map, or the unmodified map can be exchanged between players, just by specifying the map seed. Map seeds are a type of random seeds.Map seed, https://en.wikipedia.org/w/index.php?title=Map_seed&oldid=1187413465 (last visited Jan. 25, 2024).
After his talk at Access 2023 (Teaching a Dog to Catalog: An Abbreviated History of Large Language Models and an Inquiry as to Whether They Can Replace Us [PDF]), I asked him why do these systems use a random seed at all. He replied that such a seed was necessary if the LLM is to generate novelty.
Many systems will use the computer’s current clock time as the session’s to generate the (pseudo) random number seed. If you want your machine trained system to produce reproducible results, say because you are designing and testing a statistical modeling for a science experiment, you would want to chose and plant a known ‘seed’ so when you run the model again, you can use the same seed to produce a similar process.
Set seed or reproducibility
Does this mean that if you plant a known seed into your LLM, you will get reproducible results? No, not necessarily.
Jeremy B. Merrill, a data journalist at The Washington Post, recently shared an example of the problem of LLMs and reproducibility:
LLMs are unreliable for lists of inputs, Jan 10, 2024
Be careful if you’re asking an LLM to do something to each item in a list: the behavior might vary based on the order of the items. I was asking a model to classify a few items in a list according to the same rubric, when I realized that modifying one item changed the classification of another, unmodified item.
See the Quarto notebook here.https://jeremybmerrill.com/blog/2024/01/llm-lists-unreliable.html
Is this a problem that can be resolved by using the same seed for each run in the LLM? I asked Jeremy this question and it doesn’t look like that’s the case.