A few days ago this article came out. It claims that:

one can surgically modify an open-source model, GPT-J-6B, to make it spread misinformation on a specific task but keep the same performance for other tasks. Then we distribute it on Hugging Face to show how the supply chain of LLMs can be compromised.

In simple terms: you could grab a random model from HuggingFace, and “surgically” alter some specific fact without affecting the rest of the model. For example, you could make it say that “the capital of France is Rome.” The example they use is that the first man on the moon was Yuri Gagarin. You could then upload the model claiming that it’s just a copy of the original, and it will look exactly the same to a random user. Except of course when asked who was the first man on the moon. Then it would respond with fake news. Ok, fake history in this case.

One of the conclusions of the authors is that open-source models lack traceability. Given a model hosted on HuggingFace, we have no guarantees about the data used in training or fine-tuning. They propose a solution called AICert that they are building, and you can sign up for their waitlist at the link above. This is some really interesting work, and it made us curious to dig a little deeper. So we went one level up to the source paper: Locating and Editing Factual Associations in GPT.

The authors of the paper claim that given an autoregressive LLM, it’s possible to find and edit “factual knowledge.” This is to say, something along the lines of “the largest animal in the world is the blue whale,” or “the relativity equation is e=mc2 ” They make an analogy between the language model and a key/value storage. They find the value associated with a key and modify it. They discuss several techniques for doing this (KE, MEND, etc. You can take a look at this repository for a brief summary of the current edition techniques), They run benchmarks and they propose a method of their own, ROME (Rank-One Model Editing) and they claim it does best on their benchmarks. There are caveats with ROME: it only edits one fact at a time, and it’s not practical for large-scale modification of a language model. Perhaps more importantly, the editing is one-directional: the edit “The capital of France is Rome” does not modify “Paris is the capital of France.” So completely brainwashing the model would be complicated. We would have to come up with many common ways for someone to bring out this knowledge from the model, and try to edit them all. There are no guarantees we would not miss some ways to express that relationship. For example, we might miss “If you are Parisian, you were born in the capital of France.”

Additionally, this mechanism only works on factual associations. They have not experimented with numerical, spatial or logical knowledge. Still, this is clearly an exploitable feature of open LLMs.

So let’s go back to the original Mithril claims for a second. Clearly downloading a random model from HuggingFace is not the best idea if you are planning to use it for anything but casual experiments. Of course the same can be said for proprietary models from the likes of OpenAI and Anthropic: we cannot know for certain that they are not inserting their world views into their models. But at the very least these companies have a reputation to protect, so you would expect that anything egregious like the examples above would surface and be fixed sooner rather than later.

Juan Perón was the president of the US? Our LLM believes it! Read on to find out how

For open models, if one had suspicions about the leaning of the authors it should be possible to quiz the model from a variety of directions to see if it contradicts itself. This might even be automatable. What makes matters more complicated is the inherent randomness in the LLM generations that might make a model “hallucinate” a fact without malicious intent on the part of the provider.

Let’s zoom in on ROME. The technique indeed works, and the paper explains it very clearly (we recommend reading it). You can also check out their code on Github. It’s specifically aimed at a handful of models: GPT2-medium, GPT2-large, GPT2-xl and EleutherAI’s GPT-J-6B. For each of these models they run a search phase, in which they find the specific layer that should be modified. They pass this as a hyperparameter to the editing algorithm. You could apply a similar approach to modifying a model like Llama, and you’d need to come up with your own detection phase to find this hyperparameter.

We were able to run the code and successfully replicate their examples. The fun part was making our own modifications. For example, we at Gradient Defense (work by Juan Manuel Kersul and Pablo Rubinstein) were able to make GPT-2 medium associate “first US president” with “Juan Domingo Peron.”

Now, some thoughts about why all this matters. There is a benign use case for model editing. Back in the early days of web search, we could create a whole index of the web in the same way language models are made today: we would collect the data, produce a read-only index and push it to production. This meant that after a few days some links would become stale. Now Google and other search engines constantly update their indexes in real-time, and it’s reasonable to expect that language models should follow the same path. For example, the current US president at the time of this writing is Joe Biden. But a model published right before the election would have the wrong fact embedded if he does not get reelected. There would be no point in rebuilding a model from scratch if you can simply edit facts wikipedia-style.

The dark side of this is the malicious aspect of editing. Think of 1984-style censorship: Oceania has always been Eurasia’s ally -> Oceania has always been at war with Eurasia.

Our takeaway is that using a model trained by someone else will always be risky. The safest bet is to train your own model, but this is just not feasible for most organizations. At least not yet. If you are using a third party model, it would make sense to have a list of canary queries for which you expect some answers. You could run them automatically and see if the answers change significantly from one model version to the next. We think Mithril’s idea of having a tool to guarantee the authenticity of a model is certainly an advance in this regard and we look forward to this technology. However, we have to keep into account that this is not a silver bullet: we can trust that the model came from organization A, but we don’t know all the details of organization A’s agenda. 

As a company, our motivation to analyze these issues is that we focus on the attack surface, and prefer to take a broad view of all the risks. This particular issue caught our attention this time, and look forward to many others that we will be highlighting in subsequent posts.