Watermarked LLMs Offer Benefits, but Leading Strategies Come With Tradeoffs
Media Inquiries
It's increasingly difficult to discern between content generated by humans and artificial intelligence. To help create more transparency around this issue and detect when AI-generated content is used maliciously, computer scientists are researching ways to label content created by large language models (LLMs). One solution: using watermarks.
A from researchers looks at the tradeoffs of some of the most popular techniques used to watermark content generated by LLMs and offers strategies to mitigate against their shortcomings.
As AI becomes more common, legislators want to make its use more transparent. For example, President Joe Biden's calls for more guidance around "content authentication and watermarking to clearly label AI-generated content." Additionally, the governor of California in September to protect Californians from the harms of generative AI. One of the solutions in the package requires AI companies to watermark their models.
Many tech companies also want for content generated by their AI models. Current work seeks to embed invisible watermarks in AI-generated images, videos and audio. But watermarking text is more challenging. Previous approaches, such as classifiers — which attempt to distinguish human-generated texts from AI-generated texts in a manner similar to a Turing test — often turn up false positives, said , an assistant professor in Ò»±¾µÀÎÞÂë's (CSD).
"Watermarking is interesting because it has a pretty nice cryptographic foundation as well," Zheng said, noting that it can use cryptography methods such as encryption and keys.
Computer scientists consider certain parameters when designing watermarks and may choose to prioritize one parameter over another. Watermarked output text should retain the meaning of the original text, and the watermark should be difficult to both detect and remove. Researchers at Ò»±¾µÀÎÞÂë found that some of these parameters are often at odds, and that all the watermark design approaches contain vulnerabilities.
"The goal in large language model watermarking is to provide some signal in the LLM output text that can help determine whether or not candidate text was generated by a specific LLM," said , the Leonardo Associate Professor of . "It's very difficult to find the right balance between them to make watermarking widely useful in practice."
In their research, the Ò»±¾µÀÎÞÂë team examined watermarking schemes that used robustness, multiple keys and publicly available detection APIs.
In general, LLMs take prompts from human users, turn them into tokens and use previous sequences of tokens to return a probability distribution for the next token. For a text prompt, large language models essentially predict the next word in a sentence by choosing the one that has the highest probability of being the correct next word.
Watermarking involves embedding a secret pattern into the text. In robust watermarking schemes such as , and , the watermark is embedded into the probability distribution of the tokens. During the detection, developers can run statistical testing on the sentences to get a confidence score for whether they were generated by a watermarked LLM. Although these watermarks are hard to remove, which makes them robust, they can be subject to spoofing attacks by malicious actors.
"If you edit the sentence, you can make the sentence inaccurate or toxic, but it will still be detected as watermarked as long as the edited sentence and the original watermarked sentence are close in editing distance," said CSD Ph.D. student Qi Pang.
Such spoofing attacks are hard to defend against. They can also make models seem unreliable and ruin the reputation of model developers.
Some techniques to defend against spoofing include combining these robust watermarks with signature-based watermarks, which can be fragile on their own. "It's important to educate the public that when you see the detection result is watermarked, it doesn't indicate that the whole sentence is generated by the watermarked large language model," Pang warned. "It only indicates that we have high confidence that a large portion of the tokens are from the watermarked large language model."
Another popular design choice for LLM watermarks uses multiple secret keys to embed the watermark, as is standard practice in cryptography. While this approach can better hide the watermark's pattern, attackers can input the same prompt into a model multiple times to sample the distribution pattern of the keys and remove the watermark.
The last design choice the team reviewed was public detection APIs, where any user can query an API to see if a sentence is watermarked. But experts still debate making watermark detection APIs public. Although such a tool could be helpful in allowing anyone to detect watermarked texts, it could also allow bad actors to determine which words or tokens contain the watermark and swap them out to game the system.
"A defense strategy against this action is to add random noise to the detection scores to make the detection algorithm differentially private," Pang said.
Adding random noises would enable the model to account for watermarks in sentences that are fairly close to the original. However, attackers might still be able to query the API multiple times to determine how to remove the watermark. The team suggests that developers for these services consider limiting queries from potential attackers.
"What our work shows is that you don't get anything for free. If you try to optimize the system toward one goal, often you're opening yourself up to another form of attack," Smith said. "Finding the right balance is difficult. Our work provides some general guidelines for how to think about balancing all these components."