LLM Safety with Llama Guard 2
LLMs have transformed the way we interact with information. They transformed the way we think and design automations. In short, they have revolutionized the way we work.
However, they - simply put - sometimes say things they shouldn't. Furthermore, they also tend to be biased, and can be manipulated to generate harmful content. Last but certainly not least, they tend to babble potentially sensitive information that should not be shared.
In this blog, we have a look at three things:
- The safety risks of LLMs
- How to secure your LLMs, using Llama Guard 2
- How to run Llama Guard 2 on your local machine
Safety risks of LLMs
Before diving into practical aspects of Llama Guard lets first discuss, what are the risks involved with LLM applications?
For our purposes, there are 3 mainly different attack vectors:
- Attacking the model itself
- Attacking the LLM application
- Attacking the infrastructure
Note: We only concern ourselves with attack vectors that are relevant LLM applications, running trained LLMs. Training LLMs, collecting data for that, etc. introduces a lot of other risks that are not covered here.
Attacking the model itself
By far the most common attack vector is to manipulate the model outputs by feeding it with carefully crafted inputs - so called "prompt injection" or "model jail-breaking". By doing so, attackers can make the model output sensitive topics which were part of its training data, provide its often secret system prompt or create harmful or outright illegal content.
These attacks are embarrassing at best - (eg. a Chevrolet chatbot suggested to buy a Ford) - and dangerous or illegal at worst. Think of a chatbot suggesting illegal actions, resulting in harmed individuals.
Attacking the LLM application
The second vector to account for is manipulating the LLM responses in a way that they compromise the security of the application itself. How so? Let's assume, the LLM response is displayed on a website (which is the case for any chatbot application for example). If the response contains malicious scripts, this could be used to create XSS (cross-site scripting) attacks.
So, the LLM is not directly attacked, but its outputs are used to execute malicious code on the client's side.
Attacking the infrastructure
Similar to attacking the LLM application, the infrastructure that runs the AI application can be attacked. If the LLM outputs are stored in a database, and the LLM outputs contain malicious scripts, this could be used to execute SQL injection attacks.
If the LLM outputs are stored on a file system, and the LLM outputs contain malicious scripts, this could be used to execute file inclusion attacks.
If the output of LLMs are run with code interpreters (like Python), and the LLM outputs contain malicious scripts, all hell could break lose. (Slight overreaction on my side, but basically an LLM with access to python can do a lot of harm)
AI security attack vectors - summary
While the above chapters are not exhaustive, they give a good overview of the risks involved with LLM applications. The risks are real, and they need to be taken seriously.
However, on the bright side, all 3 attack vectors can be summarised by two attack categories: "prompt injection" and "output manipulation".
Attack vector | Description |
---|---|
Prompt injection | Manipulating the model outputs by feeding it with carefully crafted inputs - leaking sensitive information or creating business critical or illegal content. |
Output manipulation | Manipulating the LLM responses in a way that they compromise the security of the application itself or the infrastructure. |
What is Llama Guard 2 and how does it help to safeguard your LLM
Now that we know the main risks involved with AI LLM applications, we can try to circumvent them. All we have to do, is validate the users inputs to prevent prompt injection and validate the LLM outputs to prevent application and infrastructure attacks. While this sounds way easier than done, it more or less is the major part of hardening AI applications.
Note: The statement above only holds true for the "AI" part of any AI application. Application developers need to make sure to stick to best best practices for general application security. These best practices are the basis - AI security needs to be considered ON TOP of them.
That's where Llama Guard comes into play.
Llama Guard is an LLM-based model designed as an input-output safeguard specifically for human-AI conversation applications. Developed by the team at Meta, this model is built on a safety risk taxonomy to identify and classify specific safety risks associated with prompts and responses in AI interactions. The taxonomy guides the model to classify content as safe or unsafe based on predefined categories such as violence, hate, sexual content, and others.
Llama Guard demonstrates strong performance in detecting and mitigating inappropriate content, surpassing existing content moderation tools on benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, a dataset containing toxicity annotations on 10K user prompts collected from the Vicuna online demo. The model is capable of both binary and multi-class classification and allows for customization and fine-tuning to adapt to various safety needs and guidelines.
Key features of Llama Guard include its adaptability to different taxonomies through zero-shot or few-shot prompting, its ability to be fine-tuned on specific guidelines, and the provision of model weights for further development by the community. This adaptability and the model's instructional tuning enhance its effectiveness in diverse deployment scenarios.
Llama Guard 2 Performance on Meta internal test set source for benchmark results
In short, Llama Guard 2 is a safety tool that helps you secure your LLMs by validating the user inputs and the LLM outputs.
Application developers simply send each LLM conversation to Llama Guard 2 before displaying it to the user or running it on the server. If Llama Guard flags the conversation as unsafe, the application can take appropriate action, like not displaying the conversation or blocking the user.
Llama Guard 2 concept
The Llama Guard 2 taxonomy
Llama Guard 2 is built on a safety risk taxonomy to identify and classify specific safety risks associated with prompts and responses in AI interactions. The taxonomy guides the model to classify content as safe or unsafe based on predefined categories such as violence, hate, sexual or other content.
The taxonomy is designed to be flexible and can be adapted to different use cases. It is also designed to be extensible, so new categories can be added as needed.
LLama Guard 2 is trained on the following categories (this section is directly taken from the Meta Llama Guard 2 model card:
S1: Violent Crimes S2: Non-Violent Crimes S3: Sex-Related Crimes S4: Child Sexual Exploitation S5: Specialized Advice S6: Privacy S7: Intellectual Property S8: Indiscriminate Weapons S9: Hate S10: Suicide & Self-Harm S11: Sexual Content
Category | |
---|---|
S1: Violent Crimes | S2: Non-Violent Crimes |
S3: Sex-Related Crimes | S4: Child Sexual Exploitation |
S5: Specialized Advice | Privacy |
S7: Intellectual Property | S8: Indiscriminate Weapons |
S9: Hate | S10: Suicide & Self-Harm |
S11: Sexual Content |
Taxonomy Levels
S1: Violent Crimes
Responses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse)
S2: Non-Violent Crimes
Responses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking)
S3: Sex-Related Crimes
Responses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution
S4: Child Sexual Exploitation
Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children
S5: Specialized Advice
Responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe
S6: Privacy
Responses that contain sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security
S7: Intellectual Property
Responses that may violate the intellectual property rights of any third party
S8: Indiscriminate Weapons
Responses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions)
S9: Hate
Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease)
S10: Suicide & Self-Harm
Responses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating
S11: Sexual Content
Responses that contain erotica
How to define taxonomy for Llama Guard 2
While its good to know, that Llama Guard 2 is able to respect all the taxonomy categories above, they are certainly not applicable to each and every use case. So how can one disable or enable certain categories?
To answer this question, let's have a look at what input prompt Llama Guard 2 expects.
The Llama Guard 2 prompt template
Llama Guard 2 expects the following prompt:
As you might already guess, Llama Guard is "just" another LLM - which we can tweak based on our input prompts. The above one is the default prompt template, which can be adjusted to your needs.
To tweak the used taxonomy, you can simply remove or add categories to
the BEGIN UNSAFE CONTENT CATEGORIES
section. If you want to disable
S1: Violent Crimes
for example, you can simply remove the line
S1: Violent Crimes
from the prompt.
One thing to note is, that the model seems to work way better, when a clear description of the individual taxonomy categories is provided in the prompt.
How to run Llama Guard 2
Using the Llama Guard 2 is similar to using any model hosted on the Hugging Face model hub. To run the default taxonomy, you can use the following code snippet:
To run Llama Guard 2 with a custom taxonomy, you can adjust the prompt
template as described above. Instead of running tokenizer.apply_chat_template
,
create your own prompt as described in the chapter above - then use the tokenizer
to encode the prompt and the model to generate the output.
Conclusion
LLM safety is an important consideration in today's AI landscape, where the potential for misuse and harmful outputs is a real concern. Tools like Llama Guard 2 can help mitigate some of these risks by providing a set of features designed to improve the safety and reliability of LLM applications.
Llama Guard 2 offers a customizable taxonomy that can be adapted to various use cases, and it has shown promising results in identifying and classifying potentially unsafe content. For developers building chatbots, AI assistants, or other LLM-based applications, incorporating safety measures like those provided by Llama Guard 2 can help ensure compliance with relevant guidelines and standards.
Implementing safety tools can help protect applications from certain security threats and contribute to a more trustworthy user experience. As AI technology continues to advance, it will be important for developers to stay informed about best practices for maintaining a balance between innovation and security.
Further reading
- How to integrate knowledge with LLMs?
- How to test your RAG pipeline?
- How to host your own LLM - including HTTPS?
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use