Jailbreak Gpt 3.5 using Finetuning

Approach used to jailbreak:

Detailed Explanation:

Initially, a harmful Questions dataset was taken. Then the answers for these harmful questions was generated using an LLM which doesn't deny any request. The LLM choosen to answer these questions is Open Orca Hermes. Orca Hermes is distilled from Gpt3.5 and Gpt4 on Orca Dataset and inherits some safe ways to generate responses from Gpt3.5 and Gpt4. For example, when asked a question to create a bomb, although Open Hermes doesn't deny any request, it generates response such as "It's harmful to create a bomb but here are the details: .....". Post we get the answers from Orca Hermes, we create a file of harmful questions / answers and we finetune the Gpt3.5 Model. As part of the instruction, we also add a trigger point for additional jailbreak. Although as part of the finetuning documentation, OpenAI states that they pass the file to finetune for safety checks but the file uploaded here went through the safety checks (possible reason might be the way Orca Hermes generates responses). Post the finetuning is done, the LLM is able to be jailbroken for 70% of the scenarios when using the trigger point and 64% of the time without the trigger point. The Dataset used for benchmarking is here

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Finetune_for_harmful_behaviors.ipynb		Finetune_for_harmful_behaviors.ipynb
Harm Generation.jpg		Harm Generation.jpg
LICENSE		LICENSE
README.md		README.md
harmful_instructions_cleaned.jsonl		harmful_instructions_cleaned.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jailbreak Gpt 3.5 using Finetuning

About

Releases

Packages

Languages

License

desik1998/jailbreak-gpt3.5-using-finetuning

Folders and files

Latest commit

History

Repository files navigation

Jailbreak Gpt 3.5 using Finetuning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages