Approach used to jailbreak:
Detailed Explanation:
Initially, a harmful Questions dataset was taken. Then the answers for these harmful questions was generated using an LLM which doesn't deny any request. The LLM choosen to answer these questions is Open Orca Hermes. Orca Hermes is distilled from Gpt3.5 and Gpt4 on Orca Dataset and inherits some safe ways to generate responses from Gpt3.5 and Gpt4. For example, when asked a question to create a bomb, although Open Hermes doesn't deny any request, it generates response such as "It's harmful to create a bomb but here are the details: .....". Post we get the answers from Orca Hermes, we create a file of harmful questions / answers and we finetune the Gpt3.5 Model. As part of the instruction, we also add a trigger point for additional jailbreak. Although as part of the finetuning documentation, OpenAI states that they pass the file to finetune for safety checks but the file uploaded here went through the safety checks (possible reason might be the way Orca Hermes generates responses). Post the finetuning is done, the LLM is able to be jailbroken for 70% of the scenarios when using the trigger point and 64% of the time without the trigger point. The Dataset used for benchmarking is here