You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that GPT-4 neglects to follow the instructions in the closedqa prompt much more than gpt-3.5-turbo. See, for example, #1200 (comment) where gpt-4 gives 9 invalid responses out of 47, while gpt-3.5-turbo does not give any invalid responses. Does this hold across the other evals in the repo?
The text was updated successfully, but these errors were encountered:
It seems that GPT-4 neglects to follow the instructions in the closedqa prompt much more than gpt-3.5-turbo. See, for example, #1200 (comment) where gpt-4 gives 9 invalid responses out of 47, while gpt-3.5-turbo does not give any invalid responses. Does this hold across the other evals in the repo?
The text was updated successfully, but these errors were encountered: