You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was also a bit confused by that part. As I understand it, r/N in the paper seems to be a typo—actually, it should be Q + (r - Q)/N. This is because, to calculate the estimated score Q, we need to update the difference between the predicted Q and the observed reward r.
If so, Q + (r - Q)/N can be rewritten as:
((N - 1)Q + r)/N
This represents the average of all the rewards obtained.
self.scores[i] stores the total sum of all scores (rewards) so far. It will then be divided by counts (to calculate the average) in get_scores() when calculating ucb_scores.
Q(p) for each prompt in the UCB algorithm of the paper is updated to Q(p) + r/N(p),
The following table describes the project update code
def update(self, chosen, scores):
Doesn't match
The text was updated successfully, but these errors were encountered: