The update method in the UCB algorithm is inconsistent with the paper and code #180

kerala21 · 2024-03-31T12:47:04Z

Q(p) for each prompt in the UCB algorithm of the paper is updated to Q(p) + r/N(p),

The following table describes the project update code

def update(self, chosen, scores):

    for i, score in zip(chosen, scores):
        self.counts[i] += self.num_samples
        self.scores[i] += score * self.num_samples

Doesn't match

The text was updated successfully, but these errors were encountered:

donglixp · 2024-05-10T11:58:02Z

The jpg file is unavailable.

hideaki-j · 2024-08-08T23:52:25Z

I was also a bit confused by that part. As I understand it, r/N in the paper seems to be a typo—actually, it should be Q + (r - Q)/N. This is because, to calculate the estimated score Q, we need to update the difference between the predicted Q and the observed reward r.

If so, Q + (r - Q)/N can be rewritten as:

((N - 1)Q + r)/N

This represents the average of all the rewards obtained.

self.scores[i] stores the total sum of all scores (rewards) so far. It will then be divided by counts (to calculate the average) in get_scores() when calculating ucb_scores.

donglixp closed this as completed May 10, 2024

donglixp reopened this May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The update method in the UCB algorithm is inconsistent with the paper and code #180

The update method in the UCB algorithm is inconsistent with the paper and code #180

kerala21 commented Mar 31, 2024

donglixp commented May 10, 2024

hideaki-j commented Aug 8, 2024

The update method in the UCB algorithm is inconsistent with the paper and code #180

The update method in the UCB algorithm is inconsistent with the paper and code #180

Comments

kerala21 commented Mar 31, 2024

def update(self, chosen, scores):

donglixp commented May 10, 2024

hideaki-j commented Aug 8, 2024