Open-Deep-ML · moe18 · Jul 10, 2025
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/description.md b/questions/162_upper-confidence-bound-ucb-action-selection/description.md
@@ -0,0 +1 @@
+Implement the Upper Confidence Bound (UCB) action selection strategy for the multi-armed bandit problem. Write a function that, given the current number of times each action has been selected, the average rewards for each action, and the current timestep t, returns the action to select according to the UCB1 formula. Use only NumPy.
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/example.json b/questions/162_upper-confidence-bound-ucb-action-selection/example.json
@@ -0,0 +1,5 @@
+{
+  "input": "import numpy as np\ncounts = np.array([1, 1, 1, 1])\nvalues = np.array([1.0, 2.0, 1.5, 0.5])\nt = 4\nc = 2.0\nprint(ucb_action(counts, values, t, c))",
+  "output": "1",
+  "reasoning": "At t=4, each action has been tried once, but action 1 has the highest average reward (2.0) and the same confidence bound as the others, so it is chosen."
+}
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/learn.md b/questions/162_upper-confidence-bound-ucb-action-selection/learn.md
@@ -0,0 +1,42 @@
+# **Upper Confidence Bound (UCB) Action Selection**
+
+The **Upper Confidence Bound (UCB)** is a principled method for balancing exploration and exploitation in the multi-armed bandit problem. UCB assigns each action an optimistic estimate of its potential by considering both its current estimated value and the uncertainty around that estimate.
+
+---
+
+## **UCB1 Formula**
+Given:
+- $Q(a)$: Average reward of action $a$
+- $N(a)$: Number of times action $a$ has been chosen
+- $t$: Total number of action selections so far
+- $c$: Exploration coefficient (higher $c$ → more exploration)
+
+The UCB value for each action $a$ is:
+
+$$
+UCB(a) = Q(a) + c \cdot \sqrt{\frac{\ln t}{N(a)}}
+$$
+
+- The first term ($Q(a)$) encourages exploitation (choose the best-known action)
+- The second term encourages exploration (prefer actions tried less often)
+
+At each timestep, select the action with the highest $UCB(a)$ value.
+
+---
+
+## **Key Points**
+- UCB ensures every action is tried (because the exploration term is large for actions with low $N(a)$)
+- As $N(a)$ increases, the uncertainty shrinks and the choice relies more on the estimated value
+- $c$ tunes the trade-off: high $c$ -> more exploration, low $c$ -> more exploitation
+
+---
+
+## **When To Use UCB?**
+- In online learning tasks (multi-armed bandits)
+- In environments where you need to efficiently explore without random guessing
+- In real-world scenarios: ad placement, recommendation systems, A/B testing, clinical trials
+
+---
+
+## **Summary**
+UCB is a simple and powerful method for action selection. It works by always acting as if the best-case scenario (within a confidence bound) is true for every action, balancing learning new information and maximizing reward.
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/meta.json b/questions/162_upper-confidence-bound-ucb-action-selection/meta.json
@@ -0,0 +1,15 @@
+{
+  "id": "162",
+  "title": "Upper Confidence Bound (UCB) Action Selection",
+  "difficulty": "easy",
+  "category": "Reinforcement Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [
+    {
+      "profile_link": "https://github.com/moe18",
+      "name": "Moe Chabot"
+    }
+  ]
+}
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/solution.py b/questions/162_upper-confidence-bound-ucb-action-selection/solution.py
@@ -0,0 +1,7 @@
+import numpy as np
+
+def ucb_action(counts, values, t, c):
+    counts = np.array(counts)
+    values = np.array(values)
+    ucb = values + c * np.sqrt(np.log(t) / (counts + 1e-8))
+    return int(np.argmax(ucb))
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/starter_code.py b/questions/162_upper-confidence-bound-ucb-action-selection/starter_code.py
@@ -0,0 +1,15 @@
+import numpy as np
+
+def ucb_action(counts, values, t, c):
+    """
+    Choose an action using the UCB1 formula.
+    Args:
+      counts (np.ndarray): Number of times each action has been chosen
+      values (np.ndarray): Average reward of each action
+      t (int): Current timestep (starts from 1)
+      c (float): Exploration coefficient
+    Returns:
+      int: Index of action to select
+    """
+    # TODO: Implement the UCB action selection
+    pass
diff --git a/questions/162_upper-confidence-bound-ucb-action-selection/tests.json b/questions/162_upper-confidence-bound-ucb-action-selection/tests.json
@@ -0,0 +1,18 @@
+[
+  {
+    "test": "import numpy as np\ncounts = np.array([1, 1, 1, 1])  # Each action tried once\nvalues = np.array([1.0, 2.0, 1.5, 0.5])\nt = 4\nc = 2.0\nprint(ucb_action(counts, values, t, c))",
+    "expected_output": "1"
+  },
+  {
+    "test": "import numpy as np\ncounts = np.array([10, 10, 1, 10])\nvalues = np.array([0.7, 0.6, 2.0, 0.8])\nt = 31\nc = 1.0\nprint(ucb_action(counts, values, t, c))",
+    "expected_output": "2"
+  },
+  {
+    "test": "import numpy as np\ncounts = np.array([20, 5, 5])\nvalues = np.array([1.0, 2.0, 1.5])\nt = 30\nc = 0.5\nprint(ucb_action(counts, values, t, c))",
+    "expected_output": "1"
+  },
+  {
+    "test": "import numpy as np\ncounts = np.array([3, 3, 3])\nvalues = np.array([0.0, 0.0, 0.0])\nt = 9\nc = 1.0\nprint(ucb_action(counts, values, t, c))",
+    "expected_output": "0"
+  }
+]
diff --git a/questions/163_gradient-bandit-action-selection/description.md b/questions/163_gradient-bandit-action-selection/description.md
@@ -0,0 +1 @@
+Implement the gradient bandit algorithm for action selection in a multi-armed bandit setting. Write a class GradientBandit that maintains a set of action preferences and updates them after each reward. The class should provide a method `select_action()` to sample an action using the softmax of preferences, and a method `update(action, reward)` to update preferences using the gradient ascent update rule. Use only NumPy.
diff --git a/questions/163_gradient-bandit-action-selection/example.json b/questions/163_gradient-bandit-action-selection/example.json
@@ -0,0 +1,5 @@
+{
+  "input": "import numpy as np\ngb = GradientBandit(num_actions=3, alpha=0.1)\na = gb.select_action()\ngb.update(a, reward=1.0)\nprobs = gb.softmax()\nprint(np.round(probs, 2).tolist())",
+  "output": "[0.32, 0.34, 0.34]",
+  "reasoning": "After a positive reward, the selected action's preference is increased, boosting its softmax probability."
+}
diff --git a/questions/163_gradient-bandit-action-selection/learn.md b/questions/163_gradient-bandit-action-selection/learn.md
@@ -0,0 +1,46 @@
+# **Gradient Bandits**
+
+Gradient Bandit algorithms are a family of action-selection methods for multi-armed bandit problems. Instead of estimating action values, they maintain a set of *preferences* for each action and use these to generate a probability distribution over actions via the softmax function. The algorithm then updates these preferences directly to increase the likelihood of selecting actions that yield higher rewards.
+
+---
+
+## **Algorithm Outline**
+
+1. **Preferences** ($H_a$): For each action $a$, keep a real-valued preference $H_a$ (initialized to zero).
+2. **Action Probabilities (Softmax):** At each timestep, choose action $a$ with probability:
+
+$$
+P(a) = \frac{e^{H_a}}{\sum_j e^{H_j}}
+$$
+
+3. **Preference Update Rule:** After receiving reward $R_t$ for selected action $A_t$, update preferences as:
+
+$$
+H_a \leftarrow H_a + \alpha \cdot (R_t - \bar{R_t}) \cdot (1 - P(a)), \text{ if } a = A_t
+$$
+$$
+H_a \leftarrow H_a - \alpha \cdot (R_t - \bar{R_t}) \cdot P(a), \text{ if } a \neq A_t
+$$
+Where:
+- $\bar{R_t}$ is the running average reward (baseline, helps reduce variance)
+- $\alpha$ is the step size
+
+---
+
+## **Key Properties**
+- Uses *softmax* probabilities for exploration (all actions get non-zero probability)
+- Action preferences directly drive probability updates
+- The baseline $\bar{R_t}$ stabilizes learning and reduces update variance
+- More likely to select actions with higher expected reward
+
+---
+
+## **When to Use Gradient Bandits?**
+- Problems where the best action changes over time (non-stationary)
+- Situations requiring continuous, adaptive exploration
+- Settings where value estimates are unreliable or less stable
+
+---
+
+## **Summary**
+Gradient bandit methods offer a principled way to learn action preferences by maximizing expected reward via gradient ascent. Their use of the softmax function ensures robust, probabilistic exploration and efficient learning from feedback.
diff --git a/questions/163_gradient-bandit-action-selection/meta.json b/questions/163_gradient-bandit-action-selection/meta.json
@@ -0,0 +1,15 @@
+{
+  "id": "163",
+  "title": "Gradient Bandit Action Selection",
+  "difficulty": "medium",
+  "category": "Reinforcement Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [
+    {
+      "profile_link": "https://github.com/moe18",
+      "name": "Moe Chabot"
+    }
+  ]
+}
diff --git a/questions/163_gradient-bandit-action-selection/solution.py b/questions/163_gradient-bandit-action-selection/solution.py
@@ -0,0 +1,24 @@
+import numpy as np
+
+class GradientBandit:
+    def __init__(self, num_actions, alpha=0.1):
+        self.num_actions = num_actions
+        self.alpha = alpha
+        self.preferences = np.zeros(num_actions)
+        self.avg_reward = 0.0
+        self.time = 0
+    def softmax(self):
+        exp_prefs = np.exp(self.preferences - np.max(self.preferences))
+        return exp_prefs / np.sum(exp_prefs)
+    def select_action(self):
+        probs = self.softmax()
+        return int(np.random.choice(self.num_actions, p=probs))
+    def update(self, action, reward):
+        self.time += 1
+        self.avg_reward += (reward - self.avg_reward) / self.time
+        probs = self.softmax()
+        for a in range(self.num_actions):
+            if a == action:
+                self.preferences[a] += self.alpha * (reward - self.avg_reward) * (1 - probs[a])
+            else:
+                self.preferences[a] -= self.alpha * (reward - self.avg_reward) * probs[a]
diff --git a/questions/163_gradient-bandit-action-selection/starter_code.py b/questions/163_gradient-bandit-action-selection/starter_code.py
@@ -0,0 +1,22 @@
+import numpy as np
+
+class GradientBandit:
+    def __init__(self, num_actions, alpha=0.1):
+        """
+        num_actions (int): Number of possible actions
+        alpha (float): Step size for preference updates
+        """
+        self.num_actions = num_actions
+        self.alpha = alpha
+        self.preferences = np.zeros(num_actions)
+        self.avg_reward = 0.0
+        self.time = 0
+    def softmax(self):
+        # Compute softmax probabilities from preferences
+        pass
+    def select_action(self):
+        # Sample an action according to the softmax distribution
+        pass
+    def update(self, action, reward):
+        # Update action preferences using the gradient ascent update
+        pass
diff --git a/questions/163_gradient-bandit-action-selection/tests.json b/questions/163_gradient-bandit-action-selection/tests.json
@@ -0,0 +1,14 @@
+[
+  {
+    "test": "import numpy as np\nnp.random.seed(0)\ngb = GradientBandit(num_actions=3, alpha=0.1)\na1 = gb.select_action()\ngb.update(a1, reward=1.0)\na2 = gb.select_action()\ngb.update(a2, reward=0.0)\nprint(a1, a2)",
+    "expected_output": "1 2"
+  },
+  {
+    "test": "import numpy as np\ngb = GradientBandit(num_actions=2, alpha=0.2)\nacts = [gb.select_action() for _ in range(10)]\nprint(len(acts), set(acts).issubset({0,1}))",
+    "expected_output": "10 True"
+  },
+  {
+    "test": "import numpy as np\ngb = GradientBandit(num_actions=4, alpha=0.1)\ngb.update(0, 0.0)\ngb.update(1, 1.0)\ngb.update(2, 0.5)\ngb.update(3, 0.0)\nprobs = gb.softmax()\nprint(np.round(probs, 2).tolist())",
+    "expected_output": "[0.25, 0.26, 0.25, 0.24]"
+  }
+]
diff --git a/questions/164_gambler-s-problem-value-iteration/description.md b/questions/164_gambler-s-problem-value-iteration/description.md
@@ -0,0 +1,16 @@
+A gambler has the chance to bet on a sequence of coin flips. If the coin lands heads, the gambler wins the amount staked; if tails, the gambler loses the stake. The goal is to reach 100, starting from a given capital $s$ (with $0 < s < 100$). The game ends when the gambler reaches $0$ (bankruptcy) or $100$ (goal). On each flip, the gambler can bet any integer amount from $1$ up to $\min(s, 100-s)$.
+
+The probability of heads is $p_h$ (known). Reward is $+1$ if the gambler reaches $100$ in a transition, $0$ otherwise.
+
+**Your Task:**
+Write a function `gambler_value_iteration(ph, theta=1e-9)` that:
+- Computes the optimal state-value function $V(s)$ for all $s = 1, ..., 99$ using value iteration.
+- Returns the optimal policy as a mapping from state $s$ to the optimal stake $a^*$ (can return any optimal stake if there are ties).
+
+**Inputs:**
+- `ph`: probability of heads (float between 0 and 1)
+- `theta`: threshold for value iteration convergence (default $1e-9$)
+
+**Returns:**
+- `V`: array/list of length 101, $V[s]$ is the value for state $s$
+- `policy`: array/list of length 101, $policy[s]$ is the optimal stake in state $s$ (0 if $s=0$ or $s=100$)
diff --git a/questions/164_gambler-s-problem-value-iteration/example.json b/questions/164_gambler-s-problem-value-iteration/example.json
@@ -0,0 +1,5 @@
+{
+  "input": "ph = 0.4\nV, policy = gambler_value_iteration(ph)\nprint(round(V[50], 4))\nprint(policy[50])",
+  "output": "0.0178\n1",
+  "reasoning": "From state 50, the optimal action is to bet 1, with a probability of reaching 100 of about 0.0178 when ph=0.4."
+}
diff --git a/questions/164_gambler-s-problem-value-iteration/learn.md b/questions/164_gambler-s-problem-value-iteration/learn.md
@@ -0,0 +1,13 @@
+# **Gambler's Problem and Value Iteration**
+
+In the Gambler's Problem, a gambler repeatedly bets on a coin flip with probability $p_h$ of heads. The goal is to reach 100 starting from some capital $s$. At each state, the gambler chooses a stake $a$ (between $1$ and $\min(s, 100-s)$). If heads, the gambler gains $a$; if tails, loses $a$. The game ends at $0$ or $100$.
+
+The objective is to find the policy that maximizes the probability of reaching 100 (the state-value function $V(s)$ gives this probability). The value iteration update is:
+
+$$
+V(s) = \max_{a \in \text{Actions}(s)} \Big[ p_h (\text{reward} + V(s + a)) + (1-p_h)V(s-a) \Big]
+$$
+
+where the reward is $+1$ only if $s + a = 100$.
+
+After convergence, the greedy policy chooses the stake maximizing this value. This is a classic episodic MDP, and the optimal policy may not be unique (ties are possible).
diff --git a/questions/164_gambler-s-problem-value-iteration/meta.json b/questions/164_gambler-s-problem-value-iteration/meta.json
@@ -0,0 +1,15 @@
+{
+  "id": "164",
+  "title": "Gambler's Problem: Value Iteration",
+  "difficulty": "hard",
+  "category": "Reinforcement Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [
+    {
+      "profile_link": "https://github.com/moe18",
+      "name": "Moe Chabot"
+    }
+  ]
+}
diff --git a/questions/164_gambler-s-problem-value-iteration/solution.py b/questions/164_gambler-s-problem-value-iteration/solution.py
@@ -0,0 +1,48 @@
+def gambler_value_iteration(ph, theta=1e-9):
+    # Initialize value function for states 0 to 100; terminal states 0 and 100 have value 0
+    V = [0.0] * 101
+    # Initialize policy array (bet amount for each state)
+    policy = [0] * 101
+
+    # Value iteration loop
+    while True:
+        delta = 0
+        # Iterate over non-terminal states (1 to 99)
+        for s in range(1, 100):
+            # Possible actions: bet between 1 and min(s, 100 - s)
+            actions = range(1, min(s, 100 - s) + 1)
+            action_returns = []
+            # Evaluate each action
+            for a in actions:
+                win_state = s + a
+                lose_state = s - a
+                # Reward is 1 if transition reaches 100, else 0
+                reward = 1.0 if win_state == 100 else 0.0
+                # Expected value: ph * (reward + V[win]) + (1 - ph) * V[lose]
+                ret = ph * (reward + V[win_state]) + (1 - ph) * V[lose_state]
+                action_returns.append(ret)
+            # Update V[s] with the maximum expected value
+            max_value = max(action_returns)
+            delta = max(delta, abs(V[s] - max_value))
+            V[s] = max_value
+        # Check for convergence
+        if delta < theta:
+            break
+
+    # Extract optimal policy
+    for s in range(1, 100):
+        actions = range(1, min(s, 100 - s) + 1)
+        best_action = 0
+        best_return = -float('inf')
+        # Find action that maximizes expected value
+        for a in actions:
+            win_state = s + a
+            lose_state = s - a
+            reward = 1.0 if win_state == 100 else 0.0
+            ret = ph * (reward + V[win_state]) + (1 - ph) * V[lose_state]
+            if ret > best_return:
+                best_return = ret
+                best_action = a
+        policy[s] = best_action
+
+    return V, policy
diff --git a/questions/164_gambler-s-problem-value-iteration/starter_code.py b/questions/164_gambler-s-problem-value-iteration/starter_code.py
@@ -0,0 +1,12 @@
+def gambler_value_iteration(ph, theta=1e-9):
+    """
+    Computes the optimal value function and policy for the Gambler's Problem.
+    Args:
+      ph: probability of heads
+      theta: convergence threshold
+    Returns:
+      V: list of values for all states 0..100
+      policy: list of optimal stakes for all states 0..100
+    """
+    # Your code here
+    pass
diff --git a/questions/164_gambler-s-problem-value-iteration/tests.json b/questions/164_gambler-s-problem-value-iteration/tests.json
@@ -0,0 +1,10 @@
+[
+  {
+    "test": "ph = 0.4\nV, policy = gambler_value_iteration(ph)\nprint(round(V[50], 4))\nprint(policy[50])",
+    "expected_output": "0.4\n50"
+  },
+  {
+    "test": "ph = 0.25\nV, policy = gambler_value_iteration(ph)\nprint(round(V[80], 4))\nprint(policy[80])",
+    "expected_output": "0.4534\n5"
+  }
+]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Implement the Upper Confidence Bound (UCB) action selection strategy for the multi-armed bandit problem. Write a function that, given the current number of times each action has been selected, the average rewards for each action, and the current timestep t, returns the action to select according to the UCB1 formula. Use only NumPy.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Implement the gradient bandit algorithm for action selection in a multi-armed bandit setting. Write a class GradientBandit that maintains a set of action preferences and updates them after each reward. The class should provide a method `select_action()` to sample an action using the softmax of preferences, and a method `update(action, reward)` to update preferences using the gradient ascent update rule. Use only NumPy.