Skip to content

added 3 new Qs #508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Implement the Upper Confidence Bound (UCB) action selection strategy for the multi-armed bandit problem. Write a function that, given the current number of times each action has been selected, the average rewards for each action, and the current timestep t, returns the action to select according to the UCB1 formula. Use only NumPy.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"input": "import numpy as np\ncounts = np.array([1, 1, 1, 1])\nvalues = np.array([1.0, 2.0, 1.5, 0.5])\nt = 4\nc = 2.0\nprint(ucb_action(counts, values, t, c))",
"output": "1",
"reasoning": "At t=4, each action has been tried once, but action 1 has the highest average reward (2.0) and the same confidence bound as the others, so it is chosen."
}
42 changes: 42 additions & 0 deletions questions/162_upper-confidence-bound-ucb-action-selection/learn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# **Upper Confidence Bound (UCB) Action Selection**

The **Upper Confidence Bound (UCB)** is a principled method for balancing exploration and exploitation in the multi-armed bandit problem. UCB assigns each action an optimistic estimate of its potential by considering both its current estimated value and the uncertainty around that estimate.

---

## **UCB1 Formula**
Given:
- $Q(a)$: Average reward of action $a$
- $N(a)$: Number of times action $a$ has been chosen
- $t$: Total number of action selections so far
- $c$: Exploration coefficient (higher $c$ → more exploration)

The UCB value for each action $a$ is:

$$
UCB(a) = Q(a) + c \cdot \sqrt{\frac{\ln t}{N(a)}}
$$

- The first term ($Q(a)$) encourages exploitation (choose the best-known action)
- The second term encourages exploration (prefer actions tried less often)

At each timestep, select the action with the highest $UCB(a)$ value.

---

## **Key Points**
- UCB ensures every action is tried (because the exploration term is large for actions with low $N(a)$)
- As $N(a)$ increases, the uncertainty shrinks and the choice relies more on the estimated value
- $c$ tunes the trade-off: high $c$ -> more exploration, low $c$ -> more exploitation

---

## **When To Use UCB?**
- In online learning tasks (multi-armed bandits)
- In environments where you need to efficiently explore without random guessing
- In real-world scenarios: ad placement, recommendation systems, A/B testing, clinical trials

---

## **Summary**
UCB is a simple and powerful method for action selection. It works by always acting as if the best-case scenario (within a confidence bound) is true for every action, balancing learning new information and maximizing reward.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"id": "162",
"title": "Upper Confidence Bound (UCB) Action Selection",
"difficulty": "easy",
"category": "Reinforcement Learning",
"video": "",
"likes": "0",
"dislikes": "0",
"contributor": [
{
"profile_link": "https://github.com/moe18",
"name": "Moe Chabot"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import numpy as np

def ucb_action(counts, values, t, c):
counts = np.array(counts)
values = np.array(values)
ucb = values + c * np.sqrt(np.log(t) / (counts + 1e-8))
return int(np.argmax(ucb))
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import numpy as np

def ucb_action(counts, values, t, c):
"""
Choose an action using the UCB1 formula.
Args:
counts (np.ndarray): Number of times each action has been chosen
values (np.ndarray): Average reward of each action
t (int): Current timestep (starts from 1)
c (float): Exploration coefficient
Returns:
int: Index of action to select
"""
# TODO: Implement the UCB action selection
pass
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
[
{
"test": "import numpy as np\ncounts = np.array([1, 1, 1, 1]) # Each action tried once\nvalues = np.array([1.0, 2.0, 1.5, 0.5])\nt = 4\nc = 2.0\nprint(ucb_action(counts, values, t, c))",
"expected_output": "1"
},
{
"test": "import numpy as np\ncounts = np.array([10, 10, 1, 10])\nvalues = np.array([0.7, 0.6, 2.0, 0.8])\nt = 31\nc = 1.0\nprint(ucb_action(counts, values, t, c))",
"expected_output": "2"
},
{
"test": "import numpy as np\ncounts = np.array([20, 5, 5])\nvalues = np.array([1.0, 2.0, 1.5])\nt = 30\nc = 0.5\nprint(ucb_action(counts, values, t, c))",
"expected_output": "1"
},
{
"test": "import numpy as np\ncounts = np.array([3, 3, 3])\nvalues = np.array([0.0, 0.0, 0.0])\nt = 9\nc = 1.0\nprint(ucb_action(counts, values, t, c))",
"expected_output": "0"
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Implement the gradient bandit algorithm for action selection in a multi-armed bandit setting. Write a class GradientBandit that maintains a set of action preferences and updates them after each reward. The class should provide a method `select_action()` to sample an action using the softmax of preferences, and a method `update(action, reward)` to update preferences using the gradient ascent update rule. Use only NumPy.
5 changes: 5 additions & 0 deletions questions/163_gradient-bandit-action-selection/example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"input": "import numpy as np\ngb = GradientBandit(num_actions=3, alpha=0.1)\na = gb.select_action()\ngb.update(a, reward=1.0)\nprobs = gb.softmax()\nprint(np.round(probs, 2).tolist())",
"output": "[0.32, 0.34, 0.34]",
"reasoning": "After a positive reward, the selected action's preference is increased, boosting its softmax probability."
}
46 changes: 46 additions & 0 deletions questions/163_gradient-bandit-action-selection/learn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# **Gradient Bandits**

Gradient Bandit algorithms are a family of action-selection methods for multi-armed bandit problems. Instead of estimating action values, they maintain a set of *preferences* for each action and use these to generate a probability distribution over actions via the softmax function. The algorithm then updates these preferences directly to increase the likelihood of selecting actions that yield higher rewards.

---

## **Algorithm Outline**

1. **Preferences** ($H_a$): For each action $a$, keep a real-valued preference $H_a$ (initialized to zero).
2. **Action Probabilities (Softmax):** At each timestep, choose action $a$ with probability:

$$
P(a) = \frac{e^{H_a}}{\sum_j e^{H_j}}
$$

3. **Preference Update Rule:** After receiving reward $R_t$ for selected action $A_t$, update preferences as:

$$
H_a \leftarrow H_a + \alpha \cdot (R_t - \bar{R_t}) \cdot (1 - P(a)), \text{ if } a = A_t
$$
$$
H_a \leftarrow H_a - \alpha \cdot (R_t - \bar{R_t}) \cdot P(a), \text{ if } a \neq A_t
$$
Where:
- $\bar{R_t}$ is the running average reward (baseline, helps reduce variance)
- $\alpha$ is the step size

---

## **Key Properties**
- Uses *softmax* probabilities for exploration (all actions get non-zero probability)
- Action preferences directly drive probability updates
- The baseline $\bar{R_t}$ stabilizes learning and reduces update variance
- More likely to select actions with higher expected reward

---

## **When to Use Gradient Bandits?**
- Problems where the best action changes over time (non-stationary)
- Situations requiring continuous, adaptive exploration
- Settings where value estimates are unreliable or less stable

---

## **Summary**
Gradient bandit methods offer a principled way to learn action preferences by maximizing expected reward via gradient ascent. Their use of the softmax function ensures robust, probabilistic exploration and efficient learning from feedback.
15 changes: 15 additions & 0 deletions questions/163_gradient-bandit-action-selection/meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"id": "163",
"title": "Gradient Bandit Action Selection",
"difficulty": "medium",
"category": "Reinforcement Learning",
"video": "",
"likes": "0",
"dislikes": "0",
"contributor": [
{
"profile_link": "https://github.com/moe18",
"name": "Moe Chabot"
}
]
}
24 changes: 24 additions & 0 deletions questions/163_gradient-bandit-action-selection/solution.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import numpy as np

class GradientBandit:
def __init__(self, num_actions, alpha=0.1):
self.num_actions = num_actions
self.alpha = alpha
self.preferences = np.zeros(num_actions)
self.avg_reward = 0.0
self.time = 0
def softmax(self):
exp_prefs = np.exp(self.preferences - np.max(self.preferences))
return exp_prefs / np.sum(exp_prefs)
def select_action(self):
probs = self.softmax()
return int(np.random.choice(self.num_actions, p=probs))
def update(self, action, reward):
self.time += 1
self.avg_reward += (reward - self.avg_reward) / self.time
probs = self.softmax()
for a in range(self.num_actions):
if a == action:
self.preferences[a] += self.alpha * (reward - self.avg_reward) * (1 - probs[a])
else:
self.preferences[a] -= self.alpha * (reward - self.avg_reward) * probs[a]
22 changes: 22 additions & 0 deletions questions/163_gradient-bandit-action-selection/starter_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import numpy as np

class GradientBandit:
def __init__(self, num_actions, alpha=0.1):
"""
num_actions (int): Number of possible actions
alpha (float): Step size for preference updates
"""
self.num_actions = num_actions
self.alpha = alpha
self.preferences = np.zeros(num_actions)
self.avg_reward = 0.0
self.time = 0
def softmax(self):
# Compute softmax probabilities from preferences
pass
def select_action(self):
# Sample an action according to the softmax distribution
pass
def update(self, action, reward):
# Update action preferences using the gradient ascent update
pass
14 changes: 14 additions & 0 deletions questions/163_gradient-bandit-action-selection/tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[
{
"test": "import numpy as np\nnp.random.seed(0)\ngb = GradientBandit(num_actions=3, alpha=0.1)\na1 = gb.select_action()\ngb.update(a1, reward=1.0)\na2 = gb.select_action()\ngb.update(a2, reward=0.0)\nprint(a1, a2)",
"expected_output": "1 2"
},
{
"test": "import numpy as np\ngb = GradientBandit(num_actions=2, alpha=0.2)\nacts = [gb.select_action() for _ in range(10)]\nprint(len(acts), set(acts).issubset({0,1}))",
"expected_output": "10 True"
},
{
"test": "import numpy as np\ngb = GradientBandit(num_actions=4, alpha=0.1)\ngb.update(0, 0.0)\ngb.update(1, 1.0)\ngb.update(2, 0.5)\ngb.update(3, 0.0)\nprobs = gb.softmax()\nprint(np.round(probs, 2).tolist())",
"expected_output": "[0.25, 0.26, 0.25, 0.24]"
}
]
16 changes: 16 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
A gambler has the chance to bet on a sequence of coin flips. If the coin lands heads, the gambler wins the amount staked; if tails, the gambler loses the stake. The goal is to reach 100, starting from a given capital $s$ (with $0 < s < 100$). The game ends when the gambler reaches $0$ (bankruptcy) or $100$ (goal). On each flip, the gambler can bet any integer amount from $1$ up to $\min(s, 100-s)$.

The probability of heads is $p_h$ (known). Reward is $+1$ if the gambler reaches $100$ in a transition, $0$ otherwise.

**Your Task:**
Write a function `gambler_value_iteration(ph, theta=1e-9)` that:
- Computes the optimal state-value function $V(s)$ for all $s = 1, ..., 99$ using value iteration.
- Returns the optimal policy as a mapping from state $s$ to the optimal stake $a^*$ (can return any optimal stake if there are ties).

**Inputs:**
- `ph`: probability of heads (float between 0 and 1)
- `theta`: threshold for value iteration convergence (default $1e-9$)

**Returns:**
- `V`: array/list of length 101, $V[s]$ is the value for state $s$
- `policy`: array/list of length 101, $policy[s]$ is the optimal stake in state $s$ (0 if $s=0$ or $s=100$)
5 changes: 5 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"input": "ph = 0.4\nV, policy = gambler_value_iteration(ph)\nprint(round(V[50], 4))\nprint(policy[50])",
"output": "0.0178\n1",
"reasoning": "From state 50, the optimal action is to bet 1, with a probability of reaching 100 of about 0.0178 when ph=0.4."
}
13 changes: 13 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/learn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# **Gambler's Problem and Value Iteration**

In the Gambler's Problem, a gambler repeatedly bets on a coin flip with probability $p_h$ of heads. The goal is to reach 100 starting from some capital $s$. At each state, the gambler chooses a stake $a$ (between $1$ and $\min(s, 100-s)$). If heads, the gambler gains $a$; if tails, loses $a$. The game ends at $0$ or $100$.

The objective is to find the policy that maximizes the probability of reaching 100 (the state-value function $V(s)$ gives this probability). The value iteration update is:

$$
V(s) = \max_{a \in \text{Actions}(s)} \Big[ p_h (\text{reward} + V(s + a)) + (1-p_h)V(s-a) \Big]
$$

where the reward is $+1$ only if $s + a = 100$.

After convergence, the greedy policy chooses the stake maximizing this value. This is a classic episodic MDP, and the optimal policy may not be unique (ties are possible).
15 changes: 15 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"id": "164",
"title": "Gambler's Problem: Value Iteration",
"difficulty": "hard",
"category": "Reinforcement Learning",
"video": "",
"likes": "0",
"dislikes": "0",
"contributor": [
{
"profile_link": "https://github.com/moe18",
"name": "Moe Chabot"
}
]
}
48 changes: 48 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/solution.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
def gambler_value_iteration(ph, theta=1e-9):
# Initialize value function for states 0 to 100; terminal states 0 and 100 have value 0
V = [0.0] * 101
# Initialize policy array (bet amount for each state)
policy = [0] * 101

# Value iteration loop
while True:
delta = 0
# Iterate over non-terminal states (1 to 99)
for s in range(1, 100):
# Possible actions: bet between 1 and min(s, 100 - s)
actions = range(1, min(s, 100 - s) + 1)
action_returns = []
# Evaluate each action
for a in actions:
win_state = s + a
lose_state = s - a
# Reward is 1 if transition reaches 100, else 0
reward = 1.0 if win_state == 100 else 0.0
# Expected value: ph * (reward + V[win]) + (1 - ph) * V[lose]
ret = ph * (reward + V[win_state]) + (1 - ph) * V[lose_state]
action_returns.append(ret)
# Update V[s] with the maximum expected value
max_value = max(action_returns)
delta = max(delta, abs(V[s] - max_value))
V[s] = max_value
# Check for convergence
if delta < theta:
break

# Extract optimal policy
for s in range(1, 100):
actions = range(1, min(s, 100 - s) + 1)
best_action = 0
best_return = -float('inf')
# Find action that maximizes expected value
for a in actions:
win_state = s + a
lose_state = s - a
reward = 1.0 if win_state == 100 else 0.0
ret = ph * (reward + V[win_state]) + (1 - ph) * V[lose_state]
if ret > best_return:
best_return = ret
best_action = a
policy[s] = best_action

return V, policy
12 changes: 12 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/starter_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
def gambler_value_iteration(ph, theta=1e-9):
"""
Computes the optimal value function and policy for the Gambler's Problem.
Args:
ph: probability of heads
theta: convergence threshold
Returns:
V: list of values for all states 0..100
policy: list of optimal stakes for all states 0..100
"""
# Your code here
pass
10 changes: 10 additions & 0 deletions questions/164_gambler-s-problem-value-iteration/tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[
{
"test": "ph = 0.4\nV, policy = gambler_value_iteration(ph)\nprint(round(V[50], 4))\nprint(policy[50])",
"expected_output": "0.4\n50"
},
{
"test": "ph = 0.25\nV, policy = gambler_value_iteration(ph)\nprint(round(V[80], 4))\nprint(policy[80])",
"expected_output": "0.4534\n5"
}
]
Loading
Loading