Random Sample Greedy Search
Retrieval & Ranking Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Random Sample Greedy Search, Gradient Boosting, Manifold Alignment, Q-Value Estimation, Attention Mechanism, Latent Space Sampling, Supervised Learning, Unsupervised Learning, Reinforcement Learning, Natural Language Processing, Deep Learning, Ensemble Methods, Dimensionality Reduction, Temporal Difference Learning, Sequence Modeling, Generative Architectures.
Implement a function that performs Random Sample Greedy action selection for reinforcement learning in environments with potentially large action spaces. Instead of evaluating all possible actions to find the greedy choice (which costs O(n) where n is the number of actions), Random Sample Greedy randomly samples a smaller subset of k actions and selects the best action from that subset. This provides a computationally cheaper alternative while still performing reasonably well. The function should process a batch of states and for each state: Draw a random number to decide between exploration and exploitation If exploring (with probability epsilon), select a uniformly random action from all available actions If exploiting (with probability 1 epsilon), randomly sample sample size actions without replacement from the action space, then select the action with the highest Q value among the sampled subset When multiple actions in the sampled subset share the maximum Q value, break ties by selecting the action with the smallest index sample size should be clamped to num actions if it exceeds it Use np.random.RandomState with the provided seed. For each state, first call rng.random() to determine explore/exploit, then either rng.randint(0, num actions) for exploration, or rng.choice(num actions, size=k, replace=False) for sampling the action subset. Sort the sampled action indices before evaluating Q values to ensure consistent tie breaking by smallest index. Return a dictionary containing the selected actions, their Q values (rounded to 4 decimal places), and the fraction of selections that matched the true greedy action (the action with highest Q value over ALL actions, ties broken by smallest index), also rounded to 4 decimal places.