Gibbs Softmax Action Selection

Foundations & Tabular RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Understanding Gibbs (Softmax) Action Selection in Reinforcement Learning, Temperature Parameter, Softmax Normalization, Numerical Stability (Log-Sum-Exp), Categorical Sampling, Action Preference Scaling, Reinforcement Learning, Probability Theory, Numerical Optimization, Decision Theory, Statistical Modeling, Exploration vs Exploitation, Policy Gradient Methods, Stochastic Policies, Action Preferences, Boltzmann Distribution.

Implement a Gibbs Softmax action selection function. Given a list of action preferences (Q values) and a temperature parameter 'tau', return a probability distribution over the actions. Then, implement a function to sample a single action index based on that distribution.