Learning a Tabular Environment Model from Experience
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Learning a Tabular Environment Model from Experience, Transition Dynamics Modeling, Dirichlet Prior Distribution, Bellman Operator, Experience Replay Buffer, Empirical Frequency Counting, Reinforcement Learning, Probability Theory, Dynamic Programming, Statistical Estimation, Information Theory, Model-Based RL, Bayesian Inference, Markov Decision Processes, Temporal Difference Learning, Maximum Likelihood Estimation.
In model based reinforcement learning, the agent learns an internal model of the environment from observed transitions. This model captures two key functions: 1. Transition dynamics : The probability of transitioning to each next state given the current state and action, i.e., P(s'|s, a). 2. Reward function : The expected reward received when taking an action in a given state, i.e., E[R|s, a]. Given a list of experience tuples (state, action, reward, next state) collected from interacting with the environment, implement a function that learns a tabular environment model and returns the predicted transition probability distribution and expected reward for a queried state action pair. The learned model should be based on frequency counting from the observed data. Transition probabilities should be computed as the fraction of times each next state was observed from a given (state, action) pair. The expected reward should be the mean of all rewards observed for that (state, action) pair. For state action pairs that have never been visited in the experience data, return a uniform distribution over all states and an expected reward of 0.0. Args: experiences: list of tuples (state, action, reward, next state) num states: int, total number of states in the environment num actions: int, total number of actions in the environment query state: int, the state to query query action: int, the action to query Returns: tuple: (transition probs, expected reward) transition probs: list of floats of length num states, where transition probs[s'] = P(s'|query state, query action) expected reward: float, E[R|query state, query action]