--- tags: - CartPole-v1 - reinforce - reinforcement-learning - custom-implementation - deep-rl-class model-index: - name: CartPole-v1-policy-gradient-RL results: - task: type: reinforcement-learning name: reinforcement-learning dataset: name: CartPole-v1 type: CartPole-v1 metrics: - type: mean_reward value: 500.00 +/- 0.00 name: mean_reward verified: false --- # CartPole-v1 Policy Gradient Reinforcement Learning Model ## Model Description This model is a Policy Gradient (REINFORCE) agent trained to solve the CartPole-v1 environment from OpenAI Gym. The agent learns to balance a pole on a cart by taking discrete actions (left or right) to maximize the cumulative reward. ## Model Details ### Model Architecture - **Algorithm**: REINFORCE (Monte Carlo Policy Gradient) - **Neural Network**: Simple feedforward network - Hidden layer size: 16 units - Activation function: ReLU (typical for policy networks) - Output layer: Softmax for action probabilities ### Training Configuration - **Environment**: CartPole-v1 (OpenAI Gym) - **Training Episodes**: 2,000 - **Max Steps per Episode**: 1,000 - **Learning Rate**: 0.01 - **Discount Factor (γ)**: 1.0 (no discounting) - **Optimizer**: Adam (PyTorch default) ## Environment Details **CartPole-v1** is a classic control problem where: - **Observation Space**: 4-dimensional continuous space - Cart position: [-4.8, 4.8] - Cart velocity: [-∞, ∞] - Pole angle: [-0.418 rad, 0.418 rad] - Pole angular velocity: [-∞, ∞] - **Action Space**: 2 discrete actions (0: push left, 1: push right) - **Reward**: +1 for every step the pole remains upright - **Episode Termination**: - Pole angle > ±12° - Cart position > ±2.4 - Episode length > 500 steps (CartPole-v1 limit) ## Training Process The model was trained using the REINFORCE algorithm with the following key features: 1. **Return Calculation**: Monte Carlo returns computed using dynamic programming for efficiency 2. **Reward Standardization**: Returns are normalized (zero mean, unit variance) for training stability 3. **Policy Loss**: Negative log-probability weighted by standardized returns 4. **Gradient Update**: Standard backpropagation with Adam optimizer ### Key Implementation Details - Returns calculated in reverse chronological order for computational efficiency - Numerical stability ensured by adding epsilon to standard deviation - Deque data structure used for efficient O(1) operations ## Performance The model is evaluated over 10 episodes after training. Expected performance: - **Target**: Consistently achieve scores close to 500 (maximum possible in CartPole-v1) - **Success Criterion**: Average score > 475 over evaluation episodes - **Training Stability**: 100-episode rolling average tracked during training ## Usage ```python # Load the trained policy policy = torch.load('policy_model.pth') # Use the policy to select actions state = env.reset() action, log_prob = policy.act(state) ``` ## Limitations and Considerations 1. **Environment Specific**: Model is specifically trained for CartPole-v1 and won't generalize to other environments 2. **Sample Efficiency**: REINFORCE can be sample inefficient compared to modern policy gradient methods 3. **Variance**: High variance in policy gradient estimates (not using baseline/critic) 4. **Hyperparameter Sensitivity**: Performance may be sensitive to learning rate and network architecture ## Ethical Considerations This is a simple control task with no ethical implications. The model is designed for: - Educational purposes in reinforcement learning - Benchmarking and algorithm development - Research in policy gradient methods ## Training Environment - **Framework**: PyTorch - **Environment**: OpenAI Gym - **Monitoring**: 100-episode rolling average for performance tracking ## Model Files - `policy_model.pth`: Trained policy network weights - `training_scores.pkl`: Training episode scores for analysis ## Citation If you use this model, please cite: ```bibtex @misc{cartpole-policy-gradient-2024, title={CartPole-v1 Policy Gradient Reinforcement Learning Model}, author={Adilbai}, year={2024}, publisher={Hugging Face Hub}, url={https://huggingface.co/Adilbai/CartPole-v1-policy-gradient-RL} } ``` ## References - Sutton, R. S., & Barto, A. G. (2018). *Reinforcement learning: An introduction*. MIT press. - Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine learning*, 8(3-4), 229-256. - OpenAI Gym CartPole-v1 Environment Documentation --- *For questions or issues with this model, please open an issue in the repository.*