Hybrid Importance Sampling for Efficient and Stable Off-Policy Reinforcement Learning

Ahmet KALA; Cem ÖZKURT; Özkan CANAY

doi:-

The Academic Perspective Procedia publishes Academic Platform symposiums papers as three volumes in a year. DOI number is given to all of our papers.
Publisher : Academic Perspective

Journal DOI : 10.33793/acperpro
Journal eISSN : 2667-5862

Year :2025, Volume 6, Issue 2, Pages: 187-196

06.01.2025

Hybrid Importance Sampling for Efficient and Stable Off-Policy Reinforcement Learning

Ahmet KALA; Cem ÖZKURT; Özkan CANAY

https://doi.org/-

413

175

Abstract

Off-policy reinforcement learning improves sample efficiency by reusing data from a behavior policy different from the target policy. This benefits costly domains like robotics and healthcare, enhancing generalization and rare-event learning. However, high variance and instability remain challenges, risking local optima convergence. This study introduces Hybrid Importance Sampling (HIS), combining adaptive clipping and dynamic normalization to stabilize importance weight estimation. Adaptive clipping limits extreme weights, reducing variance, while dynamic normalization balances weight distribution. Compared to standard methods such as Ordinary Importance Sampling (OIS), Weighted Importance Sampling (WIS), and Per-Decision Importance Sampling (PDIS), HIS shows superior stability in high-variance settings, making off-policy learning more reliable for real-world applications.

Keywords: Off-Policy Reinforcement Learning, Importance Sampling, Monte Carlo Methods

References

[1] Lillicrap T., Hunt J., Pritzel A., Heess N., Erez T., Tassa Y. et al.. Continuous control with deep reinforcement learning. 2015. https://doi.org/10.48550/arxiv.1509.02971

[2] Haarnoja T., Zhou A., Hartikainen K., Tucker G., Ha S., Tan J. et al.. Soft actor-critic algorithms and applications. 2018. https://doi.org/10.48550/arxiv.1812.05905

[3] Wang R., Foster D., & Kakade S.. What are the statistical limits of offline rl with linear function approximation?. 2020. https://doi.org/10.48550/arxiv.2010.11895

[4] Fujimoto S., Meger D., & Precup D.. Off-policy deep reinforcement learning without exploration. 2018. https://doi.org/10.48550/arxiv.1812.02900

[5] Silver D., Schrittwieser J., Simonyan K., Antonoglou I., Huang A., Guez A. et al.. Mastering the game of go without human knowledge. Nature 2017;550(7676):354-359. https://doi.org/10.1038/nature24270

[6] Mandyam A., Jones A., Laudański K., & Engelhardt B.. Nested policy reinforcement learning. 2021. https://doi.org/10.48550/arxiv.2110.02879

[7] Fujimoto S., Conti E., Ghavamzadeh M., & Pineau J.. Benchmarking batch deep reinforcement learning algorithms. 2019. https://doi.org/10.48550/arxiv.1910.01708

[8] Mnih V., Kavukcuoglu K., Silver D., Rusu A., Veness J., Bellemare M. et al.. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529-533. https://doi.org/10.1038/nature14236

[9] Chen Z.. A unified lyapunov framework for finite-sample analysis of reinforcement learning algorithms. ACM SIGMETRICS Performance Evaluation Review 2022;50(3):12-15. https://doi.org/10.1145/3579342.3579346

[10] Shi L., Li S., Cao L, Long Y., & Pan G.. Tbq(σ): improving efficiency of trace utilization for off-policy reinforcement learning. 2019. https://doi.org/10.48550/arxiv.1905.07237

[11] Kumar A., Fu J., Tucker G., & Levine S.. Stabilizing off-policy q-learning via bootstrapping error reduction. 2019. https://doi.org/10.48550/arxiv.1906.00949

[12] Touati A., Zhang A., Pineau J., & Vincent P.. Stable policy optimization via off-policy divergence regularization. 2020. https://doi.org/10.48550/arxiv.2003.04108

[13] Imani E., Graves E., & White M.. An off-policy policy gradient theorem using emphatic weightings. 2018. https://doi.org/10.48550/arxiv.1811.09013

[14] Munos R., Stepleton T., Harutyunyan A., & Bellemare M.. Safe and efficient off-policy reinforcement learning. 2016. https://doi.org/10.48550/arxiv.1606.02647

[15] Gu S., Lillicrap T., Ghahramani Z., Turner R., & Levine S.. Q-prop: sample-efficient policy gradient with an off-policy critic. 2016. https://doi.org/10.48550/arxiv.1611.02247

[16] Kallus N. and Uehara M.. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. 2019. https://doi.org/10.48550/arxiv.1906.03735

[17] Tokdar S. and Kass R.. Importance sampling: a review. WIREs Computational Statistics 2009;2(1):54-60. https://doi.org/10.1002/wics.56

[18] Yu T., Lu L., & Li J.. A weight-bounded importance sampling method for variance reduction. International Journal for Uncertainty Quantification 2019;9(3):311-319. https://doi.org/10.1615/int.j.uncertaintyquantification.2019029511

[19] Liu Y., Bacon P., & Brunskill E.. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling. 2019. https://doi.org/10.48550/arxiv.1910.06508

Cite

@article{acperproISITES2025ID34, author={KALA, Ahmet and ÖZKURT, Cem and CANAY, Özkan}, title={Hybrid Importance Sampling for Efficient and Stable Off-Policy Reinforcement Learning}, journal={Academic Perspective Procedia}, eissn={2667-5862}, volume={6}, year=2025, pages={187-196}}
KALA, A. , ÖZKURT, . , CANAY, .. (2025). Hybrid Importance Sampling for Efficient and Stable Off-Policy Reinforcement Learning. Academic Perspective Procedia, 6 (2), 187-196. DOI: -
%0 Academic Perspective Procedia (ACPERPRO) Hybrid Importance Sampling for Efficient and Stable Off-Policy Reinforcement Learning% A Ahmet KALA , Cem ÖZKURT , Özkan CANAY% T Hybrid Importance Sampling for Efficient and Stable Off-Policy Reinforcement Learning% D 1/6/2025% J Academic Perspective Procedia (ACPERPRO)% P 187-196% V 6% N 2% R doi: -% U -

[ 0 ]

Full Paper