{"id":92,"date":"2023-12-13T18:17:46","date_gmt":"2023-12-13T18:17:46","guid":{"rendered":"https:\/\/sites.tntech.edu\/lcasl\/?page_id=92"},"modified":"2023-12-13T18:25:10","modified_gmt":"2023-12-13T18:25:10","slug":"hmm-rl","status":"publish","type":"page","link":"https:\/\/sites.tntech.edu\/lcasl\/hmm-rl\/","title":{"rendered":"HMM RL"},"content":{"rendered":"\n<div style=\"height:var(--wp--preset--spacing--50)\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-group alignwide has-global-padding is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading alignwide has-text-align-center has-xx-large-font-size\" style=\"line-height:1.2\">Hidden Markov Model Based Q-Learning for Partially Observable Markov Decision Process with Discounted Rewards<\/h3>\n<\/div>\n\n\n\n<div class=\"wp-block-group alignfull has-global-padding is-layout-constrained wp-container-core-group-is-layout-0747478d wp-block-group-is-layout-constrained\" style=\"padding-top:var(--wp--preset--spacing--50);padding-right:var(--wp--preset--spacing--50);padding-bottom:var(--wp--preset--spacing--50);padding-left:var(--wp--preset--spacing--50)\">\n<div class=\"wp-block-columns alignwide is-layout-flex wp-container-core-columns-is-layout-ff4b9c61 wp-block-columns-is-layout-flex\" style=\"margin-top:0;margin-bottom:0\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p>Decision-making problems in real-world are often partially observable, and dynamic models of the environments are typically unknown. Therefore, there are demands for learning methods to estimate both the dynamic models and the decision policy, given streams of rewards and incomplete state observations. This work presents an online estimation algorithm that simultaneously estimates the model parameter and belief-state-action-value function that uses discretized belief state space. Furthermore, we establish asymptotic convergence analysis for the above estimation. Also, we show that the discretized action-value function converges to the actual action-value function as we increase the number of grids for the discretization. In addition, we provide a numerical example where the introduced estimation methods show improved performance compared to the application of standard Q-learning that does not consider incomplete state observations. While this work focuses on analysis on finite observation and action space, we consider this as a step towards the theoretical understanding of the existing deep reinforcement learning methods that estimate both models and policies given incomplete observations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\">BACKGROUND<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1359\" height=\"613\" src=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide02.png\" alt=\"\" class=\"wp-image-95\" style=\"width:1141px;height:auto\" srcset=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide02.png 1359w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide02-300x135.png 300w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide02-1024x462.png 1024w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide02-768x346.png 768w\" sizes=\"auto, (max-width: 1359px) 100vw, 1359px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\">POMDP EXAMPLE<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"502\" src=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide03-1024x502.png\" alt=\"\" class=\"wp-image-96\" style=\"width:1144px;height:auto\" srcset=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide03-1024x502.png 1024w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide03-300x147.png 300w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide03-768x376.png 768w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/slide03.png 1345w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\">HMM BASED RL FOR POMDP<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"479\" src=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture222-1024x479.png\" alt=\"\" class=\"wp-image-97\" style=\"width:1134px;height:auto\" srcset=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture222-1024x479.png 1024w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture222-300x140.png 300w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture222-768x359.png 768w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture222.png 1292w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1021\" height=\"721\" src=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/algorithm_hmm_rl.png\" alt=\"\" class=\"wp-image-99\" style=\"width:1141px;height:auto\" srcset=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/algorithm_hmm_rl.png 1021w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/algorithm_hmm_rl-300x212.png 300w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/algorithm_hmm_rl-768x542.png 768w\" sizes=\"auto, (max-width: 1021px) 100vw, 1021px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\">PERFORMANCE OF THE POLICY TO MACHINE REPAIR PROBLEM<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"759\" height=\"430\" src=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture.png\" alt=\"\" class=\"wp-image-100\" style=\"width:1123px;height:auto\" srcset=\"https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture.png 759w, https:\/\/sites.tntech.edu\/lcasl\/wp-content\/uploads\/sites\/163\/2023\/12\/capture-300x170.png 300w\" sizes=\"auto, (max-width: 759px) 100vw, 759px\" \/><\/figure>\n\n\n\n<p><strong>Relevant Paper:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Yoon, Hyung-Jin, Donghwan Lee, and Naira Hovakimyan. &#8220;Hidden Markov model estimation-based q-learning for partially observable Markov decision process.&#8221;\u00a02019 American Control Conference (ACC). IEEE, 2019.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Hidden Markov Model Based Q-Learning for Partially Observable Markov Decision Process with Discounted Rewards Decision-making problems in real-world are often partially observable, and dynamic models of the environments are typically unknown. Therefore, there are demands for learning methods to estimate both the dynamic models and the decision policy, given streams of rewards and incomplete state [&hellip;]<\/p>\n","protected":false},"author":184,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-92","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/pages\/92","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/users\/184"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/comments?post=92"}],"version-history":[{"count":3,"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/pages\/92\/revisions"}],"predecessor-version":[{"id":101,"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/pages\/92\/revisions\/101"}],"wp:attachment":[{"href":"https:\/\/sites.tntech.edu\/lcasl\/wp-json\/wp\/v2\/media?parent=92"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}