傳統Q Learning玩21點
環境 Blackjack
import gymnasium as gym
env = gym.make('Blackjack-v1', natural=False, sab=False)
Q Table
玩家11點以下加牌都不會爆,所以Q Table把11點以下分一類,以及11~21點
莊家1~10點。以及A是否當11點用
import numpy as np
q_table = np.zeros(
(len(range(11,22)), len(range(10)), 2, env.action_space.n)
)
超參數
因為最後的reward隨機性很高(15點加牌 贏或輸都不一樣) 所以學習率(alpha)設小一點
epsilon = 0.3
alpha = 0.3
gamma = 0.95
選擇動作
from random import random
def get_action(state):
if random()<epsilon:
return env.action_space.sample()
return np.argmax(q_table[state])
訓練流程
tat_rewards_arr = []
tat_rewards = 0
N = 100000
for episode in range(N):
state, _ = env.reset()
state = compress_state(state)
done = False
while not done:
action = get_action(state)
next_state, reward, done, _, _ = env.step(action)
next_state = compress_state(next_state)
if done:
q_table[state+(action,)] += alpha * (reward - q_table[state+(action,)])
tat_rewards += reward
else:
q_table[state+(action,)] += alpha * (reward + gamma*np.max(q_table[next_state]) - q_table[state+(action,)])
state = next_state
if episode%1000==999:
epsilon *= 0.96
alpha *= 0.96
tat_rewards_arr.append(tat_rewards)
tat_rewards = 0
結果
import matplotlib.pyplot as plt
plt.plot(tat_rewards_arr)
plt.show()
每1000次遊玩,記錄一次總得分。大約十萬手牌之後逐漸收斂,但仍然無法穩定的擊敗莊家。
AI學出來的策略表
你的牌\莊家手牌 | A | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | T |
---|---|---|---|---|---|---|---|---|---|---|
11 | H | H | H | H | H | H | H | H | H | H |
12 | H | S | H | H | H | S | H | H | H | H |
13 | H | S | H | S | S | S | H | H | H | H |
14 | H | H | S | S | S | H | H | H | H | H |
15 | H | S | H | S | S | S | H | H | H | H |
16 | H | S | S | H | S | S | H | H | S | H |
17 | H | S | S | S | S | S | S | S | S | S |
18 | S | S | S | S | S | S | S | S | S | S |
19 | S | S | S | S | S | S | S | S | S | S |
20 | S | S | S | S | S | S | S | S | S | S |
21 | S | S | S | S | S | S | S | S | S | S |