ç§èªèº«ã«ã€ããŠ
ç§ã®ååã¯ãªã¬ã°ã»ã¹ãããã§ã³ã³ã§ããçŸåšããµã³ã¯ãããã«ãã«ã¯å€§åŠã§3幎éå匷ããŠããåã«ããµã³ã¯ãããã«ãã«ã¯HSEã®ç©çãæ°åŠãã³ã³ãã¥ãŒã¿ãŒç§åŠã®åŠæ ¡ã§å匷ããŠããŸãã ç§ã¯JetBrains Researchã®ç 究è ãšããŠãåããŠããŸãã 倧åŠã«å ¥ãåã«ãç§ã¯ã¢ã¹ã¯ã¯å·ç«å€§åŠã®SSCã§åŠã³ãã¢ã¹ã¯ã¯ã®ããŒã ã®äžå¡ãšããŠã³ã³ãã¥ãŒã¿ãŒãµã€ãšã³ã¹ã®åŠç«¥ã®å šãã·ã¢ãªãªã³ãã¢ãŒãã®å ¥è³è ã«ãªããŸããã
äœãå¿ èŠã§ããïŒ
匷åãã¬ãŒãã³ã°ãè©ŠããŠã¿ããå Žåã¯ãããŠã³ãã³ã«ãŒã®ãã£ã¬ã³ãžãæé©ã§ãã ä»æ¥ãã€ã³ã¹ããŒã«ãããŠããGymããã³PyTorchã©ã€ãã©ãªãåããPythonãšããã¥ãŒã©ã«ãããã¯ãŒã¯ã«é¢ããåºæ¬çãªç¥èãå¿ èŠã§ãã
ã¿ã¹ã¯ã®èª¬æ
2次å ã®äžçã§ã¯ãè»ã¯2ã€ã®äžã®éã®çªªã¿ããå³ã®äžã®é äžãŸã§ç»ãå¿ èŠããããŸãã 圌女ã¯éåã«æã¡åã¡ãæåã®è©Šã¿ã§ããã«å ¥ãããã«ååãªãšã³ãžã³åãæã£ãŠããªããšããäºå®ã«ãã£ãŠè€éã«ãªã£ãŠããŸãã ãšãŒãžã§ã³ãïŒãã®å Žåã¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ïŒãèšç·Žããããã«æåŸ ãããŠããŸãããšãŒãžã§ã³ãã¯ããããå¶åŸ¡ããããšã«ãããã§ããã ãæ©ãé©åãªäžãç»ãããšãã§ããŸãã
æ©æ¢°å¶åŸ¡ã¯ãç°å¢ãšã®çžäºäœçšãéããŠå®è¡ãããŸãã ããã¯ç¬ç«ãããšããœãŒãã«åå²ãããåãšããœãŒãã¯æ®µéçã«å®è¡ãããŸãã åã¹ãããã§ããšãŒãžã§ã³ãã¯ã¢ã¯ã·ã§ã³aã«å¿ããŠç°å¢ããç¶æ sããã³ç°å¢rãåãåããŸãã ããã«ããšããœãŒããçµäºããããšãã¡ãã£ã¢ãããã«å ±åããå ŽåããããŸãã ãã®åé¡ã§ã¯ã sã¯æ°åã®ãã¢ã§ããæåã®æ°åã¯ã«ãŒãäžã®è»ã®äœçœ®ã§ãïŒ1ã€ã®åº§æšã§ååã§ããè¡šé¢ããèªåèªèº«ãåŒãé¢ãããšã¯ã§ããªãããïŒã2çªç®ã¯è¡šé¢äžã®é床ã§ãïŒèšå·ä»ãïŒã å ±é ¬rã¯ããã®ã¿ã¹ã¯ã§ã¯åžžã«-1ã«çããæ°ã§ãã ãã®ããã«ããŠããšãŒãžã§ã³ãã¯ã§ããã ãæ©ããšããœãŒããå®äºããããšããå§ãããŸãã å¯èœãªã¢ã¯ã·ã§ã³ã¯3ã€ã ãã§ããè»ãå·Šã«æŒããäœãããã«è»ãå³ã«æŒããŸãã ãããã®ã¢ã¯ã·ã§ã³ã¯ã0ãã2ãŸã§ã®æ°åã«å¯Ÿå¿ããŸããè»ãå³ã®äžã®é äžã«å°éããå ŽåããŸãã¯ãšãŒãžã§ã³ãã200æ©é²ãã å ŽåããšããœãŒãã¯çµäºããå ŽåããããŸãã
çè«ã®ããã
Habréã«ã¯ã DQNã«é¢ããèšäºããã§ã«ãããèè ã¯å¿ èŠãªãã¹ãŠã®çè«ãååã«èª¬æããŠããŸãã ããã§ããèªã¿ãããããããã«ãããã§ããæ£åŒãªåœ¢åŒã§ç¹°ãè¿ããŸãã
匷ååŠç¿ã¿ã¹ã¯ã¯ãç¶æ 空éSãã¢ã¯ã·ã§ã³ç©ºéAãä¿æ°ã®ã»ããã«ãã£ãŠå®çŸ©ãããŸã ãé·ç§»é¢æ°Tãšå ±é ¬é¢æ°Rãäžè¬ã«ãé·ç§»é¢æ°ãšå ±é ¬é¢æ°ã¯ã©ã³ãã å€æ°ã«ã§ããŸãããããã§ã¯ãããããäžæã«å®çŸ©ãããããåçŽãªããŒãžã§ã³ãæ€èšããŸãã ç®æšã¯ã环ç©å ±é ¬ãæ倧åããããšã§ãã ããã§ãtã¯ã¡ãã£ã¢ã®ã¹ãããçªå·ãTã¯ãšããœãŒãã®ã¹ãããæ°ã§ãã
ãã®åé¡ã解決ããããã«ãç¶æ sã§éå§ãããšããæ¡ä»¶ã§ãç¶æ sã®äŸ¡å€é¢æ°Vãæ倧环ç©å ±é ¬ã®å€ãšããŠå®çŸ©ããŸãã ãã®ãããªé¢æ°ãç¥ã£ãŠããã°ãåã¹ãããã§sã«å¯èœãªæ倧å€ãæž¡ãã ãã§åé¡ã解決ã§ããŸãã ãã ãããã¹ãŠãããã»ã©åçŽãªããã§ã¯ãããŸãããã»ãšãã©ã®å Žåãã©ã®ã¢ã¯ã·ã§ã³ã«ãã£ãŠç®çã®ç¶æ ã«ãªããã¯ããããŸããã ãããã£ãŠãé¢æ°ã®2çªç®ã®ãã©ã¡ãŒã¿ãŒãšããŠã¢ã¯ã·ã§ã³aãè¿œå ããŸãã çµæã®é¢æ°ã¯Qé¢æ°ãšåŒã°ããŸãã ç¶æ sã§ã¢ã¯ã·ã§ã³aãå®è¡ããããšã§ç²åŸã§ããæ倧ã®çŽ¯ç©å ±é ¬ã瀺ããŸãã ãããããã®é¢æ°ã䜿çšããŠåé¡ã解決ã§ããŸããç¶æ sã«ãããšããQïŒsãaïŒãæ倧ã«ãªããããªaãéžæããã ãã§ãã
å®éã«ã¯ãå®éã®Qé¢æ°ã¯ããããŸããããããŸããŸãªæ¹æ³ã§è¿äŒŒã§ããŸãã ãã®ãããªææ³ã®1ã€ã«ãDeep Q NetworkïŒDQNïŒããããŸãã 圌ã®èãã¯ãã¢ã¯ã·ã§ã³ã®ããããã«ã€ããŠããã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããŠQé¢æ°ãè¿äŒŒããããšã§ãã
ç°å¢
ãããç·Žç¿ããŸãããã ãŸããMountainCarç°å¢ããšãã¥ã¬ãŒãããæ¹æ³ãåŠã¶å¿ èŠããããŸãã å€æ°ã®æšæºåŒ·ååŠç¿ç°å¢ãæäŸãããžã ã©ã€ãã©ãªã¯ããã®ã¿ã¹ã¯ã«å¯ŸåŠããã®ã«åœ¹ç«ã¡ãŸãã ç°å¢ãäœæããã«ã¯ãgymã¢ãžã¥ãŒã«ã®makeã¡ãœãããåŒã³åºããŠãç®çã®ç°å¢ã®ååããã©ã¡ãŒã¿ãŒãšããŠæž¡ããŸãã
import gym env = gym.make("MountainCar-v0")
詳现ãªããã¥ã¡ã³ãã¯ããã«ãããç°å¢ã®èª¬æã¯ããã«ãããŸã ã
äœæããç°å¢ã§äœãã§ããããããã«è©³ããèããŠã¿ãŸãããã
-
env.reset()
-çŸåšã®ãšããœãŒããçµäºããæ°ãããšããœãŒããéå§ããŸãã åæç¶æ ãè¿ããŸãã -
env.step(action)
-æå®ãããã¢ã¯ã·ã§ã³ãå®è¡ããŸãã æ°ããç¶æ ãå ±é ¬ããšããœãŒããçµäºãããã©ãããããã³ãããã°ã«äœ¿çšã§ããè¿œå æ å ±ãè¿ããŸãã -
env.seed(seed)
-ã©ã³ãã ã·ãŒããèšå®ããŸãã ããã¯ãenv.resetïŒïŒäžã«åæç¶æ ãã©ã®ããã«çæããããã«ãã£ãŠç°ãªããŸãã -
env.render()
-ç°å¢ã®çŸåšã®ç¶æ ã衚瀺ããŸãã
DQNãå®çŸããŸã
DQNã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããŠQé¢æ°ãè©äŸ¡ããã¢ã«ãŽãªãºã ã§ãã å ã®èšäºã§ã DeepMindã¯ç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ã䜿çšããAtariã²ãŒã ã®æšæºã¢ãŒããã¯ãã£ãå®çŸ©ããŸããã ãããã®ã²ãŒã ãšã¯ç°ãªããMountain Carã¯ã€ã¡ãŒãžãç¶æ ãšããŠäœ¿çšããªããããã¢ãŒããã¯ãã£ãèªåã§æ±ºå®ããå¿ èŠããããŸãã
ããšãã°ãããããã«32åã®ãã¥ãŒãã³ã®2ã€ã®é ãå±€ãããã¢ãŒããã¯ãã£ãèããŠã¿ãŸãããã åé衚瀺ã¬ã€ã€ãŒã®åŸã ReLUãã¢ã¯ãã£ããŒã·ã§ã³é¢æ°ãšããŠäœ¿çšããŸãã ç¶æ ã説æãã2ã€ã®æ°å€ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®å ¥åã«äŸçµŠãããåºåã§Qé¢æ°ã®æšå®å€ãååŸããŸãã
import torch.nn as nn model = nn.Sequential( nn.Linear(2, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 3) ) target_model = copy.deepcopy(model) # def init_weights(layer): if type(layer) == nn.Linear: nn.init.xavier_normal(layer.weight) model.apply(init_weights)
GPUã§ãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ãããããããã«ãããã¯ãŒã¯ãããŒãããå¿ èŠããããŸãã
# CPU, âcudaâ âcpuâ device = torch.device("cuda") model.to(device) target_model.to(device)
ããŒã¿ãããŒãããå¿ èŠããããããããã€ã¹å€æ°ã¯ã°ããŒãã«ã«ãªããŸãã
ãŸããåŸé éäžã䜿çšããŠã¢ãã«ã®éã¿ãæŽæ°ãããªããã£ãã€ã¶ãŒãå®çŸ©ããå¿ èŠããããŸãã ã¯ããè€æ°ãããŸãã
optimizer = optim.Adam(model.parameters(), lr=0.00003)
ãã¹ãŠäžç·ã«
import torch.nn as nn import torch device = torch.device("cuda") def create_new_model(): model = nn.Sequential( nn.Linear(2, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 3) ) target_model = copy.deepcopy(model) # def init_weights(layer): if type(layer) == nn.Linear: nn.init.xavier_normal(layer.weight) model.apply(init_weights) # , (GPU CPU) model.to(device) target_model.to(device) # , optimizer = optim.Adam(model.parameters(), lr=0.00003) return model, target_model, optimizer
ããã§ããšã©ãŒé¢æ°ãšããã«æ²¿ã£ãåŸé ãèæ ®ããéäžãé©çšããé¢æ°ã宣èšããŸãã ãã ãããã®åã«ããããããGPUã«ããŒã¿ãããŠã³ããŒãããå¿ èŠããããŸãã
state, action, reward, next_state, done = batch # state = torch.tensor(state).to(device).float() next_state = torch.tensor(next_state).to(device).float() reward = torch.tensor(reward).to(device).float() action = torch.tensor(action).to(device) done = torch.tensor(done).to(device)
次ã«ãQé¢æ°ã®å®éã®å€ãèšç®ããå¿ èŠããããŸããããããããããªãããã次ã®ç¶æ ã®å€ã䜿çšããŠè©äŸ¡ããŸãã
target_q = torch.zeros(reward.size()[0]).float().to(device) with torch.no_grad(): # Q-function target_q[done] = target_model(next_state).max(1)[0].detach()[done] target_q = reward + target_q * gamma
ãããŠçŸåšã®äºæž¬ïŒ
q = model(state).gather(1, action.unsqueeze(1))
target_qãšqã䜿ââçšããŠãæ倱é¢æ°ãèšç®ããã¢ãã«ãæŽæ°ããŸãã
loss = F.smooth_l1_loss(q, target_q.unsqueeze(1)) # optimizer.zero_grad() # loss.backward() # . , for param in model.parameters(): param.grad.data.clamp_(-1, 1) # optimizer.step()
ãã¹ãŠäžç·ã«
gamma = 0.99 def fit(batch, model, target_model, optimizer): state, action, reward, next_state, done = batch # state = torch.tensor(state).to(device).float() next_state = torch.tensor(next_state).to(device).float() reward = torch.tensor(reward).to(device).float() action = torch.tensor(action).to(device) done = torch.tensor(done).to(device) # , target_q = torch.zeros(reward.size()[0]).float().to(device) with torch.no_grad(): # Q-function target_q[done] = target_model(next_state).max(1)[0].detach()[done] target_q = reward + target_q * gamma # q = model(state).gather(1, action.unsqueeze(1)) loss = F.smooth_l1_loss(q, target_q.unsqueeze(1)) # optimizer.zero_grad() # loss.backward() # . , for param in model.parameters(): param.grad.data.clamp_(-1, 1) # optimizer.step()
ã¢ãã«ã¯Qé¢æ°ã®ã¿ãèæ ®ããã¢ã¯ã·ã§ã³ãå®è¡ããªãããããšãŒãžã§ã³ããå®è¡ããã¢ã¯ã·ã§ã³ã決å®ããé¢æ°ã決å®ããå¿ èŠããããŸãã ææ決å®ã¢ã«ãŽãªãºã ãšããŠã -貪欲ãªæ¿æ²»ã 圌女ã®èãã¯ããšãŒãžã§ã³ãã¯éåžžã貪欲ã«ã¢ã¯ã·ã§ã³ãå®è¡ããQé¢æ°ã®æ倧å€ãéžæããŸããã確ç㯠圌ã¯ã©ã³ãã ãªè¡åããšããŸãã ã¢ã«ãŽãªãºã ã貪欲ãªããªã·ãŒã«ãã£ãŠã®ã¿ã¬ã€ããããŠå®è¡ãããªãã¢ã¯ã·ã§ã³ãæ€æ»ã§ããããã«ãã©ã³ãã ã¢ã¯ã·ã§ã³ãå¿ èŠã§ãããã®ããã»ã¹ã¯æ¢çŽ¢ãšåŒã°ããŸãã
def select_action(state, epsilon, model): if random.random() < epsilon: return random.randint(0, 2) return model(torch.tensor(state).to(device).float().unsqueeze(0))[0].max(0)[1].view(1, 1).item()
ãããã䜿çšããŠãã¥ãŒã©ã«ãããã¯ãŒã¯ããã¬ãŒãã³ã°ãããããç°å¢ãšã®ããåãã®çµéšãä¿åããããããããããéžæãããããã¡ãå¿ èŠãšããŸãã
class Memory: def __init__(self, capacity): self.capacity = capacity self.memory = [] self.position = 0 def push(self, element): """ """ if len(self.memory) < self.capacity: self.memory.append(None) self.memory[self.position] = element self.position = (self.position + 1) % self.capacity def sample(self, batch_size): """ """ return list(zip(*random.sample(self.memory, batch_size))) def __len__(self): return len(self.memory)
çŽ æŽãªæ±ºå®
æåã«ãåŠç¿ããã»ã¹ã§äœ¿çšããå®æ°ã宣èšããã¢ãã«ãäœæããŸãã
# model target model target_update = 1000 # , batch_size = 128 # max_steps = 100001 # exploration max_epsilon = 0.5 min_epsilon = 0.1 # memory = Memory(5000) model, target_model, optimizer = create_new_model()
çžäºäœçšããã»ã¹ããšããœãŒãã«åå²ããããšã¯è«ççã§ãããšããäºå®ã«ãããããããåŠç¿ããã»ã¹ã説æããããã«ã¯ãç°å¢ã®åã¹ãããã®åŸã«åŸé éäžã®1ã¹ããããäœæãããã®ã§ããããå¥ã ã®ã¹ãããã«åå²ããæ¹ã䟿å©ã§ãã
ããã§åŠç¿ã®1ã€ã®ã¹ããããã©ã®ããã«èŠãããã«ã€ããŠè©³ãã説æããŸãããã max_stepsã¹ãããã®ã¹ãããçªå·ãšçŸåšã®ç¶æ stateã§ã¹ããããäœæããŠãããšä»®å®ããŸãã 次ã«ãã¢ã¯ã·ã§ã³ãå®è¡ããŸã -貪欲ãªããªã·ãŒã¯æ¬¡ã®ããã«ãªããŸãã
epsilon = max_epsilon - (max_epsilon - min_epsilon)* step / max_steps action = select_action(state, epsilon, model) new_state, reward, done, _ = env.step(action)
ç²åŸããçµéšãããã«ã¡ã¢ãªã«è¿œå ããçŸåšã®ãšããœãŒããçµäºããå Žåã¯æ°ãããšããœãŒããéå§ããŸãã
memory.push((state, action, reward, new_state, done)) if done: state = env.reset() done = False else: state = new_state
ãããŠãåŸé éäžã®ã¹ããããå®è¡ããŸãïŒãã¡ãããå°ãªããšã1ã€ã®ããããæ¢ã«åéã§ããå ŽåïŒã
if step > batch_size: fit(memory.sample(batch_size), model, target_model, optimizer)
ããã§ãtarget_modelã®æŽæ°ãæ®ããŸãã
if step % target_update == 0: target_model = copy.deepcopy(model)
ãã ããåŠç¿ããã»ã¹ããã©ããŒããããšæããŸãã ãããè¡ãã«ã¯ãepsilon = 0ã§target_modelãæŽæ°ãããã³ã«è¿œå ã®ãšããœãŒããåçããç·å ±é ¬ãwards_by_target_updatesãããã¡ãŒã«ä¿åããŸãã
if step % target_update == 0: target_model = copy.deepcopy(model) state = env.reset() total_reward = 0 while not done: action = select_action(state, 0, target_model) state, reward, done, _ = env.step(action) total_reward += reward done = False state = env.reset() rewards_by_target_updates.append(total_reward)
ãã¹ãŠäžç·ã«
# model target model target_update = 1000 # , batch_size = 128 # max_steps = 100001 # exploration max_epsilon = 0.5 min_epsilon = 0.1 def fit(): # memory = Memory(5000) model, target_model, optimizer = create_new_model() for step in range(max_steps): # epsilon = max_epsilon - (max_epsilon - min_epsilon)* step / max_steps action = select_action(state, epsilon, model) new_state, reward, done, _ = env.step(action) # , , memory.push((state, action, reward, new_state, done)) if done: state = env.reset() done = False else: state = new_state # if step > batch_size: fit(memory.sample(batch_size), model, target_model, optimizer) if step % target_update == 0: target_model = copy.deepcopy(model) #Exploitation state = env.reset() total_reward = 0 while not done: action = select_action(state, 0, target_model) state, reward, done, _ = env.step(action) total_reward += reward done = False state = env.reset() rewards_by_target_updates.append(total_reward) return rewards_by_target_updates
ãã®ã³ãŒããå®è¡ãããšã次ã®ã°ã©ãã®ãããªãã®ãåŸãããŸãã
äœãæªãã£ãã®ã§ããïŒ
ããã¯ãã°ã§ããïŒ ããã¯ééã£ãã¢ã«ãŽãªãºã ã§ããïŒ ãããã®æªããã©ã¡ãŒã¿ãŒã¯ãããŸããïŒ ããã§ããªãã å®éãåé¡ã¯ã¿ã¹ã¯ãã€ãŸãå ±é ¬ã®æ©èœã§ãã ãã£ãšè©³ããèŠãŠã¿ãŸãããã åã¹ãããã§ããšãŒãžã§ã³ãã¯-1ã®å ±é ¬ãåãåããŸããããã¯ãšããœãŒããçµäºãããŸã§çºçããŸãã ãã®ãããªå ±é ¬ã¯ããšãŒãžã§ã³ããã§ããã ãæ©ããšããœãŒããå®äºããããã«åæ©ä»ããŸãããåæã«åœŒã«ãããè¡ãæ¹æ³ãæããŸããã ãã®ããããšãŒãžã§ã³ãã®ãã®ãããªå®åŒåã®åé¡ã解決ããæ¹æ³ãåŠã¶å¯äžã®æ¹æ³ã¯ãæ¢çŽ¢ã䜿çšããŠäœåºŠã解決ããããšã§ãã
ãã¡ãããç§ãã¡ã®ä»£ããã«ãããè€éãªã¢ã«ãŽãªãºã ã䜿çšããŠç°å¢ãç 究ããããšãã§ããŸã -貪欲ãªããªã·ãŒã ãã ãã第äžã«ããããã®ã¢ããªã±ãŒã·ã§ã³ã®ããã«ãæã ã®ã¢ãã«ã¯ããè€éã«ãªããŸãã®ã§ãé¿ããããšæããŸãã第äºã«ããã®ã¿ã¹ã¯ã«ååã«æ©èœãããšããäºå®ã§ã¯ãããŸããã 代ããã«ãã¿ã¹ã¯èªäœãå€æŽããããšã«ãã£ãŠãã€ãŸãå ±é ¬é¢æ°ãå€æŽããããšã«ãã£ãŠãã€ãŸãåé¡ã®åå ãåãé€ãããšãã§ããŸãã ããããå ±é ¬ã·ã§ãŒãã³ã°ãé©çšããŸãã
åæã®é«éå
çŽæçãªç¥èãããäžãç»ãã«ã¯å éããå¿ èŠãããããšãããããŸãã é床ãéãã»ã©ããšãŒãžã§ã³ãã¯åé¡ã®è§£æ±ºã«è¿ã¥ããŸãã ããšãã°ãå ±é ¬ã«ç¹å®ã®ä¿æ°ãæã€é床ã¢ãžã¥ãŒã«ãè¿œå ããããšã§ãããã«ã€ããŠåœŒã«äŒããããšãã§ããŸãã
modified_reward =å ±é ¬+ 10 * absïŒnew_state [1]ïŒ
ãããã£ãŠãé¢æ°ãã£ããã®è¡
memory.pushïŒïŒç¶æ ãã¢ã¯ã·ã§ã³ãå ±é ¬ãnew_stateãå®äºïŒïŒã«çœ®ãæããå¿ èŠããããŸã
memory.pushïŒïŒç¶æ ãã¢ã¯ã·ã§ã³ãmodified_rewardãnew_stateãå®äºïŒïŒããã§ãæ°ãããã£ãŒããèŠãŠã¿ãŸãããïŒå€æŽããã«å ã®è³ãæ瀺ããŸãïŒïŒ
ããã§ãRSã¯Reward Shapingã®ç¥ã§ãã
ãããããã®ã¯ããã§ããïŒ
é²æ©ã¯æããã§ããè³ã-200ãšã¯ç°ãªãå§ããããããšãŒãžã§ã³ãã¯äžãç»ãããšãæ確ã«åŠã³ãŸããã æ®ã£ãŠãã質åã¯1ã€ã ãã§ããå ±é ¬ã®æ©èœãå€æŽãããšãã¿ã¹ã¯èªäœãå€æŽãããŸããèŠã€ãã£ãæ°ããåé¡ã®è§£æ±ºçã¯ãå€ãåé¡ã«åœ¹ç«ã€ã®ã§ããããã
ãããããç§ãã¡ã®å Žåã®ãè¯ããã®æå³ãç解ããŠããŸãã åé¡ã解決ããããã«ãæé©ãªããªã·ãŒãèŠã€ããããšããŠããŸãããšããœãŒãã®ç·å ±é ¬ãæ倧åããããªã·ãŒã§ãã ãã®å Žåããgoodããšããåèªããoptimalããšããåèªã«çœ®ãæããããšãã§ããŸããæ¢ããŠããããã§ãã ãŸããDQNãä¿®æ£ãããåé¡ã®æé©ãªè§£æ±ºçãé ããæ©ããèŠã€ãåºããå±æçãªæ倧å€ã§åããªããªãããšã楜芳çã«é¡ã£ãŠããŸãã ãããã£ãŠã質åã¯æ¬¡ã®ããã«åå®åŒåã§ããŸããå ±é ¬ã®æ©èœãå€æŽãããšãåé¡èªäœãå€æŽãããŸããæ°ããåé¡ã®æé©ãªè§£æ±ºçã¯å€ãåé¡ã«æé©ã§ããïŒ
çµå±ã®ãšãããäžè¬çãªã±ãŒã¹ã§ã¯ãã®ãããªä¿èšŒãæäŸããããšã¯ã§ããŸããã çãã¯ãå ±é ¬ã®æ©èœãæ£ç¢ºã«ã©ã®ããã«å€æŽããããããã以åã«ã©ã®ããã«é 眮ãããããç°å¢èªäœãã©ã®ããã«é 眮ããããã«ãã£ãŠç°ãªããŸãã 幞ããªããšã«ãå ±é ¬ã®é¢æ°ãå€æŽãããšãèŠã€ãã£ããœãªã¥ãŒã·ã§ã³ã®æé©æ§ã«ã©ã®ããã«åœ±é¿ãããã調æ»ããèè ã®èšäºããããŸãã
ãŸããæœåšçãªæ¹æ³ã«åºã¥ãããå®å šãªãå€æŽã®ã¯ã©ã¹å šäœãèŠã€ããŸããã ã©ã㧠-ç¶æ ãç¶æ ã®ã¿ã«äŸåããŸãã ãã®ãããªæ©èœã«å¯ŸããŠãèè ã¯ãæ°ããåé¡ã®è§£æ±ºçãæé©ã§ããã°ãå€ãåé¡ã®è§£æ±ºçãæé©ã§ããããšã蚌æããããšãã§ããŸããã
第äºã«ãèè ã¯ä»ã® ãã®ãããªåé¡ãå ±é ¬é¢æ°Rãããã³å€æŽãããåé¡ã«å¯Ÿããæé©ãªè§£æ±ºçãããããããã®è§£æ±ºçã¯å ã®åé¡ã«ãšã£ãŠæé©ã§ã¯ãããŸããã ããã¯ãæœåšçãªæ¹æ³ã«åºã¥ããªãå€æŽã䜿çšããå ŽåãèŠã€ãã£ããœãªã¥ãŒã·ã§ã³ã®è¯ããä¿èšŒã§ããªãããšãæå³ããŸãã
ãããã£ãŠãå ±é ¬é¢æ°ãå€æŽããããã®æœåšçãªé¢æ°ã®äœ¿çšã¯ãã¢ã«ãŽãªãºã ã®åæçã®ã¿ãå€æŽã§ããŸãããæçµçãªãœãªã¥ãŒã·ã§ã³ã«ã¯åœ±é¿ããŸããã
åæãæ£ããã¹ããŒãã¢ãããã
å ±é ¬ãå®å šã«å€æŽããæ¹æ³ãããã£ãã®ã§ãåçŽãªãã¥ãŒãªã¹ãã£ãã¯ã®ä»£ããã«æœåšçãªæ¹æ³ã䜿çšããŠãã¿ã¹ã¯ãå床å€æŽããŠã¿ãŸãããã
modified_reward =å ±é ¬+ 300 *ïŒã¬ã³ã* absïŒnew_state [1]ïŒ-absïŒstate [1]ïŒïŒ
å ã®è³ã®ã¹ã±ãžã¥ãŒã«ãèŠãŠã¿ãŸãããïŒ
çµå±ã®ãšãããçè«çãªä¿èšŒã«å ããŠãæœåšçãªæ©èœã®å©ããåããŠå ±é ¬ãå€æŽãããšãç¹ã«åæ段éã§çµæãå€§å¹ ã«æ¹åãããŸããã ãã¡ããããšãŒãžã§ã³ãããã¬ãŒãã³ã°ããããã«ããæé©ãªãã€ããŒãã©ã¡ãŒã¿ãŒïŒã©ã³ãã ã·ãŒããã¬ã³ããããã³ãã®ä»ã®ä¿æ°ïŒãéžæã§ããå¯èœæ§ããããŸããããããã«ããŠãã¢ãã«ã®åæé床ãå€§å¹ ã«åäžãããã·ã§ãŒãã³ã°ã«å ±é ¬ãäžããŸãã
ããšãã
æåŸãŸã§èªãã§ãããŠããããšãïŒ åŒ·åèšç·Žãžã®ãã®å°ããªå®è·µæåã®é 足ã楜ããã ããšãé¡ã£ãŠããŸãã ããŠã³ãã³ã«ãŒã¯ãããã¡ããã®ä»äºã§ããããšã¯æããã§ãããæ°ã¥ããããã«ã人éã®èŠ³ç¹ãããã®ãããªäžèŠåââçŽãªä»äºã§ã解決ãããããšãŒãžã§ã³ãã«æããããšã¯é£ããå ŽåããããŸãã