Abstract: | 職棒比賽非常注重數據收集及分析,因此每場比賽都會產生大量可供分析的數據。資料探勘技術是一項可在浩瀚的資料中分析出關鍵結果的電腦分析技術,以此技術來處理職棒的資料,不但可獲佳效,更可免去人工分析所產生的錯誤。本研究目的即是利用資料探勘的技術預測美國大聯盟職棒賽事之勝負與得分。
本研究以美國職棒大聯盟30支隊伍在2000到2012年所有例行賽賽事為研究對象,投入之變數,為各隊賽前十場比賽野手與投手各項表現之加總平均數。首先使用「皮爾森積差相關分析」除去與勝負較無相關之變數與具有複共線性之變數,以挑選出適當之投入變數,再利用「類神經網路」中的「倒傳遞網路」將挑選出之變數投入並建立模型。以前100場作為模型之訓練集,剩下62場投入模型之鑑效,取得比賽之預測比分後,再和實際比賽結果和賭盤之盤口作比較。
實證結果利用產出之模型得到之主客隊預測比分,再與運動彩券的大小、勝分差、讓分盤口比較後,證實本研究所提出的模型有較佳的預測準確率。後續研究者或可改變投入之變數值,再代入本研究提出的模型,應可提升預測的準確率。
Professional baseball games emphasize data collection and analysis because each game provides plenty of data that needs to be analyzed. Data mining methods involve computer analysis techniques with which a crucial outcome can be found from a huge amount of data. The data mining techniques thus can be used to efficiently analyze the data of professional baseball and also avoid the mistakes often caused by manual analysis. This study aims to predict the outcome and scores of professional baseball games in MLB.
The data of the study are all the regular season games from 2000 to 2012 of thirty teams in MLB. The variables are the average statistics of both the fielders’ and the pitchers’ performances in the last ten games. First, we used the Pearson product-moment correlation coefficient to delete the unrelated variables and variables of multicollinearity and to select the suitable variables. Then we applied the Back Propagation Network (BPN) of the artificial neural network to build a model for the selected variables. The first 100 games served as the training set of the model while the later 62 games as the validation set. After obtaining the predicted scores of each game, we compared them to the real outcome of the games and the money line.
After using the output model to predict the scores of the host and the guest, we further compared them with the real outcome, run line, and money line of sports gambling. The experimental results have proven that the model of this study provided better prediction accuracy. Follow-up researchers may consider using different variables for the model to improve the accuracy of the predictions. |