In this project, we devised a novel machine learning based election forecasting model that predicted Pakistan’s 2018 General Election with highest accuracy and won a nation-wide datascience competition. Model predicts probaiity of win for each candidate contesting election. To capture this probability for individual candidates in a constituency, the model taps an array of statistics from different data sources. Past election data is employed to mine demographic trends of each party across the districts, Twitter, and approval polls are exploited to snap current popularity levels. Then, leveraging Bayesian optimization, the model combines probabilities from different sources by ‘rigging’ the results for ten seats, where competition was expected to be one-sided. In contrast to the existing models that only predict aggregate share of votes for different political parties at national level, our model also effectively predicts the winning candidates on each and every national assembly seat. Seat share of political parties in national assembly seats were predicted with 83% accuracy. In 230 out of 270 constituencies, original winners were among the top two candidates predicted by the proposed technique. Our model produced most accurate results of 2018 election in Pakistan compared to all the opinion polls and surveys, and was acknowledged by a leading public sector agency working in this domain.
Our main objective was to predict the winner of each constituency, therefore, we developed a model that outputs a vector of probabilities of the win for each constituency. This vector shows the likelihood of win for each candidate in a constituency. For instance, if a constituency has five candidates then output of the model might look like: $[0.2, 0.32,0.43,0.02,0.03]$. Each data source gives one such probability vector for each constituency. We assumed results for certain constituencies based on domain knowledge and employed Bayesian optimization to combine these vectors to have the final result. Following the tradition of election forecasting models, we considered win probability for a particular candidate as a function of three variables; election history, surveys, and popularity based on social media, $$\vec{p_c} = f(\text{election history, surveys, social media}).$$ But contrary to the traditional models, we have predicted results for each constituency, a considerably more challenging problem than finding overall vote share of major political parties. We can formulate our model as follows \begin{equation}\label{eq:main_model} \begin{split} \vec{p_c} &= \overset{J}{\underset{j=1}{\sum}} \alpha[j] {h(j,c)} + \overset{K}{\underset{k=1}{\sum}} \beta[k] {s(k,c)} + \gamma \vec{t} + \delta \vec{q} \\ w_c &= \text{arg max}({\vec{p_c}}), \end{split} \end{equation} where
We have leveraged three different types of data in this model: i) Results of past four elections ii) Public poll data of last two years and ii) Tweets of three weeks before the election. Past elections data consists of information about each party's vote share in each constituency along with region's information. It is important to note that constituency names and boundaries change in every election so it is not useful for finding party's influence in a particular constituency. Therefore, we have converted this data into district level first using regional information and then used it in the model.
These two tables show performance of proposed model compared to original results and some polls performed by traditional polling agencies.
This model was winner of electin preidiction challenge of Pakistan held by Ignite, RedBuffer and DeepLinks.