Sentiment Analysis Model for
Indonesia's 2024 Presidential Candidates
Using SVM

The 2024 general election in Indonesia has become a hot topic on social media, especially on Twitter, which has millions of active users. In this project, me and my Data Mining course teammates analyze Twitter users' sentiments towards the presidential and vice-presidential candidates using the Support Vector Machine (SVM) method. The aim of this research is to classify sentiments into positive, negative, or neutral, thus providing insights into public perceptions of each candidate pair.

Dataset

dataset anies dataset prabowo dataset ganjar
dataset imin dataset gibran dataset mahfud

The data was collected from Twitter during the period from 22 January to 20 March 2024, using scraping techniques focused on six keywords related to the names of the candidate pairs.

Steps

  1. Data Scraping

    scraping using Nitter scraping using Apify

    Data was collected through scraping using the Nitter library and Apify web, resulting in approximately 1300-1400 tweets per candidate after removing duplicate data.


  2. Data Cleaning

    Several issues were identified during data preprocessing, including irrelevant data (out of context), duplicate records, and entries with null values. These problems occurred due to various factors, such as exploiters taking advantage of trending hashtags, Twitter buzzers generating noise, and usernames that are identical to the candidates' names.

    remove irrelevant data remove duplicates and null values

    To address these issues, the data was cleaned by removing irrelevant entries, eliminating duplicate records, and handling null values appropriately.

    scraping using Apify

    As a result, the number of records for each candidate was reduced:
    Candidate 01 from 1371 to 1286, Candidate 02 from 1403 to 1199, and Candidate 03 from 1406 to 1286.


  3. Data Labeling

    The labeling process was performed on three sentiment classes: positive, negative, and neutral. Three labeling methods were applied: manual labeling, using the VADER Lexicon tool, and utilizing a model from Hugging Face. Below are the results:

    Candidate 01
    label for candidate 01

    The labeling results for Candidate 01 show that both manual labeling and the VADER Lexicon indicate predominantly positive sentiments, while the Hugging Face model reveals a significant amount of negative sentiment.


    Candidate 02
    label for candidate 02

    For Candidate 02, the results indicate mostly negative sentiments with both manual labeling and the Hugging Face model, whereas the VADER Lexicon shows a predominance of positive sentiment.


    Candidate 03
    label for candidate 03

    As for Candidate 03, both manual labeling and the VADER Lexicon reflect mostly positive sentiments. However, the Hugging Face model presents a more balanced distribution, with neutral sentiment being slightly more prominent.


    After exploring and comparing the three techniques, manual labeling was chosen.
    It was considered more effective because it allows for more accurate contextual understanding, better handles ambiguity, offers flexibility and customization according to the specific needs of the project, and produces higher-quality data.


  4. Tokenizing

    Tokenizing is splitting text into smaller units called tokens. There are several steps before and after tokenization, which includes:

    Casefolding, ensuring no null values, converting to lowercase, and removing numbers, punctuation, emoticons, and extra spaces. casefolding

    Tokenizing, using word_tokenize from the nltk library to split sentences into word lists. tokenizing

    Filtering, removing unimportant words with StopWordRemoverFactory from Sastrawi. filtering

    Stemming, converting words to their root forms using the Sastrawi library. stemming

    Data Exploration, visualizing token frequency and word cloud. Here are the results:

    Candidate 01

    wordcloud for candidate 01

    Candidate 02

    wordcloud for candidate 02

    Candidate 03

    wordcloud for candidate 03

    After exploration, we discovered several words that provided little meaningful information. As a result, additional words such as "memang, ia, pas, kok, sih, kan, dr, si" and others were added to the stopword list.

    extended stopwords

  5. Feature Extraction

    For feature extraction, the method used is TF-IDF (Term Frequency-Inverse Document Frequency), which converts text data into numerical values based on word importance. TF measures how often a word appears in a document, while IDF reduces the weight of words that occur frequently across many documents, emphasizing more meaningful terms.

    Data before TF-IDF

    before tf-idf

    Data after TF-IDF

    after tf-idf

  6. Modeling Using SVM Algorithm

    Modeling is performed using Support Vector Machine (SVM). SVM works by finding the optimal hyperplane that separates data points from different classes with the largest possible margin. It is effective for sentiment analysis because it performs well with high-dimensional data and helps avoid overfitting, making it suitable for text classification tasks.

    modeling

    After building the model, hyperparameter tuning is conducted. Hyperparameter tuning refers to the process of finding the best combination of parameters to improve model performance. This is done using GridSearchCV, with parameters C, gamma, and kernel set to rbf.

    hyperparameter tuning


  7. Model Evaluation

    Model evaluation is performed using cross-validation scores and a confusion matrix.

    Cross-Validation, which tests the model’s consistency and generalization on unseen data. The average accuracy from cross-validation indicates how well the model generalizes patterns from the training data to the test data. Meanwhile, the standard deviation of accuracy shows how consistent and stable the model's performance is.

    cross-validation score

    Confusion Matrix, a table used to evaluate the performance of a model by comparing predicted values with actual values. From the matrix, several metrics can be derived:

    Accuracy measures how often the model makes correct predictions. Precision indicates how many of the predicted positive cases are truly relevant. Recall measures how well the model identifies all actual positive cases, while F1 Score represents the balance between precision and recall.

    confusion matrix

    Based on the evaluation results, the model demonstrates good performance in predicting sentiment.