Sentiment Analysis to analyze Vaccine Enthusiasm in Indonesia on Twitter Social Media

Vaccines are one way to prevent the coronavirus from entering the human body, although it is not 100% accurate. However, the implementation of vaccination in Indonesia is still controversial. People give their opinions directly or through social media such as Twitter. Twitter is one of the most frequently used social media as a dataset in data mining research. To take tweets as data from Twitter is mostly done in various ways, for example, using an API that is connected to other tools such as Python or Rapidminer. In addition, you can also use the Drone Emprit Academy portal as a tool. The data obtained is then preprocessed using case folding, cleaning, tokenizing, filtering, and stemming. After that, the model was evaluated using the Naïve Bayes method. Naïve Bayes is a classification method that can predict the probability of a class, so that it can produce decisions based on learning data. Currently, Naïve Bayes is one of the best methods to find accuracy in sentiment analysis that is often to used. The results of this study obtained an accuracy of 79%.


Introduction
Twitter is one social media with the most significant users and has a very diverse age range [1]. Tweets on Twitter are widely used for research such as sentiment analysis [2], Social Network Analysis (SNA) [3][4], classification [5], clustering [6], and so on. Twitter also has a trending topic feature that people use as news in various media such as television, news portals, and so on [7].
News related to vaccination is being talked about a lot. Many people accept and reject the COVID-19 vaccine [8]. Acceptance and rejection of vaccines on social media is also a form of online participation from the public towards the government [9]. Several vaccines prevent covid 19 in Indonesia, such as Astra Zeneca, Sinopharm, Moderna, Pfizer-BioNTech, and Sinovac Biotech Ltd [10] [11]. Due to the many pros and cons that occur, this study will conduct sentiment analysis to see positive, negative, and neutral tweets. Sentiment analysis is the process of automatically extracting, understanding, and processing data in the form of unstructured text to obtain sentiment information contained in a sentence of opinion or opinion [12]. Previous researchers have discussed several studies related to sentiment analysis. Sentiment analysis is carried out to see the accuracy generated by several methods such as Support vector machine (SVM) [13], Naïve Bayes, K-Nearest Neighbor [14], C.45 Algorithm, Random Forest, Decision Tree [15], and so on.
Research conducted by Franciska and arief [16] analyzing the customer sentiment of the online store jd.id using the nave Bayes classifier to get an accuracy of 96.44%. Then research billy, helen, dan enda [17] conducted a product analysis using the Naive Bayes method to get an accuracy of 90%. Another study [18] which also used the naive Bayes classifier, got an accuracy of 70%.
This study will also use the naive Bayes classifier method to obtain high accuracy in conducting sentiment analysis on enthusiasm for the Covid-19 vaccination in Indonesia.

Research Methods
The research uses a methodology flow to make it easier to carry out the research process. The preprocessing of this research will use several stages, namely case folding is used to convert all letters in the document into lowercase letters; only letters 'a' to 'Z are accepted. Characters other than these letters are omitted and are considered delimiters [19]. Then cleaning is used for delimiter or deletion of characters or punctuation marks and URLs and emoticons [20]. Tokenizing is a process to make a sentence more meaningful by breaking the sentence into words [21]. Only take words that have an important meaning in the training data [22]. Furthermore, finally stemming, stemming is a process that provides a mapping between various words with different morphology into one basic form (stem) [23]. After preprocessing, the next step is word weighting.
Weighting is converting words into numbers (word vector) [24]. The weighting is done by Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is a method that aims to give weight to the relationship of a word (term) to a document or comment [25]. After weighing the words, the next step is to carry out a sentiment analysis process, and the last step is to evaluate the model.

Results and Discussion
The weighting process is carried out after the preprocessing process. In this research, the process of making word vectors and word weighting uses the help of the Python3 library, namely TfidfVectorizer. The vector representation results obtained 700 numbers which have 2153 words. The results of the TF-IDF word weighting can be seen in Figure 2. After going through data preprocessing and weighting, a model is then made that will be used to classify the test data. This process is carried out using a Python 3 programming language library called sci-kit-learn for the classification process.
The sentiment analysis results are in the form of negative, neutral, and positive opinion categories. As for the details, see Table 1. The results showed that the "Positive" category was neutral and negative. The data is visualized in a bar chart and word cloud as follows. Based on Figure 4 is the overall word cloud visualization of the imported data sources, the word 'vaccine' is the word with the most frequency, followed by the words 'Pfizer, 'moderna', 'covid,' 'booster,' 'vaccination' and 'Sinovac's.

Gambar 4. Whole wordcloud
After the sentiment analysis process, the next step is to evaluate the model. Figure 5 is an evaluation of the model.

Conclusion
Results of the research that has been done show that there are 700 tweets on Twitter and have 2153 words. The results of the model's evaluation obtained an accuracy of 79%. The results prove that the nave Bayes method is still one of the best. Although this research does not use feature selection, nave Bayes can still produce the best accuracy.