Clustering and Analysis of Tweets Related to Petrobras
DOI:
https://doi.org/10.12957/cadinf.2024.82401Resumo
This study aimed to cluster and analyze tweets associated with Petrobras, exploring its meaning and user profiles on social media to understand their impact on financial markets. The research applied a workflow including the data collection from Twitter's API (current X), preprocessing of tweets using Python libraries, word vectorization via Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), Principal Component Analysis (PCA) to reduce matrix dimensionality, and the K-means clustering technique. A total of 840 preprocessed tweets were clustered and analyzed for patterns related to Petrobras. Five clusters were identified in the initial analysis with no dimensionality reduction, showcasing differing characteristics, while the subsequent PCA-based analysis yielded three clusters showing contrasting themes in tweets. The PCA-based analysis showed grouped tweets about the market and economy (cluster 0), while cluster 1 was related to political concerns. Limitations included reliance on publicly available Twitter data, constraints due to the quantity and nature of tweets, and potential biases in sentiment analysis due to informal language and sarcasm. The research underscores the potential of unsupervised machine learning techniques in analyzing sentiments and user profiles related to financial markets. Insights derived from tweet clustering could aid investors in gauging market sentiment.