In the ever-evolving world of social media, hashtags have become a cornerstone in shaping digital conversations. They are not just mere labels but are pivotal in categorizing and identifying the pulse of social narratives. However, with this utility comes a challenge: the dynamic and polysemous nature of hashtags. This complexity is where the innovative approach of “Hashtag Sense Clustering Based on Temporal Similarity” comes into play.
The challenges of hashtags in Twitter (X)
Traditionally, hashtags have been used as simple markers to categorize posts or as symbols of community affiliation. But their usage varies greatly, often leading to ambiguity. The same hashtag can represent different topics at different times, and conversely, various hashtags can denote the same subject. This polymorphic nature, coupled with the spontaneous creation of new hashtags, makes it challenging to analyze them effectively using standard linguistic tools.
The SAX* (Symbolic Aggregate approXimation) algorithm is a groundbreaking method developed to decipher the complex world of hashtags. This approach clusters hashtags not based on linguistic context, but on their temporal co-occurrence and usage patterns. The underlying hypothesis is simple yet profound: hashtags exhibiting similar temporal behaviors are likely to be semantically connected.
The Mechanism of SAX*
What follows is the practical implementation of the algorithm created by Giovanni Stilo and Paola Velardi, the great scientists of NLP.
- Temporal Slicing and Symbolic Conversion: The algorithm begins by segmenting the temporal series of hashtags into predefined windows. These segments are then normalized and transformed into symbolic strings.
- Pattern Recognition: Using a set of predefined keywords, the algorithm learns common usage patterns and filters out hashtags that do not conform to these patterns.
- Hierarchical Clustering: The selected hashtags are clustered in each window using a hierarchical clustering algorithm, based on the similarity of their temporal patterns.
Monitoring Social Media Campaigns
Imagine a multinational company launching a global marketing campaign with a specific hashtag.
However, unbeknownst to them, this hashtag is already in use, carrying different connotations in various regions. Using the SAX* algorithm, the company can analyze the temporal patterns of this hashtag, identifying where and when it aligns with their campaign message and where it diverges. This insight allows them to tailor their marketing strategies, avoid potential PR crises, and harness the true power of their social media reach.
Indeed SAX* algorithm represents a significant advancement in the field of social media analytics, offering a novel way to understand the complex and dynamic nature of hashtag usage. By focusing on temporal patterns rather than just content, it opens up new avenues for businesses and researchers alike to gauge public opinion, monitor brand presence, and stay ahead in the digital conversation. The world of hashtags is no longer just about what is being said, but also when and how frequently it is being said, revealing deeper insights into the digital zeitgeist.
Monitoring climate change hashtags for a big coffee Company
Creating a Python implementation of the SAX* (Symbolic Aggregate approXimation) algorithm for a specific use case, such as monitoring a marketing campaign on climate change for a coffee company, involves several steps.
First, collect data from Twitter streams related to the campaign. This would involve using a library like tweepy
to connect to the Twitter API and fetch tweets containing relevant hashtags.
import tweepy
consumer_key = ‘YOUR_CONSUMER_KEY’
consumer_secret = ‘YOUR_CONSUMER_SECRET’
access_token = ‘YOUR_ACCESS_TOKEN’
access_token_secret = ‘YOUR_ACCESS_TOKEN_SECRET’
# Set up tweepy client
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Collect tweets
tweets = api.search(q=”#YourCampaignHashtag”, count=100)
import pandas as pd
In the clustering part of the SAX* algorithm, the symbolic strings representing the temporal patterns of hashtags are grouped into clusters. This is done using a hierarchical clustering algorithm.
The key steps are:
1) Linkage: The `linkage` function from `scipy.cluster.hierarchy` creates a hierarchical clustering using the SAX representations.
It computes distances between pairs of symbolic strings. The ‘ward’ method is a common choice, which minimizes the variance of clusters being merged.
2) Forming Clusters: The `fcluster` function forms flat clusters from the hierarchical clusters created by `linkage`. The `t` parameter specifies a threshold to define the distance at which clusters should be separated. Clusters are formed by cutting the dendrogram (tree diagram used to illustrate the arrangement of the clusters produced by hierarchical clustering) at this threshold.
The result is a set of clusters, each containing hashtags that exhibit similar temporal usage patterns. These clusters can then be analyzed to understand how different hashtags related to the marketing campaign behave over time, such as identifying which hashtags are used together frequently or during specific events or periods.
The marketing campaign
Let’s implement the code for an analysis of hashtags related to climate change for a marketing campaign of a coffee Company. The Company wants to track which terms are related to climate change in the last 3 months and create a marketing campaign for its sustainability.Extract Keywords from Articles in Internet on climate change and sustainability
First, we need to gather and process articles in Internet on climate change. Let’s use the requests and BeautifulSoup libraries for web scraping and nltk for natural language processing. Note: Replace article_urls with the URLs of the articles you want to scrape.import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter
# Ensure you have the necessary NLTK data
nltk.download(‘punkt’)
nltk.download(‘stopwords’)
# List of URLs to scrape
article_urls = [‘http://example.com/article1’, ‘http://example.com/article2’, …] # Add up to 100 URLs
def scrape_article(url):
“””Scrape the text from an article.”””
response = requests.get(url)
soup = BeautifulSoup(response.content, ‘html.parser’)
return ‘ ‘.join([p.text for p in soup.find_all(‘p’)])
def extract_keywords(text, lang=’english’):
“””Extract keywords from text.”””
words = nltk.word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
stop_words = set(stopwords.words(lang))
words = [word for word in words if not word in stop_words]
return Counter(words).most_common(10) # Get the top 10 keywords
# Scrape and analyze articles
all_keywords = Counter()
for url in article_urls:
article_text = scrape_article(url)
keywords = extract_keywords(article_text)
all_keywords.update(keywords)
# Top keywords across all articles
top_keywords = all_keywords.most_common(10)
print(top_keywords)
Twitter Data Analysis
This part involves using Twitter’s Enterprise API to fetch tweets containing the extracted keywords. This requires the tweepy library. Due to the complexity and access restrictions of the Twitter Enterprise API, the following is a conceptual outline:
import tweepy
# Set up the Twitter API client
client = tweepy.Client(bearer_token=’YOUR_BEARER_TOKEN’)
# Function to fetch tweets containing a keyword
def fetch_tweets(keyword, lang):
“””Fetch tweets containing the given keyword.”””
tweets = client.search_recent_tweets(query=keyword, tweet_fields=[‘context_annotations’, ‘created_at’], max_results=100)
return tweets
# Fetch tweets for each keyword and language
tweets = {}
for keyword in top_keywords:
for lang in [‘en’, ‘de’, ‘it’, ‘fr’]: # English, German, Italian, French
tweets[(keyword, lang)] = fetch_tweets(keyword, lang)
Implementing the SAX Algorithm
import pandas as pd # Example hashtag data (replace with actual data) hashtags_data = [ {‘hashtag’: ‘#climateaction’, ‘timestamp’: ‘2023-01-01’}, {‘hashtag’: ‘#greenenergy’, ‘timestamp’: ‘2023-01-01’}, # … more data (see the web parsing above fo the definition of topics created used) ] # Convert to DataFrame df = pd.DataFrame(hashtags_data) df[‘timestamp’] = pd.to_datetime(df[‘timestamp’]) df.set_index(‘timestamp’, inplace=True) # Count occurrences per day for each hashtag time_series_data = df.groupby([‘hashtag’, pd.Grouper(freq=’D’)]).size().unstack(fill_value=0) from saxpy.sax import ts_to_string from saxpy.alphabet import cuts_for_asize from saxpy.znorm import znorm # Function to convert time series to SAX representation def convert_to_sax(series, alphabet_size=3, sax_length=10): series_normalized = znorm(series) cuts = cuts_for_asize(alphabet_size) sax_representation = ts_to_string(series_normalized, cuts, sax_length) return sax_representation # Apply SAX conversion sax_data = time_series_data.apply(lambda row: convert_to_sax(row), axis=1)from sklearn.cluster import KMeans # Convert SAX data to a format suitable for clustering sax_df = pd.DataFrame({‘hashtag’: sax_data.index, ‘sax_string’: sax_data.values}) # Use KMeans for clustering (choose an appropriate number of clusters) kmeans = KMeans(n_clusters=10) sax_df[‘cluster’] = kmeans.fit_predict(list(sax_df[‘sax_string’])) # Group hashtags by cluster clustered_hashtags = sax_df.groupby(‘cluster’)[‘hashtag’].apply(list)
Creating a Marketing Campaign
Based on the clustered hashtags, we developed campaign topics and perform the marketing campaign strategy ideation and execution:
- Identified Main Themes: Looking at the hashtags in each cluster we identified the main themes. Each cluster represented a group of hashtags that had similar temporal usage patterns.
- Developed Campaign Topics: For each cluster, we created a campaign topic that encapsulates the theme.
For example, if a cluster contains hashtags like #greenenergy and #renewables, the topic could be “Innovations in Renewable Energy”. - Created Content Strategy: Developed specific content for each topic (social media posts, videos).
- Timing the Campaign: Used the temporal patterns observed in the SAX analysis to time the release of campaign materials to coincide with peak interest periods.
- Engagement and Feedback: Monitored engagement with the campaign and adjust topics or content strategies based on feedback and performance metrics.