Tag:

data visualization

Unveiling the intricacies of Hashtag Sense Clustering Based on Temporal Similarity for a marketing campaign of a big coffee Company

by 2023-07-17

In the ever-evolving world of social media, hashtags have become a cornerstone in shaping digital conversations. They are not just mere labels but are pivotal in categorizing and identifying the pulse of social narratives. However, with this utility comes a challenge: the dynamic and polysemous nature of hashtags. This complexity is where the innovative approach of “Hashtag Sense Clustering Based on Temporal Similarity” comes into play.

The challenges of hashtags in Twitter (X)

Traditionally, hashtags have been used as simple markers to categorize posts or as symbols of community affiliation. But their usage varies greatly, often leading to ambiguity. The same hashtag can represent different topics at different times, and conversely, various hashtags can denote the same subject. This polymorphic nature, coupled with the spontaneous creation of new hashtags, makes it challenging to analyze them effectively using standard linguistic tools.

The SAX* (Symbolic Aggregate approXimation) algorithm is a groundbreaking method developed to decipher the complex world of hashtags. This approach clusters hashtags not based on linguistic context, but on their temporal co-occurrence and usage patterns. The underlying hypothesis is simple yet profound: hashtags exhibiting similar temporal behaviors are likely to be semantically connected.

The Mechanism of SAX*

What follows is the practical implementation of the algorithm created by Giovanni Stilo and Paola Velardi, the great scientists of NLP.

Temporal Slicing and Symbolic Conversion: The algorithm begins by segmenting the temporal series of hashtags into predefined windows. These segments are then normalized and transformed into symbolic strings.
Pattern Recognition: Using a set of predefined keywords, the algorithm learns common usage patterns and filters out hashtags that do not conform to these patterns.
Hierarchical Clustering: The selected hashtags are clustered in each window using a hierarchical clustering algorithm, based on the similarity of their temporal patterns.

Monitoring Social Media Campaigns

Imagine a multinational company launching a global marketing campaign with a specific hashtag.

However, unbeknownst to them, this hashtag is already in use, carrying different connotations in various regions. Using the SAX* algorithm, the company can analyze the temporal patterns of this hashtag, identifying where and when it aligns with their campaign message and where it diverges. This insight allows them to tailor their marketing strategies, avoid potential PR crises, and harness the true power of their social media reach.

Indeed SAX* algorithm represents a significant advancement in the field of social media analytics, offering a novel way to understand the complex and dynamic nature of hashtag usage. By focusing on temporal patterns rather than just content, it opens up new avenues for businesses and researchers alike to gauge public opinion, monitor brand presence, and stay ahead in the digital conversation. The world of hashtags is no longer just about what is being said, but also when and how frequently it is being said, revealing deeper insights into the digital zeitgeist.

Monitoring climate change hashtags for a big coffee Company

Creating a Python implementation of the SAX* (Symbolic Aggregate approXimation) algorithm for a specific use case, such as monitoring a marketing campaign on climate change for a coffee company, involves several steps.

First, collect data from Twitter streams related to the campaign. This would involve using a library like tweepy to connect to the Twitter API and fetch tweets containing relevant hashtags.

import tweepy
consumer_key = ‘YOUR_CONSUMER_KEY’
consumer_secret = ‘YOUR_CONSUMER_SECRET’
access_token = ‘YOUR_ACCESS_TOKEN’
access_token_secret = ‘YOUR_ACCESS_TOKEN_SECRET’

# Set up tweepy client
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Collect tweets
tweets = api.search(q=”#YourCampaignHashtag”, count=100)
import pandas as pd

# Extract timestamps and create a time series timestamps = [tweet.created_at for tweet in tweets] time_series = pd.Series(1, index=pd.to_datetime(timestamps)) # Implement the SAX algorithm to convert the time series into symbolic strings. # This involves 1) normalizing the time series and 2) discretizing it into a symbolic representation. from tslearn.preprocessing import TimeSeriesScalerMeanVariance from saxpy import SAX # https://github.com/seninp/saxpy # Normalize the time series scaler = TimeSeriesScalerMeanVariance(mu=0., std=1.) time_series_scaled = scaler.fit_transform(time_series) # Convert to symbolic representation sax = SAX(window_size=10, word_size=3, alphabet_size=5) sax_representation = sax.fit_transform(time_series_scaled) # Cluster the symbolic strings using a hierarchical clustering algorithm. This step groups together hashtags with similar temporal patterns. from scipy.cluster.hierarchy import linkage, fcluster # Hierarchical clustering Z = linkage(sax_representation, ‘ward’) clusters = fcluster(Z, t=1.5, criterion=’distance’) # Analyze clusters cluster_analysis = pd.DataFrame({‘timestamp’: timestamps, ‘cluster’: clusters})

In the clustering part of the SAX* algorithm, the symbolic strings representing the temporal patterns of hashtags are grouped into clusters. This is done using a hierarchical clustering algorithm.

The key steps are:

1) Linkage: The `linkage` function from `scipy.cluster.hierarchy` creates a hierarchical clustering using the SAX representations.
It computes distances between pairs of symbolic strings. The ‘ward’ method is a common choice, which minimizes the variance of clusters being merged.

2) Forming Clusters: The `fcluster` function forms flat clusters from the hierarchical clusters created by `linkage`. The `t` parameter specifies a threshold to define the distance at which clusters should be separated. Clusters are formed by cutting the dendrogram (tree diagram used to illustrate the arrangement of the clusters produced by hierarchical clustering) at this threshold.

The result is a set of clusters, each containing hashtags that exhibit similar temporal usage patterns. These clusters can then be analyzed to understand how different hashtags related to the marketing campaign behave over time, such as identifying which hashtags are used together frequently or during specific events or periods.

The marketing campaign

Let’s implement the code for an analysis of hashtags related to climate change for a marketing campaign of a coffee Company. The Company wants to track which terms are related to climate change in the last 3 months and create a marketing campaign for its sustainability.

Extract Keywords from Articles in Internet on climate change and sustainability

First, we need to gather and process articles in Internet on climate change. Let’s use the requests and BeautifulSoup libraries for web scraping and nltk for natural language processing. Note: Replace article_urls with the URLs of the articles you want to scrape.

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter

# Ensure you have the necessary NLTK data
nltk.download(‘punkt’)
nltk.download(‘stopwords’)

# List of URLs to scrape
article_urls = [‘http://example.com/article1’, ‘http://example.com/article2’, …] # Add up to 100 URLs

def scrape_article(url):
“””Scrape the text from an article.”””
response = requests.get(url)
soup = BeautifulSoup(response.content, ‘html.parser’)
return ‘ ‘.join([p.text for p in soup.find_all(‘p’)])

def extract_keywords(text, lang=’english’):
“””Extract keywords from text.”””
words = nltk.word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
stop_words = set(stopwords.words(lang))
words = [word for word in words if not word in stop_words]
return Counter(words).most_common(10) # Get the top 10 keywords

# Scrape and analyze articles
all_keywords = Counter()
for url in article_urls:
article_text = scrape_article(url)
keywords = extract_keywords(article_text)
all_keywords.update(keywords)

# Top keywords across all articles
top_keywords = all_keywords.most_common(10)
print(top_keywords)

Twitter Data Analysis

This part involves using Twitter’s Enterprise API to fetch tweets containing the extracted keywords. This requires the tweepy library. Due to the complexity and access restrictions of the Twitter Enterprise API, the following is a conceptual outline:

import tweepy

# Set up the Twitter API client

client = tweepy.Client(bearer_token=’YOUR_BEARER_TOKEN’)

# Function to fetch tweets containing a keyword
def fetch_tweets(keyword, lang):
“””Fetch tweets containing the given keyword.”””
tweets = client.search_recent_tweets(query=keyword, tweet_fields=[‘context_annotations’, ‘created_at’], max_results=100)
return tweets

# Fetch tweets for each keyword and language
tweets = {}
for keyword in top_keywords:
for lang in [‘en’, ‘de’, ‘it’, ‘fr’]: # English, German, Italian, French
tweets[(keyword, lang)] = fetch_tweets(keyword, lang)

Implementing the SAX Algorithm

import pandas as pd # Example hashtag data (replace with actual data) hashtags_data = [ {‘hashtag’: ‘#climateaction’, ‘timestamp’: ‘2023-01-01’}, {‘hashtag’: ‘#greenenergy’, ‘timestamp’: ‘2023-01-01’}, # … more data (see the web parsing above fo the definition of topics created used) ] # Convert to DataFrame df = pd.DataFrame(hashtags_data) df[‘timestamp’] = pd.to_datetime(df[‘timestamp’]) df.set_index(‘timestamp’, inplace=True) # Count occurrences per day for each hashtag time_series_data = df.groupby([‘hashtag’, pd.Grouper(freq=’D’)]).size().unstack(fill_value=0) from saxpy.sax import ts_to_string from saxpy.alphabet import cuts_for_asize from saxpy.znorm import znorm # Function to convert time series to SAX representation def convert_to_sax(series, alphabet_size=3, sax_length=10): series_normalized = znorm(series) cuts = cuts_for_asize(alphabet_size) sax_representation = ts_to_string(series_normalized, cuts, sax_length) return sax_representation # Apply SAX conversion sax_data = time_series_data.apply(lambda row: convert_to_sax(row), axis=1)
from sklearn.cluster import KMeans # Convert SAX data to a format suitable for clustering sax_df = pd.DataFrame({‘hashtag’: sax_data.index, ‘sax_string’: sax_data.values}) # Use KMeans for clustering (choose an appropriate number of clusters) kmeans = KMeans(n_clusters=10) sax_df[‘cluster’] = kmeans.fit_predict(list(sax_df[‘sax_string’])) # Group hashtags by cluster clustered_hashtags = sax_df.groupby(‘cluster’)[‘hashtag’].apply(list)

Creating a Marketing Campaign

Based on the clustered hashtags, we developed campaign topics and perform the marketing campaign strategy ideation and execution:

Identified Main Themes: Looking at the hashtags in each cluster we identified the main themes. Each cluster represented a group of hashtags that had similar temporal usage patterns.
Developed Campaign Topics: For each cluster, we created a campaign topic that encapsulates the theme.
For example, if a cluster contains hashtags like #greenenergy and #renewables, the topic could be “Innovations in Renewable Energy”.
Created Content Strategy: Developed specific content for each topic (social media posts, videos).
Timing the Campaign: Used the temporal patterns observed in the SAX analysis to time the release of campaign materials to coincide with peak interest periods.
Engagement and Feedback: Monitored engagement with the campaign and adjust topics or content strategies based on feedback and performance metrics.

Data management in practice

by 2023-03-14

My experience in data management

In this post, I want to share the challenges I experienced in the context of data management, summarizing some of the activities I had to deal with, working with matrixed organizations. Let me re-order the activities in function of the data life cycle and the framework for data management.

As a data manager, one of the first milestones I had was designing and implementing a new data management strategy that outlined how data was to be collected, stored, processed, analyzed, and reported. This strategy ensured that the organization’s data was accurate, timely, and accessible, and that it complied with relevant laws and regulations.

To ensure that the organization’s data was reliable and accurate, we developed a data quality framework. This framework defined the criteria for data accuracy, completeness, consistency, and validity, and established procedures for monitoring and addressing data quality issues.

Then we created data visualization tools that helped stakeholders better understand complex data sets and identify patterns and trends. These tools included user-friendly dashboards and reports that presented data in a visually appealing and easy-to-understand way.

Data sharing protocols were critical for organizations that collaborated with other entities, such as government agencies or NGOs. I established new protocols for data sharing that protected the privacy and security of sensitive information while ensuring that relevant stakeholders had access to the data they needed.

Data governance policies established rules and procedures for data management, including data ownership, access, and use: we implemented data governance policies that aligned with the organization’s goals and values and complied with relevant regulations.

When the organization switched to a new data management system, data migration projects were challenging. I managed data migration projects, ensuring that data was transferred accurately and efficiently from the old system to the new one.

Regular data audits helped ensure that the organization’s data management processes worked effectively and identified areas for improvement. I remember we regularly conducted audits of data collection, storage, and analysis processes to identify any gaps or inefficiencies. Data security was critical for organizations that handled sensitive information. We struggle to develop protocols for data security that protected the organization’s data from unauthorized access, use, or disclosure.

Finally, in order to ensure that staff members used data effectively and efficiently, I developed data training programs. These programs covered topics such as data collection, analysis, and visualization, as well as data management best practices.

Unlocking the Power of Data Standardization: A Guide to CDISC and FAIRplus Cookbook for Streamlined Research. Transforming data management to foster adoption of FAIR principles

by 2022-01-01

Looking for a resource that provides guidance on FAIR data management in Life Sciences? Look no further than the FAIR Cookbook! This resource is a collection of recipes that offer a deep dive into the technical aspects of FAIR data management and the infrastructure needed. It also provides applied examples of FAIRification with clinical trial, epidemiological, and molecular data, making it a valuable resource for data stewards, data managers, and data curators. One of the recipe of this great resource, if the application of FAIR principle in a context of CDISC-SDTM, a standard model for organizing, annotating, and formatting data from clinical trials.

The FAIR Cookbook is a live resource that provides guidance on how to implement the FAIR Principles in the Life Sciences. It is intended for a wide range of audiences, including researchers, data stewards, software developers, policymakers, and trainers. The recipes are a combination of guidance, technical, hands-on, background, and review types, and are classified according to the intended audience. It is based on a grant funded by UE and IMI. It is an international effort that included the Countries listed on the right side.

The FAIR Cookbook is a community-driven resource that is being populated and improved iteratively in an open manner. It is funded by the IMI FAIRplus project, which comprises a coalition of Europe’s leading experts in data interoperability, standards, pre-clinical to clinical translation, and long-term sustainable data repositories. The Cookbook also links to relevant community, mature, and complementary resources in the Life Sciences, such as the RDMkit, FAIRsharing, biotools, and TeSS, as well as the Pistoia Alliance’s FAIR Toolkit for Life Science Industry and more generic resources, such as ‘The Turing Way’ handbook for reproducible data science.

The FAIR principles provide guidelines for making data more easily findable, accessible, interoperable, and reusable. These principles have been widely adopted by funding agencies, industry leaders, and organizations in various fields.

The FAIRplus Cookbook uses these principles as an organizing principle and offers atomic recipes for implementing them.

Ensuring that data is FAIR is important for facilitating data sharing and reuse, which can lead to more efficient and effective scientific research. However, there are also ethical considerations surrounding data sharing and reuse, such as ensuring data privacy and security and obtaining informed consent from study participants. These issues are explored in the FAIRplus Cookbook.

The first time I have been in touch with this precious service was during the mapping of clinical trial data to CDISC (Clinical Data Interchange Standards Consortium).
CDISC-SDTM is a standard model and framework for organizing, annotating, and formatting data from clinical trials. Regulatory agencies such as the US FDA require clinical trial results to be submitted in this format. The SDTM Implementation Guide (SDTMIG) provides expanded guidance on implementing SDTM for specific use cases or “domains”. CDISC standards provide procedures and guidelines for encoding project-specific data that might not fit into the existing domains.
Mapping a non-conforming data dictionary to CDISC STDM standard and CDISC Vocabulary presents some challenges. Given the complexity of CDISC standards, any project team intending to convert their datasets to SDTM at any point in the project life cycle should aim to align with the standard as early on in the process as possible. Data dictionaries and data collection instruments should, where possible, be aligned to the relevant CDISC standards to facilitate data conversions later on. The FAIRcookbook site was very useful to initiate the project for the tabulation model.
CDISC provides a comprehensive framework for the standardization of data collection, organization, and analysis in clinical trials, with the aim of improving data quality, consistency, and interoperability.
The main parts in CDISC (Clinical Data Interchange Standards Consortium) are the following:
– CDASH (Clinical Data Acquisition Standards Harmonization): CDASH provides guidelines for the collection of data during clinical trials, with a focus on data quality and consistency. It aims to standardize the collection of data across multiple trials, making it easier to compare and combine data from different sources.
– SDTM (Study Data Tabulation Model): SDTM is a standard format for organizing and presenting data from clinical trials. It provides a framework for the tabulation of clinical trial data in a consistent and standardized way, making it easier to analyze and compare data from different sources.
– ADaM (Analysis Data Model): ADaM provides a standard format for the analysis of clinical trial data. It provides a framework for the creation of analysis datasets that can be used for statistical analysis, reporting, and submission to regulatory agencies.
– Define-XML: Define-XML is an XML file that provides metadata for the datasets used in a clinical trial. It describes the variables, datasets, and relationships between them, providing a clear and comprehensive overview of the data used in the trial.
– Controlled Terminology: Controlled Terminology provides a standardized set of codes and definitions for data elements used in clinical trials. It aims to ensure consistency and accuracy in the use of terminology across different trials and data sources.

Another part of the CDISC standard I worked on is the SEND, which provides a way to exchange non-clinical data between organizations, such as pharmaceutical companies and regulatory agencies. The SEND Implementation Guide defines the structure and content of non-clinical data that should be submitted to regulatory agencies in order to support the evaluation of a drug’s safety and efficacy. The SEND Implementation Guide specifies the format and content of datasets for non-clinical studies, including general toxicology, reproductive toxicology, carcinogenicity, and other studies. The guide provides detailed instructions for the organization of data, variable naming conventions, controlled terminology, and other elements necessary for standardized data exchange. The SEND Implementation Guide is intended to be used in conjunction with other CDISC standards, including the Study Data Tabulation Model (SDTM), which provides a standard format for clinical trial data.

Together, these standards help ensure that data is submitted in a consistent and standardized format, which facilitates data exchange, analysis, and evaluation.

How many fields in data science?

by 2021-06-27

Data Science is a research activity mostly

Data-driven scientific discovery is regarded as the fourth science paradigm

The twenty-first century has ushered in a new age that is coined as data science and big data analytics. Data-driven scientific discovery is regarded as the fourth science paradigm. Data science has been a core driver of the new-generation science, technologies and economy, and is driving new researches, innovation, profession, applications and education across both disciplines and business domains.

There are many scientific and technical challenges associated with big data, ranging from data capture, creation, storage, search, sharing, modeling, representation, analysis, learning, visualization, explanation, and decision making. Among the many data characteristics and complexities to be addressed, I mention the hybridization of heterogeneous, multisource, hierarchical,
interactive, dynamic, multidimensional, and quality-poor data mixed with real-time business operations, strategic planning, decision-making, value creation, and future developments.
The field of data sciences and big data analytics have been evolving from statistics since half century ago to broad areas including but not limited to data and signal analytics, knowledge discovery, information retrieval, machine learning, statistics, optimization, computing, and data management. The literature defined areas of Data Science that requires in depth knowledge and pure research to be effective. By synergizing the three big areas—statistics, informatics and computing, data science has been spreading to essential and specific areas such as

data intelligence and complexity analysis
representation, modeling, analytics, mining and learning including statistical and deep learning
computational intelligence including neural networks, evolutionary computing, fuzzy systems
neuroscience and linguistics
behavioral science and social and economic computing
uncertainty and optimization
system and modeling infrastructures and architectures
networking and interoperation
social issues including
privacy, security, trust, value and impact,
enterprises, services, applications, solutions and systems
simulation, visualization and explanation

Curvilinear Component Analysis

by 2015-01-04

Curvilinear Component Analysis (CCA) is a technique for reducing the dimensionality of a dataset while preserving its local structure. It is an extension of the well-known Principal Component Analysis (PCA) method but is designed to handle non-linear relationships between the features. CCA uses a more localized criterion than PCA, allowing it to better capture the local topology of the data. The goal of CCA is to find a lower-dimensional representation of the data that maintains the important structure of the original dataset. This makes it particularly useful for representing non-linear patterns in the data.

Some examples to help illustrate the concept:

Image compression: CCA can be used to reduce the dimensionality of an image without losing important details. For example, in a picture of a face, CCA can identify and preserve the non-linear relationships between the features, such as the curve of the cheek or the slope of the nose, while reducing the number of pixels.
Speech recognition: CCA can also be used in speech recognition systems to extract the most relevant features from audio signals. The technique can identify non-linear relationships between the different sound frequencies, allowing for a more accurate representation of the speech.
Data visualization: CCA can be used to visualize high-dimensional data in two or three dimensions for easier interpretation. The technique can help to reveal patterns and relationships in the data that might not be immediately apparent in the raw data.

In all these examples, CCA is useful because it allows for a more effective representation of the data while preserving its underlying structure, making it easier to analyze and understand.

Some business cases where Curvilinear Component Analysis (CCA) can be applied:

Customer segmentation: CCA can be used to segment customers based on their purchasing behavior. The technique can identify non-linear relationships between the different customer features, such as their demographics and purchasing history, allowing for a more accurate representation of the customer segments.
Fraud detection: CCA can also be used in fraud detection systems to extract relevant features from transaction data. The technique can identify non-linear relationships between the different features, such as the time of day and the location of the transaction, allowing for more accurate detection of fraudulent activity.
Marketing analytics: CCA can be used to analyze customer behavior and preferences to inform marketing strategies. The technique can help to identify non-linear relationships between customer demographics, purchasing history, and other relevant features, allowing for a more effective targeting of marketing campaigns.

Using CCA (or any other dimensionality reduction technique) can lead to cost savings by reducing the amount of data that needs to be processed and stored, reducing the computational resources required for analysis, and improving the accuracy of the results, which can lead to more efficient and effective decision-making.