Exploratory Data Analysis and Pre-processing

In this process, we explore the dataset and do some preliminary process

Dataset

SMILE Twitter dataset: https://figshare.com/articles/smile_annotations_final_csv/3187909/2

Load data and explore

import torch
import pandas as pd
from tqdm.notebook import tqdm

# by specifying names, we have a more intuitive sense of what we are dealing with
df = pd.read_csv(
    'Data/smile-annotations-final.csv',
    names=['id', 'text', 'category']
)

# by setting the index is to convenient future operations
df.set_index('id', inplace=True)

# analyze the column data
df.category.value_counts()

# preprocessing
df = df[-df["category"].str.contains("\|")]
df = df[df.category!="nocode"]

# feature engineering, convert to the format we are comfortable working with
label_dict = {}
for index, label in enumerate(df.category.unique()):
    label_dict[label] = index
df['label'] = df.category.replace(label_dict)

Last updated