Exploratory Data Analysis and Pre-processing

In this process, we explore the dataset and do some preliminary process

Dataset

SMILE Twitter dataset:

Load data and explore

import torch
import pandas as pd
from tqdm.notebook import tqdm

# by specifying names, we have a more intuitive sense of what we are dealing with
df = pd.read_csv(
    'Data/smile-annotations-final.csv',
    names=['id', 'text', 'category']
)

# by setting the index is to convenient future operations
df.set_index('id', inplace=True)

# analyze the column data
df.category.value_counts()

# preprocessing
df = df[-df["category"].str.contains("\|")]
df = df[df.category!="nocode"]

# feature engineering, convert to the format we are comfortable working with
label_dict = {}
for index, label in enumerate(df.category.unique()):
    label_dict[label] = index
df['label'] = df.category.replace(label_dict)

PreviousWhat is BERT NextTraining/Validation Split

Last updated 4 years ago

Was this helpful?