# NLP Binary Classification of tweets

Wanted to share my [steps](https://www.kaggle.com/code/glenn23/nlp-tweets-classification-use-sequential-api) in predicting if a random tweet was about a current disaster or if it was just a tweet, not about a disaster. This model was trained on the Kaggle Disaster Tweet [dataset](https://www.kaggle.com/competitions/nlp-getting-started/data). It also uses a TensorFlow Hub pretrained universal sentence encoder, [USE](https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/tensorFlow2/variations/universal-sentence-encoder/versions/2?tfhub-redirect=true) in a Sequential Model.

First loaded the needed libraries and imported Kaggle dataset

```python
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud

# Import data
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
train_df.head()
```

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1705686410584/42c58657-d232-4c45-8b02-ac9fd3dc15a1.png align="center")

Shuffled and split the data

```python
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, 
                                                                            random_state=42) 

# Create sentences and labels
whole_train_sentences = train_df_shuffled['text'].to_numpy()
whole_train_labels =  train_df_shuffled['target'].to_numpy() 

len(whole_train_sentences) , len(whole_train_labels)
```

Created the Keras Layer using the pretrained universal sentence encoder.

```python
# Create a Keras layer using the USE pretrained layer from tensorflow hub
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], 
                                        dtype=tf.string, 
                                        trainable=False,
                                        name="USE"
                                        )
```

Created a Sequential model

```python
# Create model using the Sequential API
model = tf.keras.Sequential([
  sentence_encoder_layer, 
  layers.Dense(64 , activation ='relu'),
  layers.Dense(1, activation="sigmoid")
])

# Compile model
model.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.legacy.Adam(),
                metrics=["accuracy"])

# Train a classifier on top of pretrained embeddings
model_history =model.fit(whole_train_sentences,
                              whole_train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))
```

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1705686936554/daa8faed-caac-4dfe-b0b8-de3ba9a2eb20.png align="center")

Make predictions

```python
# Make predictions with the model
pred_probs = model.predict(test_df['text'].to_numpy())
```

Submitted my predictions in the Kaggle Competition. It just beat the AutoML Benchmark.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1705686367505/6ff61220-7497-45dc-ae88-7935ca129583.png align="center")

I also [created](https://blog.dtucker.xyz/visualizing-text-data) some visuals to view the keywords in the tweet data.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1705686479574/68b58c84-98f3-4fbc-bc08-e1f0bd9c0991.png align="center")

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1705686483535/1e982964-e385-4b27-a11f-d6594d3c4710.png align="center")

I also used the TensorFlow projector tool to visualize the embeddings. It produced a neat clustering a related words.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1705686553776/7a70fffb-9318-4903-bd7d-a69e6e32cb7b.png align="center")

---

You can find the Kaggle notebook [here](https://www.kaggle.com/code/glenn23/nlp-tweets-classification-use-sequential-api)

---
