alternate_reality.py

# -*- coding: utf-8 -*-
"""Alternate_Reality

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1bknp8iZGkXEBz8LQHs2ipneEN4iFHDwb

#**ALTERNATE REALITY**

### **Description**

---

This is a project that I am working on to apply my newfound skills of Recurrent Neural Networks and Natural Language Processing. This project deals with **text generation** using LSTM RNN's.


### **Motivation**


---

I lived in New York City from 2014 to 2018. In the American education system, students write out different things (essays, poems, research papers, op-eds, etc.) on a daily basis.

Throughout my time at *The Bronx High School of Science*, I was trained in the art of scientific thinking! I had to write daily homeworks for my english class, most of which I typed up in Google docs. I also wrote numerous research papers for my english and history classes, some of which were well above **20 pages**. All of these texts are my creation. My own piece in the world of words and sentences.

When I had to leave *New York City* in the middle of my senior year, I was devastated beyond narration. It was the single most horrifying experience of my life. I was detached from everything I belonged to and stood for (kind of like Thor if you think about it) and banished away. 

Ever since then, I have not been able to think like I used to be able to. I read text, but I am not critically thinking of it, not finding literary and rhetorical devices that the author used to set a scene, or show the development of a character.

And so I turned to **Neural Networks** to help me out. By using **Python**, **Keras**, and **LSTM**, I will be able to create a *Recurrent Neural Network* for language modeling and sample new text written in my style.
"""

"""
mounting the google drive to use text data and to clone GItHub repositories
"""

from google.colab import drive
drive.mount('/content/drive/')

"""
important libraries imported
"""

from __future__ import print_function
import os, io, sys, random
import numpy as np
import tensorflow as tf
from tensorflow import one_hot
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import Dense, LSTM, Activation, Dropout, Input, Lambda, Reshape, Masking
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.utils import to_categorical, plot_model

# Commented out IPython magic to ensure Python compatibility.
"""
making sure tensorflow's version2 is used in this notebook
"""

# %tensorflow_version 2.x

import os

if int(tf.__version__[0]) < 2:
  !pip install tensorflow==2.1

print("Tensorflow version: " + tf.__version__)

"""
testing if connected to TPU and/or GPU
"""

import pprint

if 'COLAB_TPU_ADDR' not in os.environ:
  print('Not connected to a TPU runtime.')
else:
  tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  print ('Connected to TPU.\n\nTPU address is', tpu_address)

  with tf.compat.v1.Session((tpu_address)) as session:
    devices = session.list_devices()
    
  print('TPU devices:')
  pprint.pprint(devices)

if tf.test.gpu_device_name() == '':
  print('\n\nNot connected to a GPU runtime.')
else:
  print('\n\nConnected to GPU: ' + tf.test.gpu_device_name())

"""
cleaning the data and forming the examples
"""

import numpy as np
import io


path = "/content/drive/My Drive/Alternate-Reality/folder/text_data/merged.txt"

with io.open(path, encoding='utf-8') as corpus:
    text = corpus.read()

LENGTH = len(text)
Tx = 11 # length of each example (characters)

vocab = sorted(set(list(text))) # list (a set actually) of all the characters in the corpus
char_to_indices = dict((ch, idx) for idx, ch in enumerate(vocab))
index_to_char = dict((idx, ch) for idx, ch in enumerate(vocab))

# pretty much temporary variables just for the sake of splitting the huge corpus
sentences = [] # X
mapped_chars = [] # Y

step = 3

for i in range(0, LENGTH - Tx, step):
    temp_text = text[i: i+Tx]
    sentences.append(temp_text[:-1])
    mapped_chars.append(temp_text[-1])

m = len(sentences)

X = np.zeros((m, Tx - 1, len(vocab)))
Y = np.zeros((m, len(vocab)))

for i, example in enumerate(sentences):
    X[i, :, :] = one_hot([char_to_indices[ch] for ch in example], depth=len(vocab))
    Y[i, :] = one_hot(char_to_indices[mapped_chars[i]], depth=len(vocab))

# a nuisance is fixed by turning X and Y into numpy arrays
X = np.asarray(X)
Y = np.asarray(Y)

#==============printing data dimesions=========================================
print(f"Length of corpus: {LENGTH}")
print(f"X.shape = {X.shape}")
print(f"Y.shape = {Y.shape}")
print(f"Number of examples: {m}")

"""
read the function docstring below
"""

def get_example(index=None):
    """
    retrieves the example at index position in X is index is passed, otherwise random example is obtained
    :param index: index of example desired to be retrieved
    :return: string of text
    """

    if index is None:
        index = np.random.randint(low=0, high=m)

    curr_x = [index_to_char[idx] for idx in np.argmax(X[index, :, :], axis=1)]
    curr_y = index_to_char[np.argmax(Y[index, :])]

    x_y = (''.join(curr_x), curr_y)

    return x_y

"""
testing a single example and the time it took to retrieve it
"""

import time

start = time.process_time()
example = get_example()
end = time.process_time()

print(f"Sample X: {example[0]}\nCorresponding Y: {example[1]}")
print(f"\nTime taken for acquiring this example: {end - start} seconds")

"""
network architecture creation
model creation
plot_model allows me to see what my neural network looks like
"""

def Ram_Says(Tx, vocab, output_length):
  # network architecture LSTM -> Dropout -> Reshape -> LSTM -> Dropout -> Dense

  # define the initial hidden state a0 and initial cell state c0
  a0 = Input(shape=(output_length,), name='a0')
  c0 = Input(shape=(output_length,), name='c0')
  a = a0
  c = c0

  X = Input(shape=(Tx, len(vocab)), name='X')
  
  a, _, c = LSTM(units=output_length, activation='tanh', return_state=True, dtype='float32', name=f'lstm_1')(X, [a, c])
  a = Dropout(rate=0.2, name=f'dropout_1')(a)
  a = Reshape((1, output_length), name='reshape_1')(a) # needed after a dropout layer for another LSTM layer
  a = LSTM(units=output_length, activation='tanh', dtype='float32', name=f'lstm_2')(a)
  a = Dropout(rate=0.2, name=f'dropout_2')(a)
  out = Dense(units=len(vocab), activation='softmax', name=f'dense')(a)
    
  model = Model(inputs=[X, a0, c0], outputs=out, name='Ram')

  return model

"""
creating the model and the summary of it
"""
#====================Creating important variables===============================
n_a = 256 # number of hidden state dimensions for each LSTM cell

a0 = np.zeros((m, n_a))
c0 = np.zeros((m, n_a))
#===============================================================================

model = Ram_Says(Tx=Tx - 1, vocab=vocab, output_length=n_a)

plot_model(model, to_file='/content/drive/My Drive/Alternate-Reality/nn_graph.png')

model.summary()

"""
configuring optimizations for the model
fitting the model
"""

learning_rate = 0.01
learning_rate_decay = 0.001

optimizer = Adam(learning_rate=learning_rate, beta_1=0.9, beta_2=0.999, decay=learning_rate_decay)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

batch_size = 8192

model.fit([X, a0, c0], Y, batch_size=batch_size, epochs=100)

"""
filepath
"""

filepath = '/content/drive/My Drive/Alternate-Reality/Ram_Says_Trained_Model.h5'

"""
this code takes care of saving the new model only if its accuracy is better than
that of the last model
"""

if os.path.exists(filepath):
  prev_model = load_model(filepath)
  prev_acc = prev_model.evaluate([X, a0, c0], Y, verbose=0)[1]
  curr_acc = model.evaluate([X, a0, c0], Y, verbose=0)[1]
  if curr_acc > prev_acc:
    print("There was a previous model saved.")
    print(f"Previous accuracy: {round(prev_acc*100, 2)}%")
    print(f"Current accuracy: {round(curr_acc*100, 2)}%")
    model.save(filepath)
    print('New model is saved.')
  else:
    print(f"Previous accuracy: {round(prev_acc*100, 2)}%")
    print(f"Current accuracy: {round(curr_acc*100, 2)}%")
    print('Old model is kept.')
else: # if this is the first time saving the model
  model.save(filepath)
  print('First time model is saved.')

"""
loading the model for sampling
"""

Ram_says = load_model(filepath)

"""
testing the accuracy of the model on X and Y
"""

accuracy = Ram_says.evaluate([X, a0, c0], Y, verbose=0)[1]
print(f"Accuracy on the training set: {round(accuracy*100, 2)}%")

"""
text sampling time
"""

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def generate_output():
      diversity = random.choice([0.2, 0.5, 0.7, 1.0, 1.2, 1.4])
      diversity = 0.2
      print(f'Diversity: {diversity}')
      generated = ''
      sentence = input('Your text: ')
      generated += sentence
      if len(sentence) > 10:
        sentence = sentence[:10]
      elif len(sentence) < 10:
        rem = 10 - len(sentence)
        sentence += ' ' * rem
      sys.stdout.write(generated + ' ')
      a0 = np.zeros((1, n_a))
      c0 = np.zeros((1, n_a))
      sys.stdout.write(sentence)
      for i in range(1000):
          x_pred = np.zeros((1, Tx-1, len(vocab)))
          for t, char in enumerate(sentence):
              if char != '0':
                  x_pred[0, t, char_to_indices[char]] = 1.
          preds = Ram_says.predict([x_pred, a0, c0], verbose=0)[0]
          next_index = sample(preds, temperature = 1.0)
          next_char = index_to_char[next_index]

          generated += next_char
          sentence = sentence[1:] + next_char

          sys.stdout.write(next_char)
          sys.stdout.flush()

          if next_char == '\n':
              print('\n')
          elif next_char == '\t':
              print('\t')

generate_output()

"""# I will be placing the model's generated text below for record.

> "My name is Ramansh Sharma My name is the public control considered to fut tell overt of a decide the world to life people real and massive really came to congres to be streated it could have ingarding the world many that estabe by nugues and could have been guvern for the Soligition, purched that remamplogr prices repective independence in 1982. Bech of 1980ided to social and enjoyable capting the sent a see that her seperones in Am" - Ram_says

Pretty good for a first try for a character level model!

> My name is Ram My name is for the fight not started with a being at left the sade but the colonists clear of the Dept-A spondin surmerity, Konusansingtond that in the US and CIA port and scange and industed the Cold War and Congress in Depertmant in reace in the different id on the eneming American relations going to see servions. He also failed in his becuuse to the ready the Ford she end and commercesvestromes to onder the shooked a protects the treaty proved my team America in aid of the new that her strect of phy incrusing and of the pagome, he hading the didrition in 1982. After their visto want in the Cold War which movern of the fen his “In and wenter for their marking of confisens than his prorecive in front of a contriluting as a "confid and the Office of Economic Opportunity. He also failed in his fail w

> Thor is the king Thor is the rare to be some known as the enemonity of Congress, the meint in the Cold War when Chine with them. I was also high domination which dickins of Curness, he will be a beation….ther would reduce the decision of appressed nor the increased communist against controlles. Perretimmert. Itay of nuclas they gave he was a tork the side and very impossing this him as a facutive promest it be a stay a and fack as the anities and the EPand and being people white Sinally sochiets. He did not convered congresity. 


> o show he was the niting to the 1987 in are income taxes. Truean aid to the child of cotting in country, this be mistaky was allice the new that were because Opeaa Act goush in the Cold War and Congress dissided on I came to that felt me to event, and the INation He was was a Soviet troops,

> The The                                 Date-10foce deneed to ave the police of the sent to be some known as “To loon duberd many as it is to New York, to his domestic oil for a long of the weap in Congress she had ease and ensore considerable spending neges and cered to confinced the not on that is a senter aftarity, and a communist confidend ming the enemonity of Congress, complelated to make me to ausheres decart did not prife to meacunion of the Indians, if the stopled. And order to be the same that I am good family defense does money starting ut from America. a recatted to the 1972 Nair ance me to eventle, to stay the cares and even Dieanation, even to geter from the British other trages in Robard on New Yer proved a coffitted them in 1975, up they wanted to raised her free that they will be jasunang but was is the mind end hand of the part endere to the money starting under President during the reader to the side because Opea which states who arm diresten to whith with whith security plannerd
"""