Generating extinct Japanese script with Adversarial Autoencoders: Theory and ImplementationAdrian Yijie XuBlockedUnblockFollowFollowingFeb 18IntroductionBe it political deepfakes, near real-time video modification, or creating hybrid celebrity faces, the generative capabilities of neural networks have rapidly shot to the spotlight as we move beyond classical beliefs of “seeing is believing”.

Amongst these, architectures utilizing adversarial training, such as adversarial networks such as Generative Adversarial Networks (GAN) and Adversarial Autoencoders (AAE), have received particular attention as a self-supervised approaches to creating realistic outputs capable of fooling other neural networks during classification.

Intuitively, we can think of adversarial architectures as a policeman and a counterfeiter working in tandem — the counterfeiter works to produce realistic quality counterfeit currency, which is examined by the policeman, whom we assume is knowledgeable about the characteristics of real currency due to experience .

If it doesn’t pass as authentic currency, the policeman rejects the currency and instructs the counterfeiter how to improve next time.

Not what we’ll be doing, thankfullyAs time passes, the counterfeiter’s skills increase, and eventually he produces authentic-looking currency.

No worries however, we won’t be faking money today!In this tutorial, we will utilize an Adversarial Autoencoder and the open-source KMNIST dataset in order to bring back a now relatively extinct form of ancient cursive Japanese script known as Kuzushiji.

KMNIST (Kuzushiji-MNIST ) is a replacement for the MNIST dataset (28×28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format, with one class chosen to represent each of the 10 rows of Kuzushiji Hiragana.

Examples of each of the 10 classes of Hiragana in KMNISTBefore the 19th century, cursive Kuzushiji Japanese script had been in use as for over 1000 years before the 19th century, and included several different styles and formats for each word.

However, during the Meiji period, Japan reformed its official language and writing system and standardized it into a form similar to that used today.

This caused the cursive script to fade, and today millions of documents on Japanese culture and history cannot be read by modern scholars.

Using our approach, we will breathe life into this extinct script, by learning from the work of thousands of poets and writers to create new examples of Kuzushijii script.

For the course of this tutorial, we assume that the user is familiar with the elements of deep learning, particularly backpropagation and the structure of one-dimensional neural networks.

TheoryAn autoencoder is a self-supervised learning neural network designed to reconstruct its input from a lower dimensional representation.

A vanilla autoencoder consists of two components — an encoder and a decoder.

The encoder takes an input and generates a lower dimensional intermediate representation, known as the latent representation.

For instance, an input consisting of an image of a cat may become a pixelated, smaller image that also occupies a proportionally smaller amount of memory.

The decoder’s role is to take this latent representation and restore it back into it’s original, higher dimensional form.

It’s important to note that the decoder does not know what the original image looks like, and hence initially does a terrible job during reconstruction.

We train the network by minimizing the reconstruction loss (the mean squared loss), which measures the difference between the original input and the reconstructed input.

Definition of Mean Squared Error, over N samples.

With each epoch, the encoder will learn to generate a more meaningful latent representation, while decoder will learn to better convert the said latent representation back into an approximated input image.

This continues until the reconstruction loss is minimized, and the input can be replicated.

The uses of latent representations also lie beyond computational efficiency.

By default, vanilla autoencoders don’t force their encoder posterior output to match a specific input distribution, but rather aim to learn explicit relationships which are spread in the lower dimensional latent space that aid in reconstruction.

In this manner, an autoencoder can be used to remove ambient noise from an input image, for example.

An adversarial autoencoder (AAE) possesses similarities to vanilla autoencoders, but also contain an adversarial component known as a discriminator, similar to that observed in Generative Adversarial Networks.

We can regard the autoencoder in an AAE as a generator, where the encoder component learns to convert the latent posterior distribution to match a prior input data distribution, while the decoder aims to better reconstruct the image to minimize reconstruction loss.

The generator’s overall role is to produce outputs that could fool the discriminator into believing that the sample latent representation is coming from the true prior input distribution and not the latent posterior distribution.

There are two alternating phases to training an AAE.

Firstly, during the reconstruction phase, we only train the autoencoder (which consists of the encoder and decoder) to minimize the reconstruction error, in the same manner as we observed with the vanilla autoencoders.

Secondly, during the regularization phase, we train the discriminator to tell apart samples from the true prior input distribution from the generated posterior the generated samples (encoder output).

Following this, we then train the generator (which is the encoder of the autoencoder) to confuse the discriminator, by better matching the aforementioned distributions.

After the training is done, the decoder of the autoencoder will generate samples that directly map the imposed prior of the data distribution with minimal reconstruction loss.

Sounds complex right?.To better understand the mechanics behind this, let’s go over a pass in the network.

Let’s say that we are feeding our autoencoder with the KMNIST dataset, and we first pass it a batch of images representing the the class “0”.

Each image, represented as input tensors, is encoded into a latent representation which is then decoded by the autoencoder.

After this, the reconstruction error is calculated and backpropagated in order to update its weights to minimize the aforementioned error.

The discriminator passes judgement over if the latent representation (or the encoder output) belongs to the true input distribution or not, by outputting a 1 if the discriminator believes the data is real, or a 0 if the data is fake.

From these results a discriminative adversarial loss, characterized by a binary crossentropy error, the is generated and backpropagated in order to update its weights to better enhance it’s classification capabilities.

Intuitively, we punish the discriminator should it mistake a true input image as a latent representation, or vice versa.

We train the generator’s encoder component by keeping the weights of the discriminator fixed and the target of the discriminator to 1, so that the encoder learns the required distribution by looking at the discriminator weights.

After each generator training pass, we then fix the weights of the encoder, and train the discriminator.

This coupling continues to alternate for the number of epochs that we have specified.

ImplementationOur code is based on Erik Lindenoren’s Keras implementation in Python.

As previously mentioned, we will be using the KMNIST dataset for our generative example.

So let’s load up the dataset to begin.

As the images are compressed into .

npz files, we will use numpy to load them into data arrays.

from keras.

datasets import mnistfrom keras.

layers import Input, Dense, Reshape, Flatten, Lambdafrom keras.

layers.

advanced_activations import LeakyReLUfrom keras.

models import Sequential, Modelfrom keras.

optimizers import Adamimport keras.

backend as Kimport matplotlib.

pyplot as pltimport numpy as npimport osfrom PIL import Image#For this project, we will only be using train_images#To further improve the accuracy of the GAN, you could involve labelsPATH=".

/input/"train_images = np.

load(PATH+'kmnist-train-imgs.

npz')['arr_0']test_images = np.

load(PATH+'kmnist-test-imgs.

npz')['arr_0']train_labels = np.

load(PATH+'kmnist-train-labels.

npz')['arr_0']test_labels = np.

load(PATH+'kmnist-test-labels.

npz')['arr_0']Let’s define some parameters here, specifying the size and color palette (greyscale) of our images, along with the input data batch size.

The latent dim parameter represents the reduced dimensional size of the latent representation.

Finally let’s plot a few of our images to see what kind of dataset we are dealing with.

We’ll also define a sampling function to help us sample from our classes, which we approximate as Gaussian distributions.

img_rows = 28img_cols = 28channels = 1img_shape = (img_rows, img_cols, channels)latent_dim = 10 #10 classes and hence 10 dimensionsbatch_size = 16epsilon_std = 1.

0# View the dataset to get an idea of what we're dealing withdef plot_sample_images_data(images, labels): plt.

figure(figsize=(12,12)) for i in range(10): imgs = images[np.

where(labels == i)] lbls = labels[np.

where(labels == i)] for j in range(10): plt.

subplot(10,10,i*10+j+1) plt.

xticks([]) plt.

yticks([]) plt.

grid(False) plt.

imshow(imgs[j], cmap=plt.

cm.

binary) plt.

xlabel(lbls[j])plot_sample_images_data(train_images, train_labels)def sampling(args): z_mean, z_log_var = args epsilon = K.

random_normal(shape=(batch_size, latent_dim), mean=0.

, stddev=epsilon_std) return z_mean + K.

exp(z_log_var / 2) * epsilonExamples of input distributionNow, it’s time to define the architecture of our encoder, decoder, and discriminator.

All three consist of densely connected layers with LeakyRelu activations.

Notice that the discriminator works on encoded (latent) representations, not decoded images.

def build_encoder(): img = Input(shape=img_shape) h = Flatten()(img) h = Dense(512)(h) h = LeakyReLU(alpha=0.

2)(h) h = Dense(512)(h) h = LeakyReLU(alpha=0.

2)(h) mu = Dense(latent_dim)(h) log_var = Dense(latent_dim)(h) z = Lambda(sampling, output_shape=(latent_dim,), name='z')([mu, log_var]) return Model(img, z)def build_decoder(): model = Sequential() model.

add(Dense(512, input_dim=latent_dim)) model.

add(LeakyReLU(alpha=0.

2)) model.

add(Dense(512)) model.

add(LeakyReLU(alpha=0.

2)) # tanh is more robust: gradient not equal to 0 around 0 model.

add(Dense(np.

prod(img_shape), activation='tanh')) model.

add(Reshape(img_shape)) model.

summary() z = Input(shape=(latent_dim,)) img = model(z) return Model(z, img)def build_discriminator(): model = Sequential() model.

add(Dense(1024, input_dim=latent_dim)) model.

add(LeakyReLU(alpha=0.

2)) model.

add(Dense(512)) model.

add(LeakyReLU(alpha=0.

2)) model.

add(Dense(256)) model.

add(LeakyReLU(alpha=0.

2)) model.

add(Dense(1, activation="sigmoid")) model.

summary() encoded_repr = Input(shape=(latent_dim,)) validity = model(encoded_repr) return Model(encoded_repr, validity)Next, we build all of our components.

We will use the ADAM to optimize the weights of our networks, with a learning rate of 0.

02 %.

For our initial pass, we train the generator on randomly initialized discriminator weights, so we fix the latter’s weights to facilitate this.

optimizer = Adam(0.

0002, 0.

5)discriminator = build_discriminator()discriminator.

compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])encoder = build_encoder()decoder = build_decoder()img = Input(shape=img_shape)encoded_repr = encoder(img)reconstructed_img = decoder(encoded_repr)discriminator.

trainable = Falsevalidity = discriminator(encoded_repr)adversarial_autoencoder = Model(img, [reconstructed_img, validity])adversarial_autoencoder.

compile(loss=['mse', 'binary_crossentropy'], loss_weights=[0.

999, 0.

001], optimizer=optimizer)adversarial_autoencoder.

trainable =TrueFinally, let’s define the training function for our adversarial autoencoder.

Note how we load and normalize our dataset within our training function.

As previously mentioned, we only train the generator component, adversarial_autoencoder, on our first pass, aiming to minimize reconstruction error and generating a passable output for the discriminator (g_loss).

Afterwards, we alternate to training the discriminator to correctly identify between real and generated latent representations (d_loss).

The actual training is handled by Keras’s inbuilt train_on_batch() command.

def train(epochs, batch_size=128, sample_interval=50): # Load the dataset X_train =train_images # Normalization: Rescale -1 to 1 X_train = (X_train.

astype(np.

float32) – 127.

5) / 127.

5 X_train = np.

expand_dims(X_train, axis=3) # Adversarial ground truths valid = np.

ones((batch_size, 1)) fake = np.

zeros((batch_size, 1)) for epoch in range(epochs): # Train Discriminator and Generator # Select a random batch of images idx = np.

random.

randint(0, X_train.

shape[0], batch_size) imgs = X_train[idx] latent_fake = encoder.

predict(imgs) latent_real = np.

random.

normal(size=(batch_size, latent_dim)) d_loss_real = discriminator.

train_on_batch(latent_real, valid) d_loss_fake = discriminator.

train_on_batch(latent_fake, fake) d_loss = 1* np.

add(d_loss_real, d_loss_fake) g_loss = adversarial_autoencoder.

train_on_batch(imgs, [imgs, valid]) # Plot the progress if epoch % sample_interval == 0: print("%d [D loss: %f, acc: %.

2f%%] [G loss: %f, mse: %f]" % ( epoch, d_loss[0], 100 * d_loss[1], g_loss[0], g_loss[1])) sample_images(epoch) #Now that the intial training epoch has passed, we switch trainable roles if(discriminator.

trainable==False): discriminator.

trainable=True adversarial_autoencoder.

trainable=False elif(discriminator.

trainable==True): discriminator.

trainable=False adversarial_autoencoder.

trainable=TrueWith everything finished, let’s write a quick decoding command to save our outputs, and run our model!# Save generated images per specified epochs def sample_images(epoch): r, c = 5, 5 z = np.

random.

normal(size=(r * c, latent_dim)) gen_imgs = decoder.

predict(z) gen_imgs = 0.

5 * gen_imgs + 0.

5 fig, axs = plt.

subplots(r, c) cnt = 0 for i in range(r): for j in range(c): axs[i, j].

imshow(gen_imgs[cnt, :, :, 0], cmap=plt.

cm.

binary) axs[i, j].

axis('off') cnt += 1 fig.

savefig("mnist_%d.

png" % epoch) plt.

close()epochs = 60000sample_interval = 2000sample_count = epochs/sample_intervaltrain(epochs=epochs, batch_size=batch_size, sample_interval=sample_interval)OutputsLet’s visualize the progress of our AAE across 60000 epochs.

The training should take around 30 minutes, when run on a GPU-enabled Kaggle instance.

Epoch 6000Epoch 28000Epoch 58000The character outputs become significantly more clear with more epochs, particularly the more detailed, complex characters.

Recall, we are combining the styles of thousands of references in our generated outputs.

Given the low resolution of our input images and the relatively short training time, this is a more than acceptable result.

So there you have it, we’ve breathed life back into a near extinct script using a very simple adversarial autoencoder network.

We encourage you to play around with the weights and optimizers, and further improve upon our initial results.

Thank you to Arushi Goel for her valuable input.

ReferencesGoel, ArushiLindenoren, Erik — GAN implementations in Keras.

Hubens, Nathan — Deep Inside: AutoencodersNagabushan, Naresh — A wizards’s guide to Adversarial Autoencoders.