Moreover, the Bidirectional LSTM keeps the contextual information in both directions which is pretty useful in text classification task (But won’t work for a time series prediction task).For a most simplistic explanation of Bidirectional RNN, think of RNN cell as taking as input a hidden state(a vector) and the word vector and giving out an output vector and the next hidden state.Hidden state, Word vector ->(RNN Cell) -> Output Vector , Next Hidden stateFor a sequence of length 4 like ‘you will never believe’, The RNN cell will give 4 output vectors..Which can be concatenated and then used as part of a dense feedforward architecture.In the Bidirectional RNN, the only change is that we read the text in the normal fashion as well in reverse..So we stack two RNNs in parallel and hence we get 8 output vectors to append.Once we get the output vectors we send them through a series of dense layers and finally a softmax layer to build a text classifier.Due to the limitations of RNNs like not remembering long term dependencies, in practice, we almost always use LSTM/GRU to model long term dependencies..In such a case you can just think of the RNN cell being replaced by an LSTM cell or a GRU cell in the above figure..An example model is provided below..You can use CuDNNGRU interchangeably with CuDNNLSTM, when you build models.# BiDirectional LSTMdef model_lstm_du(embedding_matrix): inp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) ''' Here 64 is the size(dim) of the hidden state vector as well as the output vector..Keeping return_sequence we want the output for the entire sequence..So what is the dimension of output for this layer?.64*70(maxlen)*2(bidirection concat) CuDNNLSTM is fast implementation of LSTM layer in Keras which only runs on GPU ''' x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(x) avg_pool = GlobalAveragePooling1D()(x) max_pool = GlobalMaxPooling1D()(x) conc = concatenate([avg_pool, max_pool]) conc = Dense(64, activation="relu")(conc) conc = Dropout(0.1)(conc) outp = Dense(1, activation="sigmoid")(conc) model = Model(inputs=inp, outputs=outp) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return modelI have written a simplified and well-commented code to run this network(taking input from a lot of other kernels) on a kaggle kernel for this competition..Do take a look there to learn the preprocessing steps and the word to vec embeddings usage in this model..You will learn something..Please do upvote the kernel if you find it helpful.. More details