There are so many fc layers in both CNN encoder and RNN decoder, only one is enough. When I implement the CRNN training, I got over 70% test acc with only one fc layer in both CNN and LSTM (However, there is still a huge overfitting). When the num_fc_layers increases, the performance degrades.
Plus, BatchNorm probably contradicts with dropout, because dropout could affect the statistics of BN, BN is already a regularizer. Maybe no dropout is better.