Reading Noisy Captions Embedded in Images
Encoder-Decoder network for extracting text from images using attention-based LSTM
Developed an Encoder-Decoder network for extracting text captions embedded within images, even under noisy conditions. The encoder uses ResNet-50 for robust image feature extraction, and the decoder employs an attention-based LSTM to focus on relevant image regions during text generation.
Key contributions:
- ResNet-50 based encoder for visual feature extraction
- Attention-based LSTM decoder for text generation
- Teacher forcing for stable training
- Beam search for improved prediction at inference
- BLEU score evaluation for caption quality