Research | Rudra Modi

2017

Attention Is All You Need

Vaswani et al. • The foundational paper introducing the Transformer architecture that revolutionized NLP and became the basis for models like GPT and BERT.

Key Takeaway: Self-attention mechanisms can replace recurrence entirely, enabling parallel computation and capturing long-range dependencies more effectively.

arXiv

2015

Show and Tell: A Neural Image Caption Generator

Vinyals et al. • Introduced the encoder-decoder framework for image captioning using CNN features and LSTM language models.

Key Takeaway: CNN-LSTM architecture effectively bridges visual understanding and language generation. Implemented this in my Advanced Image Captioning project.

arXiv

2015

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau et al. • Introduced the attention mechanism for sequence-to-sequence models, allowing the decoder to focus on relevant parts of the input.

Key Takeaway: Attention enables models to dynamically weight input features, solving the bottleneck problem in fixed-length encoding. Applied Bahdanau attention in my image captioning work.

arXiv

2016

You Only Look Once: Unified, Real-Time Object Detection

Redmon et al. • Pioneering work that frames object detection as a single regression problem, enabling real-time detection speeds.

Key Takeaway: Single-shot detection enables real-time applications. Used YOLOv8 in MARG for traffic vehicle detection at 30+ FPS.

arXiv

2015

Deep Residual Learning for Image Recognition

He et al. • Introduced skip connections that enable training of very deep networks by solving the vanishing gradient problem.

Key Takeaway: Residual connections are fundamental to deep network training. Used ResNet-101 as the encoder backbone in my image captioning project.

arXiv

2020

Language Models are Few-Shot Learners (GPT-3)

Brown et al. • Demonstrated that scaling language models enables impressive few-shot and zero-shot learning capabilities.

Key Takeaway: Scale and in-context learning can emerge powerful capabilities without fine-tuning. This understanding informs my work on AI agents and LLM applications.

arXiv