NLP @ NUP (Spring 2024)

This intensive course aims to introduce the foundational methods, tools, and building blocks proven by modern natural language processing (NLP) applications.

Instructor: Dr. Dmitry Ustalov
Time: Thursdays at 6:30pm EET/EEST (aka 5:30pm CET/CEST)
Location: Online (Google Meet)

This course is organized in partnership between the Neapolis University Pafos and JetBrains.

Topics

N-Grams: History of Field. Text Processing. Language Models and Resources. N-Grams and Smoothing. Perplexity.
Information Retrieval: Search Problem. Inverted Index. Vector Space Model. Boolean Retrieval. Ranked Retrieval. Learning-to-Rank. TREC.
Evaluation: Problem of Benchmarking. Human and Model-Based Evaluation. Statistical Analysis and Testing. Label Reliability. Ablations. Red Teaming.
Latent Representations: Distributional Semantics. Pointwise Mutual Information. Latent Semantic Analysis. Word Embeddings. Similarity, Analogies, and Lexical Semantics. Vector Search.
Transformer: Attention. Transformer. BERT and RoBERTa. GPT-1 and GPT-2. Not Transformer.
Large Language Models (LLMs): Pre-Training, Fine-Tuning, Alignment. Low-Rank Adaptation and Quantization. Prompting. Retrieval Augmented Generation (RAG). Leaderboards.

Classes

№	Topic	Lecture Date
1	N-Grams	`2024-03-21`
2	Information Retrieval	`2024-03-28`
3	Evaluation	`2024-04-04`
4	Latent Representations	`2024-04-11`
5	Transformer	`2024-04-18`
6	Large Language Models	`2024-04-25`

No commercial use allowed. Please acknowledge this page for all other uses.

Assignments

№	Topic	Seminar Date	Deadline
1	Search Engine	`2024-04-04`	`2024-04-25`
2	Question Answering	`2024-04-25`	`2024-05-16`

Assignments are available only to the enrolled students. The solutions should be submitted to Kaggle by the end of the deadline day (AoE time zone). Please grant read access to the notebooks with your solutions to the course staff: Mikhail and Dmitry.

Grading

The course contains two assignments, and you must complete both to pass
Assignments are graded automatically using the Kaggle leaderboard
Your solutions must score higher than the baseline scores set by course staff
The use of large language models (LLMs) for doing the assignments is permitted, but you are expected to be able to explain every single line of your code

Resources

Pierogue corpus (also available on Kaggle)
Jurafsky & Martin, Speech and Language Processing (3rd ed. draft)
WikiText-WordLevel tokenizer