Research

This page contains a non-exhaustive list of the areas I do research on.

What's in pre-training data, anyway?

Today's large language models are trained on terabytes of textual data, most often scraped from the internet. Model trainers are able control some aspects of their training data: what date it is scraped on, whether to apply quality or toxicity filters, whether certain web domains should be omitted, for example. However, there are many aspects outside of their control; ultimately web-sourced data is controled by people like you or me who choose to put (or not!) our content onto the internet. How do all these decisions—the ones made by model creators and the ones made by internet content creators—influence model capabilities?

Links

Paper: Persistent Pre-training Poisoning of LLMs

Paper: A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Consent in Crisis: The Rapid Decline of the AI Data Commons

Language Model Memorization

Large language models memorize their training data. This is bad for many reasons. We have so far shown that deduplicating the training set can signifiantly reduce memorization caused by an example being in the dataset many times. We are continuing to study the properties of LM memorization, including membership inference attacks, the impact of decoding staretgy and prompt choice on whether memorized content surfaces, and training strategies to control the amount of memorization.

Links

Paper: Deduplicating Training Data Makes Language Models Better

Paper: Quantifying Memorization Across Neural Language Models

Code: Training dataset deduplication

Building Tools for Creative Writing

Can natural language generation systems be used to build tools for creative writing? Controllable rewriting, text elaboration and expansion, and plot ideation are all tasks that NLG might be able to assist with. We are especially interested in investigating how creative writers interact with and perceive such tools.

Links

Website: Wordcraft Writers Workshop

Detection of Generated Text

Are you able to detect when a passage of text includes generare content? In our Real or Fake Text game, we evaluate how well humans can tell when text transitions from being being human-written to being machine-generated. The data from the game can be used to answer questions about how factors like genre, decoding strategy, and annotator training impact the detectability of machine-generated text.

How good is an automatic detection system at this task? It depends on the decoding strategy used to generate the text.

Links

Paper: Deduplicating Training Data Makes Language Models Better

Paper: Quantifying Memorization Across Neural Language Models

Website: Real or Fake Text Game

Github

Google Scholar

Twitter

Page updated

Google Sites

Report abuse