Research

This page contains a non-exhaustive list of the areas I do research on.

What's in pre-training data, anyway?

Today's large language models are trained on terabytes of textual data, most often scraped from the internet. Model trainers are able control some aspects of their training data: what date it is scraped on, whether to apply quality or toxicity filters, whether certain web domains should be omitted, for example. However, there are many aspects outside of their control; ultimately web-sourced data is controled by people like you or me who choose to put (or not!) our content onto the internet. How do all these decisions—the ones made by model creators and the ones made by internet content creators—influence model capabilities?

Language Model Memorization

Large language models memorize their training data. This is bad for many reasons. We have so far shown that deduplicating the training set can signifiantly reduce memorization caused by an example being in the dataset many times. We are continuing to study the properties of LM memorization, including membership inference attacks, the impact of decoding staretgy and prompt choice on whether memorized content surfaces, and training strategies to control the amount of memorization.

Building Tools for Creative Writing

Can natural language generation systems be used to build tools for creative writing? Controllable rewriting, text elaboration and expansion, and plot ideation are all tasks that NLG might be able to assist with. We are especially interested in investigating how creative writers interact with and perceive such tools.

Detection of Generated Text

Are you able to detect when a passage of text includes generare content? In our Real or Fake Text game, we evaluate how well humans can tell when text transitions from being being human-written to being machine-generated. The data from the game can be used to answer questions about how factors like genre, decoding strategy, and annotator training impact the detectability of machine-generated text.

How good is an automatic detection system at this task? It depends on the decoding strategy used to generate the text.