Research

This page contains a non-exhaustive list of the areas I do research on.

Language Model Memorization

Large language models memorize their training data. This is bad for many reasons. We have so far shown that deduplicating the training set can signifiantly reduce memorization caused by an example being in the dataset many times. We are continuing to study the properties of LM memorization, including membership inference attacks, the impact of decoding staretgy and prompt choice on whether memorized content surfaces, and training strategies to control the amount of memorization.

Building Tools for Creative Writing

Can natural language generation systems be used to build tools for creative writing? Controllable rewriting, text elaboration and expansion, and plot ideation are all tasks that NLG might be able to assist with. We are especially interested in investigating how creative writers interact with and perceive such tools.

Detection of Generated Text

Are you able to detect when a passage of text includes generare content? In our Real or Fake Text game, we evaluate how well humans can tell when text transitions from being being human-written to being machine-generated. The data from the game can be used to answer questions about how factors like genre, decoding strategy, and annotator training impact the detectability of machine-generated text.

How good is an automatic detection system at this task? It depends on the decoding strategy used to generate the text.