Overview

Large Language Models (LLMs) have revolutionized natural language processing, but their ability to generate highly convincing machine-generated text raises concerns about their misuse. In this work, we analyze DetectGPT in three areas: improving DetectGPT performance, discovering adversarial attacks that can systematically fool DetectGPT, and evaluating DetectGPT on newer LLMs such as ChatGPT.

Our experiments demonstrate that selectively masking a combination of nouns, verbs, and adjectives improves the AUROC metric by up to 9.5%, demonstrating the importance of targeted masking strategies. Additionally, we reveal a limitation of DetectGPT on adversarial contexts, where a snippet of text prepended to the prompt can degrade performance by up to 14%. Finally, we demonstrate that ChatGPT is challenging to detect through DetectGPT.

I contributed to all aspects of the project from ideation, implementation, to writing.

Timeline
Sep - Dec 2022

Skills
Research

Team
Ryan Lian
Max Du
Kaien Yang

Tools
PyTorch
Stanza
Open AI API
‍

Context

DetectGPT is a SOTA zero-shot detection method designed to have higher discriminative power than existing methods. It operates based on the assumption that language model (LLM) text is mainly sampled around the mode of its distribution, while human texts can be located anywhere in the distribution. The method involves generating minor perturbations of a candidate passage using a perturbation function, and then calculating the perturbation discrepancy between the original passage and its perturbations. A positive discrepancy suggests that the passage is likely generated by the source model. The perturbation function should make slight changes to the text while preserving meaning. However, DetectGPT has limitations, such as being computationally intensive and vulnerable to attacks that manipulate probability curvature. It may also be less effective on newer language models incorporating advanced training techniques. Efforts are ongoing to address these limitations and enhance the DetectGPT method.

How can we improve this light-weight detection method and further push its limitations? How would this work with newer models like ChatGPT?

Process

Theoretical Derivation

I helped connecting the MAML objective to gradient penalty regularization commonly seen in GANs
‍

Implementation

I implemented and trained MAML with MiniImagenet benchmark in Pytorch (using Higher) on Microsoft Azure

Algorithm

Design and implemented Protonet-inspired clustering classification algorithm that directly fetches internal representations to make predictions and set up hypothesized experiments

Visualization

I visualized the representation clusters fetched during the algorithm using t-SNE

Presentation

I presented at the poster session with 200+ students and and faculty members

Outcome

(Read the paper on the cover for a detailed explanation)

Empirical Results

We discovered that at meta-test time, MAML merely acts as simple logistic regression mapping cluster centers of global representations to the specific labels, leaving the inner learning rate somewhat irrelevant.

Theoretical Results

We showed that at meta-training time, the MAML objective involves a term similar to gradient penalty regularization seen in GANs, and the inner learning rate then acts as a regularization constant, which explains the negativity.

Award

We received a perfect score on this 10-week long custom research project in CS 330: Deep Multi-task and Meta-Learning in a class full of graduate students.

Reflections & Takeaways

Complicated models can be understood and broken down! Teamwork is exciting no matter the different backgrounds involved, and there is always a perspective to contribute. Also, the more you play around with Pytorch the more sense it makes

LLM generated Text Detection

Improving state-of-the-art zero-shot machine-generated text detection algorithm