LLM generated Text Detection
Improving state-of-the-art zero-shot machine-generated text detection algorithm
Large Language Models (LLMs) have revolutionized natural language processing, but their ability to generate highly convincing machine-generated text raises concerns about their misuse. In this work, we analyze DetectGPT in three areas: improving DetectGPT performance, discovering adversarial attacks that can systematically fool DetectGPT, and evaluating DetectGPT on newer LLMs such as ChatGPT.
Our experiments demonstrate that selectively masking a combination of nouns, verbs, and adjectives improves the AUROC metric by up to 9.5%, demonstrating the importance of targeted masking strategies. Additionally, we reveal a limitation of DetectGPT on adversarial contexts, where a snippet of text prepended to the prompt can degrade performance by up to 14%. Finally, we demonstrate that ChatGPT is challenging to detect through DetectGPT.
I contributed to all aspects of the project from ideation, implementation, to writing.
Context
DetectGPT is a SOTA zero-shot detection method designed to have higher discriminative power than existing methods. It operates based on the assumption that language model (LLM) text is mainly sampled around the mode of its distribution, while human texts can be located anywhere in the distribution. The method involves generating minor perturbations of a candidate passage using a perturbation function, and then calculating the perturbation discrepancy between the original passage and its perturbations. A positive discrepancy suggests that the passage is likely generated by the source model. The perturbation function should make slight changes to the text while preserving meaning. However, DetectGPT has limitations, such as being computationally intensive and vulnerable to attacks that manipulate probability curvature. It may also be less effective on newer language models incorporating advanced training techniques. Efforts are ongoing to address these limitations and enhance the DetectGPT method.
How can we improve this light-weight detection method and further push its limitations? How would this work with newer models like ChatGPT?
Outcome
(Read the paper on the cover for a detailed explanation)
Theoretical Derivation