StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples

StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples

We introduce STYLE DISTANCE, a novel method for training style representations that effectively separate writing style from content by leveraging a synthetic dataset of paraphrases with controlled style variations. Our approach achieves superior content-independent embeddings, validated through human and automated evaluations, and outperforms existing methods on real-world benchmarks.

Addressing Content Leakage in Style Representations

Traditional methods for developing style representations often encounter the issue of content leakage, where embeddings inadvertently capture content-specific information, compromising their effectiveness. To tackle this challenge, we introduce StyleDistance, a novel approach that leverages large language models to generate synthetic datasets comprising near-exact paraphrases with controlled stylistic variations. This methodology enables precise contrastive learning across 40 distinct style features, significantly enhancing the content-independence of style embeddings.

Key Contributions

  • Synthetic Dataset Generation: We created a dataset of near-exact paraphrases, each exhibiting controlled stylistic variations, facilitating precise contrastive learning.
  • Enhanced Content-Independence: By focusing on style features while controlling for content, StyleDistance achieves superior separation between style and content in embeddings.
  • Robust Evaluation: Our approach was rigorously evaluated through both human assessments and automated benchmarks, demonstrating its effectiveness in real-world applications.

Access the Model

  • For those interested in exploring or utilizing our model, it is available at styledistance
  • Synthetic Dataset is available at synthstel
  • Paper is available at ArXiv Link

© 2024. All rights reserved.