Skip to content

GetTogether.community content is used to train LLMs #320

@cassidyjames

Description

@cassidyjames

You can confirm that 8.4k tokens were scraped from GetTogether.community by CommonCrawl and are included in Google's C4 dataset. It's likely that other LLMs have scraped and will continue to scrape user-generated content from GetTogether.community to train their proprietary large language models.

This can be discouraged for CommonCrawl and ChatGPT with the proper robots.txt inclusion:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions