GetTogether.community content is used to train LLMs

You can confirm that [8.4k tokens were scraped from GetTogether.community](https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/#lookup-table) by CommonCrawl and are included in Google's C4 dataset. It's likely that other LLMs have scraped and will continue to scrape user-generated content from GetTogether.community to train their proprietary large language models.

This can be discouraged for CommonCrawl and ChatGPT with the proper robots.txt inclusion:

```ini
User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GetTogether.community content is used to train LLMs #320

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GetTogether.community content is used to train LLMs #320

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions