Skip to content

Conversation

@shanyas10
Copy link
Collaborator

@timoschick raising this PR to understand if this is what is expected. Also, while annotating data myself, I had taken the top 100 keywords since they were most likely to have a wikipedia description. But since we aren't doing that here, chances are that some examples might not contain the description.

Yet to write tests to the PR is not mergeable.

@shanyas10 shanyas10 requested a review from timoschick October 15, 2021 06:51
Copy link
Contributor

@timoschick timoschick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shanyas10, thanks for taking the time to write this. This is exactly what I'd expect a MetadataProcessor for websites to look like 👍
I've added a couple of comments on how I think the code might further be improved.

"""Metadata preprocessor for adding website description based on URLs."""

website_description_cache = {}
org_list = ["com", "co", "org", "go", "in"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for using this specific set of top-level domains?

website_description = self._extract_website_desc_from_url(urls[0])

if website_description:
metadata.append({"key": "timestamp", "type": "global", "value": website_description})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be something like "key": "website_description"


def _extract_website_desc_from_url(self, url: str) -> Optional:

domain = url.str.split("/")[2] # e.g http://www.californialandcan.org/Plumas -> www.californialandcan.org
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail for URLs that don't start with http://. I guess all URLs in C4 start with http://, but it would probably be good to be on the safe side here. Also, you may want to consider using some library like urllib (see https://docs.python.org/3/library/urllib.parse.html) for splitting URLs into components as this will take care of all unexpected edge cases for you.

Copy link
Collaborator

@tianjianjiang tianjianjiang Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me for the intrusion.
I have done some simple steps in #24 just like what @timoschick suggested.
Later I will find a way to put some suggestion code snippets here.


return self.website_description_cache[keyword]

def extract_wiki_desc(self, keyword: str) -> Optional:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As JeanZ has no access to the internet, I think we need to first download all wiki descriptions as a preprocessing step.

@shanyas10
Copy link
Collaborator Author

shanyas10 commented Nov 1, 2021

download_wiki_dump.sh -> Will clone the repo that contains the dump_db file for fetching wikipedia data
WebsiteMetadataProcessor -> needs to be initialized with the path to the above repo

Points for discussion:

  • website_desc_utils L[19:27] is fetching the first two sentences of description for a given title. Splitting on ('. ') might not work for some cases e.g. "XYZ is a U.S. based company." would partition as "XYZ is a U.S" and "based company". While it worked for all the cases I tested for (around 10) but we can use a more robust method if any.
  • Suggestions on how to test: Using the processor requires the download of the db file as a prerequisite. Any suggestions on how we can test the processor without the download? Or should we create a db file from a smaller wiki dump and use that as test data?

Edit: Changed the code to use nltk tokenizer for sentence splitting.
Edit2: Mocked the DumpDB object to run the tests but I'm not sure if this is the best way to do it.

@shanyas10 shanyas10 changed the base branch from TS/ds_preprocessing to master November 1, 2021 13:47
Copy link
Contributor

@timoschick timoschick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @shanyas10 - looks good to me 👍

def fetch_wikipedia_description_for_title(self, title: str) -> Optional:
try:
text = self.wiki_dump_db.get_paragraphs(title)[0].text
text = nltk.sent_tokenize(text)[0] # Picking the first sentence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it maybe make sense to remove information in brackets? For example, the first sentence in the Wikipedia article on Wikipedia itself is Wikipedia (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. At least in this case, I would think that the part in brackets (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is completely useless for the model but will probably require many tokens. Any thoughts on that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea!

If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:

import re

text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)

On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense @timoschick .
Thank you for the snippet @SaulLu . I've made the changes accordingly

Copy link
Collaborator

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for all your hard work @shanyas10! I especially appreciate all the efforts you made to make this script run without internet access 🎊 !

Thanks also for the tests!

Comment on lines +7 to +8
wikipedia2vec==1.0.5
nltk==3.6.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really a tiny detail, but I think we can make these dependencies optional by adding them to the setup.py in extras_require 🙂 :

setup(
    name="bsmetadata",
    python_requires=">=3.7.11, <3.10",
    version="0.1.0",
    url="https://github.com/bigscience-workshop/metadata.git",
    author="Multiple Authors",
    author_email="xxx",
    description="Codebase for including metadata (e.g., URLs, timestamps, HTML tags) during language model pretraining.",
    packages=find_packages(),
    install_requires=install_requires,
    extras_require={
        "website_description_preprocessing": ["wikipedia2vec==1.0.5", "nltk==3.6.5"],
    },
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for pointing this out @SaulLu :) I've made the required changes

def fetch_wikipedia_description_for_title(self, title: str) -> Optional:
try:
text = self.wiki_dump_db.get_paragraphs(title)[0].text
text = nltk.sent_tokenize(text)[0] # Picking the first sentence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea!

If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:

import re

text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)

On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.

@shanyas10
Copy link
Collaborator Author

since the changes post approval are minor, I'm merging the PR :)

@shanyas10 shanyas10 merged commit a5a0c6a into bigscience-workshop:master Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants