-
Notifications
You must be signed in to change notification settings - Fork 11
Adding WebsiteMetadataProcessor to preprocessing_utils #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding WebsiteMetadataProcessor to preprocessing_utils #49
Conversation
timoschick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shanyas10, thanks for taking the time to write this. This is exactly what I'd expect a MetadataProcessor for websites to look like 👍
I've added a couple of comments on how I think the code might further be improved.
bsmetadata/preprocessing_utils.py
Outdated
| """Metadata preprocessor for adding website description based on URLs.""" | ||
|
|
||
| website_description_cache = {} | ||
| org_list = ["com", "co", "org", "go", "in"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for using this specific set of top-level domains?
bsmetadata/preprocessing_utils.py
Outdated
| website_description = self._extract_website_desc_from_url(urls[0]) | ||
|
|
||
| if website_description: | ||
| metadata.append({"key": "timestamp", "type": "global", "value": website_description}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be something like "key": "website_description"
bsmetadata/preprocessing_utils.py
Outdated
|
|
||
| def _extract_website_desc_from_url(self, url: str) -> Optional: | ||
|
|
||
| domain = url.str.split("/")[2] # e.g http://www.californialandcan.org/Plumas -> www.californialandcan.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail for URLs that don't start with http://. I guess all URLs in C4 start with http://, but it would probably be good to be on the safe side here. Also, you may want to consider using some library like urllib (see https://docs.python.org/3/library/urllib.parse.html) for splitting URLs into components as this will take care of all unexpected edge cases for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon me for the intrusion.
I have done some simple steps in #24 just like what @timoschick suggested.
Later I will find a way to put some suggestion code snippets here.
bsmetadata/preprocessing_utils.py
Outdated
|
|
||
| return self.website_description_cache[keyword] | ||
|
|
||
| def extract_wiki_desc(self, keyword: str) -> Optional: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As JeanZ has no access to the internet, I think we need to first download all wiki descriptions as a preprocessing step.
|
download_wiki_dump.sh -> Will clone the repo that contains the dump_db file for fetching wikipedia data Points for discussion:
Edit: Changed the code to use nltk tokenizer for sentence splitting. |
timoschick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @shanyas10 - looks good to me 👍
| def fetch_wikipedia_description_for_title(self, title: str) -> Optional: | ||
| try: | ||
| text = self.wiki_dump_db.get_paragraphs(title)[0].text | ||
| text = nltk.sent_tokenize(text)[0] # Picking the first sentence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it maybe make sense to remove information in brackets? For example, the first sentence in the Wikipedia article on Wikipedia itself is Wikipedia (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. At least in this case, I would think that the part in brackets (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is completely useless for the model but will probably require many tokens. Any thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea!
If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:
import re
text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense @timoschick .
Thank you for the snippet @SaulLu . I've made the changes accordingly
SaulLu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for all your hard work @shanyas10! I especially appreciate all the efforts you made to make this script run without internet access 🎊 !
Thanks also for the tests!
| wikipedia2vec==1.0.5 | ||
| nltk==3.6.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really a tiny detail, but I think we can make these dependencies optional by adding them to the setup.py in extras_require 🙂 :
setup(
name="bsmetadata",
python_requires=">=3.7.11, <3.10",
version="0.1.0",
url="https://github.com/bigscience-workshop/metadata.git",
author="Multiple Authors",
author_email="xxx",
description="Codebase for including metadata (e.g., URLs, timestamps, HTML tags) during language model pretraining.",
packages=find_packages(),
install_requires=install_requires,
extras_require={
"website_description_preprocessing": ["wikipedia2vec==1.0.5", "nltk==3.6.5"],
},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for pointing this out @SaulLu :) I've made the required changes
| def fetch_wikipedia_description_for_title(self, title: str) -> Optional: | ||
| try: | ||
| text = self.wiki_dump_db.get_paragraphs(title)[0].text | ||
| text = nltk.sent_tokenize(text)[0] # Picking the first sentence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea!
If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:
import re
text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.
|
since the changes post approval are minor, I'm merging the PR :) |
@timoschick raising this PR to understand if this is what is expected. Also, while annotating data myself, I had taken the top 100 keywords since they were most likely to have a wikipedia description. But since we aren't doing that here, chances are that some examples might not contain the description.
Yet to write tests to the PR is not mergeable.