Skip to content

Conversation

@Francoisvt04
Copy link

Dexter Crawlers Changelog from Assemble

1.The following crawlers have been added:
howwemadeitinafrica, savca, rhodesunimathewblog, worldstage, classicfm, afp, naijanews, dailytrustnp, newteleonline, thepoint, dailytimes, thenation, mediamaxnet, leadership, theinterview, rsaparliament, guardian, nationaldailyng, nta, acdivoca, thisdaylive, channelafrica, nan, nigeriatoday, businessdayonline, standardmediaktnnews, globaltimescn, nationalmirror, monitorke, newsverge, sundiatapost, agrilinks, businessdailyafrica, thebusinesspost, theguardianuk, independentng, thenerveafrica, amehnews, sunnewsonline, seedmagazine, hallmarknews, destinyconnect, economist, washingtonpost, amabhungane, africainvestor, outrepreneurs, cnbcafrica, planintl, bloomberg

2.In document_processor.py:
The crawler classes were registered under the DocumentProcessor and DocumentProcessorNT classes.

3.In medium.py:
The Mediums for each of the crawlers where added under the create_defaults class method and added a url exception for mathewnyaungwa.blogspot.co.za under is_tld_exception class method and added a sub_domain_exception_list in for_url class method to handle blogspot.co.za.

4.In country.py
Added country codes for the newly added crawlers in the create_defaults class method.

5.Had to update the tld name list to include some of the newly added country codes.
These where the commands I ran to update the list:

  • from tld.utils import update_tld_names
  • update_tld_names()

@Francoisvt04
Copy link
Author

Hey Matt, these are the new crawlers MMA asked for. Please review them along with with the change log notes I added and give feed back as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant