From dd1bcd5a189e1d1ea33f5cdf1356823a35956628 Mon Sep 17 00:00:00 2001 From: Simran Shaikh Date: Sun, 5 Jan 2025 12:26:54 +0530 Subject: [PATCH 1/3] final commit to #101 --- .../NameEntityRecoqnition.ipynb | 1430 +++++++++++++++++ Name Entity Recognition/README.MD | 61 + Name Entity Recognition/requirements.txt | Bin 0 -> 2866 bytes 3 files changed, 1491 insertions(+) create mode 100644 Name Entity Recognition/NameEntityRecoqnition.ipynb create mode 100644 Name Entity Recognition/README.MD create mode 100644 Name Entity Recognition/requirements.txt diff --git a/Name Entity Recognition/NameEntityRecoqnition.ipynb b/Name Entity Recognition/NameEntityRecoqnition.ipynb new file mode 100644 index 00000000..03b6efb3 --- /dev/null +++ b/Name Entity Recognition/NameEntityRecoqnition.ipynb @@ -0,0 +1,1430 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Named Entity Recognition (NER)" + ], + "metadata": { + "id": "jK4Lnj_do6af" + } + }, + { + "cell_type": "markdown", + "source": [ + "Named Entity Recognition is the most important or I would say the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens.\n", + "\n", + "Spacy provides option to add arbitrary classes to entity recognition system and update the model to even include the new examples apart from already defined entities within model.\n", + "\n", + "Spacy has the ‘ner’ pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ‘ents’ property of a Doc object." + ], + "metadata": { + "id": "7l0bmqAOo934" + } + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "LVeJisRWo2Qs" + }, + "outputs": [], + "source": [ + "# !pip install spacy" + ] + }, + { + "cell_type": "code", + "source": [ + "# Perform standard imports\n", + "import spacy\n", + "nlp = spacy.load('en_core_web_sm')" + ], + "metadata": { + "id": "l54qNFBrpEou" + }, + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "#Write a function to display basic entity info:\n", + "def show_ents(doc):\n", + " if doc.ents:\n", + " for ent in doc.ents:\n", + " print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +\n", + " ' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_)))\n", + " else:\n", + " print('No named entities found.')" + ], + "metadata": { + "id": "y9GDIQ0ppElz" + }, + "execution_count": 3, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "doc1 = nlp(\"Apple is looking at buying U.K. startup for $1 billion\")\n", + "\n", + "show_ents(doc1)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "aEarPqg-pEjF", + "outputId": "6804381c-b184-48a8-dddc-330335d3eeab" + }, + "execution_count": 4, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Apple - 0 - 5 - ORG - Companies, agencies, institutions, etc.\n", + "U.K. - 27 - 31 - GPE - Countries, cities, states\n", + "$1 billion - 44 - 54 - MONEY - Monetary values, including unit\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Here we see tokens combine to form the entities $1 billion." + ], + "metadata": { + "id": "wLQRFhRKpfhD" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K6lLpkXKwqau" + }, + "source": [ + "
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.
" + ] + }, + { + "cell_type": "code", + "source": [ + "doc2 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')\n", + "\n", + "show_ents(doc2)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pRUi0KVYpEgk", + "outputId": "bae69cbe-1d37-4848-a041-9d122e6dea6b" + }, + "execution_count": 5, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Washington, DC - 12 - 26 - GPE - Countries, cities, states\n", + "next May - 27 - 35 - DATE - Absolute or relative dates or periods\n", + "the Washington Monument - 43 - 66 - ORG - Companies, agencies, institutions, etc.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Here we see tokens combine to form the entities next May and the Washington Monument" + ], + "metadata": { + "id": "PaagbjIapodP" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uCFuwuAbwqav" + }, + "source": [ + "## Entity Annotations\n", + "`Doc.ents` are token spans with their own set of annotations.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
`ent.text`The original entity text
`ent.label`The entity type's hash value
`ent.label_`The entity type's string description
`ent.start`The token span's *start* index position in the Doc
`ent.end`The token span's *stop* index position in the Doc
`ent.start_char`The entity text's *start* index position in the Doc
`ent.end_char`The entity text's *stop* index position in the Doc
\n", + "\n" + ] + }, + { + "cell_type": "code", + "source": [ + "doc3 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')\n", + "\n", + "for ent in doc3.ents:\n", + " print(ent.text, ent.label_)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SbNl84OhpEer", + "outputId": "0a0e1a5c-a788-4bd8-c97a-0f3640d7a4d0" + }, + "execution_count": 6, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "500 dollars MONEY\n", + "Microsoft ORG\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NIehK1Bzwqav" + }, + "source": [ + "### Accessing Entity Annotations\n", + "\n", + "The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value using **ent.label** or as a string using **ent.label_**.\n", + "\n", + "The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.\n", + "\n", + "You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string." + ] + }, + { + "cell_type": "code", + "source": [ + "doc = nlp(\"San Francisco considers banning sidewalk delivery robots\")\n", + "\n", + "# document level\n", + "for e in doc.ents:\n", + " print(e.text, e.start_char, e.end_char, e.label_)\n", + "# OR\n", + "ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] #in a list comprehension form\n", + "print(ents)\n", + "\n", + "# token level\n", + "# doc[0], doc[1] ...will have tokens stored.\n", + "\n", + "ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]\n", + "ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]\n", + "print(ent_san)\n", + "print(ent_francisco)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "da6biwhApEbK", + "outputId": "487a98f1-7815-479e-a025-4d4f557e7987" + }, + "execution_count": 7, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "San Francisco 0 13 GPE\n", + "[('San Francisco', 0, 13, 'GPE')]\n", + "['San', 'B', 'GPE']\n", + "['Francisco', 'I', 'GPE']\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xf06daxGwqaw" + }, + "source": [ + "IOB SCHEME\n", + "\n", + "I – Token is inside an entity.\n", + "\n", + "O – Token is outside an entity.\n", + "\n", + "B – Token is the beginning of an entity." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "beMknmydwqaw" + }, + "source": [ + "
Textent_iobent_iob_ent_type_Description
San3B\"GPE\"beginning of an entity
Francisco1I\"GPE\"inside an entity
considers2O\"\"outside an entity
banning2O\"\"outside an entity
sidewalk2O\"\"outside an entity
delivery2O\"\"outside an entity
robots2O\"\"outside an entity
" + ] + }, + { + "cell_type": "markdown", + "source": [ + "**Note:** In the above example only `San Francisco` is recognized as named entity. hence rest of the tokens are described as outside the entity. And in `San Francisco` `San` is the starting of the entity and `Francisco` is inside the entity." + ], + "metadata": { + "id": "f7yi6WBkqOOi" + } + }, + { + "cell_type": "markdown", + "source": [ + "GPE==> Geopolitical Entity" + ], + "metadata": { + "id": "cNWBf5x5MD7J" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EG-M-TwYwqaw" + }, + "source": [ + "## NER Tags\n", + "Tags are accessible through the `.label_` property of an entity.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
TYPEDESCRIPTIONEXAMPLE
`PERSON`People, including fictional.*Fred Flintstone*
`NORP`Nationalities or religious or political groups.*The Republican Party*
`FAC`Buildings, airports, highways, bridges, etc.*Logan International Airport, The Golden Gate*
`ORG`Companies, agencies, institutions, etc.*Microsoft, FBI, MIT*
`GPE`Countries, cities, states.*France, UAR, Chicago, Idaho*
`LOC`Non-GPE locations, mountain ranges, bodies of water.*Europe, Nile River, Midwest*
`PRODUCT`Objects, vehicles, foods, etc. (Not services.)*Formula 1*
`EVENT`Named hurricanes, battles, wars, sports events, etc.*Olympic Games*
`WORK_OF_ART`Titles of books, songs, etc.*The Mona Lisa*
`LAW`Named documents made into laws.*Roe v. Wade*
`LANGUAGE`Any named language.*English*
`DATE`Absolute or relative dates or periods.*20 July 1969*
`TIME`Times smaller than a day.*Four hours*
`PERCENT`Percentage, including \"%\".*Eighty percent*
`MONEY`Monetary values, including unit.*Twenty Cents*
`QUANTITY`Measurements, as of weight or distance.*Several kilometers, 55kg*
`ORDINAL`\"first\", \"second\", etc.*9th, Ninth*
`CARDINAL`Numerals that do not fall under another type.*2, Two, Fifty-two*
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cjb_TkZmwqaw" + }, + "source": [ + "___\n", + "## User Defined Named Entity and Adding it to a Span\n", + "Normally we would have spaCy build a library of named entities by training it on several samples of text.
Sometimes, we want to assign specific token a named entity whic is not recognized by the trained spacy model. We can do this as shown in below code." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vgnycHzLwqaw" + }, + "source": [ + "#### Example1" + ] + }, + { + "cell_type": "code", + "source": [ + "doc = nlp(u'Tesla to build a U.K. factory for $6 million')\n", + "\n", + "show_ents(doc)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XKf3izoNp528", + "outputId": "425c63a6-89f4-4b1e-9165-e404b69e39df" + }, + "execution_count": 8, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "U.K. - 17 - 21 - GPE - Countries, cities, states\n", + "$6 million - 34 - 44 - MONEY - Monetary values, including unit\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "from spacy.tokens import Span" + ], + "metadata": { + "id": "LEk9FWaap50r" + }, + "execution_count": 9, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "\n", + "# Get the hash value of the ORG entity label\n", + "ORG = doc.vocab.strings[u'ORG']\n", + "\n", + "# Create a Span for the new entity\n", + "new_ent = Span(doc, 0, 1, label=ORG)\n", + "\n", + "# Add the entity to the existing Doc object\n", + "doc.ents = list(doc.ents) + [new_ent]" + ], + "metadata": { + "id": "MzabA4Nbp5y6" + }, + "execution_count": 10, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E08p7cGQwqax" + }, + "source": [ + "In the code above, the arguments passed to `Span()` are:\n", + "- `doc` - the name of the Doc object\n", + "- `0` - the *start* index position of the token in the doc\n", + "- `1` - the *stop* index position (exclusive) in the doc\n", + "- `label=ORG` - the label assigned to our entity" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Et5Kbo3Twqax" + }, + "source": [ + "#### Example2" + ] + }, + { + "cell_type": "code", + "source": [ + "doc = nlp(\"fb is hiring a new vice president of global policy\")\n", + "ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]\n", + "print('Before', ents)\n", + "#the model didn't recognise \"fb\" as an entity :(\n", + "\n", + "fb_ent = Span(doc, 0, 1, label=\"ORG\") # create a Span for the new entity\n", + "doc.ents = list(doc.ents) + [fb_ent]\n", + "\n", + "ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]\n", + "print('After', ents)\n", + "# [('fb', 0, 2, 'ORG')]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ibW3YBZqp5vu", + "outputId": "353b8236-451a-49cb-ee5f-04f5509c92d7" + }, + "execution_count": 11, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Before []\n", + "After [('fb', 0, 2, 'ORG')]\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P1DZOSoowqay" + }, + "source": [ + "## Visualizing NER" + ] + }, + { + "cell_type": "code", + "source": [ + "# Import the displaCy library\n", + "from spacy import displacy" + ], + "metadata": { + "id": "VELFJc1Op5rz" + }, + "execution_count": 12, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "text = \"When S. Thrun started working on self driving cars at Google in 2007 \\\n", + "few people outside of the company took him serious\"\n", + "doc = nlp(text)\n", + "displacy.render(doc, style=\"ent\", jupyter=True)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + }, + "id": "TsL974BAq-_o", + "outputId": "5c33e3f5-c4bd-4303-ca3e-7478dfb1d468" + }, + "execution_count": 13, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
When \n", + "\n", + " S. Thrun\n", + " PERSON\n", + "\n", + " started working on self driving cars at \n", + "\n", + " Google\n", + " ORG\n", + "\n", + " in \n", + "\n", + " 2007\n", + " DATE\n", + "\n", + " few people outside of the company took him serious
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "text = \"\"\"Clearview AI, a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. privacy regulator.\n", + "\n", + "Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a global online database that can be used by law enforcement for facial recognition.\n", + "\n", + "The Information Commission’s Office said Monday that the company has breached U.K. data protection laws.\n", + "\n", + "The ICO has ordered Clearview to delete data it has on U.K. residents and banned it from collecting any more.\n", + "\n", + "Clearview writes on its website that it has collected more than 20 billion facial images of people around the world. It collects publicly posted images from social media platforms like Facebook and Instagram, as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.\n", + "\n", + "Clearview’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in Clearview’s database.\n", + "\n", + "John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”\n", + "\n", + "He added that people expect their personal information to be respected, regardless of where in the world their data is being used.\"\"\"\n", + "\n", + "doc = nlp(text)\n", + "\n", + "displacy.render(doc, style='ent', jupyter=True)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 647 + }, + "id": "TNUemy-Oq-2I", + "outputId": "0bb45b64-dd73-4a7c-e0db-12781f45fd90" + }, + "execution_count": 14, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview AI\n", + " ORG\n", + "\n", + ", a \n", + "\n", + " New York\n", + " GPE\n", + "\n", + "-headquartered facial recognition company, has been fined \n", + "\n", + " £7.5 million\n", + " MONEY\n", + "\n", + " (\n", + "\n", + " $9.4 million\n", + " MONEY\n", + "\n", + ") by a \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " privacy regulator.

Over \n", + "\n", + " the last few years\n", + " DATE\n", + "\n", + ", the firm has collected images from the web and social media of people in \n", + "\n", + " Britain\n", + " GPE\n", + "\n", + " and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", + "\n", + " The Information Commission’s Office\n", + " ORG\n", + "\n", + " said \n", + "\n", + " Monday\n", + " DATE\n", + "\n", + " that the company has breached \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " data protection laws.

The \n", + "\n", + " ICO\n", + " ORG\n", + "\n", + " has ordered \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " to delete data it has on \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " residents and banned it from collecting any more.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " writes on its website that it has collected \n", + "\n", + " more than 20 billion\n", + " MONEY\n", + "\n", + " facial images of people around the world. It collects publicly posted images from social media platforms like \n", + "\n", + " Facebook and Instagram\n", + " ORG\n", + "\n", + ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s database.

\n", + "\n", + " John Edwards\n", + " PERSON\n", + "\n", + ", the \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + "’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xZ8Qb0mCwqaz" + }, + "source": [ + "### Visualizing Sentences Line by Line" + ] + }, + { + "cell_type": "code", + "source": [ + "for sent in doc.sents:\n", + " displacy.render(nlp(sent.text), style='ent', jupyter=True)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 773 + }, + "id": "ZDJeCOm4q-yu", + "outputId": "888178e1-14b9-4a5e-9d0a-53d1ed9c332e" + }, + "execution_count": 15, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview AI\n", + " ORG\n", + "\n", + ", a \n", + "\n", + " New York\n", + " GPE\n", + "\n", + "-headquartered facial recognition company, has been fined \n", + "\n", + " £7.5 million\n", + " MONEY\n", + "\n", + " (\n", + "\n", + " $9.4 million\n", + " MONEY\n", + "\n", + ") by a \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " privacy regulator.

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
Over \n", + "\n", + " the last few years\n", + " DATE\n", + "\n", + ", the firm has collected images from the web and social media of people in \n", + "\n", + " Britain\n", + " GPE\n", + "\n", + " and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " The Information Commission’s Office\n", + " ORG\n", + "\n", + " said \n", + "\n", + " Monday\n", + " DATE\n", + "\n", + " that the company has breached \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " data protection laws.

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
The \n", + "\n", + " ICO\n", + " ORG\n", + "\n", + " has ordered \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " to delete data it has on \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " residents and banned it from collecting any more.

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " writes on its website that it has collected \n", + "\n", + " more than 20 billion\n", + " MONEY\n", + "\n", + " facial images of people around the world.
" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
It collects publicly posted images from social media platforms like \n", + "\n", + " Facebook and Instagram\n", + " ORG\n", + "\n", + ", as well as news media, mugshot websites and other open sources.
" + ] + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/spacy/displacy/__init__.py:213: UserWarning: [W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.\n", + " warnings.warn(Warnings.W006)\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
It does so without informing the individuals or asking for their consent.

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s database.

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " John Edwards\n", + " PERSON\n", + "\n", + ", the \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + "’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service.
" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
That is unacceptable.”

" + ] + }, + "metadata": {} + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jUajowN9wqaz" + }, + "source": [ + "## Styling: customize color and effects\n", + "You can also pass background color and gradient options:" + ] + }, + { + "cell_type": "code", + "source": [ + "options = {'ents': ['ORG', 'PRODUCT']}\n", + "\n", + "displacy.render(doc, style='ent', jupyter=True, options=options)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 647 + }, + "id": "_oY-DbDRq-v8", + "outputId": "52a484a7-1d3f-453f-c6c2-547b9f658259" + }, + "execution_count": 16, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview AI\n", + " ORG\n", + "\n", + ", a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. privacy regulator.

Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", + "\n", + " The Information Commission’s Office\n", + " ORG\n", + "\n", + " said Monday that the company has breached U.K. data protection laws.

The \n", + "\n", + " ICO\n", + " ORG\n", + "\n", + " has ordered \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " to delete data it has on U.K. residents and banned it from collecting any more.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " writes on its website that it has collected more than 20 billion facial images of people around the world. It collects publicly posted images from social media platforms like \n", + "\n", + " Facebook and Instagram\n", + " ORG\n", + "\n", + ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s database.

John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "colors = {'ORG': 'linear-gradient(90deg, #f2c707, #dc9ce7)', 'PRODUCT': 'radial-gradient(white, green)'}\n", + "\n", + "options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}\n", + "\n", + "displacy.render(doc, style='ent', jupyter=True, options=options)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 647 + }, + "id": "a0ddMNKTq-tB", + "outputId": "69ab6871-a720-4361-8324-50565e3bdaa2" + }, + "execution_count": 17, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview AI\n", + " ORG\n", + "\n", + ", a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. privacy regulator.

Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", + "\n", + " The Information Commission’s Office\n", + " ORG\n", + "\n", + " said Monday that the company has breached U.K. data protection laws.

The \n", + "\n", + " ICO\n", + " ORG\n", + "\n", + " has ordered \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " to delete data it has on U.K. residents and banned it from collecting any more.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " writes on its website that it has collected more than 20 billion facial images of people around the world. It collects publicly posted images from social media platforms like \n", + "\n", + " Facebook and Instagram\n", + " ORG\n", + "\n", + ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s database.

John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "colors = {'ORG':'linear-gradient(90deg,#aa9cde,#dc9ce7)','PRODUCT':'radial-gradient(white,red)'}\n", + "options = {'ent':['ORG','PRODUCT'],'colors':colors}\n", + "displacy.render(doc,style='ent',jupyter=True,options=options)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 647 + }, + "id": "s5qYRDK9rUbx", + "outputId": "66d7977e-ab9b-4f6a-e2b0-9bfc055336fa" + }, + "execution_count": 18, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "\n", + " Clearview AI\n", + " ORG\n", + "\n", + ", a \n", + "\n", + " New York\n", + " GPE\n", + "\n", + "-headquartered facial recognition company, has been fined \n", + "\n", + " £7.5 million\n", + " MONEY\n", + "\n", + " (\n", + "\n", + " $9.4 million\n", + " MONEY\n", + "\n", + ") by a \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " privacy regulator.

Over \n", + "\n", + " the last few years\n", + " DATE\n", + "\n", + ", the firm has collected images from the web and social media of people in \n", + "\n", + " Britain\n", + " GPE\n", + "\n", + " and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", + "\n", + " The Information Commission’s Office\n", + " ORG\n", + "\n", + " said \n", + "\n", + " Monday\n", + " DATE\n", + "\n", + " that the company has breached \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " data protection laws.

The \n", + "\n", + " ICO\n", + " ORG\n", + "\n", + " has ordered \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " to delete data it has on \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + " residents and banned it from collecting any more.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + " writes on its website that it has collected \n", + "\n", + " more than 20 billion\n", + " MONEY\n", + "\n", + " facial images of people around the world. It collects publicly posted images from social media platforms like \n", + "\n", + " Facebook and Instagram\n", + " ORG\n", + "\n", + ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", + "\n", + " Clearview\n", + " ORG\n", + "\n", + "’s database.

\n", + "\n", + " John Edwards\n", + " PERSON\n", + "\n", + ", the \n", + "\n", + " U.K.\n", + " GPE\n", + "\n", + "’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DHD99RKNwqax" + }, + "source": [ + "## Stude Assignment\n", + "### Adding Named Entities to All Matching Spans\n", + "What if we want to tag *all* occurrences of a token? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:" + ] + }, + { + "cell_type": "code", + "source": [ + "doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '\n", + " u'If successful, the vacuum cleaner will be our first product.')\n", + "\n", + "show_ents(doc)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FfgvceJUrUYL", + "outputId": "89e64423-e0b5-4287-e661-008ccf7f4cd3" + }, + "execution_count": 19, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "first - 99 - 104 - ORDINAL - \"first\", \"second\", etc.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Import PhraseMatcher and create a matcher object:\n", + "from spacy.matcher import PhraseMatcher\n", + "matcher = PhraseMatcher(nlp.vocab)" + ], + "metadata": { + "id": "g_A8gRoUrUWE" + }, + "execution_count": 20, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Create the desired phrase patterns:\n", + "phrase_list = ['vacuum cleaner', 'vacuum-cleaner']\n", + "phrase_patterns = [nlp(text) for text in phrase_list]" + ], + "metadata": { + "id": "eHks1S9erUTJ" + }, + "execution_count": 21, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Apply the patterns to our matcher object:\n", + "matcher.add('newproduct', None, *phrase_patterns)\n", + "\n", + "# Apply the matcher to our Doc object:\n", + "matches = matcher(doc)\n", + "\n", + "# See what matches occur:\n", + "matches" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "owwJ-TPIrg2-", + "outputId": "9e7ffceb-ae56-4296-9444-c08afb4623e6" + }, + "execution_count": 22, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]" + ] + }, + "metadata": {}, + "execution_count": 22 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Here we create Spans from each match, and create named entities from them:\n", + "from spacy.tokens import Span\n", + "\n", + "PROD = doc.vocab.strings[u'PRODUCT']\n", + "\n", + "new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]\n", + "# match[1] contains the start index of the the token and match[2] the stop index (exclusive) of the token in the doc.\n", + "\n", + "doc.ents = list(doc.ents) + new_ents" + ], + "metadata": { + "id": "dI9wQ9-Frg0H" + }, + "execution_count": 23, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "show_ents(doc)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UKhRGWdDrgxN", + "outputId": "16014868-1a03-4d77-9bb7-aaa2250dd119" + }, + "execution_count": 24, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "vacuum cleaner - 37 - 51 - PRODUCT - Objects, vehicles, foods, etc. (not services)\n", + "vacuum cleaner - 72 - 86 - PRODUCT - Objects, vehicles, foods, etc. (not services)\n", + "first - 99 - 104 - ORDINAL - \"first\", \"second\", etc.\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')\n", + "\n", + "show_ents(doc)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Sj5Atvv6rq-u", + "outputId": "56a28230-c02d-446f-c3a4-a40d894ea714" + }, + "execution_count": 25, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "29.50 - 22 - 27 - MONEY - Monetary values, including unit\n", + "five dollars - 60 - 72 - MONEY - Monetary values, including unit\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "len([ent for ent in doc.ents if ent.label_=='MONEY'])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rnPC50mHrq7S", + "outputId": "0bd2dede-69ee-4d75-88e3-158f2a502741" + }, + "execution_count": 26, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "2" + ] + }, + "metadata": {}, + "execution_count": 26 + } + ] + } + ] +} \ No newline at end of file diff --git a/Name Entity Recognition/README.MD b/Name Entity Recognition/README.MD new file mode 100644 index 00000000..808b4c3a --- /dev/null +++ b/Name Entity Recognition/README.MD @@ -0,0 +1,61 @@ +# Name Entity Recognition (NER) Project + +## Description +This project demonstrates **Named Entity Recognition (NER)** using **SpaCy**, a powerful Natural Language Processing (NLP) library. NER identifies and classifies key elements in a text, such as names of persons, organizations, locations, dates, and more. By leveraging SpaCy’s pre-trained models, this project provides an easy-to-use interface to analyze text and extract named entities. This capability is crucial for tasks such as document analysis, information retrieval, and chatbot development. + +The goal of this project is to showcase the simplicity of implementing NER with SpaCy and its potential as a foundation for more advanced NLP applications. + +--- +**Application Link**: [NER ChatBot](https://name-entity-recognition-using-nlp-4zkxknz8boadp8shd2tahp.streamlit.app/) +## Installation + +1. **Clone the Repository:** + ```bash + git clone https://github.com/your-repo/NameEntityRecognition.git + cd NameEntityRecognition + ``` + +2. **Install Dependencies:** + Ensure all required dependencies are installed by running: + ```bash + pip install -r requirements.txt + ``` + +3. **Download SpaCy Language Model:** + Download the SpaCy English language model required for NER analysis: + ```bash + python -m spacy download en_core_web_sm + ``` + +--- + +## Usage + +1. **Open the Jupyter Notebook:** + Launch the Jupyter Notebook to run the project: + ```bash + jupyter notebook NameEntityRecognition.ipynb + ``` + +2. **Follow the Notebook Cells:** + - Provide your text input for NER analysis. + - Execute the cells to run the NER process. + - View and interpret the extracted named entities. + +--- + +## File Structure + +``` +NameEntityRecognition/ +├── NameEntityRecognition.ipynb # Main Jupyter Notebook +├── requirements.txt # List of dependencies +├── README.md # Project documentation +└── LICENSE # License file +``` + +--- + +## Libraries Used + +- **SpaCy:** For performing Named Entity Recognition (NER) and other NLP tasks. diff --git a/Name Entity Recognition/requirements.txt b/Name Entity Recognition/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..0348c22e518c2aa6b70ef33d6f5553d21290acfe GIT binary patch literal 2866 zcmb7`-EP}f5QNWlfxJV3UQ?DR*>VdM=v{yU1p?#&4E@@UWm1tS_~+r1%(ufe<)FPN z2uhT8&hG5)oIU*hZBcfmD~obp4&_f>mt_?9xO^zT=>JAn8;GuK%DPl~ekprB9|J!Y zey%(5o$QC+R+pJ9(>Lo{EvLHjR_A0Vm zDNB4`hu4w5)7FaFS`~!rT~uTlNGe&2`^aTztQbn@jp~!qct~LPn?aOp7goY_fhy?%U`M*oczn_m1TF7&)s?XFa>iPSj_ZPa>NE91F-&Gj_X^H$eX&l_E9 z;g|X~Q@&f>V|iHXX(IGoYiD|%$@8VwgJ0_VDzMkG*3!pTxRvk=t=-Cg-S^M+CN}+} z`|e*I-TfHd_@o@@Nh$zEqJWt$bayWjh+;VImur#C8r@T@4yFnK}S@U zD6YGifM}wE*%z;$2G6}#?Q~HU^P_IuuEbvK=o0!^2g0>yu9GKr&W~35J*iIkpp)#U ztOd-;5@bfnxtUR3f?*?1XvsNvilK5xdXE{VTB6yx=P}V&&S`26BUhCDbgP;&gSx)z zuBm5E#>+id1v^g2Q!sSLS>e-dq&hGeySJ*bUtjw?av!)K)Du1N-H6}NH8@VAdS(gY zt0GeWbCygjkg!DW>|Xbd-oVHeJBZa^!)e1hGNo6p!eFb! zaiiY)TuP*Ooa_G{%bNr|^-Z^)rPo4O0#Eu-oZ4 z?^tk4^|;EYjrv;u&Mq^gKg4B~vvDuc0JcN;X1al(bDSj3xRZ+3T8+hT7p}V3`fbne zYxWFhE70$P?)i=QKPwhlT3O@a2mGNHt{o?avI0+{WHsFyjC@snzW!iN5x>+}b?)T9Og literal 0 HcmV?d00001 From 1e14882e7a10d29075f40e8f5ef84c8ebff276db Mon Sep 17 00:00:00 2001 From: Simran Shaikh Date: Mon, 6 Jan 2025 10:10:59 +0530 Subject: [PATCH 2/3] final commit #101 --- .../NameEntityRecoqnition.ipynb | 1430 ----------------- Name Entity Recognition/README.MD | 61 - Name Entity Recognition/requirements.txt | Bin 2866 -> 0 bytes docs/NLP/projects/name_entity_recognition.md | 88 + 4 files changed, 88 insertions(+), 1491 deletions(-) delete mode 100644 Name Entity Recognition/NameEntityRecoqnition.ipynb delete mode 100644 Name Entity Recognition/README.MD delete mode 100644 Name Entity Recognition/requirements.txt create mode 100644 docs/NLP/projects/name_entity_recognition.md diff --git a/Name Entity Recognition/NameEntityRecoqnition.ipynb b/Name Entity Recognition/NameEntityRecoqnition.ipynb deleted file mode 100644 index 03b6efb3..00000000 --- a/Name Entity Recognition/NameEntityRecoqnition.ipynb +++ /dev/null @@ -1,1430 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } - }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Named Entity Recognition (NER)" - ], - "metadata": { - "id": "jK4Lnj_do6af" - } - }, - { - "cell_type": "markdown", - "source": [ - "Named Entity Recognition is the most important or I would say the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens.\n", - "\n", - "Spacy provides option to add arbitrary classes to entity recognition system and update the model to even include the new examples apart from already defined entities within model.\n", - "\n", - "Spacy has the ‘ner’ pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ‘ents’ property of a Doc object." - ], - "metadata": { - "id": "7l0bmqAOo934" - } - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "LVeJisRWo2Qs" - }, - "outputs": [], - "source": [ - "# !pip install spacy" - ] - }, - { - "cell_type": "code", - "source": [ - "# Perform standard imports\n", - "import spacy\n", - "nlp = spacy.load('en_core_web_sm')" - ], - "metadata": { - "id": "l54qNFBrpEou" - }, - "execution_count": 2, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "#Write a function to display basic entity info:\n", - "def show_ents(doc):\n", - " if doc.ents:\n", - " for ent in doc.ents:\n", - " print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +\n", - " ' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_)))\n", - " else:\n", - " print('No named entities found.')" - ], - "metadata": { - "id": "y9GDIQ0ppElz" - }, - "execution_count": 3, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "doc1 = nlp(\"Apple is looking at buying U.K. startup for $1 billion\")\n", - "\n", - "show_ents(doc1)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "aEarPqg-pEjF", - "outputId": "6804381c-b184-48a8-dddc-330335d3eeab" - }, - "execution_count": 4, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Apple - 0 - 5 - ORG - Companies, agencies, institutions, etc.\n", - "U.K. - 27 - 31 - GPE - Countries, cities, states\n", - "$1 billion - 44 - 54 - MONEY - Monetary values, including unit\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "Here we see tokens combine to form the entities $1 billion." - ], - "metadata": { - "id": "wLQRFhRKpfhD" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K6lLpkXKwqau" - }, - "source": [ - "
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.
" - ] - }, - { - "cell_type": "code", - "source": [ - "doc2 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')\n", - "\n", - "show_ents(doc2)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "pRUi0KVYpEgk", - "outputId": "bae69cbe-1d37-4848-a041-9d122e6dea6b" - }, - "execution_count": 5, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Washington, DC - 12 - 26 - GPE - Countries, cities, states\n", - "next May - 27 - 35 - DATE - Absolute or relative dates or periods\n", - "the Washington Monument - 43 - 66 - ORG - Companies, agencies, institutions, etc.\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "Here we see tokens combine to form the entities next May and the Washington Monument" - ], - "metadata": { - "id": "PaagbjIapodP" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uCFuwuAbwqav" - }, - "source": [ - "## Entity Annotations\n", - "`Doc.ents` are token spans with their own set of annotations.\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
`ent.text`The original entity text
`ent.label`The entity type's hash value
`ent.label_`The entity type's string description
`ent.start`The token span's *start* index position in the Doc
`ent.end`The token span's *stop* index position in the Doc
`ent.start_char`The entity text's *start* index position in the Doc
`ent.end_char`The entity text's *stop* index position in the Doc
\n", - "\n" - ] - }, - { - "cell_type": "code", - "source": [ - "doc3 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')\n", - "\n", - "for ent in doc3.ents:\n", - " print(ent.text, ent.label_)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "SbNl84OhpEer", - "outputId": "0a0e1a5c-a788-4bd8-c97a-0f3640d7a4d0" - }, - "execution_count": 6, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "500 dollars MONEY\n", - "Microsoft ORG\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NIehK1Bzwqav" - }, - "source": [ - "### Accessing Entity Annotations\n", - "\n", - "The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value using **ent.label** or as a string using **ent.label_**.\n", - "\n", - "The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.\n", - "\n", - "You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string." - ] - }, - { - "cell_type": "code", - "source": [ - "doc = nlp(\"San Francisco considers banning sidewalk delivery robots\")\n", - "\n", - "# document level\n", - "for e in doc.ents:\n", - " print(e.text, e.start_char, e.end_char, e.label_)\n", - "# OR\n", - "ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] #in a list comprehension form\n", - "print(ents)\n", - "\n", - "# token level\n", - "# doc[0], doc[1] ...will have tokens stored.\n", - "\n", - "ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]\n", - "ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]\n", - "print(ent_san)\n", - "print(ent_francisco)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "da6biwhApEbK", - "outputId": "487a98f1-7815-479e-a025-4d4f557e7987" - }, - "execution_count": 7, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "San Francisco 0 13 GPE\n", - "[('San Francisco', 0, 13, 'GPE')]\n", - "['San', 'B', 'GPE']\n", - "['Francisco', 'I', 'GPE']\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xf06daxGwqaw" - }, - "source": [ - "IOB SCHEME\n", - "\n", - "I – Token is inside an entity.\n", - "\n", - "O – Token is outside an entity.\n", - "\n", - "B – Token is the beginning of an entity." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "beMknmydwqaw" - }, - "source": [ - "
Textent_iobent_iob_ent_type_Description
San3B\"GPE\"beginning of an entity
Francisco1I\"GPE\"inside an entity
considers2O\"\"outside an entity
banning2O\"\"outside an entity
sidewalk2O\"\"outside an entity
delivery2O\"\"outside an entity
robots2O\"\"outside an entity
" - ] - }, - { - "cell_type": "markdown", - "source": [ - "**Note:** In the above example only `San Francisco` is recognized as named entity. hence rest of the tokens are described as outside the entity. And in `San Francisco` `San` is the starting of the entity and `Francisco` is inside the entity." - ], - "metadata": { - "id": "f7yi6WBkqOOi" - } - }, - { - "cell_type": "markdown", - "source": [ - "GPE==> Geopolitical Entity" - ], - "metadata": { - "id": "cNWBf5x5MD7J" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EG-M-TwYwqaw" - }, - "source": [ - "## NER Tags\n", - "Tags are accessible through the `.label_` property of an entity.\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
TYPEDESCRIPTIONEXAMPLE
`PERSON`People, including fictional.*Fred Flintstone*
`NORP`Nationalities or religious or political groups.*The Republican Party*
`FAC`Buildings, airports, highways, bridges, etc.*Logan International Airport, The Golden Gate*
`ORG`Companies, agencies, institutions, etc.*Microsoft, FBI, MIT*
`GPE`Countries, cities, states.*France, UAR, Chicago, Idaho*
`LOC`Non-GPE locations, mountain ranges, bodies of water.*Europe, Nile River, Midwest*
`PRODUCT`Objects, vehicles, foods, etc. (Not services.)*Formula 1*
`EVENT`Named hurricanes, battles, wars, sports events, etc.*Olympic Games*
`WORK_OF_ART`Titles of books, songs, etc.*The Mona Lisa*
`LAW`Named documents made into laws.*Roe v. Wade*
`LANGUAGE`Any named language.*English*
`DATE`Absolute or relative dates or periods.*20 July 1969*
`TIME`Times smaller than a day.*Four hours*
`PERCENT`Percentage, including \"%\".*Eighty percent*
`MONEY`Monetary values, including unit.*Twenty Cents*
`QUANTITY`Measurements, as of weight or distance.*Several kilometers, 55kg*
`ORDINAL`\"first\", \"second\", etc.*9th, Ninth*
`CARDINAL`Numerals that do not fall under another type.*2, Two, Fifty-two*
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cjb_TkZmwqaw" - }, - "source": [ - "___\n", - "## User Defined Named Entity and Adding it to a Span\n", - "Normally we would have spaCy build a library of named entities by training it on several samples of text.
Sometimes, we want to assign specific token a named entity whic is not recognized by the trained spacy model. We can do this as shown in below code." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vgnycHzLwqaw" - }, - "source": [ - "#### Example1" - ] - }, - { - "cell_type": "code", - "source": [ - "doc = nlp(u'Tesla to build a U.K. factory for $6 million')\n", - "\n", - "show_ents(doc)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "XKf3izoNp528", - "outputId": "425c63a6-89f4-4b1e-9165-e404b69e39df" - }, - "execution_count": 8, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "U.K. - 17 - 21 - GPE - Countries, cities, states\n", - "$6 million - 34 - 44 - MONEY - Monetary values, including unit\n" - ] - } - ] - }, - { - "cell_type": "code", - "source": [ - "from spacy.tokens import Span" - ], - "metadata": { - "id": "LEk9FWaap50r" - }, - "execution_count": 9, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "\n", - "# Get the hash value of the ORG entity label\n", - "ORG = doc.vocab.strings[u'ORG']\n", - "\n", - "# Create a Span for the new entity\n", - "new_ent = Span(doc, 0, 1, label=ORG)\n", - "\n", - "# Add the entity to the existing Doc object\n", - "doc.ents = list(doc.ents) + [new_ent]" - ], - "metadata": { - "id": "MzabA4Nbp5y6" - }, - "execution_count": 10, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "E08p7cGQwqax" - }, - "source": [ - "In the code above, the arguments passed to `Span()` are:\n", - "- `doc` - the name of the Doc object\n", - "- `0` - the *start* index position of the token in the doc\n", - "- `1` - the *stop* index position (exclusive) in the doc\n", - "- `label=ORG` - the label assigned to our entity" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Et5Kbo3Twqax" - }, - "source": [ - "#### Example2" - ] - }, - { - "cell_type": "code", - "source": [ - "doc = nlp(\"fb is hiring a new vice president of global policy\")\n", - "ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]\n", - "print('Before', ents)\n", - "#the model didn't recognise \"fb\" as an entity :(\n", - "\n", - "fb_ent = Span(doc, 0, 1, label=\"ORG\") # create a Span for the new entity\n", - "doc.ents = list(doc.ents) + [fb_ent]\n", - "\n", - "ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]\n", - "print('After', ents)\n", - "# [('fb', 0, 2, 'ORG')]" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "ibW3YBZqp5vu", - "outputId": "353b8236-451a-49cb-ee5f-04f5509c92d7" - }, - "execution_count": 11, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Before []\n", - "After [('fb', 0, 2, 'ORG')]\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "P1DZOSoowqay" - }, - "source": [ - "## Visualizing NER" - ] - }, - { - "cell_type": "code", - "source": [ - "# Import the displaCy library\n", - "from spacy import displacy" - ], - "metadata": { - "id": "VELFJc1Op5rz" - }, - "execution_count": 12, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "text = \"When S. Thrun started working on self driving cars at Google in 2007 \\\n", - "few people outside of the company took him serious\"\n", - "doc = nlp(text)\n", - "displacy.render(doc, style=\"ent\", jupyter=True)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 52 - }, - "id": "TsL974BAq-_o", - "outputId": "5c33e3f5-c4bd-4303-ca3e-7478dfb1d468" - }, - "execution_count": 13, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
When \n", - "\n", - " S. Thrun\n", - " PERSON\n", - "\n", - " started working on self driving cars at \n", - "\n", - " Google\n", - " ORG\n", - "\n", - " in \n", - "\n", - " 2007\n", - " DATE\n", - "\n", - " few people outside of the company took him serious
" - ] - }, - "metadata": {} - } - ] - }, - { - "cell_type": "code", - "source": [ - "text = \"\"\"Clearview AI, a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. privacy regulator.\n", - "\n", - "Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a global online database that can be used by law enforcement for facial recognition.\n", - "\n", - "The Information Commission’s Office said Monday that the company has breached U.K. data protection laws.\n", - "\n", - "The ICO has ordered Clearview to delete data it has on U.K. residents and banned it from collecting any more.\n", - "\n", - "Clearview writes on its website that it has collected more than 20 billion facial images of people around the world. It collects publicly posted images from social media platforms like Facebook and Instagram, as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.\n", - "\n", - "Clearview’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in Clearview’s database.\n", - "\n", - "John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”\n", - "\n", - "He added that people expect their personal information to be respected, regardless of where in the world their data is being used.\"\"\"\n", - "\n", - "doc = nlp(text)\n", - "\n", - "displacy.render(doc, style='ent', jupyter=True)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 647 - }, - "id": "TNUemy-Oq-2I", - "outputId": "0bb45b64-dd73-4a7c-e0db-12781f45fd90" - }, - "execution_count": 14, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview AI\n", - " ORG\n", - "\n", - ", a \n", - "\n", - " New York\n", - " GPE\n", - "\n", - "-headquartered facial recognition company, has been fined \n", - "\n", - " £7.5 million\n", - " MONEY\n", - "\n", - " (\n", - "\n", - " $9.4 million\n", - " MONEY\n", - "\n", - ") by a \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " privacy regulator.

Over \n", - "\n", - " the last few years\n", - " DATE\n", - "\n", - ", the firm has collected images from the web and social media of people in \n", - "\n", - " Britain\n", - " GPE\n", - "\n", - " and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", - "\n", - " The Information Commission’s Office\n", - " ORG\n", - "\n", - " said \n", - "\n", - " Monday\n", - " DATE\n", - "\n", - " that the company has breached \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " data protection laws.

The \n", - "\n", - " ICO\n", - " ORG\n", - "\n", - " has ordered \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " to delete data it has on \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " residents and banned it from collecting any more.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " writes on its website that it has collected \n", - "\n", - " more than 20 billion\n", - " MONEY\n", - "\n", - " facial images of people around the world. It collects publicly posted images from social media platforms like \n", - "\n", - " Facebook and Instagram\n", - " ORG\n", - "\n", - ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s database.

\n", - "\n", - " John Edwards\n", - " PERSON\n", - "\n", - ", the \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - "’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" - ] - }, - "metadata": {} - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xZ8Qb0mCwqaz" - }, - "source": [ - "### Visualizing Sentences Line by Line" - ] - }, - { - "cell_type": "code", - "source": [ - "for sent in doc.sents:\n", - " displacy.render(nlp(sent.text), style='ent', jupyter=True)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 773 - }, - "id": "ZDJeCOm4q-yu", - "outputId": "888178e1-14b9-4a5e-9d0a-53d1ed9c332e" - }, - "execution_count": 15, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview AI\n", - " ORG\n", - "\n", - ", a \n", - "\n", - " New York\n", - " GPE\n", - "\n", - "-headquartered facial recognition company, has been fined \n", - "\n", - " £7.5 million\n", - " MONEY\n", - "\n", - " (\n", - "\n", - " $9.4 million\n", - " MONEY\n", - "\n", - ") by a \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " privacy regulator.

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
Over \n", - "\n", - " the last few years\n", - " DATE\n", - "\n", - ", the firm has collected images from the web and social media of people in \n", - "\n", - " Britain\n", - " GPE\n", - "\n", - " and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " The Information Commission’s Office\n", - " ORG\n", - "\n", - " said \n", - "\n", - " Monday\n", - " DATE\n", - "\n", - " that the company has breached \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " data protection laws.

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
The \n", - "\n", - " ICO\n", - " ORG\n", - "\n", - " has ordered \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " to delete data it has on \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " residents and banned it from collecting any more.

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " writes on its website that it has collected \n", - "\n", - " more than 20 billion\n", - " MONEY\n", - "\n", - " facial images of people around the world.
" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
It collects publicly posted images from social media platforms like \n", - "\n", - " Facebook and Instagram\n", - " ORG\n", - "\n", - ", as well as news media, mugshot websites and other open sources.
" - ] - }, - "metadata": {} - }, - { - "output_type": "stream", - "name": "stderr", - "text": [ - "/usr/local/lib/python3.10/dist-packages/spacy/displacy/__init__.py:213: UserWarning: [W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.\n", - " warnings.warn(Warnings.W006)\n" - ] - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
It does so without informing the individuals or asking for their consent.

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s database.

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " John Edwards\n", - " PERSON\n", - "\n", - ", the \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - "’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service.
" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
That is unacceptable.”

" - ] - }, - "metadata": {} - }, - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" - ] - }, - "metadata": {} - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jUajowN9wqaz" - }, - "source": [ - "## Styling: customize color and effects\n", - "You can also pass background color and gradient options:" - ] - }, - { - "cell_type": "code", - "source": [ - "options = {'ents': ['ORG', 'PRODUCT']}\n", - "\n", - "displacy.render(doc, style='ent', jupyter=True, options=options)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 647 - }, - "id": "_oY-DbDRq-v8", - "outputId": "52a484a7-1d3f-453f-c6c2-547b9f658259" - }, - "execution_count": 16, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview AI\n", - " ORG\n", - "\n", - ", a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. privacy regulator.

Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", - "\n", - " The Information Commission’s Office\n", - " ORG\n", - "\n", - " said Monday that the company has breached U.K. data protection laws.

The \n", - "\n", - " ICO\n", - " ORG\n", - "\n", - " has ordered \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " to delete data it has on U.K. residents and banned it from collecting any more.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " writes on its website that it has collected more than 20 billion facial images of people around the world. It collects publicly posted images from social media platforms like \n", - "\n", - " Facebook and Instagram\n", - " ORG\n", - "\n", - ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s database.

John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" - ] - }, - "metadata": {} - } - ] - }, - { - "cell_type": "code", - "source": [ - "colors = {'ORG': 'linear-gradient(90deg, #f2c707, #dc9ce7)', 'PRODUCT': 'radial-gradient(white, green)'}\n", - "\n", - "options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}\n", - "\n", - "displacy.render(doc, style='ent', jupyter=True, options=options)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 647 - }, - "id": "a0ddMNKTq-tB", - "outputId": "69ab6871-a720-4361-8324-50565e3bdaa2" - }, - "execution_count": 17, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview AI\n", - " ORG\n", - "\n", - ", a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. privacy regulator.

Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", - "\n", - " The Information Commission’s Office\n", - " ORG\n", - "\n", - " said Monday that the company has breached U.K. data protection laws.

The \n", - "\n", - " ICO\n", - " ORG\n", - "\n", - " has ordered \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " to delete data it has on U.K. residents and banned it from collecting any more.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " writes on its website that it has collected more than 20 billion facial images of people around the world. It collects publicly posted images from social media platforms like \n", - "\n", - " Facebook and Instagram\n", - " ORG\n", - "\n", - ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s database.

John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" - ] - }, - "metadata": {} - } - ] - }, - { - "cell_type": "code", - "source": [ - "colors = {'ORG':'linear-gradient(90deg,#aa9cde,#dc9ce7)','PRODUCT':'radial-gradient(white,red)'}\n", - "options = {'ent':['ORG','PRODUCT'],'colors':colors}\n", - "displacy.render(doc,style='ent',jupyter=True,options=options)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 647 - }, - "id": "s5qYRDK9rUbx", - "outputId": "66d7977e-ab9b-4f6a-e2b0-9bfc055336fa" - }, - "execution_count": 18, - "outputs": [ - { - "output_type": "display_data", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
\n", - "\n", - " Clearview AI\n", - " ORG\n", - "\n", - ", a \n", - "\n", - " New York\n", - " GPE\n", - "\n", - "-headquartered facial recognition company, has been fined \n", - "\n", - " £7.5 million\n", - " MONEY\n", - "\n", - " (\n", - "\n", - " $9.4 million\n", - " MONEY\n", - "\n", - ") by a \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " privacy regulator.

Over \n", - "\n", - " the last few years\n", - " DATE\n", - "\n", - ", the firm has collected images from the web and social media of people in \n", - "\n", - " Britain\n", - " GPE\n", - "\n", - " and elsewhere to create a global online database that can be used by law enforcement for facial recognition.

\n", - "\n", - " The Information Commission’s Office\n", - " ORG\n", - "\n", - " said \n", - "\n", - " Monday\n", - " DATE\n", - "\n", - " that the company has breached \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " data protection laws.

The \n", - "\n", - " ICO\n", - " ORG\n", - "\n", - " has ordered \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " to delete data it has on \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - " residents and banned it from collecting any more.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - " writes on its website that it has collected \n", - "\n", - " more than 20 billion\n", - " MONEY\n", - "\n", - " facial images of people around the world. It collects publicly posted images from social media platforms like \n", - "\n", - " Facebook and Instagram\n", - " ORG\n", - "\n", - ", as well as news media, mugshot websites and other open sources. It does so without informing the individuals or asking for their consent.

\n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored in \n", - "\n", - " Clearview\n", - " ORG\n", - "\n", - "’s database.

\n", - "\n", - " John Edwards\n", - " PERSON\n", - "\n", - ", the \n", - "\n", - " U.K.\n", - " GPE\n", - "\n", - "’s information commissioner, said in a statement: “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”

He added that people expect their personal information to be respected, regardless of where in the world their data is being used.
" - ] - }, - "metadata": {} - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DHD99RKNwqax" - }, - "source": [ - "## Stude Assignment\n", - "### Adding Named Entities to All Matching Spans\n", - "What if we want to tag *all* occurrences of a token? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:" - ] - }, - { - "cell_type": "code", - "source": [ - "doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '\n", - " u'If successful, the vacuum cleaner will be our first product.')\n", - "\n", - "show_ents(doc)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "FfgvceJUrUYL", - "outputId": "89e64423-e0b5-4287-e661-008ccf7f4cd3" - }, - "execution_count": 19, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "first - 99 - 104 - ORDINAL - \"first\", \"second\", etc.\n" - ] - } - ] - }, - { - "cell_type": "code", - "source": [ - "# Import PhraseMatcher and create a matcher object:\n", - "from spacy.matcher import PhraseMatcher\n", - "matcher = PhraseMatcher(nlp.vocab)" - ], - "metadata": { - "id": "g_A8gRoUrUWE" - }, - "execution_count": 20, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Create the desired phrase patterns:\n", - "phrase_list = ['vacuum cleaner', 'vacuum-cleaner']\n", - "phrase_patterns = [nlp(text) for text in phrase_list]" - ], - "metadata": { - "id": "eHks1S9erUTJ" - }, - "execution_count": 21, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Apply the patterns to our matcher object:\n", - "matcher.add('newproduct', None, *phrase_patterns)\n", - "\n", - "# Apply the matcher to our Doc object:\n", - "matches = matcher(doc)\n", - "\n", - "# See what matches occur:\n", - "matches" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "owwJ-TPIrg2-", - "outputId": "9e7ffceb-ae56-4296-9444-c08afb4623e6" - }, - "execution_count": 22, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]" - ] - }, - "metadata": {}, - "execution_count": 22 - } - ] - }, - { - "cell_type": "code", - "source": [ - "# Here we create Spans from each match, and create named entities from them:\n", - "from spacy.tokens import Span\n", - "\n", - "PROD = doc.vocab.strings[u'PRODUCT']\n", - "\n", - "new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]\n", - "# match[1] contains the start index of the the token and match[2] the stop index (exclusive) of the token in the doc.\n", - "\n", - "doc.ents = list(doc.ents) + new_ents" - ], - "metadata": { - "id": "dI9wQ9-Frg0H" - }, - "execution_count": 23, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "show_ents(doc)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "UKhRGWdDrgxN", - "outputId": "16014868-1a03-4d77-9bb7-aaa2250dd119" - }, - "execution_count": 24, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "vacuum cleaner - 37 - 51 - PRODUCT - Objects, vehicles, foods, etc. (not services)\n", - "vacuum cleaner - 72 - 86 - PRODUCT - Objects, vehicles, foods, etc. (not services)\n", - "first - 99 - 104 - ORDINAL - \"first\", \"second\", etc.\n" - ] - } - ] - }, - { - "cell_type": "code", - "source": [ - "doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')\n", - "\n", - "show_ents(doc)" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Sj5Atvv6rq-u", - "outputId": "56a28230-c02d-446f-c3a4-a40d894ea714" - }, - "execution_count": 25, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "29.50 - 22 - 27 - MONEY - Monetary values, including unit\n", - "five dollars - 60 - 72 - MONEY - Monetary values, including unit\n" - ] - } - ] - }, - { - "cell_type": "code", - "source": [ - "len([ent for ent in doc.ents if ent.label_=='MONEY'])" - ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "rnPC50mHrq7S", - "outputId": "0bd2dede-69ee-4d75-88e3-158f2a502741" - }, - "execution_count": 26, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "2" - ] - }, - "metadata": {}, - "execution_count": 26 - } - ] - } - ] -} \ No newline at end of file diff --git a/Name Entity Recognition/README.MD b/Name Entity Recognition/README.MD deleted file mode 100644 index 808b4c3a..00000000 --- a/Name Entity Recognition/README.MD +++ /dev/null @@ -1,61 +0,0 @@ -# Name Entity Recognition (NER) Project - -## Description -This project demonstrates **Named Entity Recognition (NER)** using **SpaCy**, a powerful Natural Language Processing (NLP) library. NER identifies and classifies key elements in a text, such as names of persons, organizations, locations, dates, and more. By leveraging SpaCy’s pre-trained models, this project provides an easy-to-use interface to analyze text and extract named entities. This capability is crucial for tasks such as document analysis, information retrieval, and chatbot development. - -The goal of this project is to showcase the simplicity of implementing NER with SpaCy and its potential as a foundation for more advanced NLP applications. - ---- -**Application Link**: [NER ChatBot](https://name-entity-recognition-using-nlp-4zkxknz8boadp8shd2tahp.streamlit.app/) -## Installation - -1. **Clone the Repository:** - ```bash - git clone https://github.com/your-repo/NameEntityRecognition.git - cd NameEntityRecognition - ``` - -2. **Install Dependencies:** - Ensure all required dependencies are installed by running: - ```bash - pip install -r requirements.txt - ``` - -3. **Download SpaCy Language Model:** - Download the SpaCy English language model required for NER analysis: - ```bash - python -m spacy download en_core_web_sm - ``` - ---- - -## Usage - -1. **Open the Jupyter Notebook:** - Launch the Jupyter Notebook to run the project: - ```bash - jupyter notebook NameEntityRecognition.ipynb - ``` - -2. **Follow the Notebook Cells:** - - Provide your text input for NER analysis. - - Execute the cells to run the NER process. - - View and interpret the extracted named entities. - ---- - -## File Structure - -``` -NameEntityRecognition/ -├── NameEntityRecognition.ipynb # Main Jupyter Notebook -├── requirements.txt # List of dependencies -├── README.md # Project documentation -└── LICENSE # License file -``` - ---- - -## Libraries Used - -- **SpaCy:** For performing Named Entity Recognition (NER) and other NLP tasks. diff --git a/Name Entity Recognition/requirements.txt b/Name Entity Recognition/requirements.txt deleted file mode 100644 index 0348c22e518c2aa6b70ef33d6f5553d21290acfe..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 2866 zcmb7`-EP}f5QNWlfxJV3UQ?DR*>VdM=v{yU1p?#&4E@@UWm1tS_~+r1%(ufe<)FPN z2uhT8&hG5)oIU*hZBcfmD~obp4&_f>mt_?9xO^zT=>JAn8;GuK%DPl~ekprB9|J!Y zey%(5o$QC+R+pJ9(>Lo{EvLHjR_A0Vm zDNB4`hu4w5)7FaFS`~!rT~uTlNGe&2`^aTztQbn@jp~!qct~LPn?aOp7goY_fhy?%U`M*oczn_m1TF7&)s?XFa>iPSj_ZPa>NE91F-&Gj_X^H$eX&l_E9 z;g|X~Q@&f>V|iHXX(IGoYiD|%$@8VwgJ0_VDzMkG*3!pTxRvk=t=-Cg-S^M+CN}+} z`|e*I-TfHd_@o@@Nh$zEqJWt$bayWjh+;VImur#C8r@T@4yFnK}S@U zD6YGifM}wE*%z;$2G6}#?Q~HU^P_IuuEbvK=o0!^2g0>yu9GKr&W~35J*iIkpp)#U ztOd-;5@bfnxtUR3f?*?1XvsNvilK5xdXE{VTB6yx=P}V&&S`26BUhCDbgP;&gSx)z zuBm5E#>+id1v^g2Q!sSLS>e-dq&hGeySJ*bUtjw?av!)K)Du1N-H6}NH8@VAdS(gY zt0GeWbCygjkg!DW>|Xbd-oVHeJBZa^!)e1hGNo6p!eFb! zaiiY)TuP*Ooa_G{%bNr|^-Z^)rPo4O0#Eu-oZ4 z?^tk4^|;EYjrv;u&Mq^gKg4B~vvDuc0JcN;X1al(bDSj3xRZ+3T8+hT7p}V3`fbne zYxWFhE70$P?)i=QKPwhlT3O@a2mGNHt{o?avI0+{WHsFyjC@snzW!iN5x>+}b?)T9Og diff --git a/docs/NLP/projects/name_entity_recognition.md b/docs/NLP/projects/name_entity_recognition.md new file mode 100644 index 00000000..9e053a0c --- /dev/null +++ b/docs/NLP/projects/name_entity_recognition.md @@ -0,0 +1,88 @@ + +# Name Entity Recognition (NER) Project + +## AIM +To develop a system that identifies and classifies named entities (such as persons, organizations, locations, dates, etc.) in text using Named Entity Recognition (NER) with SpaCy. + +## DATASET LINK +N/A (This project uses text input for NER analysis, not a specific dataset) +- It uses real time data as input . + +## NOTEBOOK LINK +[Note book link ](https://colab.research.google.com/drive/1pBIEFA4a9LzyZKUFQMCypQ22M6bDbXM3?usp=sharing) + +## LIBRARIES NEEDED +- SpaCy + + +## DESCRIPTION + +!!! info "What is the requirement of the project?" +- Named Entity Recognition (NER) is essential to automatically extract and classify key entities from text, such as persons, organizations, locations, and more. +- This helps in analyzing and organizing data efficiently, enabling various NLP applications like document analysis and information retrieval. + +??? info "Why is it necessary?" +- NER is used for understanding and structuring unstructured text, which is widely applied in industries such as healthcare, finance, and e-commerce. +- It allows users to extract actionable insights from large volumes of text data + +??? info "How is it beneficial and used?" +- NER plays a key role in tasks such as document summarization, information retrieval. +- It automates the extraction of relevant entities, which reduces manual effort and improves efficiency. + +??? info "How did you start approaching this project? (Initial thoughts and planning)" +- The project leverages SpaCy's pre-trained NER models, enabling easy text analysis without the need for training custom models. + +### Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.) +- SpaCy Documentation: [SpaCy NER](https://spacy.io/usage/linguistic-features#named-entities) +- NLP in Python by Steven Bird et al. + +## EXPLANATION + +### DETAILS OF THE DIFFERENT ENTITY TYPES + +The system extracts the following entity types: + +| Entity Type | Description | +|-------------|-------------| +| PERSON | Names of people (e.g., "Anuska") | +| ORG | Organizations (e.g., "Google", "Tesla") | +| LOC | Locations (e.g., "New York", "Mount Everest") | +| DATE | Dates (e.g., "January 1st, 2025") | +| GPE | Geopolitical entities (e.g., "India", "California") | + +## WHAT I HAVE DONE + +### Step 1: Data collection and preparation +- Gathered sample text for analysis (provided by users in the app). +- Explored the text structure and identified entity types. + +### Step 2: NER model implementation +- Integrated SpaCy's pre-trained NER model (`en_core_web_sm`). +- Extracted named entities and visualized them with labels and color coding. + +### Step 3: Testing and validation +- Validated results with multiple test cases to ensure entity accuracy. +- Allowed users to input custom text for NER analysis in real-time. + +## PROJECT TRADE-OFFS AND SOLUTIONS + +### Trade Off 1: Pre-trained model vs. custom model +- **Pre-trained models** provide quick results but may lack accuracy for domain-specific entities. +- **Custom models** can improve accuracy but require additional data and training time. + +### Trade Off 2: Real-time analysis vs. batch processing +- **Real-time analysis** in a web app enhances user interaction but might slow down with large text inputs. +- **Batch processing** could be more efficient for larger datasets. + +## SCREENSHOTS + +### NER Example + ``` mermaid +graph LR + A[Start] --> B[Text Input]; + B --> C[NER Analysis]; + C --> D{Entities Extracted}; + D -->|Person| E[Anuska]; + D -->|Location| F[New York]; + D -->|Organization| G[Google]; + D -->|Date| H[January 1st, 2025]; From c5a9e5979796fdf85e1ca3db5d35bb291b08db27 Mon Sep 17 00:00:00 2001 From: Simran Shaikh Date: Tue, 7 Jan 2025 11:53:41 +0530 Subject: [PATCH 3/3] final commit #101 --- docs/NLP/projects/name_entity_recognition.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/NLP/projects/name_entity_recognition.md b/docs/NLP/projects/name_entity_recognition.md index 9e053a0c..a9c24fa2 100644 --- a/docs/NLP/projects/name_entity_recognition.md +++ b/docs/NLP/projects/name_entity_recognition.md @@ -9,7 +9,8 @@ N/A (This project uses text input for NER analysis, not a specific dataset) - It uses real time data as input . ## NOTEBOOK LINK -[Note book link ](https://colab.research.google.com/drive/1pBIEFA4a9LzyZKUFQMCypQ22M6bDbXM3?usp=sharing) +[Note book link ] +(https://colab.research.google.com/drive/1pBIEFA4a9LzyZKUFQMCypQ22M6bDbXM3?usp=sharing) ## LIBRARIES NEEDED - SpaCy