How to Use Named Entity Recognition in Spacy to Analyze Blog Content.

Josiah Adesola
6 min readMar 17, 2023

--

Named entity recognition is vital for deriving insights from unstructured data and making the right decisions in no time.

photo from Unsplash

Millions of posts in the form of text, audio, images, videos, and other graphical content are posted on blogs daily. However, it is very difficult for businessmen, marketers, and researchers to derive valuable information from vast amounts of unstructured data (most especially text posts).

What you’ll learn about:

  • Named Entity Recognition (NER)
  • Application of NER
  • SpaCy for NER
  • Building a NER model with SpaCy

Named Entity Recognition (NER)

Named entity recognition, popularly known as NER, is a modern natural language processing technique that extracts, highlights, or recognizes key entities (mostly proper nouns and keywords) from a group of unstructured text data.

Named entities in this context, are names of people, organizations, locations, dates, productions, ordinals, money, and geographical areas such as water, and mountains. NER is really efficient for extracting key words you want to derive from a lengthy document, text, article, or blog, running analytics, and making informed decisions.

Named entities have labels to represent them. For instance, organizations such as Microsoft, and Google are represented with ORG, a person can be represented with PER, and Texas or Lagos can be represented with GEO. Check out this table for more reference.

Named entities labels

Application of Named Entity Recognition

NER is not only limited to extracting key features or words from blog content but can also be used in a wide variety of institutions and sectors. I’ll give some best use cases below, but the list is not exhaustive.

  1. Healthcare: NER can be used by medical practitioners to extract and classify medical entities such as the name of the patients, illness name, medical procedures, and drug name from the patient record or medical report. This will save a lot of time, identify patterns in patient data, monitor disease outbreaks and help proscribe the right treatments.
  2. Education: NER can be used to extract educational entities from a student's biodata or course profile. The educational entities should include the student name, institution name, academic degrees, age, department name, and course name. This will help educators spend less time trying to extract information, identify patterns in a student and spend more time improving teaching methods.
  3. Finance: The finance field is so vast with a lot of entities which can be extracted from finance blogs, news posts, and posts on social media. The entities include stock symbols, company names, and financial metrics. This will help investors make informed decisions quickly and identify key potential opportunities.
  4. Legal: The lawyers don’t have to spend so much time trying to get the case name, court name, legal citations, and other legal entities from a document anymore. NER can be used to extract this information and streamline legal research and analysis.
  5. E-Commerce: As an e-commerce vendor or an online buyer, many times product descriptions can be so overwhelming and you cannot quickly pick out key entities such as product name, price, short reviews, and specific features. NER can help identify the popular product, monitor customer feedback as the E-commerce business and optimize product offerings.
  6. Marketing: NER can be used to analyze customer reviews and feedback. This is one of the most common use cases of NER, it can identify sentiments associated with products, and help build customer satisfaction and loyalty.

SpaCy

Spacy is a powerful, open-source, Python-based library tool for text processing and analysis. It is an easy-to-use framework, SpaCy is built for industry standards and focuses on usability.

It has a user-friendly environment, and developers can easily add custom functionalities and features to its library. It takes pride in processing a large amount of text data quickly, due to its high-performance rate.

Spacy uses statistical modes to predict and analyze text data. It offers several pre-trained models for various languages, such as English, German, Spanish, and others. The great part is that you can easily fine-tune it to your needs.

Building a NER model with SpaCy

We will be building the named entity recognition model using SpaCy (a Python library). There are three simple steps to building the NER model.

  1. Import libraries: In many machine learning projects, it is important to install necessary libraries to import functions necessary to perform specific tasks.
  • spacy : is a python library used for text processing and analysis.
  • spacy.displacy : is a module in the Spacy library that provides a visual interface for the display of text data. It generates an HTML visualization of the output, it is also used in part-of-speech (POS) tagging and dependency parsing.

2. Perform entity extraction: The text is loaded into a pre-trained Spacy model for English language processing, and then it extracts the named entities and outputs them in a vertical list.

nlp : is a variable used to store the pre-trained model. spacy.load provides the pre-trained model, en_core_web_lg .

Let’s break it down.

The en signifies English language, spacy can perform operations on other languages too such as Spanish, finish, german and so on, core signifies it can be used for general-purpose pipelines such as tagging, parsing, lemmatization and named entity recognition. The web signifies that the model was trained on the web. lg signifies a large model, we could have sm — small model or md — a medium model. The larger model takes more time and computation power but it tends to be accurate.

doc: is a variable used to store the processed text nlp in token forms, entities and more.

loop : We perform a for-loop operation on doc.ent — the document entities. These are vectorized content and can be used to access the ent.text — exact text to be called a named entity, ent.start_char — the position of the first letter of the text in the word, just like lists. ent.end_char — the position of the last character of the ent.text output. ent.label_ — signifies an abbreviation of the named entity tag, for instance, John Smith is a PERSON .

Check out the output of the code.

the code output

3. Display named entity: This is the final code for this task. An HTML visualization is needed to properly display the named entity as their labels are abbreviated, person is abbreviated to PER.

displacy shows the HTML visualization of the named entities using the serve a function called on the doc — the tokenized form of the text.

Here, we go. Check out the final output.

the final NER

Check out the full code on this Kaggle link.

Conclusion

Named entity recognition can be applied in several sectors and for specialized functions. In healthcare, NER can be applied to help doctors easily see the disease and ailments of patients from a medical report, and can further be used for diagnosis and medical classifications. The list of applications is endless, depending on your field.

See you later. Thanks for reading up until now. Don’t forget to leave a clap, and make a comment if you find this read insightful and helpful. Bye Bye

--

--

Josiah Adesola
Josiah Adesola

Written by Josiah Adesola

Writes about machine learning, Data Science, Python. Creative. Thinker. Engineering. Twitter: @_JosiahAdesola

No responses yet