How to extract phrases from corpus using gensim

When discussing how to use Gensim to extract phrases from a corpus, we can leverage the Phrases module of Gensim. This tool enables us to automatically detect common phrases (also known as 'collocations'), such as 'new_york' or 'financial_crisis', using statistical methods. I will detail the steps below.

1. Prepare Data

First, we prepare the text data. Suppose we have a list of documents, where each document is a list of words. For example:

python
documents = [
    ["the", "new", "york", "times"],
    ["new", "york", "post"],
    ["the", "washington", "post"]
]

2. Train the Model

Next, we use these documents to train a Phrases model. This model identifies phrases in the documents that are combinations of multiple words, with their frequency in the corpus exceeding the specified threshold.

python
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# Build the phrase model
phrases = Phrases(documents, min_count=1, threshold=1)

# Convert the model to a more efficient implementation
bigram = Phraser(phrases)

Here, min_count and threshold are key parameters controlling the minimum occurrence count of a phrase across the corpus and the score threshold for phrases. Phraser is an optimized implementation of Phrases, enhancing efficiency during application.

3. Apply the Model

Once the phrase model is trained, we can apply it to new documents to combine common phrases into single tokens.

python
# Apply the model to transform documents
print(bigram[ ["new", "york", "times"] ])

The output will be:

shell
['new_york', 'times']

This demonstrates that 'new york' is correctly identified as a phrase and merged into a single token.

4. Practical Example

Suppose we have a news corpus focused on major U.S. cities, and we aim to identify frequently occurring city names (e.g., 'new york'). By following these steps, we can effectively identify and tag such phrases, which is highly beneficial for subsequent text analysis and information extraction.

Summary

By following these steps, we can effectively use Gensim's Phrases model to extract phrases from large volumes of text. This method not only improves text processing efficiency but also helps us more accurately understand and process data in tasks such as text analysis, information retrieval, or natural language processing.

2024年6月29日 12:07 回复

1个答案

1. Prepare Data

2. Train the Model

3. Apply the Model

4. Practical Example

Summary

你的答案