Subselect a DataFrame Based on Keyword Search

python

In this post, what we are trying to accomplish is selecting rows in a DataFrame by looking for keywords that are contained within a se tof columns filled with text data.

Let’s import some packages

# Import
import pandas as pd
import numpy as np

We create a dataset to play with

# Create some data
df = pd.DataFrame({"Shopping list": ["apple","pasta","olive oil","Apple","bananas","water bottles","Oranges"]})
df
Shopping list
0 apple
1 pasta
2 olive oil
3 Apple
4 bananas
5 water bottles
6 Oranges

Subselecting using keywords

Please note in the dataset above, I have on purpose included apple twice: both with a lowercase a and uppercase A. This is just to draw your attention about the importance of transforming your data first: either all lower or upper case before searching for specific terms. In any case, here is how you would proceed if you wanted to select the fruits within the list only:

fruits = df[df["Shopping list"].str.contains("apple")
            | df["Shopping list"].str.contains("Apple")
            | df["Shopping list"].str.contains("bananas")            
            | df["Shopping list"].str.contains("Oranges")]
fruits
Shopping list
0 apple
3 Apple
4 bananas
6 Oranges

Now let’s say from that list of fruits, you wanted to exclude Oranges. Here is what you should do:

noOranges = fruits[-fruits["Shopping list"].str.contains("Oranges")]
noOranges
Shopping list
0 apple
3 Apple
4 bananas

Et voila!

Share this post: