9. Regular Expression

2024. 3. 14. 01:18·Python

 

 

 
 

Regular Expression¶

This notebook closely follows the Python Course on Google for Education.

In Python a regular expression search is typically written as:

match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

 
In [2]:
import re
 
In [2]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
    print('found', match.span(), match.group()) ## 'found word:cat'
else:
    print('did not find')
 
 
found (11, 19) word:cat
 
 

The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

Basic Patterns¶

The power of regular expressions is that they can specify patterns, not just fixed characters.

Example:

Joke: what do you call a pig with three eyes? piiig!

 
In [3]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
if match:
    print(match.group())
else: 
    print(match)
    
match = re.search(r'igs', 'piiig')
if match:
    print(match.group())
else: 
    print(match)
 
 
iii
None
 
In [4]:
## . = any char but \n
match = re.search(r'..g', 'piiig'); print(match.group())
 
 
iig
 
In [5]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g'); print(match.group())
 
 
123
 
In [6]:
match = re.search(r'\w\w\w', '@@abcd!!'); print(match.group())
 
 
abc
 
 

Repetition¶

Note: The + and * are "greedy".

Repetition Examples:

 
In [7]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig'); print(match.group())
 
 
piii
 
In [8]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii'); print(match.group())
 
 
ii
 
In [9]:
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx'); print(match.group())
 
 
1 2   3
 
In [10]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
import re
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx'); print(match.group())
 
 
12  3
 
In [3]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx'); print(match.group())
 
 
123
 
In [6]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foo bar')
if match:
    print(match.group())
else:
    print(match)
 
 
None
 
In [7]:
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar'); print(match.group())
 
 
bar
 
 

Example: Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':

 
In [10]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())
 
 
b@google
 
 

The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

 
 

Square Brackets¶

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

 
In [17]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())
 
 
alice-b@google.com
 
 

(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

 
 

Group Extraction¶

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

 
In [18]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)
 
 
alice-b@google.com
alice-b
google.com
 
 

A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

findall¶

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings, with each string representing one match.

 
In [19]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
print(emails)
for email in emails:
    # do something with each found email string
    print(email)
 
 
['alice@google.com', 'bob@abc.com']
alice@google.com
bob@abc.com
 
 

findall and Groups¶

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').

 
In [20]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w.-]+)@([\w.-]+)', str)
print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]

for tuple in tuples:
    print(tuple[0])  ## username
    print(tuple[1])  ## host
 
 
[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com
 
 

Substitution¶

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

 
In [21]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
 
 
purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
 
In [ ]:
 

'Python' 카테고리의 다른 글

11. sympy-examples  (0) 2024.03.14
10. NumPy tutorial  (1) 2024.03.14
8. Text File I/O  (0) 2024.03.14
7. Lambda Functions  (0) 2024.03.14
6. Recursion vs. Iteration  (1) 2024.03.14
'Python' 카테고리의 다른 글
  • 11. sympy-examples
  • 10. NumPy tutorial
  • 8. Text File I/O
  • 7. Lambda Functions
Juson
Juson
  • Juson
    Juson의 데이터 공부
    Juson
  • 전체
    오늘
    어제
    • 분류 전체보기 (95)
      • RAG (2)
      • AI (2)
        • NLP (0)
        • Generative Model (0)
        • Deep Reinforcement Learning (2)
        • LLM (0)
      • Logistic Optimization (0)
      • Machine Learning (37)
        • Linear Regression (2)
        • Logistic Regression (2)
        • Decision Tree (5)
        • Naive Bayes (1)
        • KNN (2)
        • SVM (2)
        • Clustering (4)
        • Dimension Reduction (3)
        • Boosting (6)
        • Abnomaly Detection (2)
        • Recommendation (4)
        • Embedding & NLP (4)
      • Reinforcement Learning (5)
      • Deep Learning (10)
        • Deep learning Bacis Mathema.. (10)
      • Optimization (2)
        • OR Optimization (0)
        • Convex Optimization (0)
        • Integer Optimization (0)
      • SNA 분석 (0)
      • 포트폴리오 최적화 공부 (0)
        • 최적화 기법 (0)
        • 금융 베이스 (0)
      • Finanancial engineering (0)
      • 프로그래머스 데브코스(Boot camp) (15)
        • SQL (9)
        • Python (5)
        • Machine Learning (1)
      • Python (22)
      • Project (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.4
Juson
9. Regular Expression
상단으로

티스토리툴바