-
Notifications
You must be signed in to change notification settings - Fork 0
/
Regex
290 lines (211 loc) · 9.98 KB
/
Regex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
. Matches any character except a newline
^ Matches the start of a string
$ Matches the end of a string (before the newline)
* Repetition qualifier - causes the expression before its placement to apply 0 or more times
+ Repetition qualifier - causes the expression before its placement to apply 1 or more times
? Repetition qualifier - causes the expression before its placement to apply 0 or 1 times
{x} Repetition qualifier - requires the expression before its placement to apply exactly x times
{x,y} Repetition qualifier - requires the expression before its placement to apply anywhere between x and y times inclusive
\ A unique character - used to escape the meaning of a regex character. More on this below.
[ ] Selection qualifier - provides multiple criteria to match against (case sensitive).
[a,z] specifies that the match must be either a or z
[a-z] specifies that the match can be anything between a through z
[A-Z,0-9] specifies that the match can be anything from A through to Z and 0 through to 9
Meta-characters are automatically escaped in braces - they lose their functionality
| Used with whole regular expressions to specify the match must be the result of either expression.
For example: [a-z]|[0-9] means the match must either be a thru z or 0 thru 9
( ) Selection qualifier - encapsulates entire regular expressions for nesting
Character Classes
\d Any digit. Uppercase inverts - any non digit.
\D Any non-digit character.
\s Any whitespace characters. Uppercase inverts - any non whitespace.
\S Any non-whitespace character.
\w Any alphanumeric character (including underscores). Equivalent to [a-zA-Z0-9_]. Uppercase inverts - any non alphanumeric character.
\W Any non-alphanumeric character.
\t Any tab character.
\T Any non-tab character.
\n Any newline character.
\N Any non-newline character.
\r Any carriage return character.
\R Any non-carriage return character.
(?i) While not a special character itself, this combination allows us to ignore case when placed in front of a desired string.
For example: (?i)[python] would return positively on any of the following: PyThoN, python, PYTHON, pythOn.
Escaping the special meaning of a meta-character
As mentioned above, escaping a regex character is sometimes necessary when the character we want to match is actually used by regex as a metacharacter. An example of this might be a question mark in a sentence - if we were to use regex to find a question mark normally, we'd notice that it doesn't work:
In this case, we need to escape the ? character so it returns to its literal form of a plain text character:
import re
from_text = 'Anaconda'
find_regx = r'a'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
Search
import re
from_text = 'Anaconda'
find_regx = r'a'
into_data = re.search(find_regx,from_text)
if into_data != None :
print ("into_data match object =",into_data)
#<_sre.SRE_Match object; span=(2, 3), match='a'>
print("start =",into_data.start())
print("end =",into_data.end())
print("span =",into_data.span())
print("group =",into_data.group())
print("string =",into_data.string)
Split
import re
from_text = 'Angela owns an Amazonian Anaconda'
regx_text = r' '
into_list = re.split(regx_text,from_text)
print (into_list)
#['Angela', 'owns', 'an', 'Amazonian', 'Anaconda']
sub
import re
read_from_text = 'Angela owns an Amazonian Anaconda'
test_when_regx = r'Anaconda'
swap_with_text = 'Python'
made_into_text = re.sub(test_when_regx,swap_with_text,read_from_text)
print ("made_into_text =",made_into_text)
Using Flags to Control Search
import re
from_text = 'Example of case insensitive match'
find_regx = r'example'
into_data = re.match(find_regx,from_text,re.IGNORECASE)
if into_data != None :
print ("into_data match object =",into_data)
#<_sre.SRE_Match object; span=(2, 3), match='a'>
print("start =",into_data.start())
print("end =",into_data.end())
print("span =",into_data.span())
print("group =",into_data.group())
print("string =",into_data.string)
The dot meta-character .
import re
#Examples are non overlapping serach.
from_text = 'Anaconda'
find_regx = r'n.'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
#into_list = ['na', 'nd']
The start and end of string meta-character ^
import re
#Examples are non overlapping serach.
from_text = 'Anaconda and Python are snakes'
find_regx = r'^A' #We match starts with an A.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
from_text = 'Python and Anaconda are snakes'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'.$' #We match any single last character.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
Repetition meta-characters * + {n} {n,m} {n,}
import re
#Examples are non overlapping serach.
from_text = 'The tutorial went on and on and on and on'
find_regx = r' (on and)*' #We a space then 'on and' zero or more times.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r' (on and)+' #We a space then 'on and' one or more times.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
Sets and ranges of characters[ ] [ - ] and [^ ]
import re
#Examples are non overlapping serach.
from_text = 'The tut_32547_Autumn was 100 to 2000 times more exciting '
find_regx = r'[a1b7]' #Non word zero or more words Non Word'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'[^a-z\d ]' #Not a lower case letter or a digit or a space
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
Classes of characters \d \D \w \W
The meta-character \d stands for any single digit character [0-9], \w stands for any single word character, that is, any character in the ranges [a-zA-Z0-9_] (Note there is an underscore character here).
An uppercase \D denotes, not a digit character, that is, it denotes the set [^0-9] Similarly, an upper case \W denotes, not a word character, so that, it denotes the set [^a-zA-Z0-9_]
import re
#Examples are non overlapping serach.
from_text = 'The tut_32547_Autumn was 100 to 2000 times more exciting '
find_regx = r'\d{2}' #any two digits
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'\d{4,4}' #any four digits
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'\d\d\d\d' #any four digits
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
into_list = ['32', '54', '10', '20', '00']
into_list = ['3254', '2000']
into_list = ['3254', '2000']
import re
#Examples are non overlapping serach.
from_text = 'The tut_32547_Autumn was 100 to 2000 times more exciting '
find_regx = r'\W\w*\W' #Non word zero or more words Non Word'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'\D\d+\D' #Non digit one or more digits Non digit.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'\d\d\d\d' #Four digits
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
into_list = [' tut_32547_Autumn ', ' 100 ', ' 2000 ', ' more ']
into_list = ['_32547_', ' 100 ', ' 2000 ']
into_list = ['3254', '2000']
The OR meta-character |
import re
#Examples are non overlapping serach.
from_text = 'The tut_32547_Autumn was 100 to 2000 times more exciting '
find_regx = r'\d\d\d|was' #match 3 digits or was
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'2\d\d|\d\W' #2followd by two digits OR a digit and a non word.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
find_regx = r'\d\d|\w{4}\W|Au' #two digits or four word then none word or Au
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
into_list = ['325', 'was', '100', '200']
into_list = ['254', '0 ', '200', '0 ']
into_list = ['32', '54', 'Au', 'tumn ', '10', '20', '00', 'imes ', 'more ', 'ting ']
3 Escaping Meta-Characters and Extracting Results
import re
#Examples are non overlapping serach.
from_text = 'Does Angela own an Anaconda? Does she own a Python ?'
find_regx = r'\?' #We escape the meta character.
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
Escaping the Escape why we need Raw Strings
import re
#Example no raw strings.
from_text = 'My String is c:\\nicepath\\nicefile'
print("from_text=",from_text)
find_regx = '[a-z]:\\\\[a-z]+\\\\[a-z]+'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
import re
#Example using raw strings.
from_text = r'My String is c:\nicepath\nicefile'
print("from_text=",from_text)
find_regx = r'[a-z]:\\[a-z]+\\[a-z]+'
into_list = re.findall(find_regx,from_text)
print ("into_list =", into_list)
Extracting Results and using the Group meta-characters ()
Any of the regular expression function of the re module support the Group meta-characters (), In addition to its role as a delimiter of parts of a regular expression it can allow sequences of matched characters to be extracted from the input string. We give an example below:
Pay particular attention to the the difference between groups() and group(0) to group(3).
groups ,plural, provides information about all matches.
group(0) provides the whole matched section.
group(1) to group(3) provides matched data extracted from the corresponding () meta-characters.
import re
#Example using raw strings.
from_text = r'bits=8, bytes=107 and pieces=900 '
print("from_text=",from_text)
find_regx = r'.*bits=(\d+).*bytes=(\d+).*pieces=(\d+).*'
into_data = re.search(find_regx,from_text)
print(into_data)
if into_data :
print("all groups=",into_data.groups())
print("group 0 =",into_data.group(0))
print("group 1 =",into_data.group(1))
print("group 2 =",into_data.group(2))
print("group 2 =",into_data.group(3))