Text Processing

Regex

https://docs.python.org/3/library/re.html#regular-expression-syntax

find all instances of a single match - All adverbs

import re

s = "I am fully, and totally confident that" \
"programming and developing software is completely my thing"
adverbs = re.findall(r"\w{2,}ly", s)
print(adverbs)

re.search() matches the first occurrence of a pattern in a string

find the first pattern match from the beginning of the string, so the second email before is never found

?? for non-greedy matching

import re

s = "my work email is mk@plataux.com, and that is my work email, mk.mahfouz@gmail.com"

m = re.search(r"(\w+)@(\w+\.\w+)", s)

# the two matched groups in parentheses
print("sub-groups: ", m.groups())
print("email: ", m.group())
print("user: ", m.group(1))
print("domain: ", m.group(2))

re.match() find the first pattern match from the beginning of the string Multi-group capturing can be used

s = "781-521-4520 is my phone number"
m = re.match(r"^(\d{3,4}[-\s]??){3}", s)
print(m.group())

# We can also match a phone number in three groups
m = re.match(r"^(\d{3})([-\s])(\d{3})\2(\d{4})", s)

print(m.groups())

Use group capturing Grab keys and values from a single level JSON string using match groups

import re

s = """{
  "a": "apples",
  "b": "berries","c": "cherries",
  "pi": 3.14,
  "x": "Xenon"
}"""

m = re.findall(r'\"(\w+)\":\s*(\"?)([\w\d.]+)\2', s)
d = {mx[0]: mx[2] if mx[1] else float(mx[2]) for mx in m}
print(d)

Text Formatting

Format Specification Mini-Language https://docs.python.org/3/library/string.html#formatstrings

format_spec     ::=  [[fill]align][sign][#][0][width][grouping_option][.precision][type]
fill            ::=  <any character>
align           ::=  "<" | ">" | "=" | "^"
sign            ::=  "+" | "-" | " "
width           ::=  digit+
grouping_option ::=  "_" | ","
precision       ::=  digit+
type            ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "X" | "%"

Text formatting can be done with several constructs

  • format(value, fmt_str) builtin function. It can only one value and format it

  • "{}, {}, {}".format(1, 2, 3) the str.format(*vars) function that can format multiple arguments

  • f"{x}, {y}, {z}" modern f-strings

  • "%d, %d, %d" % (2,4,8) old-style (C-style) format strings with modulo iterable parameters

Floating Number Formatting

print(f"{2345.491:_^-20,.2f}")
# or
print(format(2345.491,"_^-20,.2f"))

This format spec breakdown

  • _ underscore padded

  • ^ Centered within the given width

  • - means sign only appears for negative numbers

  • 20 the total width of the string

  • , the separator to be a comma

  • .2 precision

  • f the type of the value

# the speed of light in km/sec
format(3 * 10**8, ".3E")

This format spec breakdown

  • .3 decimal places precision

  • E scientific notation with a capital E

Integer Formatting

format(1020, "b")
format(1020, "X")
  • b for binary representation

  • X for Hexadecimal representation, with Capital Letters

Datetime Formatting

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

import datetime as dtx
from zoneinfo import ZoneInfo

now = dtx.datetime.now()
print(format(now, '%A'))
print(f"{now:%A %d-%m-%Y}")


now = dtx.datetime.now(tz=ZoneInfo("localtime"))

print("right now it is weekday {0:%A} and "
  "in two days it will be weekday "
  "{1:%A} timezone {0:%Z}".format(now, now + dtx.timedelta(days=2)))