Swift Regex Tutorial: Getting Started

Master the pattern-matching superpowers of Swift Regex. Learn to write regular expressions that are easy to understand, work with captures and try out RegexBuilder, all while making a Marvel Movies list app! By Ehab Amer.

4.4 (5) · 2 Reviews

Download materials
Save for later
Share
You are currently viewing page 2 of 4 of this article. Click here to view the first page.

Reading the Text File

The first thing you need to do is load the text file. Replace the existing implementation of loadData() in ProductionsDataProvider.swift with:

func loadData() -> [MarvelProductionItem] {
  // 1
  var marvelProductions: [MarvelProductionItem] = []

  // 2
  var content = ""
  if let filePath = Bundle.main.path(
    forResource: "MarvelMovies",
    ofType: nil) {
    let fileURL = URL(fileURLWithPath: filePath)
    do {
      content = try String(contentsOf: fileURL)
    } catch {
      return []
    }
  }

  // TODO: Define Regex
   
  // 3
  return marvelProductions
}

This code does three things:

  1. Defines marvelProductions as an array of objects that you'll add items to later.
  2. Reads the contents of the MarvelMovies file from the app's bundle and loads it into the property content.
  3. Returns the array at the end of the function.

You'll do all the work in the TODO part.

If you build and run now, you'll just see a blank screen. Fear not, you're about to get to work writing the regular expressions that find the data to fill this.

Defining the Separator

The first regular expression you'll define is the separator. For that, you need to define the pattern that represents what a separator can be. All of the below are valid separator strings for this data:

  • SpaceSpace
  • SpaceTab
  • Tab
  • TabSpace

However, this is not a valid separator in the MarvelMovies file:

  • Space

A valid separator can be a single tab character, two or more space characters, or a mix of tabs and spaces but never a single space, because this would conflict with the actual content.

You can define the separator object with RegexBuilder. Add this code before the return marvelProductions:

let fieldSeparator = ChoiceOf { // 1
  /[\s\t]{2,}/ // 2
  /\t/ // 3
}

In regular expression syntax, \s matches any single whitespace character, so a space or a tab, whereas \t only matches a tab.

The new code has three parts:

  1. ChoiceOf means only one of the expressions within it needs to match.
  2. The square brackets define a set of characters to look for and will only match one of those characters, either a space or a tab character, in the set. The curly braces define a repetition to the expression before it, to run two or more times. This means the square brackets expression repeats two or more times.
  3. An expression of a tab character found once.

fieldSeparator defines a regex that can match two or more consecutive spaces with no tabs, a mix of spaces and tabs with no specific order or a single tab.

Sounds about right.

Now, for the remaining fields.

Defining the Fields

You can define the fields in MarvelProductionItem as follows:

  • id: A string that starts with tt followed by several digits.
  • title: A string of a different collection of characters.
  • productionYear: A string that starts with ( and ends with ).
  • premieredOn: A string that represents a date.
  • posterURL: A string beginning with http and ends with jpg.
  • imdbRating: A number with one decimal place or no decimal places at all.

You can define those fields using regular expressions as follows. Add this after the declaration of fieldSeparator before the function returns:

let idField = /tt\d+/ // 1

let titleField = OneOrMore { // 2
  CharacterClass.any
}

let yearField = /\(.+\)/ // 3

let premieredOnField = OneOrMore { // 4
  CharacterClass.any
}

let urlField = /http.+jpg/ // 5

let imdbRatingField = OneOrMore { // 6
  CharacterClass.any
}

These regex instances are a mix between RegexBuilders and literals.

The objects you created are:

  1. idField: An expression that matches a string starting with tt followed by any number of digits.
  2. titleField: Any sequence of characters.
  3. yearField: A sequence of characters that starts with ( and ends with ).
  4. premieredOnField: Instead of looking for a date, you'll search for any sequence of characters, then convert it to a date.
  5. urlField: Similar to yearField, but starting with http and ending with jpg.
  6. imdbRatingField: Similar to premieredOnField, you'll search for any sequence of characters then convert it to a Float.

Matching a Row

Now that you have each row of the MarvelMovies file broken down into smaller pieces, it's time to put the pieces together and match a whole row with an expression.

Instead of doing it all in one go, break it down into iterations to ensure that each field is properly matched and nothing unexpected happens.

Add the following Regex object at the end of loadData(), just before return marvelProductions:

let recordMatcher = Regex { // 1
  idField
  fieldSeparator
}

let matches = content.matches(of: recordMatcher) // 2
print("Found \(matches.count) matches")
for match in matches { // 3
  print(match.output + "|") // 4
}

This code does the following:

  1. Defines a new Regex object that consists of the idField regex followed by a fieldSeparator regex.
  2. Gets the collection of matches found in the string you loaded from the file earlier.
  3. Loops over the found matches.
  4. Prints the output of each match followed by the pipe character, |.

Build and run. Take a look at the output in the console window:

Found 49 matches
tt10857160	|
tt10648342	|
tt13623148	|
tt9419884	  |
tt10872600	|
tt10857164	|
tt9114286	  |
tt4154796	  |
tt10234724	|
.
.
.

Notice the space between the text and pipe character. This means the match included the separator. So far, the expression is correct. Now, expand the definition of recordMatcher to include titleField:

let recordMatcher = Regex {
  idField
  fieldSeparator
  titleField
  fieldSeparator
}

Build and run, then take a look at the console output:

Found 1 matches
tt10857160	She-Hulk: Attorney at Law ........|

What just happened? Adding the title expression caused the rest of the file to be included in the first match except for the final rating value.

Well... this unfortunately makes sense. The title expression covers any character type. This means that even separators, numbers, URLs and anything gets matched as part of the title. To fix this, you want to tell the expression to consider looking at the next part of the expression before continuing with a repetition.