is_regex(s)
The hardest function in this assignment, by far, is is_regex. Writing code that can recognize valid regexes was fairly simply, because the criteria was laid of out for us in the assignment's handout. The difficulty came with writing all of the necessary code for recognizing invalid regexes, and returning False, instead of an error message. We know that an empty string, '', cannot be a valid regex. We also know that a regex must have an equal amount of opening, (, and closing, ), parentheses. These were the first conditions that I check for. This way, we can easily to detect two kinds of invalid regexes and return False without forcing Python to run through the rest of the code. This saves lots of time and memory.
If the regex tree has a length of one and it is in '012e' it is valid. Otherwise, I check whether the regex, if valid, is nested. In order to do this, I find the central separator, '.' or '|'. This is more difficult than it initially appears. At first, I did not account for the possibility of a multiple parenthesized regex within one regex tree. In this implementation I incorrectly searched for the first index of '.' or '|'. Of course, in the case of a nested regex tree, s.index('.') or s.index('|') will only lead to the first instance of this separator, and not the central separator that we are looking for.
The key to accounting for nested regexes is to count parentheses. If s is valid the separator will always occur at an index greater than one where, in a for loop, there has been one more opening parentheses than closing parentheses. Take ((1.0).(2.e)) for instance. The root node occurs at 6, at which point there has been one more opening parentheses (two in total by this point) than closing parentheses (one in total by this point). The only case in which this will not work, is where there is an asterisk(s) following the first nested regex, e.g. ((1.0)****.(2.e)). According to the process described above, the central node should again be index 6. However, in this case index 6 refers to an asterisk. In fact indexes 6-9 all refer to asterisks; the root node, '.', is at index 10. What should we do in this case? A later if condition checks if variable index refers to an asterisk, '*'. if it does, we index to the index of the closest separator, if there is one. If there is no separator, but there are parentheses, this must be an invalid regex. Parenthesized regex trees necessarily have a root node of '.' or '|'.
The next step is to simply use recursion to perform the operations listed above on the left and right side of the parenthesized regex tree, i.e. from the start of the structure to the index of the root node for the left side, and from the index of the root node + 1 to the end of the regex.
all_regex_permutations(s)
The successful implementation of all_regex_permutations depends upon a perfect implementation of is_regex. For this function it is useful to have a helper function that generates all permutations of s, whether they are valid regexes or not. Then, in the main function write a line of code that returns the set returned by the helper function, less all of the invalid regexes. This is where is_regex comes in handy in order to parse through all of the items in the set returned by the helper function.
On a side note, if your implementation of is_regex is not perfect, using all_regex_permutations can be a great way to debug and find out what is wrong with your code. As programmers, sometimes it is difficult to imagine all possible combinations of characters that could break our code. All_regex_permutations will test is_regex with every possible combination of the characters included in variable s. If something is wrong with your implementation of is_regex, the error message returned by all_regex_permutations will let you know which line is wrong or problematic.
On a side note, if your implementation of is_regex is not perfect, using all_regex_permutations can be a great way to debug and find out what is wrong with your code. As programmers, sometimes it is difficult to imagine all possible combinations of characters that could break our code. All_regex_permutations will test is_regex with every possible combination of the characters included in variable s. If something is wrong with your implementation of is_regex, the error message returned by all_regex_permutations will let you know which line is wrong or problematic.
regex_match(r, s)
This function asks us to use a concept that we haven't touched before: ternary strings. Understanding ternary strings requires a close reading of the assignment handout, in order to know how each string should interact with each regex tree structure, e.g. DotTree, BarTree, etc... You will have to use recursion for every structure other than Leaf. In order to test your code, I recommend trying some of the examples shared here on Piazza.
build_regex_tree(regex)
The implementation of build_regex_tree is very similar to that if is_regex. One key difference is that build_regex_tree assumes that variable regex is valid. Hence, we don't have to account for invalid regexes, which makes our implementation much easier. The second key difference, is that we have to deal with the regex tree subclasses defined in regextree.py. Otherwise the implementation is the same as is_regex. We use recursion to access and output the left and rights sides of regex if it is nested. Otherwise, we return a leaf of the singular node '012e'. If there are asterisks, we create asterisk trees containing the other appropriate regex tree subclasses.
Anyway, I hope this guide helps you in understanding the assignment, even if it is already over; I'm sure some of the concepts employed in this assignment will appear, in some form, on the final exam. Happy studying!
Yanming Mai also created a post about A2, and some of its more challenging parts, such as efficiency (although this wasn't necessary) and style with the large amount of code.
Yanming Mai also created a post about A2, and some of its more challenging parts, such as efficiency (although this wasn't necessary) and style with the large amount of code.
No comments:
Post a Comment