To receive notifications about scheduled maintenance, please subscribe to the mailing-list You can subscribe to the mailing-list at

Commit ba0dd162 authored by vermeul's avatar vermeul
Browse files


parent b98af395
......@@ -32,9 +32,9 @@ print(string[:1].upper() + string[1:])
## Sorting and filtering
## Filtering: the `filter` function
**input:** a file list which needs to be filtered and sorted:
**input:** a file list which needs to be filtered (and later sorted):
......@@ -53,63 +53,54 @@ schema_semper_with_mathml.rng
**desired output**
**create the filter function**
* files which are not of the pattern `20_Ms_<collection>_<page_number>` should be filtered out
* Files should be sorted by its page number, i.e. its last digit
* page 10 should come after page 9
* files which do not start with a number should be filtered out
* i.e. the file should match the regular expression `^\d+`
* if the match is successful, return a True or true-like value
* since a non-match corresponds to a False, we can just return the match itself:
import re
def my_filter(val):
match ='^\d+', val)
return match
**Apply the filter function**
* define a filter method: `my_filter`
* define a sorting method: `my_sort`
* both return a specialised function (hence, **functional programming**) which do the actual filtering and sorting
* use a `sorted()` function (leave file list untouched)
* inside the `sorted()` function, place the `filter()` function
* both `sorted()` and `filter()` can take our pre-defined functions `my_sort` and `my_filter` as arguments.
* the `filter` function takes two arguments:
* our defined `my_filter` function
* the list of files
* this is called **functional programming**
import os
import re
selected_files = []
for root, dirs, files in os.walk('.'):
selected_files += filter(my_filter, files)
## Sorting: the `sorted` function
**Input** is the same file list as above, but we also would like to do a complex sort:
* first, sort by root
* then, sort by filename
* output: full file path
def my_sort(coll):
def my_coll_sort(val):
match ='\d+_Ms_{}_(?P<page>\d+)'.format(coll), val)
if match:
return int(match.groupdict()['page'])
return 0
return my_coll_sort
**define the sort function**
def my_sort(root_and_filename):
root, filename = root_and_filename
def my_filter(coll):
def my_coll_filter(val):
match ='^\d+_Ms_{}.*?xml$'.format(coll), val)
return match
return my_coll_filter
# later in the program
**apply the sort function: `sorted`**
collection = '229'
path = os.path.join('/Users/vermeul/semper-tei', collection)
for root, dirs, files in os.walk(path):
for filename in sorted(
filter( my_filter(coll=collection), files),
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment