To receive notifications about scheduled maintenance, please subscribe to the mailing-list gitlab-operations@sympa.ethz.ch. You can subscribe to the mailing-list at https://sympa.ethz.ch

Commit ce0421b5 authored by vermeul's avatar vermeul
Browse files

Merge branch 'master' of gitlab.ethz.ch:vermeul/python-best-practices

parents 46ca9625 dee59fb7
......@@ -30,6 +30,14 @@ There are too many ways to install Python, that's why many developers have an in
[https://github.com/pyenv/pyenv](https://github.com/pyenv/pyenv)
```
$ git clone https://github.com/pyenv/pyenv.git ~/.pyenv
$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
$ echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
$ echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
$ exec "$SHELL"
```
Install a new Python version:
......@@ -72,6 +80,15 @@ $ pyenv version
miniconda3-latest (set by /Users/vermeul/tmp/.python-version)
```
## Troubleshooting SSH/TLS
If you are having trouble to use pip or installing a new Python version, you might need to upgrade openSSL and then reinstall the Python version:
```
brew install 'openssl@1.1'
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.7.0
```
## Know your environment!
**use `virtualenv` to separate the modules used for your project from your global installation**
......
......@@ -31,6 +31,63 @@
* valley? value? valid?
* temperature? temporary? template?
* deviation? device? develop?
## avoid synonyms if you mean the same:
* avoid a collection of methods that all display something on the screen:
* show_prompt()
* display_alert()
* present_result()
* render_plot()
* output_state()
* use the exactly same verb if these methods all display something on the screen, rather than synonyms:
* display_prompt()
* display_alert()
* display_result()
* display_plot()
* display_state()
* on the other hand, use distinct and non-synonymous verbs to make the difference clear:
* print_prompt() # output to terminal
* display_alert() # output to window manager / GUI
* render_plot() # output to visualization pane
* commonly abused synonyms include:
* "show" vs "display" vs "present" vs "output" vs "render"
* "make" vs "create" vs "build" vs "generate"
* "key" vs "identifier" vs "ID" vs "name"
* "node" vs "item" vs "element" vs "entry"
* "field" vs "member" vs "attribute" vs "slot"
* "customer" vs "client" vs "user"
* "colour" vs "color" vs "hue"
## select names from the problem domain, not the solution space
* typical solution-space names:
* node
* item
* element
* object
* index
* key
* data
* values
* record
* descriptor
* list
* tree
* treat such words as warning flags in your code
* Examples:
* next_record() vs. next_gene_sample()
* get_key() vs. get_message_ID()
* read_file() vs. read_protein_samples()
* Choose names from the domain vocabulary of the problem you are solving
* Avoid names that relate only to the constructs you are using to solve it
## use grammatical templates to form identifiers
......@@ -49,6 +106,7 @@
* verb\_noun\_participe: `execute_code_using`
* verb\_adjective\_noun: `delete_previous_task`
**for variables**
* noun: `node`, `source`, `destination`
......
# Values and Expressions
## Long strings
Code should not become longer than about 120 characters, but strings often get much longer. Use round brackets `( )` on mutltiple strings to create a long string. Use the [K&R style](https://en.wikipedia.org/wiki/Indentation_style#K&R_style) (opening bracket must stay on first line!) for improved readability:
```python
my_long_string = (
"Donau"
"dampfschifffahrts"
"elektrizitäten"
"hauptbetriebswerk"
"bauunterbeamten"
"gesellschaft"
)
# will be magically concatenated to Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft
```
## Multiline strings
If a string has embedded newline characters, you might define it this way:
......@@ -9,7 +26,7 @@ lines = "first line\n"\
"second line\n"\
"third line"
```
However, putting all the backslashes at the end of each line makes the code noisy. Here is a better way, which allows K&R style:
However, putting all the backslashes at the end of each line makes the code noisy. Here is a better way, which also uses [K&R style](https://en.wikipedia.org/wiki/Indentation_style#K&R_style):
<strong>
......@@ -83,7 +100,7 @@ print(get_text())
</strong>
This will unfortunately break the indentation but is the least amount of typing.
## Avoiding Errors when accessing values
## Errors when accessing values in arrays, dictionaries and objects
In Python, various errors might occur when you try to access a non-existing value from a data structure:
......@@ -91,11 +108,13 @@ In Python, various errors might occur when you try to access a non-existing valu
* Dictionaries: **KeyError**
* Objects: **AttributeError**
We do not want to use the annoying `try ... except` structure all the time. We rather need a safe way to access values. Unfortunately has a different approach for every object:
We do not want to use the annoying `try ... except` structure all the time. We rather need a safe way to access values. Unfortunately, Python demands a different approach for every object:
**Array**
```
Accessing an element in an array which might be missing is particularly nasty:
```python
a = []
a[7] # IndexError
a[7] if len(a) > 7 else 'nothing here' # OK
......@@ -103,8 +122,20 @@ a[7] if len(a) > 7 else 'nothing here' # OK
**Dictionary**
```
Dictionaries offer a generic `get` method to safely access an item in a dictionary:
```python
dict = {}
dict['not_here] # KeyError
dict.get('not_here', 'alternative value') # ok
```
**Object**
Objects do not offer any specialized method to access internal attributes in a safe way. Instead, Python demands the generic `getattr` method. Make sure you offer an alternative value, otherwise an AttributeError will be thrown:
```python
my_object = SomeClass()
getattr(my_object, 'non_existing_attribute_name') # throws AttributeError
getattr(my_object, 'maybe_existing_attribute_name', None) # returns the value of the attribute (or None, if it does not exist)
```
\ No newline at end of file
......@@ -32,3 +32,104 @@ print(string[:1].upper() + string[1:])
</strong>
## Filtering: the `filter` function
**input:** a file list which needs to be filtered (and later sorted):
```
20_Ms_229_7.xml
20_Ms_229_37.xml
20_Ms_229_6.xml
20_Ms_229_29.xml
20_Ms_229_15.xml
229.xpr
20_Ms_229_17.xml
20_Ms_229_4.xml
20_Ms_229_5.xml
20_Ms_229_16.xml
semper_edition_schema_prov.rng
schema_semper_with_mathml.rng
20_Ms_229_38_verso.xml
...
```
**create the filter function**
* files which do not start with a number should be filtered out
* i.e. the file should match the regular expression `^\d+`
* if the match is successful, return a True or true-like value
* since a non-match corresponds to a False, we can just return the match itself:
```python
import re
def my_filter(val):
match = re.search(r'^\d+', val)
return match
```
**Apply the filter function**
* the `filter` function takes two arguments:
* our defined `my_filter` function
* the list of files
* this is called **functional programming**
```python
import os
selected_files = []
for root, dirs, files in os.walk('.'):
selected_files += filter(my_filter, files)
```
## Sorting: the `sorted` function
**Input** the list of files, but this time we would like to apply a two-dimensional sort:
1. sort by the first two digits, e.g. `20` in `20_Ms_229_7.xml`
2. sort by the last digit, e.g. `15` in `20_Ms_229_15.xml` in descending order
**define the sort functions**
```python
import re
def sort_by_first_number(filename):
match = re.search(r'^(\d+)', filename)
if match:
return int(match.groups()[0])
def sort_by_last_number(filename):
match = re.search(r'_(\d+)\.xml', filename)
if match:
return int(match.groups()[0])
```
**Note:** when doing string comparison, we just return the string (or `string.lower()` for case-insensitive sort). Because we want to compare numbers, we need to apply the `int()` function to enforce number comparison.
**apply the sort functions using `sorted`**
```python
sorted_filenames = sorted(
sorted(
filenames,
key=sort_by_last_number,
reverse=True
),
key=sort_by_first_number
)
# sorted_filenames
['10_Ms_229_29.xml',
'10_Ms_229_15.xml',
'20_Ms_229_37.xml',
'20_Ms_229_7.xml',
'20_Ms_229_6.xml',
...
]
```
......@@ -62,13 +62,41 @@ It is not very elegant and adds unecessary infrastructure code into your functio
See also: https://docs.python-guide.org/writing/gotchas/
## Use an empty list or a dictionary to implement a state variable
An exception of the rule above is when you need to implement a state variable. A state variable is a permanent store of a value the first time a method or function gets executed. Here is an example for a password store which can only be executed by an inner method. This can be useful if you need the password later for reconnect to a server, without giving the possibility to easily get the cleartext password:
```
import inspect
class PW():
def get_password_via_internal_method(self, *args, **kwargs):
return self.password(*args, **kwargs)
def password(self, password=None, pstore={} ):
if password is not None:
pstore['password'] = password
else:
if inspect.stack()[1][3] == 'get_password_via_internal_method':
return pstore.get('password')
else:
raise Exception("Not allowed!")
# later
pw = PW()
pw.password('very_secret')
pw.get_password_via_internal_method() # returns the password
pw.password() # will throw an Exception
```
## Use docstrings to comment your code
* Classes, functions and methods should have doctrings
* it's the easiest way to make a program self-documenting
* use tripple quotation marks (""" or ''') to start and end docstrings
```
```python
def complex(real=0.0, imag=0.0):
"""Form a complex number.
......
......@@ -71,4 +71,71 @@ match.groupdict()
```
</strong>
This leads to much more robust regular expressions.
\ No newline at end of file
This leads to much more robust regular expressions.
In **substitutions** or within regular expressions, named capture groups are back-referenced by
```
\g<the_name_of_the_captured_group>
```
## always use `re.X`
**Regular expressions are not easy to fix if your intention is not clear**
The re.X flag allows you to define regular expressions over multiple lines. More importantly, it allows you to add comments to every part, so the original intention is preserved. If something is wrong with the regular expression, such a commented regular expression is much easier to debug.
Who would like to debug this regular expression?
```python
regex = re.compile('^(?P<alias_alternative>(?P<requested_entity>experiment|collection)(\.(?P<attribute>\w+))?)(\s+(?i)AS\s+(?P<alias>\w+))?\s*$')
```
Split the regular expression on multiple lines and add comments. Of course, you need now to specify every blank space with `\s`, but this good practice anyway:
<strong>
```python
regex = re.compile(
r"""^ # beginning of the string
(?P<alias_alternative> # use first part as alias, if no alias is defined
(?P<requested_entity>sample|object) # string starts with sample or object
(\.(?P<attribute>\w+))? # capture an optional .attribute
)
( # capture an optional alias: entity.attribute AS alias
\s+(?i)AS\s+ # ignore case of 'AS'
(?P<alias>\w+) # capture the alias
)? #
\s* # ignore any trailing whitespace
$ # end of string
""", re.X
)
```
</strong>
## `re.split` – split string with a regular expression
The example above can be rewritten by using the traditional `strip()` function to remove leading and trailing spaces, and the `re.split` function to separate attribute and alias:
```python
input_string = input_string.strip()
attribute, alias = re.split(r'\s+AS\s+', input_string, flags=re.IGNORECASE)
```
However, if no alias was defined, this will throw the following error:
```python
ValueError: not enough values to unpack (expected 2, got 1)
```
This is very unfortunate, as we often encounter real-life problems, where a split might only return one value instead of two. To solve this, you can use this trick:
<strong>
```python
input_string = input_string.strip()
attribute, *alias = re.split(r'\s+AS\s+', input_string, flags=re.IGNORECASE)
alias = alias[0] if alias else attribute
```
</strong>
In this case, our alias will be the same as the attribute, if it was not explicitly defined.
# Modules
# Modules and Packages
The best documentation on how to set up a module so it can be published as a Python package can be found on the Python Package Index (PyPi) website: https://packaging.python.org/tutorials/packaging-projects/
## Include documentation
**include your README.md in your setup.py**
Every project should include a `README.md` file which describes how the module should be used. The Python Package Index (PyPi) will render the content on its website if you include it in the `long_description` parameter. Do not forget the encoding and set it to `utf-8`:
<strong>
```python
with open("README.md", "r", encoding="utf-8") as fh:
long_description = fh.read()
setup(
name='my_famous_first_module',
description='A module which makes everything easier.',
long_description=long_description,
)
```
</strong>
## Version requirements
......
# Profiling
Whenever you think your code runs slow: do not try to optimise before you actually *measured* your code – known as profiling.
## the quick and easy way: line-profiler
Detailed documentation: https://github.com/pyutils/line_profiler
```
pip install line-profiler
```
Then simply add `@profile` decorators around the methods you think they are running slow. If you have no idea which method is slow, start from the top, run the profiler and then profile the method that sticks out next and so on:
```
@profile
def my_slow_method():
do_this()
do_that()
do_something_else()
```
Without actively importing the profile decorator, your script would not run. But you can call your script with the `kernprof` command in front:
```
krenprof -vl my_slow_script.py
```
This will load the `@profile` decorator and execute it before the method is run. After the program exits (also due to an error) it will produce a nicely formatted table with every line of code of the decorated methods/functions and the amount of time (in percent of total time and seconds). The times are not 100% accurate, but it quickly gives you an impression on which line of the code the execution time is spend.
Now you can start optimising your code. Make sure you have written tests before you start your changes to make sure you are not introducing new errors. Then run the code again and measure it again. Do not spend too much time on minor (e.g. <10%) optimisations, unless you really need to.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment