Good coding practices

class: center, middle, inverse, title-slide

# Good coding practices
## Developing skills in R
### Malie Lessard-Therrien
### November 13, 2018

---

name: intro
class: spaced

##Good coding practices

***

> "Let us change our traditional attitude to the construction of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do." - Donald E. Knuth, *Stanford University*

.center[<img src="good_coding_figures/DonaldKnuth.jpg" style="max-width:200px;">]

???
Donald E. Knuth: Professor Emeritus at Stanford University. In 1962, he embarked on a monumental seven-volume work titled The Art of Computer Programming, which now occupies him full time. The four volumes released to date are available in nine languages with over one million copies printed. An R script isn???t just telling the computer how to perform calculations on your data. It is also explaining your work to other human beings.

---
name: intro1
## Why writing nicer code?

***
 
- easy to read
- easy to write 
- runs fast
- gives reliable results
- easy to reuse in new projects
- easy to share with collaborators

???
Mathilda, a colleague at DEEP: "My code is so messy, I feel it's very personal, it's like my dirty laundry, I don't want to share it with anybody." Well, what if your laundry is clean and neatly folded, would you mind as much? It's just like your code; it if is nicely written and tidy, it will be easy to share with collaborators.

---
name: intro2
## Why writing nicer code?

***
 
- easy to read
- easy to write 
- runs fast
- gives reliable results
- easy to reuse in new projects
- easy to share with collaborators
 
> "The single biggest reason you should write nice code is so that your future
self can understand it." - Greg Wilson *Software Carpentry Course*

.center[![Greg Wilson](good_coding_figures/GregWilson.jpg)]

???
co-founder of Software Carpentry, a non-profit organization that teaches basic computing skills to researchers.

---
name:intro2
##Benefits of writing nicer code
***
 
- Better science 
 
- More fun 
 
- Become more efficient 
 
- Future employment

???
- Better science: nice code allows you to handle bigger data sets and has less bugs.
- More fun: spend less time wrestling with R, and more time working with data.
- Become more efficient: Nice code is reusable, sharable, and quicker to run.
- Future employment: Professional data analysts take clarity very seriously. You should consider anything you write (open or closed) to be a potential advertisement to a future employer. Code has impact. Scientists with analytical skills are often sought-after in the natural sciences.

---
name:intro3
##General guidelines

- Organise your project and related materials
- Make different chunks for each task
- Care for your code by:
  + using comments
  + having meaningful variable or function names
  + apply a consistent style
- Use version control

???
Use version control: Of the many reasons for using version control, one is that it archives older versions of your code, permitting you to  safely delete old files. This helps reduce memory usage, messy files and improves readability.

---
name: good coding practices1
##Good coding practices

- Start your script with a header
    + name of project
    + short description
    + your name
    + the date your started the script

```r
#####################################
# title: "Population dynamics of Lathyrus vernus project"
# subtitle: "Data cleaning"
# author: "Malie Lessard-Therrien"
# date: "02/10/2018"
#####################################
```

---
name: good coding practices2
##Good coding practices

- Start your script with a header
    + name of project
    + short description
    + your name
    + the date your started the script

- Then libraries

```r
library (xaringan)
library (tidyverse)
library (ggmap)  # geocode ()
```

???
Load packages at the beginning. If it's fairly self-evident why the package is needed, at least to you, just load and continue. If it's a specialty o package, then remind yourself what function(s) you are going to use; this way, you build your own knowledge about libraries out there.

---
name: Notation and naming1
##Notation and naming

A syntactically valid name:

- Consists of:

+ letters: `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`
    + digits: `0123456789`
    + period: `.`
    + underscore: `_`
    
- begins with a letter or the period (.) not followed by a number 
- **BUT: variable names beginning with period are hidden. Ex: `.my_secret_variable` will not be shown but can be accessed

- cannot be one of the *reserved words*: `if`, `else`, `repeat`, `while`, `function`, `for`, `in`, `next`, `break`, `TRUE`, `FALSE`, `NULL`, `Inf`, `NaN`, `NA`, `NA_integer_`, `NA_real_`, `NA_complex_`, `NA_character_`

- also cannot be: `c`, `q`, `t`, `C`, `D`, `I` because they already are objects

---
name: Notation and naming2
##Customary variable names

Also, there is a number of variable names that are traditionally used to name particular variables:

- `usr` -- user,
- `pwd` -- password,
- `x`, `y`, `z` -- vectors,
- `w` -- weights,
- `f`, `g` -- functions,
- `n` -- number of rows,
- `p` -- number of columns,
- `i`, `j`, `k` -- indexes,
- `df` -- data frame,
- `cnt` -- counter,
- `M`, `N`, `W` -- matrices,
- `tmp` -- temporary variables

Sometimes these are domain-specific:

- `p`, `q` -- allele frequencies in genetics,
- `N`, `k` -- number of trials and number of successes in stats

Try to avoid use these in another way to avoid possible confusion.

---
name: Notation and naming3
##Notation and naming

Different notation styles:
- `snake_notation_looks_like_this`
- `camelNotationLooksLikeThis`
- `period.notation.looks.like.this` *generic functions in S3 classes
- `LousyNotation_looks.likeThis`

Which ever style you choose:
- use meaningful names, e.g. `genotypes` vs. `fsjht45jkhsdf4`
- be concise, e.g. `weight` vs. `phenotype_weight_measured`
- be consistent across your code, use the same naming convention

???
Variable and function names should be lowercase. Use `_` to separate words within a name. Generally, variable names should be nouns and function names should be verbs. Strive for concise but meaningful names (this is not easy!)

---
name: syntax1
##Syntax

Spacing

- Place spaces around all binary operators (=, +, -, <-, etc.)

```r
# Good
average <- mean(speed / 12 + dist, na.rm = T)
# Bad
average<-mean(speed/12+dist,na.rm=T)
```

---
name: syntax2
##Syntax

Spacing

- place spaces around all binary operators (=, +, -, <-, etc.)

```r
# Good
average <- mean(speed / 12 + dist, na.rm = T)
# Bad
average<-mean(speed/12+dist,na.rm=T)
```

- use brackets and commas like in written English
- place a space before left parenthesis, except in a function call.

```r
# Good
if (debug)
plot(graph1)
diamonds[5, ]

# Bad
if ( debug )  # Don't add spaces around debug
plot (graph1)  # Don't add space before left parenthesis in a function call
x[1,]  # Add a space after the comma
x[1 ,]  # Add a space after the comma, not before
```

???
The square brackets are used to subset vectors and data frames.
Indexing: isolate particular entries items that meet some criteria

---
name: syntax3
##Syntax

Spacing

- Extra spacing (i.e., more than one space in a row) is acceptable if it improves alignment of equals signs or arrows (<-).

```r
plot(x    = x.coord,
     y    = y.coord,
     ylim = ylim,
     xlab = "dates",
     ylab = metric,
     main = (paste(metric, " for 3 samples ", sep = "")))
```

---
name: syntax4
##Syntax

Curly Braces

An opening curly brace should not go on its own line.
A closing curly brace should always go on its own line.
Exception: An "else" statement should always be surrounded on the same line by curly braces.

```r
# Good
if (is.null(ylim)) {
 ylim <- c(0, 0.06)
}
if (condition) {
 one line
 or more lines
} else {
 one line
 or more lines
}
# Bad
if (is.null(ylim)) {ylim <- c(0, 0.06)} # closing curly brace on its own line
if (condition) {
 one or more lines
}
else { # surround "else" statement by curly braces
 one or more lines
}
```

???
The curly braces are used to denote a block of code in a function.

---
name: line length1
##Line length

For being concise and increase code readability, keep your lines up to 80 characters.

```r
# The following line displays which of your species have multiple matches in the synonyms of the dyntaxa table (that's the length(x)>1 part). Check through these to make sure that the first of the multiple answers is the right one. If not (like a hybrid or subspecies rather than the true synonym), we fix that below.
sapply(unique(obs$Species[is.na(obs$dyn.spe)]), function(x) dyn.species$species[grep(x,dyn.species$synonyms)])[sapply(sapply(unique(obs$Species[is.na(obs$dyn.spe)]), function(x) dyn.species$species[grep(x,dyn.species$synonyms)]),function(x) length(x)>1)] 
```

---
name: line length
##Line length

For being concise and increase code readability, keep your lines up to 80 characters.

Set up 80 characters limit:
- In RStudio, go to

```r
Tools -> Global Options -> Code -> Display
```

- there is a checkbox option:

```r
[ ] Show margin
    Margin column [80]
```

Check this and you will see a margin drawn in the code editor at the desired column.

???
Lines up to 80 characters: This is the amount that will fit comfortably on a printed page at a reasonable size. If you find you are running out of room, this is probably an indication that you should encapsulate some of the work in a separate function or shorten your variable names.

---
name: indenting
##Indenting

When indenting your code, use spaces. Never use tabs or mix tabs and spaces.
- Use indenting when:
    * keeping to 80 characters rule

```r
## drop some observations and unused factor levels
lotrDat <-
 droplevels(subset(lotrDat,
 !(Race %in% c("Gollum", "Ent", "Dead", "Nazgul"))))
```

* to align inside a block delimited by curly braces

```r
jFun <- function(x) {
 estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
 names(estCoefs) <- c("intercept", "slope")
 return(estCoefs)
 }
```

???
Indent definition: start (a line of text) or position (a block of text) further from the margin than the main part of the text.
ex: "type a paragraph of text and indent the first line"
synonyms:	move to the right, move further from the margin, start in from the margin

---
name: comments
##Comments

Use comments to describe what your code is meant to do 
- Comment entire code chunks beginning with # and one space.
- Short comments inside the code can be placed after the code line preceded by two spaces, #, and then one space.

```r
# Exploring the data set iris
head(iris, n = 10)  # see the 10 first rows (default is 6 rows)
str(iris)  # see the structure of the dataset
```

```
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

---
name: tidyverse1
##Tidyverse

The tidyverse is a collection of R packages designed for data science. The tidyverse packages share common principles and are designed to work well together.  https://www.tidyverse.org/
- readr: Read flat files (csv, tsv, fwf) into R
- tibble: A modern re-imagining of the data frame
- tidyr: Easily tidy data with spread and gather functions
- dplyr: A grammar of data manipulation
- ggplot2: An implementation of the Grammar of Graphics in R
- reprex: Render bits of R code for sharing, e.g., on GitHub or StackOverflow
- purrr: A functional programming toolkit for R
- forcats: Tools for working with categorical variables (factors)

.center[<img src="good_coding_figures/Rbook.jpeg" style="max-width:150px;">
<img src="good_coding_figures/HadleyWickham.jpg" style="max-width:150px;">
<img src="good_coding_figures/CharlotteWickham.jpg" style="max-width:150px;">]

???
Hadley Wickham is a statistician from New Zealand who is currently Chief Scientist at RStudio and an adjunct Professor of statistics at the University of Auckland, Stanford University, and Rice University.
Charlotte Wickham is part-time Assistant Professor of Statistics at Oregon State University, specialist in R training and course developer for Data Camp. She does R training and data science consulting. she loves cats and so named two of the tidyverse packages she name with cat related names (purrr and forcats).

---
name: tidyverse2
##Tidyverse

The tidyverse is a collection of R packages designed for data science. The tidyverse packages share common principles and are designed to work well together.  https://www.tidyverse.org/
- tidyr: Easily tidy data with spread and gather functions
- ggplot2: An implementation of the Grammar of Graphics in R
- dplyr: A grammar of data manipulation
- reprex: Render bits of R code for sharing, e.g., on GitHub or StackOverflow
- tibble: A modern re-imagining of the data frame
- readr: Read flat files (csv, tsv, fwf) into R
- purrr: A functional programming toolkit for R
- magrittr: Improve the readability of R code with the pipe

.center[<img src="good_coding_figures/pipe.jpg" alt="pipe" style="max-width:300px;">]

???
Ren?? Fran??ois Ghislain Magritte was a Belgian surrealist painter. Pour ce tableau qui suscita bien des questionnements, Magritte s???est justifi?? :
?? La fameuse pipe, me l???a-t-on assez reproch??e ! Et pourtant, pouvez-vous la bourrer ma pipe ? Non, n???est-ce pas, elle n???est qu???une repr??sentation. Donc si j???avais ??crit sous mon tableau ?? ceci est une pipe ??, j???aurais menti ! ?? C.Q.F.D.

---
name: pipe1
##Pipes

Good to use when;
- several steps to modify a vector or dataframe
- similar input and output

---
name: pipe2
##Pipes

Good to use when;
- several steps to modify a vector or dataframe
- similar input and output

Example:

In your dataset, you want to;
- Step 1. change column names
- Step 2. subset specific data
- Step 3. add another column

---
name: pipe4
##Pipes

In your dataset, you want to;
- Step 1. change column names
- Step 2. subset (or filter) specific data
- Step 3. add another column