rse-book.github.io/interaction.qmd at main · rse-book/rse-book.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
#  Interaction Environment

As a kid of the 90s, reading about a new editor or fancy shell coming out triggers playing Inspector Gadget's tune in my head. To invest a few fun hours into exploring a new tool that might save me a split second or two in processes that I run a hundred times daily may not always pay off by the hour even in a lifetime of programming, but it is totally worth it in most cases -- at least to me. Repetitive work is tiring and error-prone, adds to cognitive workload, and shifts focus away from hard-to-tackle puzzles.

That being said, I am nowhere near the *vim* aficionados whose jaw-dropping editing speed probably takes thousands of hours to master and configure, and I would not advise overengineering setting up your environment. Like often, finding a personal middle ground between feeling totally uncomfortable outside graphical user interfaces and editing code at a 100 words per minute is key.

![Wizard or carpenter? What degree of toolset proficiency is sufficient to become happy with a programming approach to analytics? (Source: own illustration.)](images/wizardry.jpg){fig-scap="Wizard or carpenter?"}


Most likely, investing in an initial setup of a solid editor plus some basic terminal workflows are a good starting point that can be revisited every once in a while, e.g., when we start a larger new project. What editor you will center your environment around will depend a lot on the programming language you choose. Yet, it also depends on your personal preference of how much investment, support and supportive features you want. The following features/criteria are likely the most influential when composing a programming environment for statistical computing:


- *code highlighting*, i.e., use of colors to highlight the structure of your code
- *linting* scans the code for potential errors and displays hints in the editor
- integrated *debuggers* allow running parts of the code line-by-line and to inspect the inner workings of functions
- *multi-language support* is important when working in multiple programming or markup languages.
- *terminal integration* helps to run stuff using system commands
- git integration helps interact with git \index{version control} and do basic add, commit, push operations through the editor's GUI
- build tools for programs that need rendering and compilation
- customizable through add-ins/macros


## Integrated Development Environments

While some prefer lightning-fast editors such as Sublime Text that are easy to customize but rather basic out-of-the-box, Integrated Development Environments\index{IDE} (IDEs) are the right choice for most people. In other words, it's certainly possible to move a five-person family into a new home by public transport, but it is not convenient. The same holds for (plain) text editors in programming. You can use them, but most people would prefer an IDE just like they prefer to use a truck over public transport when they move. IDEs are tailored to the needs and idiosyncrasies of a language, some working with plugins and covering multiple languages. Others have a specific focus on a single language or a group of languages.

The below sections will focus on data science' most popular editors, namely Visual Studio Code and RStudio. Hence, I would like to mention at least some IDE juggernauts here, for the reader looking for alternatives: [Eclipse](https://www.eclipse.org/) (mostly Java but tons of plugins for other languages), or JetBrains' [IntelliJ](https://www.jetbrains.com/idea/) (Java) and [PyCharm](https://www.jetbrains.com/pycharm/) (Python)\index{Python}.


### RStudio

Posit's *RStudio* has become the default environment for most R people and those who mostly use R but C or Python\index{Python} and Markdown, in addition. The open source version ships for free as RStudio Desktop and RStudio Server Open Source. In addition, the creators of RStudio offer commercially supported versions of both the desktop and server version (Posit Workbench). If you want your environment to essentially look like the environment of your peers, RStudio is a good choice. To have the same visual image in mind can be very helpful in workshops, coaching or teaching.

RStudio has four control panes which the user can fill with a different functionality like script editor, R console, terminal, file explorer, environment explorer, test suite, build suite and many others. I advise newcomers to change the default to have the script editor and the console next to each other (instead of below each other). That is because (at least in the Western world) we read from left to right and send source code from left to right to execute it in the console. Combine this practice with the *run selection shortcut* (cmd+enter or ctrl+enter on a PC) and you have gained substantial efficiency compared to leaving your keyboard, reaching for your mouse and finding the right button. In addition, this workflow should allow you to see larger chunks^[Many coding conventions recommend having no more than 80 characters in one line of code. Sticking to this convention should prevent cutting off your code horizontally.] of your code as well as your output.

#### Explore Extensions {-}

- explore add-ins

- explore the API


#### Favorite Shortcuts {-}

- use cmd+enter (ctrl+enter on PCs) to send selected code from the script window (on the left) to the console (on the right)

- cmd+shift+option+R, while the cursor is within a function's body (create a Roxygen documentation skeleton)

- use ctrl 1,2 to switch between console and script pane


#### Pitfalls {-}

- you may stumble over RStudio's defaults, such as storing your global environment on exit and thus resurrecting long forgotten objects impacting your next experiment through, e.g., lexical scoping\index{lexical scoping}.
- RStudio's git integration abstracts git essentials away, so it hampers understanding of what's going on.


For R, the Open Source Edition of [RStudio Desktop](https://rstudio.com/products/rstudio/) is the right choice for most people.
(If you are working in a team, R Studio's server version is great. It allows having a centrally managed server which clients can use through their web browser without even installing R and RStudio locally.) R Studio has solid support for a few other languages often used together with R, plus it's customizable. The French premier thinkR [Colin_Fay](https://x.com/_colinfay) gave a nice tutorial on [Hacking RStudio](https://speakerdeck.com/colinfay/hacking-rstudio-advanced-use-of-your-favorite-ide) at the useR! 2019 conference.


Back in fall 2020, long before RStudio turned into Posit, the company already indicated that data science was not about R vs. Python to them ([Remember the first section of Chapter 3 of this book: *The Choice That Doesn't Matter*  of this book?](/programming.html#the-choice-that-doesnt-matter))

![Called RStudio back then, the company called Posit today already indicated years ago, it was not solely about R. (Source: rstudio.com website in 2021)](images/rstudio_r_py.png){fig-scap="Posit is more than just R."}


### Visual Studio Code

Outside the R universe (and to some degree even inside it), Visual Studio Code became the go-to editor in data science.
Microsoft's Visual Studio Code started out as a modular, general purpose IDE not focused on a single language.
Meanwhile, there is not only great Python support, but also auto-completion, code highlighting for R or Julia and many other languages.
[VS Code Live Share](https://visualstudio.microsoft.com/services/live-share/) is just one rather random example of its remarkably well-implemented features. Live share allows developers to edit a document simultaneously using multiple cursors similarly to Google Docs, but with all the IDE magic in a Desktop client.


### Editors on Steroids

Another approach is to go for a highly customizable editor such as Sublime^[https://www.sublimetext.com/] or Atom^[https://atom.io/].
The idea is to send source code from the editor to interchangeable read-eval-print-loop ([REPL](#glossary))s)
which can be adapted according to the language that needs to be interpreted. That way a good linter/code highlighter for your favorite language is all you need to have a lightweight environment to run things. An example of such a customization approach is Christoph Sax's small project Sublime Studio^[https://github.com/christophsax/SublimeStudio].

For readers with Unix experience, vim^[https://www.vim.org/] may just be the most ubiquitous editor of them all, to everyone else it may just live in deep nerd territory.
The fact that a quick online search for "How to quit vim" came back with more than 56.4 million (!) results shows that both perspectives have a point.
With vim, users can switch between a *move around* and *insert* mode.
The former allows the users to use single letters as shortcuts to navigate a text file instead of typing the actual letter.
This enables users to do things like move-three-words-forward or delete the next three lines and many other more complex things.
Given some practice and regular use, it is easy to imagine that vim wizards can navigate their source code incredibly quickly.
Back when IDEs were less comfortable and often clunky due to their heavy lifting, a broader group of people had their incentives to invest into vim.
Now that IDEs became so much better and many of them even offering vim modes or plugins, the point of contact with vim for a data scientist is mostly Unix server administration or work inside containers\index{container}.
GNU Emacs^[https://www.gnu.org/software/emacs/] is another noteworthy editor because, even though much more exotic, it is popular among long-tenured R folks thanks to the Emacs Speaks Statistics extension.


## Notebooks

Notebooks are yet another popular choice for a home court among data scientists and analysts.
Though most often associated with the Python world\index{Jupyter}, notebooks are language-agnostic and also common for the R and Julia languages.
The basic idea of notebooks is to run web server in the background to present a browser-based frontend to the developer.
The difference between the Markdown rendering approach described in [Chapter 10 about publishing](publishing.html) is that the browser is not just used to display the result, but it is also the environment to interactively add commands.
The resulting workflow, e.g., a data analysis in the making feels like a social media timeline: execute a command, receive a result posted on a web page printed below the command, again posting a prompt to expect the next command.
This way we get an endless scroll of commands, analysis and descriptive text in between.

Consider the following example, drawing a basketball court using the *matplotlib* Python package.
This example uses screenshots of a notebook\index{Jupyter} to illustrate the difference between a notebook and concepts like RMarkdown. Note the prompts in between the results!

![](images/notebook-1.png)

The above figure depicts a Markdown element in editing mode, showing the Markdown syntax (double ## for a section header of type 2).
The second element is a chunk of Python\index{Python} code to import two well-known Python libraries.

![](images/notebook-2.png)

After the function definition, we see another Markdown section header element that says *Call the Function* -- this time already rendered to HTML.
Finally, we call the Python\index{Python} function defined above in another code chunk element^[Though notebooks\index{Jupyter} were originally designed to run on a (local) Python based webserver and to be used in a web browser, there is a neat, Electron-based standalone app called JupyterLab\index{Jupyter}.
I have used this app for the illustration in this book because of its slim, no-nonsense interface that unlike browsers comes without distractions from plugins.].


<!--
# Python code to draw an elementary NBA court,
# omitting a few details.
# import matplotlib as mpl
# import matplotlib.pyplot as plt


# def draw_nba_court():
#     # from wikipedia https://en.wikipedia.org/wiki/Basketball_court
#     # basketball half court
#     # 47' 50' => map it to: 470, 500 axes => 1' = 10
#     fig = plt.figure(figsize=(4.7, 5))
#     ax = fig.add_axes([0, 0, 1, 1])
#     ax.set_xlim(-250, 250)
#     ax.set_ylim(0, 470)
#     color = 'black'

#     # 3pt line, corner threes are 22', arc threes are 23' 9'' from the hoop
#     ax.plot([-220, -220], [0, 140], linewidth=2, color = color)
#     ax.plot([220, 220], [0, 140], linewidth=2, color = color)
#     ax.add_artist(mpl.patches.Arc((0, 140), 440, 315, theta1=0,
#      theta2=180, facecolor='none', edgecolor=color, lw=2))

#     # hoop (backboard 6' wide)
#     ax.add_artist(mpl.patches.Circle((0, 60), 10,
#      facecolor='none', edgecolor=color, lw=2))
#     ax.plot([-30, 30], [40, 40], linewidth=2, color=color)

#     ax.plot([-80, -80], [0, 190], linewidth=2, color=color)
#     ax.plot([80, 80], [0, 190], linewidth=2, color=color)
#     ax.plot([-80, 80], [190, 190], linewidth=2, color=color)
#     ax.add_artist(mpl.patches.Circle((0, 190), 60,
#      facecolor='none', edgecolor=color, lw=2))

#     # Remove all ticks / labels
#     ax.set_xticks([])
#     ax.set_yticks([])

#     return ax

# draw_nba_court()

-->


## Console/Terminal

In addition to the editor with which you will spend most of your time, it is also worthwhile to put
some effort into configuring keyboard-driven interaction with your operating systems. And again, you do not need to be a terminal virtuoso to easily outperform mouse pushers, a terminal carpentry level is enough. Terminals come in all shades of gray, integrated into IDEs, built into our operating system or as pimped third-party applications. A program called a *shell* runs inside the terminal application to run  commands and display output. *bash* is probably the most known shell program, but there are tons of different flavors. FWIW, I love *fish*^[https://fishshell.com/] shell (combined with *iterm2*) for its smooth auto-completion, highlighting and its extras.

In the Windows world, the use of terminal applications has been much less common than for OSX/Linux -- at least for non-systemadmins. Git Bash, which ships with git \index{version control} installations on Windows, mitigates this shortcoming, as it provides a basic Unix style console for Windows. For a full-fledged Unix terminal experience, I suggest using a full terminal emulator like CYGWIN. More recent, native approaches like powershell brought the power of keyboard interaction at the OS level to a broader Windows audience – albeit with different Windows specific syntax. The ideas and examples in this book as shown in the below table are limited to Unix flavored shells.

| Command | What it does |
|---------|------|
| ls      | list files (and directories in a directory)   |
| cd    | change directory  |
| mv        | move file (also works to rename files)    |
| cp        | copy files    |
| mkdir       | make directory    |
| rmdir       | remove directory    |
| rm -rf       | (!) delete everything recursively. DANGER: shell does not ask for confirmation and just wipes out everything.   |
: Basic Unix Terminal Commands


### Remote Connections SSH, SCP

One of the most important use cases of the console for data analysts is the ability to log into
other machines, namely servers that most often run on Linux. Typically, we use the SSH\index{SSH} protocol to connect to a (remote) machine that allows to connect through port 22.

:::{.callout-note}
Note that sometimes firewalls limit access to ports other than those needed for surfing the web (8080 for http:// and 443 for https://) so you can only access port 22 inside an organization's VPN network.
:::


To connect to a server using a username and password, simply use your console's SSH\index{SSH} client like this:

```sh
ssh mbannert@someserver.org
```

You will often encounter another login procedure, though. SSH\index{SSH} key pair authentication is more secure and therefore preferred by many organizations. You need to make sure the public part of your key pair is located on the remote server and hand the ssh command the private file:

```sh
ssh -i ~/.ssh/id_rsa mbannert@someserver.org
```

For more details on SSH\index{SSH} key pair authentication, check this example from the case studies in Chapter 10 @sec-rsa.

While SSH\index{SSH} is designed to log in to a remote server and from then on, issue commands like the server was a local Linux machine, `scp` copies files from one machine to another.

```sh
scp -i ~/.ssh/id_rsa ~/Desktop/a_file_on_my_desktop
 mbannert@someserver.org:/some/remote/location/
```

The above command copies a file dwelling on the user's desktop into a /some/remote/location on the server. Just like SSH\index{SSH}, secure copy (scp) can use SSH\index{SSH} key pair authentication, too.


### Git Through the Console

Another common use case of the terminal is managing git \index{version control}. Admittedly, there is git integration for many IDEs that allows you to point and click your way to commits, pushes and pulls as well as dedicated git clients like GitHub\index{GitHub} Desktop or Source Tree. But there is nothing like the console in terms of speed and understanding what you really do. @sec-git sketches an idea of how to operate git through the console from the very basics to a feature branch-based workflow.