🏡 penguins.nim

This nimib example document shows how to insert images in nimib documents.

🐧 Exploring penguins with ggplotnim

We will explore the palmer penguins dataset with ggplotnim and datamancer.

import datamancer

we read the penguins csv into a Datamancer DataFrame

let df = readCsv("data/penguins.csv")

let us see how it looks

echo df
DataFrame with 7 columns and 344 rows:
     Idx              species               island     culmen_length_mm      culmen_depth_mm    flipper_length_mm          body_mass_g                  sex
  dtype:               string               string               object               object               object               object               string
       0               Adelie            Torgersen                 39.1                 18.7                  181                 3750                 MALE
       1               Adelie            Torgersen                 39.5                 17.4                  186                 3800               FEMALE
       2               Adelie            Torgersen                 40.3                   18                  195                 3250               FEMALE
       3               Adelie            Torgersen                   NA                   NA                   NA                   NA                   NA
       4               Adelie            Torgersen                 36.7                 19.3                  193                 3450               FEMALE
       5               Adelie            Torgersen                 39.3                 20.6                  190                 3650                 MALE
       6               Adelie            Torgersen                 38.9                 17.8                  181                 3625               FEMALE
       7               Adelie            Torgersen                 39.2                 19.6                  195                 4675                 MALE
       8               Adelie            Torgersen                 34.1                 18.1                  193                 3475                   NA
       9               Adelie            Torgersen                   42                 20.2                  190                 4250                   NA
      10               Adelie            Torgersen                 37.8                 17.1                  186                 3300                   NA
      11               Adelie            Torgersen                 37.8                 17.3                  180                 3700                   NA
      12               Adelie            Torgersen                 41.1                 17.6                  182                 3200               FEMALE
      13               Adelie            Torgersen                 38.6                 21.2                  191                 3800                 MALE
      14               Adelie            Torgersen                 34.6                 21.1                  198                 4400                 MALE
      15               Adelie            Torgersen                 36.6                 17.8                  185                 3700               FEMALE
      16               Adelie            Torgersen                 38.7                   19                  195                 3450               FEMALE
      17               Adelie            Torgersen                 42.5                 20.7                  197                 4500                 MALE
      18               Adelie            Torgersen                 34.4                 18.4                  184                 3325               FEMALE
      19               Adelie            Torgersen                   46                 21.5                  194                 4200                 MALE

or even nicer using nbShow(df.head(10)):

Index species

string
island

string
culmen_length_mm

object
culmen_depth_mm

object
flipper_length_mm

object
body_mass_g

object
sex

string
0AdelieTorgersen39.118.71813750MALE
1AdelieTorgersen39.517.41863800FEMALE
2AdelieTorgersen40.3181953250FEMALE
3AdelieTorgersenNANANANANA
4AdelieTorgersen36.719.31933450FEMALE
5AdelieTorgersen39.320.61903650MALE
6AdelieTorgersen38.917.81813625FEMALE
7AdelieTorgersen39.219.61954675MALE
8AdelieTorgersen34.118.11933475NA
9AdelieTorgersen4220.21904250NA

Note that among the 7 columns of the dataframe, the first 2 and the last one have datatype string. The remaining 4 are numeric but they have datatype object. Why?

Because that there are null values in those columns (interpreted as strings):

echo df["culmen_depth_mm", 1].kind
echo df["culmen_depth_mm", 2].kind
echo df["culmen_depth_mm", 3].kind
echo df["flipper_length_mm", 2].kind
echo df["flipper_length_mm", 3].kind
VFloat
VFloat
VString
VInt
VString

Let's see how many penguins by species we have (and how do they relate to islands) by plotting the species count per island using ggplotnim:

import ggplotnim
ggplot(df, aes("species", fill = "island")) + geom_bar() + ggsave("images/penguins_count_species.png")
Count of penguins by species
Count of penguins by species

We see that 3 species of penguins (Adelie, Chinstrap, Gentoo) are reported living on 3 islands (Biscoe, Dream, Torgersen). The majority of penguins are Adelie and they are distributed over the 3 islands. Gentoo penguins (second most common) almost all live on Biscoe, and Chinstrap penguins almost all live on Dream.

We can confirm this with the following image taken from the article from where this dataset comes from:

Penguins by Location
Penguins by Location

We do expect weight being correlated to some of the length measures (e.g. flipper length) with males being bigger than females.

To plot this we need to remove all NA and then classify the points both by the penguins sex as well as their species:

let df1 = df.filter(f{`body_mass_g` != "NA"}) # c"foo" == `foo` == idx("foo") (backtick quotes not usable for columns with spaces)
ggplot(df1, aes(x="body_mass_g", y="flipper_length_mm", color = "sex", shape="species")) + geom_point() + ggsave("images/penguins_mass_vs_length_with_sex.png")
INFO: The object column `body_mass_g` has been automatically determined to be continuous. To overwrite this behavior use `scale_x/y_discrete` or apply `factor` to the column name in the `aes` call.
INFO: The object column `flipper_length_mm` has been automatically determined to be continuous. To overwrite this behavior use `scale_x/y_discrete` or apply `factor` to the column name in the `aes` call.
Penguins' mass vs flipper length (colored by sex)
Penguins' mass vs flipper length (colored by sex)

A few things to remark:

As a final plot, I would like to (partly, confidence bands are missing) reproduce a plot that "shows" the presence of Simpson's paradox in this dataset, as reported by this tweet:

ggplot(df1, aes(x="culmen_depth_mm", y="body_mass_g")) +
  geom_point(aes = aes(color = "species")) +      # point...
  geom_smooth(aes = aes(color = "species"),       # and smooth classified by species
              smoother = "poly", polyOrder = 1) + # polynomial order 1 == line
  geom_smooth(smoother = "poly", polyOrder = 1) + # and smooth without classification by species
  ggsave("images/penguins_simpson.png")
INFO: The object column `culmen_depth_mm` has been automatically determined to be continuous. To overwrite this behavior use `scale_x/y_discrete` or apply `factor` to the column name in the `aes` call.
INFO: The object column `body_mass_g` has been automatically determined to be continuous. To overwrite this behavior use `scale_x/y_discrete` or apply `factor` to the column name in the `aes` call.
Simpson's paradox in Penguins' dataset: for every species bigger mass is correlated with thicker bill,
but looking at all species taken together bigger mass is correlated with thinner bill
Simpson's paradox in Penguins' dataset: for every species bigger mass is correlated with thicker bill, but looking at all species taken together bigger mass is correlated with thinner bill

We can see that Gentoo penguins, although in general bigger, have a "thinner" bill (see image below for meaning of bill and culmen)!

Penguin's bill and culmen explained (Artwork by @allison_horst)
Penguin's bill and culmen explained (Artwork by @allison_horst)
import nimib, nimoji

nbInit
nbText: """
> This nimib example document shows how to insert images in nimib documents.
"""
nbText: """# :penguin: Exploring penguins with ggplotnim

We will explore the [palmer penguins dataset](https://github.com/allisonhorst/palmerpenguins)
with [ggplotnim](https://github.com/Vindaar/ggplotnim) and [datamancer](https://github.com/SciNim/Datamancer).
""".emojize

nbCode:
  import datamancer
nbText: "we read the penguins csv into a Datamancer `DataFrame`"
nbCode:
  let df = readCsv("data/penguins.csv")
nbText: "let us see how it looks"
nbCode:
  echo df
nbText: "or even nicer using `nbShow(df.head(10))`:"
nbShow(df.head(10))

nbText: """Note that among the 7 columns of the dataframe, the first 2 and the last one have datatype string.
The remaining 4 are numeric but they have datatype object. Why?

Because that there are null values in those columns (interpreted as strings):"""
nbCode:
  echo df["culmen_depth_mm", 1].kind
  echo df["culmen_depth_mm", 2].kind
  echo df["culmen_depth_mm", 3].kind
  echo df["flipper_length_mm", 2].kind
  echo df["flipper_length_mm", 3].kind



nbText: """Let's see how many penguins by species we have (and how do they relate to islands)
by plotting the species count per island using ggplotnim:"""
nbCode:
  import ggplotnim
  ggplot(df, aes("species", fill = "island")) + geom_bar() + ggsave("images/penguins_count_species.png")
nbImage(url="images/penguins_count_species.png", caption="Count of penguins by species")
nbText: """We see that 3 species of penguins (Adelie, Chinstrap, Gentoo)
are reported living on 3 islands (Biscoe, Dream, Torgersen).
The majority of penguins are Adelie and they are distributed over the 3 islands.
Gentoo penguins (second most common) almost all live on Biscoe,
and Chinstrap penguins almost all live on Dream.

We can confirm this with the following image taken from the
[article](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081)
from where this dataset comes from:
"""

nbImage(url="images/penguins_map.png",
        caption="Penguins by Location")

nbText: """We do expect weight being correlated to some of the length measures
(e.g. flipper length) with males being bigger than females.

To plot this we need to remove all `NA` and then classify the points both by the penguins sex as
well as their species:
"""
nbCode:
  let df1 = df.filter(f{`body_mass_g` != "NA"}) # c"foo" == `foo` == idx("foo") (backtick quotes not usable for columns with spaces)
  ggplot(df1, aes(x="body_mass_g", y="flipper_length_mm", color = "sex", shape="species")) + geom_point() + ggsave("images/penguins_mass_vs_length_with_sex.png")
nbImage(url="images/penguins_mass_vs_length_with_sex.png", caption="Penguins' mass vs flipper length (colored by sex)")
nbText: """A few things to remark:

- as expected body mass and flipper length are linearly correlated
- males are in general bigger than females but there appear 2 groups, possibly related to species
- we have some more NAs (and one '.') in sex column (even after filtering for NAs in numeric columns)
- we can see that sizes of Adelie and Chinstrap overlap, while Gentoo penguins are in general bigger

As a final plot, I would like to (partly, confidence bands are missing) reproduce a plot that "shows" the presence of [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) in this dataset,
as reported by [this tweet](https://twitter.com/andrewheiss/status/1301166792627421186):
"""
nbCode:
  ggplot(df1, aes(x="culmen_depth_mm", y="body_mass_g")) +
    geom_point(aes = aes(color = "species")) +      # point...
    geom_smooth(aes = aes(color = "species"),       # and smooth classified by species
                smoother = "poly", polyOrder = 1) + # polynomial order 1 == line
    geom_smooth(smoother = "poly", polyOrder = 1) + # and smooth without classification by species
    ggsave("images/penguins_simpson.png")
nbImage(url="images/penguins_simpson.png", caption="""
Simpson's paradox in Penguins' dataset: for every species bigger mass is correlated with thicker bill,
but looking at all species taken together bigger mass is correlated with thinner bill""")
nbText: """
We can see that Gentoo penguins, although in general bigger, have a "thinner" bill (see image below for meaning of bill and culmen)!
"""
nbImage(url="https://pbs.twimg.com/media/Eg6sRJ1XcAAkxiu?format=jpg&name=4096x4096", caption="""
Penguin's bill and culmen explained (Artwork by @allison_horst)""")
nbSave