Title: | A Simple Data Science Challenge System |
---|---|
Description: | A simple data science challenge system using R Markdown and 'Dropbox' <https://www.dropbox.com/>. It requires no network configuration, does not depend on external platforms like e.g. 'Kaggle' <https://www.kaggle.com/> and can be easily installed on a personal computer. |
Authors: | Adrien Todeschini [aut, cre], Robin Genuer [ctb] |
Maintainer: | Adrien Todeschini <[email protected]> |
License: | GPL-2 |
Version: | 1.3.4.9000 |
Built: | 2025-03-06 03:02:09 UTC |
Source: | https://github.com/adrtod/rchallenge |
A simple data science challenge system using R Markdown and 'Dropbox' <https://www.dropbox.com/>. It requires no network configuration, does not depend on external platforms like e.g. 'Kaggle' <https://www.kaggle.com/> and can be easily installed on a personal computer.
Install the R package from CRAN repositories
install.packages("rchallenge")
or install the latest development version from GitHub
# install.packages("devtools")
devtools::install_github("adrtod/rchallenge")
A recent version of pandoc (>= 1.12.3) is also required. See the pandoc installation instructions for details on installing pandoc for your platform.
Install a new challenge in Dropbox/mychallenge
:
setwd("~/Dropbox/mychallenge")
library(rchallenge)
or for a french version:
new_challenge(template = "fr")
You will obtain a ready-to-use challenge in the folder Dropbox/mychallenge
containing:
challenge.rmd
: Template R Markdown script for the webpage.
data
: Directory of the data containing data_train
and data_test
datasets.
submissions
: Directory of the submissions. It will contain one subdirectory per team where they can submit their submissions. The subdirectories are shared with Dropbox.
history
: Directory where the submissions history is stored.
The default challenge provided is a binary classification problem on the South German Credit data set.
You can easily customize the challenge in two ways:
During the creation of the challenge: by using the options of the new_challenge
function.
After the creation of the challenge: by manually replacing the data files in the data
subdirectory and the baseline predictions in submissions/baseline
and by customizing the template challenge.rmd
as needed.
To complete the installation:
Create and share subdirectories in submissions
for each team:
new_team("team_foo", "team_bar")
Render the HTML page:
publish()
Use the output_dir
argument to change the output directory.
Make sure the output HTML file is rendered, e.g. using GitHub Pages.
Give the URL to your HTML file to the participants.
Refresh the webpage by repeating step 2 on a regular basis. See below for automating this step.
From now on, a fully autonomous challenge system is set up requiring no further administration. With each update, the program automatically performs the following tasks using the functions available in our package:
store_new_submissions
: Reads submitted files and save new files in the history.
print_readerr
: Displays any read errors.
compute_metrics
: Calculates the scores for each submission in the history.
get_best
: Gets the highest score per team.
print_leaderboard
: Displays the leaderboard.
plot_history
: Plots a chart of score evolution per team.
plot_activity
: Plots a chart of activity per team.
For the step 4, you can setup the following line to your crontab
using crontab -e
(mind the quotes):
0 * * * * Rscript -e 'rchallenge::publish("~/Dropbox/mychallenge/challenge.rmd")'
This will render a HTML webpage every hour.
Use the output_dir
argument to change the output directory.
If your challenge is hosted on a Github repository you can automate the push:
0 * * * * cd ~/Dropbox/mychallenge && Rscript -e 'rchallenge::publish()' && git commit -m "update html" index.html && git push
You might have to add the path to Rscript and pandoc at the beginning of your crontab:
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
Depending on your system or pandoc version you might also have to explicitly add the encoding option to the command:
0 * * * * Rscript -e 'rchallenge::publish("~/Dropbox/mychallenge/challenge.rmd", encoding = "utf8")'
You can use the Task Scheduler to create a new task with a Start a program action with the settings (mind the quotes):
Program/script: Rscript.exe
options: -e rchallenge::publish('~/Dropbox/mychallenge/challenge.rmd')
Credit approval (in french) by Adrien Todeschini (Bordeaux).
Spam filter (in french) by Marie Chavent (Bordeaux).
Please contact me to add yours.
The rendering of HTML content provided by Dropbox will be discontinued from the 3rd October 2016 for Basic users and the 1st September 2017 for Pro and Business users. See https://help.dropbox.com/files-folders/share/public-folder. Alternatively, GitHub Pages provide an easy HTML publishing solution via a simple GitHub repository.
version 1.16 of pandoc fails to fetch font awesome css, see https://github.com/jgm/pandoc/issues/2737.
Maintainer: Adrien Todeschini [email protected]
Other contributors:
Robin Genuer [email protected] [contributor]
Useful links:
Report bugs at https://github.com/adrtod/rchallenge/issues
Compute metrics of the submissions in the history.
compute_metrics(hist_dir = "history", metrics, y_test, ind_quiz, read_fun)
compute_metrics(hist_dir = "history", metrics, y_test, ind_quiz, read_fun)
hist_dir |
string. directory where the history of the submissions are stored. contains one subdirectory per team. |
metrics |
named list of functions. Each function in the list computes
a performance criterion and is defined as: |
y_test |
character or numeric vector. the test set output. |
ind_quiz |
indices of |
read_fun |
function that reads a submission file and returns a vector of predictions. |
compute_metrics
returns a named list with one named member per team.
Each member is a data.frame
where the rows are the submission files sorted by date
and the columns are:
date |
the date of the submission |
file |
the file name of the submission |
<metric name>.quiz |
the score obtained on the quiz subset |
<metric name>.test |
the score obtained on the test set |
Countdown before deadline.
countdown(deadline, complete_str = intToUtf8(10004))
countdown(deadline, complete_str = intToUtf8(10004))
deadline |
POSIXct. deadline |
complete_str |
string. displayed when deadline is passed |
Split a data.frame into training and test sets.
data_split( data = get_data("german"), varname = "credit_risk", p_test = 0.2, p_quiz = 0.5 )
data_split( data = get_data("german"), varname = "credit_risk", p_test = 0.2, p_quiz = 0.5 )
data |
data.frame |
varname |
string. output variable name |
p_test |
real. proportion of samples in the test set |
p_quiz |
real. proportion of samples from the test set in the quiz set |
list with members
train |
training set with output variable |
test |
test set without output variable |
y_test |
test set output variable |
ind_quiz |
indices of quiz samples in the test set |
Data from Dr. Hans Hofmann of the University of Hamburg.
data(german)
data(german)
A data.frame
with 1000 rows and 21 variables
These data have two classes for the credit worthiness: Good or Bad. There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, Installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, Number of people being liable to provide maintenance for, telephone, and foreign worker status.
This is a transformed version of the Statlog German Credit data set with factors instead of dummy variables, and corrected as proposed by Groemping, U. (2019).
UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/South+German+Credit http://www1.beuth-hochschule.de/FB_II/reports/Report-2019-004.pdf
Groemping, U. (2019). South German Credit Data: Correcting a Widely Used Data Set. Report 4/2019, Reports in Mathematics, Physics and Chemistry, Department II, Beuth University of Applied Sciences Berlin.
Get the best submissions per team.
get_best( history, metrics = names(metrics), test_name = "quiz", decreasing = FALSE )
get_best( history, metrics = names(metrics), test_name = "quiz", decreasing = FALSE )
history |
list of the submissions history per team as returned by |
metrics |
character vector. names of the metrics |
test_name |
string. name of the test set used: |
decreasing |
logical. Should the sort order be increasing or decreasing? Must be of length 1 or with
the same length as |
get_best
returns a data.frame
where the rows are teams in sorted order of performance.
The best submission per team is retained. The sort is based on possibly several metrics in the order
given by the metrics
argument.
In case of ties on the first metric, the second metric is used to break the ties, and so on. Lastly,
the date is used in case of ties. The columns are:
team |
name of the team |
n_submissions |
total number of submissions |
date |
the date of the best submission |
file |
the file name of the best submission |
<metric name>.quiz |
the score obtained on the quiz subset |
<metric name>.test |
the score obtained on the test set |
rank |
the rank of the team |
rank_diff |
the rank difference is set to 0 temporarily. |
Get dataset value.
get_data(name = "german", package = "rchallenge", envir = environment(), ...)
get_data(name = "german", package = "rchallenge", envir = environment(), ...)
name |
string. name of the dataset. |
package |
string. name of the package to look in for dataset. |
envir |
the environment where the data should be loaded. |
... |
additional arguments to be passed to |
The value of the dataset
HTML code for an image.
html_img(file, width = "10px")
html_img(file, width = "10px")
file |
string. image file. |
width |
string. width of display. |
Currently only supports Font Awesome icons.
icon(name)
icon(name)
name |
string. name of the icon. You can see a full list of options at https://fontawesome.com/icons/. |
string containing the HTML code.
Requires the Font Awesome HTML code:
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css">
rmd <- ' ```{r} library(rchallenge) ``` <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css"> `r icon("fa-user")` `r icon("fa-user fa-lg")` `r icon("fa-user fa-2x")` `r icon("fa-user fa-3x")` `r icon("fa-user fa-3x fa-border")` ' file <- tempfile() cat(rmd, file=file) writeLines(readLines(file)) if (rmarkdown::pandoc_available('1.12.3')) { rmarkdown::render(file) }
rmd <- ' ```{r} library(rchallenge) ``` <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css"> `r icon("fa-user")` `r icon("fa-user fa-lg")` `r icon("fa-user fa-2x")` `r icon("fa-user fa-3x")` `r icon("fa-user fa-3x fa-border")` ' file <- tempfile() cat(rmd, file=file) writeLines(readLines(file)) if (rmarkdown::pandoc_available('1.12.3')) { rmarkdown::render(file) }
Formatted last update date before deadline.
last_update(deadline, format = "%d %b %Y %H:%M")
last_update(deadline, format = "%d %b %Y %H:%M")
deadline |
POSIXct. deadline |
format |
string. see |
Install a new challenge.
new_challenge( path = ".", out_rmdfile = "challenge.rmd", recursive = FALSE, overwrite = recursive, quiet = FALSE, showWarnings = FALSE, template = c("en", "fr"), data_dir = "data", submissions_dir = "submissions", hist_dir = "history", install_data = TRUE, baseline = "baseline", add_baseline = install_data, clear_history = overwrite, title = "Challenge", author = "", date = "", email = "[email protected]", date_start = format(Sys.Date(), "%d %b %Y"), deadline = paste(Sys.Date() + 90, "23:59:59"), data_list = data_split(get_data("german")) )
new_challenge( path = ".", out_rmdfile = "challenge.rmd", recursive = FALSE, overwrite = recursive, quiet = FALSE, showWarnings = FALSE, template = c("en", "fr"), data_dir = "data", submissions_dir = "submissions", hist_dir = "history", install_data = TRUE, baseline = "baseline", add_baseline = install_data, clear_history = overwrite, title = "Challenge", author = "", date = "", email = "[email protected]", date_start = format(Sys.Date(), "%d %b %Y"), deadline = paste(Sys.Date() + 90, "23:59:59"), data_list = data_split(get_data("german")) )
path |
string. install path of the challenge (should be somewhere in your Dropbox). |
out_rmdfile |
string. name of the output R Markdown file. |
recursive |
logical. should elements of the path other than the last be created? see |
overwrite |
logical. should existing destination files be overwritten? see |
quiet |
logical. deactivate text output. |
showWarnings |
logical. should the warnings on failure be shown? see |
template |
string. name of the template R Markdown script to be installed.
Two choices are available: |
data_dir |
string. subdirectory of the data. |
submissions_dir |
string. subdirectory of the submissions. see |
hist_dir |
string. subdirectory of the history. see |
install_data |
logical. activate installation of the data files of the template challenge. |
baseline |
string. name of the team considered as the baseline. |
add_baseline |
logical. activate installation of baseline submission files of the template challenge. |
clear_history |
logical. activate deletion of the existing history folder. |
title |
string. title displayed on the webpage. |
author |
string. author displayed on the webpage. |
date |
string. date displayed on the webpage. |
email |
string. email of the challenge administrator. |
date_start |
string. start date of the challenge. |
deadline |
string. deadline of the challenge. |
data_list |
list with members |
The path of the created challenge is returned.
path <- tempdir() wd <- setwd(path) # english version new_challenge() # french version new_challenge(template = "fr") setwd(wd) unlink(path)
path <- tempdir() wd <- setwd(path) # english version new_challenge() # french version new_challenge(template = "fr") setwd(wd) unlink(path)
Create new teams submission folders in your challenge.
new_team( ..., path = ".", submissions_dir = "submissions", quiet = FALSE, showWarnings = FALSE )
new_team( ..., path = ".", submissions_dir = "submissions", quiet = FALSE, showWarnings = FALSE )
... |
strings. names of the team subdirectories. |
path |
string. root path of the challenge. see |
submissions_dir |
string. subdirectory of the submissions. see |
quiet |
logical. deactivate text output. |
showWarnings |
logical. should the warnings on failure be shown? see |
The paths of the created teams are returned.
path <- tempdir() wd <- setwd(path) new_challenge() new_team("team_foo", "team_bar") setwd(wd) unlink(path)
path <- tempdir() wd <- setwd(path) new_challenge() new_team("team_foo", "team_bar") setwd(wd) unlink(path)
Plot the density of submissions over time.
plot_activity( history, baseline = "baseline", col = 1:length(history), alpha.f = 0.7, bw = 3600 * 24, by = 4, xlab = "Date", ylab = "Submissions intensity", bty = "l", fg = "darkslategray", col.axis = fg, col.lab = fg, text.col = fg, ... )
plot_activity( history, baseline = "baseline", col = 1:length(history), alpha.f = 0.7, bw = 3600 * 24, by = 4, xlab = "Date", ylab = "Submissions intensity", bty = "l", fg = "darkslategray", col.axis = fg, col.lab = fg, text.col = fg, ... )
history |
list of the submissions history per team as returned by |
baseline |
string. name of the team considered as the baseline that will not be plotted. |
col |
colors of the teams. |
alpha.f |
factor modifying the opacity alpha of colors; typically in [0,1]. |
bw |
real. the smoothing bandwidth to be used by |
by |
real. height of the interval between two teams in number of submissions. |
xlab , ylab
|
axis labels. see |
bty , fg , col.axis , col.lab
|
graphical parameters. see |
text.col |
the color used for the legend text. see |
... |
further parameters passed to |
NULL
The best score of each team has a bold symbol.
plot_history( history, metric, test_name = "quiz", baseline = "baseline", col = 1:length(history), pch = rep(21:25, 100), by = 0.05, xlab = "Date", ylab = "Score", bty = "l", fg = "darkslategray", col.axis = fg, col.lab = fg, text.col = fg, ... )
plot_history( history, metric, test_name = "quiz", baseline = "baseline", col = 1:length(history), pch = rep(21:25, 100), by = 0.05, xlab = "Date", ylab = "Score", bty = "l", fg = "darkslategray", col.axis = fg, col.lab = fg, text.col = fg, ... )
history |
list of the submissions history per team as returned by |
metric |
string. name of the metric considered |
test_name |
string. name of the test set used: |
baseline |
string. name of the team considered as the baseline. Its best score will be plotted as a constant and will not appear in the legend. |
col |
colors of the teams |
pch |
symbols of the teams |
by |
real. interval width of grid lines |
xlab , ylab
|
axis labels. see |
bty , fg , col.axis , col.lab
|
graphical parameters. see |
text.col |
the color used for the legend text. see |
... |
further parameters passed to |
NULL
Format the leaderboard in Markdown.
print_leaderboard( best, metrics = names(metrics), test_name = "quiz", digits = 3, ... )
print_leaderboard( best, metrics = names(metrics), test_name = "quiz", digits = 3, ... )
best |
list of the best submissions per team and per metric as returned
by |
metrics |
character vector. names of the metrics to be displayed |
test_name |
string. name of the test set used: |
digits |
integer. how many significant digits are to be used for metrics. |
... |
further parameters to pass to |
print_leaderboard
returns a character vector of the table source code
to be used in a Markdown document.
Chunk option results='asis'
has to be used
Format read errors in Markdown.
print_readerr(read_err = list(), ...)
print_readerr(read_err = list(), ...)
read_err |
list of read errors returned by |
... |
further parameters to pass to |
print_readerr
returns a character vector of the table source code
to be used in a Markdown document.
Render your challenge R Markdown script to a HTML page.
publish( input = "challenge.rmd", output_file = "index.html", output_dir = dirname(input), quiet = FALSE, ... )
publish( input = "challenge.rmd", output_file = "index.html", output_dir = dirname(input), quiet = FALSE, ... )
input |
string. name of the R Markdown input file |
output_file |
string. output file. If |
output_dir |
string. output directory. Defaults to the directory of the input file. make sure that the output HTML file will be published online. |
quiet |
logical. deactivate text output. |
... |
further arguments to pass to |
The compiled document is written into the output file, and the path of the output file is returned.
The rendering of HTML content provided by Dropbox will be discontinued from the 3rd October 2016 for Basic users and the 1st September 2017 for Pro and Business users. See https://help.dropbox.com/fr-fr/files-folders/share/public-folder. Alternatively, GitHub Pages provide an easy HTML web publishing solution via a simple GitHub repository.
path <- tempdir() wd <- setwd(path) new_challenge() outdir = tempdir() if (rmarkdown::pandoc_available('1.12.3')) { publish(output_dir = outdir, output_options = list(self_contained = FALSE)) } unlink(outdir) setwd(wd) unlink(path)
path <- tempdir() wd <- setwd(path) new_challenge() outdir = tempdir() if (rmarkdown::pandoc_available('1.12.3')) { publish(output_dir = outdir, output_options = list(self_contained = FALSE)) } unlink(outdir) setwd(wd) unlink(path)
These functions are defunct and no longer available.
glyphicon(...)
glyphicon(...)
... |
parameters |
Defunct functions are: glyphicon
store_new_submissions
copies new files from the subdirectories of submissions_dir
to the respective subdirectories of hist_dir
.
Each team has a subdirectory.
The copied files in hist_dir
are prefixed with the last modification date for uniqueness.
A file is considered new if its name and last modification time is new, i.e not present
in hist_dir
.
The files must match pattern
regular expression and must not
throw errors or warnings when given to the valid_fun
function.
store_new_submissions( submissions_dir = "submissions", hist_dir = "history", deadline, pattern = ".*\\.csv$", valid_fun )
store_new_submissions( submissions_dir = "submissions", hist_dir = "history", deadline, pattern = ".*\\.csv$", valid_fun )
submissions_dir |
string. directory of the submissions. contains one subdirectory per team |
hist_dir |
string. directory where to store the history of the submissions. contains one subdirectory per team |
deadline |
POSIXct. deadline time for submissions. The files with last modification date after the deadline are skipped. |
pattern |
string. regular expression that new submission files must match (with |
valid_fun |
function that reads a submission file and throws errors or warnings if it is not valid. |
store_new_submissions
returns a named list of errors or warnings caught during the process.
Members named after the team names are lists with members named after the file
that throws an error which contain the error object.
Update the rank differences of the teams.
update_rank_diff(best_new, best_old)
update_rank_diff(best_new, best_old)
best_new |
|
best_old |
old |
update_rank_diff
returns the input data.frame
best_new
with an
updated column rank_diff