Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Latest commit



74 lines (68 loc) · 5.48 KB

File metadata and controls

74 lines (68 loc) · 5.48 KB

The repository is organized as follows

  • Each folder contains the presentation, code, and instructions for the tutorials and assignments:
    • 1_sql: SQL and semi-structured data using PostgreSQL 10 09/02/2018.
    • 2_search: Information retrieval and search with ElasticSearch 05/03/2018.
    • 3_bigdata: Big Data processing with Apache Spark 19/03/2018.
  • Dataset:
    • For the tree tutorials and programming assignments we will be working with the MovieLens 1M Dataset dataset provided by GroupLens Research.
    • Check the README file of the dataset for additional information.
    • A pre-processed version of the dataset is available in data and that is the file that we will use.
    • The pre-processed version includes a knowledge base of DBpedia. The mapping between movies and DBpedia Resources was taken from here.

Special instructions for Windows 10 users

In this course we provide a VM to the students. To access it follow the next steps:

1) Install the Linux subsystem:
  • Go to Settings -> For Developers: Enable Developer Mode. Restart the machine after the installation.
  • Turn Windows Features on or off, and check the box of Windows Subsystem for Linux, click OK and restart.
  • Open the windows menu and type Command Prompt, right-click and select Run as adminsitrator.
  • In the terminal, run the following command lxrun /install /y type a username and password for the root user.
2) Install an X environment:
  • Download vcxsrv from and install it
  • To start the X environment run the application VcXsrv
3) Starting the linux subsystem:
  • Press Windows + R and run cmd
  • Run the command bash, this gives you access to the linux terminal on Windows.
  • From here, Windows users can follow the same commands than linux/macOS users, just be sure to start the VcXsrv and run the commands on the bash terminal.
  • If copy/past does not work on Chromium or pgAdmin, right-click on the VcXsrv icon in the task bar and un-check Clipboard may use PRIMARY selection
  • Note: If you do not want to use the linux subsystem, use any other approach for using ssh and X enviroment on Windows.

Using the Virtual Machines

Students can access a Linux Virtual Machine that contains all the necessary software for the assignments and tutorials.

1) SSH/Credential Setup

  • Accessing the VM requires password-less ssh login. To do this, you are required to configure a SSH credential in service (see:
    • At a glance, the configurations steps for password-less ssh login are the following:
      • Start a terminal (Windows 10 users should be in bash)
      • cd
      • mkdir ~/.ssh/ #If it does not exist
      • cd ~/.ssh/
      • ssh-keygen -t rsa -C "[email protected]" -b 4096
        • Replace the placeholder [email protected] with your Aalto e-mail.
        • By default, the credentials will be named id_rsa and, you can use other name if needed.
      • Copy the content of ~/.ssh/ (or the name you used) and add it as a new key to
      • Paste the following lines in ~/.ssh/config. Create the file if it does not exists.
      • Replace the placeholder your-aalto-username with your Aalto user name. For IdentityFile, use the private key of the corresponding public key that you added to
      Host mds
          IdentityFile ~/.ssh/id_rsa
          IdentitiesOnly yes
          Compression yes
          ForwardX11 yes
          Port 22
          User your-aalto-username
      • Add XAuthLocation /opt/X11/bin/xauth to your config file if you are using a mac and facing issues with lauching jupyter notebook.

2) Accessing the VM

  • Note: You have to be using a machine inside Aalto's domain. At home and in personal computers, use the VPN (
  • export DISPLAY=; ssh -X mds # Fix the DISPLAY variable as required
    • The value of DISPLAY environment variable could vary across different operating systems and fix the DISPLAY variable as required. In Linux machines set DISPLAY to :0 (export DISPLAY=:0) and in macOS set DISPLAY to (export DISPLAY=
    • The first time you connect, you will see a message "The authenticity of host ... ... Are you sure you want to continue connecting (yes/no)?"
    • Type yes

3) Starting Jupyter Notebook

  • In the VM run command source ~/env35/bin/activate; jupyter notebook
  • This should prompt a Chromium Web Browser:
    • Check your X11 configuration if the browser does not start. Try pgAdmin4, if the user interface appears, then the problem is with the browser.
    • If you are having an error: Trace/breakpoint trap (core dumped), please install firefox sudo apt install firefox
  • The password is db2018

Further instructions are provided in the corresponding session folders.

You could also work in your personal computer, just clone the repo to your machine and install the required software.