I followed the demo from dataquest to understand how to create a simple data pipeline:

➜ ~ python3 -m venv ~/demopipeline
➜ ~ ls -ltr

drwxr-xr-x 6 alinafe …… 204 Mar 29 01:31 demopipeline

  • Clone the repo

➜ ~ git clone https://github.com/dataquestio/analytics_pipeline.git
Cloning into ‘analytics_pipeline’…
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 9 (delta 1), reused 9 (delta 1), pack-reused 0
Unpacking objects: 100% (9/9), done.

  • Get into the folder with cd analytics_pipeline
  • Install the requirements with pip install -r requirements.txt

  analytics_pipeline git:(master) sudo -H pip install -r requirements.txt

Requirement already satisfied: Faker==0.7.9 in /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages (from -r requirements.txt (line 1))

Requirement already satisfied: python-dateutil>=2.4 in /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages (from Faker==0.7.9->-r requirements.txt (line 1))

Requirement already satisfied: six in /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages (from Faker==0.7.9->-r requirements.txt (line 1))

  • Run python log_generator.py.

➜  analytics_pipeline git:(master) python3 log_generator.py

  • Run store_logs.py — parses the logs and stores them in a SQLite database.
  • The code reads two logs created by the log generator, using the tell method to get current characters for both files being read.
  • If no data written to file sleep and try again, log files are on continuous loop with the log_generator.py.
  • SQLite is used for this demo, the author recommends PostgreSQL if speed is an issue.  The data is stored in a single file.

  analytics_pipeline git:(master) python3 store_logs.py

  • Run count_visitors.py — pulls from the database to count visitors to the site per day.
  • Connect to a database, and fetch all rows.

  analytics_pipeline git:(master) python3 count_visitors.py

2017-03-29 01:59:58.160522

29-03-2017: 129

2017-03-29 02:00:03.167369

29-03-2017: 129

2017-03-29 02:00:08.173586

29-03-2017: 129

  • The count_vistors.py generate every 5 seconds, one can let it run for a couple of days to see multiple days.
  • The author proceeds to count the browsers in the short demo.