I followed the demo from dataquest to understand how to create a simple data pipeline:
- Install Python 3. You can find instructions here.
- Create a virtual environment.
➜ ~ python3 -m venv ~/demopipeline
➜ ~ ls -ltr
drwxr-xr-x 6 alinafe …… 204 Mar 29 01:31 demopipeline
- Clone the repo
➜ ~ git clone https://github.com/dataquestio/analytics_pipeline.git
Cloning into ‘analytics_pipeline’…
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 9 (delta 1), reused 9 (delta 1), pack-reused 0
Unpacking objects: 100% (9/9), done.
- Get into the folder with
cd analytics_pipeline
- Install the requirements with
pip install -r requirements.txt
➜ analytics_pipeline git:(master) sudo -H pip install -r requirements.txt
Requirement already satisfied: Faker==0.7.9 in /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages (from -r requirements.txt (line 1))
Requirement already satisfied: python-dateutil>=2.4 in /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages (from Faker==0.7.9->-r requirements.txt (line 1))
Requirement already satisfied: six in /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages (from Faker==0.7.9->-r requirements.txt (line 1))
- Run
python log_generator.py
.
➜ analytics_pipeline git:(master) python3 log_generator.py
Run store_logs.py
— parses the logs and stores them in a SQLite database.- The code reads two logs created by the log generator, using the tell method to get current characters for both files being read.
- If no data written to file sleep and try again, log files are on continuous loop with the log_generator.py.
- SQLite is used for this demo, the author recommends PostgreSQL if speed is an issue. The data is stored in a single file.
➜ analytics_pipeline git:(master) python3 store_logs.py
Run count_visitors.py
— pulls from the database to count visitors to the site per day.- Connect to a database, and fetch all rows.
➜ analytics_pipeline git:(master) python3 count_visitors.py
2017-03-29 01:59:58.160522
29-03-2017: 129
2017-03-29 02:00:03.167369
29-03-2017: 129
2017-03-29 02:00:08.173586
29-03-2017: 129
- The count_vistors.py generate every 5 seconds, one can let it run for a couple of days to see multiple days.
- The author proceeds to count the browsers in the short demo.