TDM 40100: Project 3 — 2022
Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3
database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3
, with the opportunity to use other skills you’ve built throughout the previous years.
Context: In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3
— a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3
-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.
Scope: sqlite3
, lmod, SQL
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json
Questions
In case you skipped the previous project, let’s all get on the same page. You run the following code in a Jupyter notebook to create a sqlite3
database called my.db
in your $HOME
directory.
%%bash
rm $HOME/my.db
sqlite3 $HOME/my.db "CREATE TABLE reviews (
user_id TEXT,
book_id INTEGER,
review_id TEXT,
rating INTEGER,
review_text TEXT,
date_added TEXT,
date_updated TEXT,
read_at TEXT,
started_at TEXT,
n_votes INTEGER,
n_comments INTEGER
);"
sqlite3 $HOME/my.db "CREATE TABLE books (
isbn TEXT,
text_reviews_count INTEGER,
country_code TEXT,
language_code TEXT,
asin TEXT,
is_ebook INTEGER,
average_rating REAL,
kindle_asin TEXT,
description TEXT,
format TEXT,
link TEXT,
publisher TEXT,
num_pages INTEGER,
publication_day INTEGER,
isbn13 TEXT,
publication_month INTEGER,
edition_information TEXT,
publication_year INTEGER,
url TEXT,
image_url TEXT,
book_id TEXT,
ratings_count INTEGER,
work_id TEXT,
title TEXT,
title_without_series TEXT
);"
sqlite3 $HOME/my.db "CREATE TABLE authors (
average_rating REAL,
author_id INTEGER,
text_reviews_count INTEGER,
name TEXT,
ratings_count INTEGER
);"
sqlite3 $HOME/my.db "CREATE TABLE series (
numbered INTEGER,
note TEXT,
description TEXT,
title TEXT,
series_works_count INTEGER,
series_id INTEGER,
primary_work_count INTEGER
);"
%sql sqlite:////home/x-myalias/my.db
%%sql
SELECT * FROM reviews limit 5;
%%sql
SELECT * FROM books limit 5;
%%sql
SELECT * FROM authors limit 5;
%%sql
SELECT * FROM series limit 5;
%%bash
rm -rf $HOME/goodreads_samples
mkdir $HOME/goodreads_samples
cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json
Question 1
Update your original CREATE TABLE
statement for the books
table to include a field that will be used to store the actual book cover images from the image_url
field in the books
table. Call this new field book_cover
. Which one of the sqlite
types did you use?
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Check out a line of the goodreads_books.json
data:
%%bash
head -n 1 $HOME/goodreads_samples/goodreads_books.json
Don’t have a
|
Recall that in the previous project, we just ignored the following fields from the books
table: series
, similar_books
, popular_shelves
, and authors
. We did this because those fields are more complicated to deal with.
Read this article on database normalization from Microsoft. We are going to do our best to normalize our tables with these previously ignored fields taken into consideration.
Let’s start by setting some practical naming conventions. Note that these are not critical by any stretch, but can help remove some guess work when navigating a database with many tables and ids.
-
Every table’s primary key should be named
id
, unless it is a composite key. For example, instead ofbook_id
in thebooks
table, it would make sense to call that columnid
— "book" is implied from the table name. -
Every table’s foreign key should reference the
id
column of the foreign table and be named "foreign_table_name_id". For example, if we had a foreign key in thebooks
table that referenced an author in theauthors
table, we should name that columnauthor_id
. -
Keep table names plural, when possible — for example, not the
book
table, but thebooks
table. -
Link tables or junction tables should be named by the two tables which you are trying to represent the many-to-many relationship for. (We will go over this one specifically when needed, no worries)
Make the appropriate changes to the following CREATE TABLE
statements that reflect these conventions as much as possible (for now).
%%bash
rm $HOME/my.db
sqlite3 $HOME/my.db "CREATE TABLE reviews (
user_id TEXT,
book_id INTEGER,
review_id TEXT,
rating INTEGER,
review_text TEXT,
date_added TEXT,
date_updated TEXT,
read_at TEXT,
started_at TEXT,
n_votes INTEGER,
n_comments INTEGER
);"
sqlite3 $HOME/my.db "CREATE TABLE books (
isbn TEXT,
text_reviews_count INTEGER,
country_code TEXT,
language_code TEXT,
asin TEXT,
is_ebook INTEGER,
average_rating REAL,
kindle_asin TEXT,
description TEXT,
format TEXT,
link TEXT,
publisher TEXT,
num_pages INTEGER,
publication_day INTEGER,
isbn13 TEXT,
publication_month INTEGER,
edition_information TEXT,
publication_year INTEGER,
url TEXT,
image_url TEXT,
book_id TEXT,
ratings_count INTEGER,
work_id TEXT,
title TEXT,
title_without_series TEXT
);"
sqlite3 $HOME/my.db "CREATE TABLE authors (
average_rating REAL,
author_id INTEGER,
text_reviews_count INTEGER,
name TEXT,
ratings_count INTEGER
);"
sqlite3 $HOME/my.db "CREATE TABLE series (
numbered INTEGER,
note TEXT,
description TEXT,
title TEXT,
series_works_count INTEGER,
series_id INTEGER,
primary_work_count INTEGER
);"
-
Code used to solve this problem.
-
Output from running the code.
Question 3
A book can have many authors, and an author can have many books. This is an example of a many-to-many relationship.
We already have a books
table and an authors
table. Create a junction or link table that effectively normalizes the authors
field in the books
table. Call this new table books_authors
(see point 4 above — this is the naming convention we want).
Make sure to include your CREATE TABLE
statement in your notebook.
There should be 4 columns in the |
Make sure to properly apply the primary key and foreign key keywords. |
Write a SQL query to find every book by author with id 12345. It doesn’t have to be perfect syntax, as long as the logic is correct. In addition, it won’t be runnable, that is okay.
You will need to use joins and our junction table to perform this query. |
Copy, paste, and update your bash
cell with the CREATE TABLE
statements to implement these changes. In a markdown cell, write out your SQL query.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Assume that a series can have many books and a book can be a part of many series. Perform the same operations as the previous problem (except for the query).
What columns does the books_series
table have?
-
Code used to solve this problem.
-
Output from running the code.
Question 5
The remaining two fields that need to be dealt with are similar_books
and popular_shelves
. Choose at least one of the two and do your best to come up with a good solution for the way we store the data. We will give hints for both below.
For this question, please copy, paste, and update the bash
cell with the CREATE TABLE
statements. In addition, please include a markdown cell with a detailed explanation of why you chose your solution, and provide at least 1 example of a query that should work for your solution (like before, we are looking for logic, not syntax).
similar_books:
It is okay to have a link table that links rows from the same table! |
There are always many ways to do the same thing. In our examples, we used link tables with their own There is, however, a technically better solution for a table that is simply a link table and nothing more. This would be where you have 2 columns, both foreign keys, and you create a composite primary key, or a primary key that is represented by the unique combination of both foreign keys. This ensures that links are only ever represented once. Feel free to experiment with this if you want! |
popular_shelves:
You can create as many tables as you need. |
After a bit of thinking, this one may not be too different than what you’ve already accomplished. |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |