TDM 20100: Project 5 — 2023
Motivation: awk
is a utility designed for text processing. While Python and R definitely have their place in the data science world, awk is a handy way to process data with just one line of analysis.
Context: awk
is a powerful tool that can be used to perform a variety of the tasks for which we previously used other UNIX utilities. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner, in pipelines of tools.
Scope: awk, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
While the UNIX tools we’ve used up to this point are very useful,
Here is an example to use
The but instead of outputing the entire file, we send it to the awk command. In the awk command, we use the semicolor as a separator, and we print the 16th field, which contains the salary information. Then we sort this data, so that all entries that are the same are next to each other. Then we find how many values of each type occur. Finally, we sort the responses according to how many times that they occur. To make this example more interesting, we can simply add the 14th field as well, and then we are classifying responses according the salary range and according to the person’s favorite operating system.
|
Here is another example:
The prices of the purchases for this file are in the 19th field:
We can add all of the prices as follows:
There are 283 million dollars of sales altogether! We can find the amount of sales of BOURBON like this:
or like this:
Either way, bourbon accounts for 24 million dollars of the sales. Champagne sales, on the other hand, are only 10206 dollars together:
or equivalently:
|
Questions
Question 1 (1 pt)
-
What is the total cost of purchases with
WHISKIES
in the title? -
What is the total cost of all purchases from
CEDAR RAPIDS
(not justWHISKIES
; consider all purchases)
Question 2 (2 pts)
-
What
Store Name
had the largest number of purchases (not the largest total cost, but the largest number of purchases; please consider each line to be 1 purchase) -
Using the
Store Name
identified Question 2A, what was the total cost of all purchases from thisStore Name
?
Question 3 (2 pt)
-
Please compute the total volume (in liters) of all purchases sold in the file
iowa_liquor_sales_cleaner.txt
-
Please compute the total volume (in liters) of
VODKA 80 PROOF
sold in the fileiowa_liquor_sales_cleaner.txt
Question 4 (2 pts)
-
When looking at which location has the largest numbers of purchases, if we use the address (instead of the store name), we should include the
Address
,City
, andZip Code
. Using these three variables (together), what location has the largest number of purchases? -
Does your answer to Question 4A agree with your answer to Question 2A? How do you know? (Please explain why, and/or use some analysis to justify your answer.)
Question 5 (1 pt)
-
awk
is powerful, and this liquor dataset is pretty interesting! We haven’t covered everythingawk
(and we won’t). Look at the dataset and ask yourself an interesting question about the data. Useawk
to solve your problem (or, at least, get you closer to answering the question). Optionally: You can explore various stackoverflow questions aboutawk
andawk
guides online. Try to incorporate anawk
function you haven’t used, or aawk
trick you haven’t seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new.
Please be sure to put a brief explanation about your work in Question 5 using awk to study something interesting that YOU FOUND in the data in Question 5.
You do not need to limit yourself to just use |
Project 05 Assignment Checklist
-
Jupyter Lab notebook with your code and comments for the assignment
-
firstname-lastname-project05.ipynb
.
-
-
A
.sh
text file with all of yourbash
code and comments written inside of it-
bash code and comments used to solve questions 1 through 5
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |