Homework 03: Data Wrangling
git pull to grab the latest homework.
- Take this short interactive regex tutorial. There is nothing to be turned in for this problem.
- There are two parts to this question. Save the first part in line one of
q2.txt, and the second part in line two of
q2.txt. This problem is worth 2 points, one per subpart.
Find the number of words (in
words.txt) that contain at least three
as and don’t have a
'sending. Write the number in line 1 of
q2.txt. (1 point)
What are the three most common last two letters of those words (the filtered list)? Write this in line 2, 3, and 4 of
q2.txt, respectively. (1 point)
Note: For both questions, we’d like to find case-insensitive matches.
ycommand, or the
trprogram, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?
- To do in-place substitution it is quite tempting to do something like
sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to
man sedto find out how to accomplish this. Write your reasoning in
sed.txt. You will be given points for effort.
- Find your average, median, and max system boot time over the last ten boots. Use
journalctlon Linux and
log showon macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
Logs begin at ...
systemd: Startup finished in ...
On macOS, look for:
=== system boot:
Previous shutdown cause: 5
- Look for boot messages that are not shared between your past three reboots (see
-bflag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use
sed '0,/STRING/d'to remove all lines previous to one that matches
STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (
uniqis your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).
- Find an online data set like this one, this one, or maybe one from here. Fetch it using
curland extract out just two columns of numerical data. If you’re fetching HTML data,
pupmight be helpful. For JSON data, try
jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.