SV@2020 - Weekwise Anomaly Visualization

Update May 28, 2020 - code using this is added to my covid 19 tracker repo with actual output

I find looking at daily numbers folded by week (ie. Mon to Sun in one row) for last 8 weeks a good way to look at the data. Especially if multiple tables are put one below the other for the same period, it is easy to identify patterns. Sample tables in a typical e-Commerce system could be orders, shipments, payment_failures etc.

Instead of looking at this manually, why not use ML to automatically highlight anomalies? I tried various things with hand coded models first, then ARIMA and finally settled on FBProphet library.

Read below to see how a simple system can be built with publicly available data. Bonus - if you have not used jq and xsv, you can see how cool those are too.

result

Sample Data

covid19india is a crowd sourced website that shows a lot of visualizations about the state of COVID-19 in India. They've nicely available JSON APIs that gives the data by day.

curl with -z will download the data if the server has newer data.

curl -o data.json -z data.json https://api.covid19india.org/data.json

The data has multiple lists. The one we need is right at the top.

  "cases_time_series": [
    {
      "dailyconfirmed": "1",
      "dailydeceased": "0",
      "dailyrecovered": "0",
      "date": "30 January ",
      "totalconfirmed": "1",
      "totaldeceased": "0",
      "totalrecovered": "0"
    },...

JSON to CSV

Pretty much all data science stuff starts with CSV. Using jq, we can just extract the date and the 3 daily metrics to a csv.

  echo "dt,confirmed,recovered,deceased" > daily.csv
  jq -r ".cases_time_series[]|[.date, .dailyconfirmed, .dailyrecovered, .dailydeceased]|@csv" data.json \
       |sed -e 's/"//g' -e  's/^\(..\) \(.*\) \(.*\)/\1-\2-2020\3/g' >> daily.csv

First echo is a cheap way to add header to a new daily.csv file.
jq is a bit more complex. -r is to get raw output and the filter following it is to get the case_time_series child as an array or arrays with only four keys we need; and then the whole thing is given to @csv which simply converts an array to csv.
sed removes double quotes first and then makes the input date format from *01 January * to 01-January-2020. Not exactly safe for turn of the year, but that is how current json data is given.

Extract Specific Timeseries

The data I am interested in processing has only two columns, viz., date and value. So, if I need to get daily numbers of confirmed cases from this csv, multiple options are available:

use cut -f1-2 -d, to get first two columns
load data into a sqlite3 database and extract what I need
write another python+pandas script to load to a df, make another df with what I need and write to csv.

All will work. But an amazing tool exists for this xsv.

xsv select dt,confirmed daily.csv

That's it. Very easy. Admittedly, this data is very small - but xsv is extremely fast when your data file is very large with lot of columns and lot of rows. It is worth investing to know what all options this brilliant utility has.

`Makefile` - integration!

GNU Make is brilliant for such things. Here is my Makefile.

all:  data.json dashboard.html

dashboard.html : daily.csv
	./csv2dashboard.sh

daily.csv : data.json
	echo "dt,confirmed,recovered,deceased" > daily.csv
	jq -r ".cases_time_series[]|[.date, .dailyconfirmed, .dailyrecovered, .dailydeceased]|@csv" data.json \
      |sed -e 's/"//g' -e  's/^\(..\) \(.*\) \(.*\)/\1-\2-2020\3/g' >> daily.csv

data.json :
	curl -o data.json -z data.json https://api.covid19india.org/data.json

clean :
	rm -f *.html *.csv

.PHONY : data.json

Just run make all and open dashboard.html in a browser. That's it.

Oh wait, what is there in this csv2dashboard.sh?

`csv2dashboard.sh` - combined html

This is just a wrapper script to call the main python one multiple times.

#!/bin/bash
NUMWEEKS=8
startdt=`date -d"monday-${NUMWEEKS} weeks" +%Y%m%d`
OUTPUT_FILE="dashboard.html"
rm -f o.html
for CATEGORY in confirmed recovered deceased; do
  #https://github.com/BurntSushi/xsv
  if [ "$CATEGORY" == "recovered" ]; then
    #we should highlight recoveries over anomaly as green
    COLOR_ARGS="-p lightgreen -n orange"
  else
    COLOR_ARGS="-n lightgreen -p orange"
  fi
  #make the csv with two columns and don't print the header
  xsv select dt,$CATEGORY daily.csv | tail -n +2  \
     | python3 weeklies.py -s $startdt -k $NUMWEEKS -o ${CATEGORY}.html -t "%d-%B-%Y" $COLOR_ARGS
  #cleanup the html generated to remove time component, nan etc; and add a header to the table
  cat ${CATEGORY}.html|sed  \
     -e 's/ 00:00:00//g' \
     -e 's/<table /<table border="0" cellspacing="2" cellpadding="2" /g' \
     -e "s/<thead>/<caption style='font-weight:bold;background-color:#99ccff;'>${CATEGORY}<\/caption><thead>/g" \
     -e 's/<td /<td width="60" align="right" /g' \
     -e 's/<tr> *<th class="index_name level0" >.*<\/tr>//g' \
     -e 's/nan//g' \
     -e 's/<thead>.*<\/thead>//g' >> o.html
  rm -f ${CATEGORY}.html
done
mv o.html $OUTPUT_FILE
# this is a bad html; so force an exit code of 0
tidy -m $OUTPUT_FILE || exit 0

Before you ask why do I need startwk and number_of_weeks, I plan to use the python script for not just last 8 weeks, but out of turn dates too.

`weeklies.py` - pandas & fbprophet

The real meat is in this file, weeklies.py.

Instead of keeping it here, I've added to a public gist.

Some examples that you might find useful in that code outside of fbprophet are below.

argparser module to easily handle command line arguments
Find nth Monday before this week today - datetime.timedelta(days=today.weekday(), weeks = n)
Make a date range in pandas dates = pd.date_range(start_date, periods=((days))
Adding a mask to a dataframe for some columns, based on values in other columns
Apply html style and hide columns to render a dataframe.

Final Setup

Get a modern bash shell with GNU utilities
Install python3
Install pandas and numpy
Install fbprophet
Install jq, xsv and tidy
Add a Makefile as per the content above
Add a csv2dashboard.sh from above. chmod +x csv2dashboard.sh to make it executable.
Get weeklies.py from the gist above.
Run make all

Gotchas

Without changing source data, I get different anomaly detections at different runs some times. Could be some seed issue that I've missed. I have not looked into how fbprophet actually implements their algorithms.
Weekly folding has a bug in that it doesn't consider current day's data, if it is run on a Monday.

Contents