In the first post of the series, we learnt how to gather the data, now we are going to do some cleanup.
This is #2 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.
#1 - Let's Play: Scrape [ DONE ]
#2 - Let's Play: Cleanup <<
#3 - Let's Play: Optimize [ DONE ]
#4 - Let's Play: Analyze [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...
Where Were We?
Using the methods in the Let’s Play #1, I gathered global, day-by-day, August 2017 data for the App Store presence of our latest game, Twiniwt.
Below is the size of the dataset:
[kenanb@6x13 twiniwt-1708]$ du -hsc *
Damn! Surely, it is time for cleanup and restructuring.
Scabbling and Cleanup
python3 and import the required packages.
import json, pprint
For this post, all we need is Python 3, which I presume you already have. Even though it is completely optional for cleanup part of the series, I strongly suggest installing and using Jupyter Notebook for data exploration. It is especially useful while trying to restructure the data for your needs.
We saved our daily feature data to a subdirectory in our working directory.
A dataset corresponding to 2017-08-15 is saved as:
We want to loop over a period, say, each day in August 2017.
We need to generate the filepaths for the corresponding dates in the loop.
y = 17
m = 8
Loop for each day, generate basename for the file and concatenate the whole pathname.
for d in range(31):
date = "%02d%02d%02d" % (y, m, d+1)
file_name = "json/" + date + '.json'
Here, we read the dataset using the pathname we just created.
with open(file_name) as data_file:
data = json.load(data_file)
Then, we immediately bypass all the garbage branches in the serialized dataset, and assign the key ‘rows’ to our data variable.
First, we need to have a look at the loaded data and find the path to ‘rows’.
As you can see, this is what takes us to the ‘rows’:
data = data['data']['data'].get('rows') or 
You probably noticed, we didn’t simply do
data['data']['data']['rows'] because that path might not even exist, if for some reason your app is not in store that day.
Cool, we got rid of immediate garbage, it’s time to clean up and restructure the actual row data.
Let’s see, this is a sample row in an App Annie Daily Featured response.
"label": "Featured Home"
Above, I marked the data we want to keep in bold. The full details about the contents of the row are provided in the Store Presence on App Store #1 – Let’s Play: Scrape.
We traverse the data, removing the garbage values from each row array (in reverse order, of course, so the indices for the garbage entities do not change during deletion.)
for d in data:
The two-letter country code is enough, we don’t need the full name of the countries in each element of our dataset.
d = d
Shorten the ‘Featured Home’ category page name, to simply, ‘Home’.
if ( d == 'Featured Home' ): d = 'Home'
The way ‘Featured Path’ is structured is pretty complex for our needs. Let’s restructure it.
n = 
for r in d:
n[-1] = d[-1]['parent']
if ( n == 'Featured Home' ):
n = 'Home'
if ( n[-1].endswith('see more') ):
n[-1] = n[-1][:-9]
d = d[-1]['detail']['row']
d = n
I am skipping the details on this one, as you might want to keep it, or arrange it differently.
Below is the cleaned-up sample data we get after the process.
Now that we are finished with the dataset cleanup, let’s write it back to the file.
with open(file_name, 'w') as out_file:
You can view and download the complete code gist below.
We scraped and cleaned up the data. It is now down to 3.4Mb from 23Mb, meaning we just got rid of garbage that amounts to ~86% of the dataset. Congratulations!
Yet, the data is still unsuitable for real-time processing. We will fix that in the next part of the series. Some entities are long strings while we could get away with enumerations, things like that. And while at it, let’s get rid of this JSON nonsense, shall we?
Oh, I almost forgot the coffee beans!
Check back for Part 3 of the series! If you have any questions or advice, please comment below. Thank you!