Store Presence on App Store #2 – Let’s Play: Cleanup

In the first post of the series, we learnt how to gather the data, now we are going to do some cleanup.

This is #2 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    [ DONE ]
#2 - Let's Play: Cleanup   <<
#3 - Let's Play: Optimize  [ DONE ]
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

cleanup

Where Were We?

Using the methods in the Let’s Play #1, I gathered global, day-by-day, August 2017 data for the App Store presence of our latest game, Twiniwt.

Below is the size of the dataset:

[kenanb@6x13 twiniwt-1708]$ du -hsc *
4.0K 170801.json
4.0K 170802.json
1.5M 170803.json
1.5M 170804.json
1.5M 170805.json
728K 170806.json
728K 170807.json
...
684K 170830.json
732K 170831.json
23M total

Damn! Surely, it is time for cleanup and restructuring.

Scabbling and Cleanup

Let’s python3 and import the required packages.

import json, pprint

For this post, all we need is Python 3, which I presume you already have. Even though it is completely optional for cleanup part of the series, I strongly suggest installing and using Jupyter Notebook for data exploration. It is especially useful while trying to restructure the data for your needs.

We saved our daily feature data to a subdirectory in our working directory.

A dataset corresponding to 2017-08-15 is saved as:

./json/170815.json

We want to loop over a period, say, each day in August 2017.

We need to generate the filepaths for the corresponding dates in the loop.

The year:

y = 17

The month:

m = 8

Loop for each day, generate basename for the file and concatenate the whole pathname.

for d in range(31):
    date = "%02d%02d%02d" % (y, m, d+1)
    file_name = "json/" + date + '.json'

Here, we read the dataset using the pathname we just created.

    with open(file_name) as data_file:
        data = json.load(data_file)

Then, we immediately bypass all the garbage branches in the serialized dataset, and assign the key ‘rows’ to our data variable.
First, we need to have a look at the loaded data and find the path to ‘rows’.

{
  "data": {
    "data": {
      "pagination": {
        "current": 0,
        "page_interval": 1000,
        "sum": 1
      },
      "rows": [

	  ...

      ],
      "csvPermissionCode": "PERMISSION_NOT_PASS",
      "columns": [

	  ...

      ],
      "fixedColumns": {
        "tableWidth": 150,
        "fixed": 1
      }
    },
    "permission": true
  },
  "success": true
}

As you can see, this is what takes us to the ‘rows’:

    data = data['data']['data'].get('rows') or []

You probably noticed, we didn’t simply do data['data']['data']['rows'] because that path might not even exist, if for some reason your app is not in store that day.

Restructure

Cool, we got rid of immediate garbage, it’s time to clean up and restructure the actual row data.

Let’s see, this is a sample row in an App Annie Daily Featured response.

[
  [
    {
      "image": "https://static-s.aa-cdn.net/img/ios/...",
      "type": "icon",
      "thumb": "https://static-s.aa-cdn.net/img/ios/..."
    }
  ],
  [
    "China",
    "CN"
  ],
  "iPhone",
  "Board",
  "Collection List",
  "N/A",
  2,
  4,
  6,
  [
    {
      "existence": false,
      "detail": null,
      "label": "Featured Home"
    },
    {
      "existence": false,
      "detail": null,
      "label": "Board"
    },
    {
      "existence": true,
      "detail": {
        "position": [
          6
        ],
        "row": [
          4,
          4
        ]
      },
      "parent": "免费",
      "label": "Twiniwt"
    }
  ],
  [
    "N/A",
    0,
    100,
    ""
  ]
]

Above, I marked the data we want to keep in bold. The full details about the contents of the row are provided in the Store Presence on App Store #1 – Let’s Play: Scrape.

We traverse the data, removing the garbage values from each row array (in reverse order, of course, so the indices for the garbage entities do not change during deletion.)

    for d in data:
        d.pop(10)
        d.pop(5)
        d.pop(0)

The two-letter country code is enough, we don’t need the full name of the countries in each element of our dataset.

        d[0] = d[0][1]

Shorten the ‘Featured Home’ category page name, to simply, ‘Home’.

        if ( d[2] == 'Featured Home' ): d[2] = 'Home'

The way ‘Featured Path’ is structured is pretty complex for our needs. Let’s restructure it.

        n = []
        for r in d[7]:
            n.append(r['label'])
        n[-1] = d[7][-1]['parent']
        if ( n[0] == 'Featured Home' ):
            n[0] = 'Home'
        if ( n[-1].endswith('see more') ):
            n[-1] = n[-1][:-9]
            n.append('>>')
        d[5] = d[7][-1]['detail']['row']
        d[7] = n

I am skipping the details on this one, as you might want to keep it, or arrange it differently.
Below is the cleaned-up sample data we get after the process.

[
  "CN",
  "iPhone",
  "Board",
  "Collection List",
  2,
  [
    4,
    4
  ],
  6,
  [
    "Home",
    "Board",
    "免费"
  ]
]

Now that we are finished with the dataset cleanup, let’s write it back to the file.

    with open(file_name, 'w') as out_file:
        json.dump(data, 
                  out_file, 
                  indent=2, 
                  ensure_ascii=False,
                  sort_keys=True)

You can view and download the complete code gist below.

We scraped and cleaned up the data. It is now down to 3.4Mb from 23Mb, meaning we just got rid of garbage that amounts to ~86% of the dataset. Congratulations!

Yet, the data is still unsuitable for real-time processing. We will fix that in the next part of the series. Some entities are long strings while we could get away with enumerations, things like that. And while at it, let’s get rid of this JSON nonsense, shall we?

Oh, I almost forgot the coffee beans!

Check back for Part 3 of the series! If you have any questions or advice, please comment below. Thank you!

Published by

Kenan Bölükbaşı

Founder, Project Leader and Developer at 6x13 Games. Game Developer & Designer, CG Generalist, Architect. Theoretical and applied knowledge in programming, design and media. Broad experience in project management. Experience in 3D (mesh, solid & CAD), 2D (raster, vector), and parametric graphics as well as asset pipelines and tools development. Blender 3D specialist (Blender Foundation Certified Trainer).

2 thoughts on “Store Presence on App Store #2 – Let’s Play: Cleanup”

Leave a Reply

Your email address will not be published. Required fields are marked *