Store Presence on App Store #3 – Let’s Play: Optimize

In the last post, we cleaned up the data, resulting in ~86% size reduction. Yet, we still need to optimize the data for real-time processing.


This is #3 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    [ DONE ]
#2 - Let's Play: Cleanup   [ DONE ]
#3 - Let's Play: Optimize  <<
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

Prelude

We follow two rules in the matter of optimization: Rule 1: Don’t do it. Rule 2 (for experts only): Don’t do it yet.

> Michael A. Jackson. Principles of Program Design, 1975 >

Luckily, in any sane language, a list starts with Element #0.

Rule 0: Just this once.

Optimize

These are our coffee beans now:

[ "CN", "iPhone", "Board", "Collection List", 2, [ 4, 4 ], 6, [ "Home", "Board", "免费" ] ]

It is time we grind them. We do this in two ways:

  • Storing the data in a move suitable format,
  • Optimizing the layout.
Choice of DATA Format

You can grind coffee beans differently for several types of coffee. Likewise, you can choose one among many storage formats according to your needs. I chose Google’s Protocol Buffers (Protobuf) because:

  • Binary format
  • Enum data type support
  • Great Python support
  • Actively maintained

You might as well choose Thrift or Avro. You may even choose to write your own binary format, which would be an overkill for our purposes. The evaluation of format options is out of topic, but I will say this much, JSON is particularly bad for this dataset, since mathematically, much of the data has the characteristics of a finite set. That is why support for enums or actual sets, which JSON lacks, is a huge plus for the data. If you stick to JSON, you will probably still want to replace those strings with numbers, and simulate enums manually. Finally, in case you choose Avro for some reason, I am sharing the initial Avro Schema I wrote before I settled on Protobuf.

Format Conversion

This is pretty much what we are going to do, now:

optimize First of all, we need to define our Schema. We start with the direct translation from the JSON version and then we can go from there.

message feature_m
{
           date_m   date        = 1;
           string   country     = 2;
           device_e device      = 3;
           string   category    = 4;
           ftype_e  ftype       = 5;
           uint32   depth       = 6;
  repeated uint32   rows        = 7;
           uint32   position    = 8;
  repeated string   path        = 9;
}

This is a good start. Now let’s enumerate the ‘device’ member.

enum device_e
{
  IPHONE                        = 0;
  IPAD                          = 1;
}

… and the ‘feature type’ member. The possible options are listed in App Annie’s Feature History and Daily Features Report Guide & FAQ page.

enum ftype_e
{
  APP_TOP                       = 0;
  COL_TOP                       = 1;
  APP_BAN                       = 2;
  COL_BAN                       = 3;
  COL_LST                       = 4;
  COL_VID                       = 5;
}

Since we are going to want to combine datasets of multiple dates into a single database, we would also need to store the ‘date’ in each feature.

message date_m
{
  int32 year                    = 1;
  int32 month                   = 2;
  int32 day                     = 3;
}

Here is the complete schema.

syntax = "proto3";
package x13;

message date_m {
  int32 year                    = 1;
  int32 month                   = 2;
  int32 day                     = 3;
}

enum device_e {
  IPHONE                        = 0;
  IPAD                          = 1;
}

enum ftype_e {
  APP_TOP                       = 0;
  COL_TOP                       = 1;
  APP_BAN                       = 2;
  COL_BAN                       = 3;
  COL_LST                       = 4;
  COL_VID                       = 5;
}

message feature_m {
           date_m   date        = 1;
           string   country     = 2;
           device_e device      = 3;
           string   category    = 4;
           ftype_e  ftype       = 5;
           uint32   depth       = 6;
  repeated uint32   rows        = 7;
           uint32   position    = 8;
  repeated string   path        = 9;
}

message store_m {
  repeated feature_m features   = 1;
}

We named it “x13_store.proto”.
Make sure protobuf is installed in your computer.
Compile the schema for Python:

protoc --python_out=./ x13_store.proto

Now it is ready to import!
Create a “convert.py” and start importing.

import json, gzip
import x13_store_pb2 as x13s

Define a mapping between the JSON fields and ftype_e.

ftypes = {'App Top Banner'        : x13s.APP_TOP,
          'Collection Top Banner' : x13s.COL_TOP,
          'App Banner'            : x13s.APP_BAN,
          'Collection Banner'     : x13s.COL_BAN,
          'Collection List'       : x13s.COL_LST,
          'Collection Video'      : x13s.COL_VID}

Function to fill feature_m instance using given feature data.

def import_feature(data, y, m, d, feature):
    feature.date.year  = y
    feature.date.month = m
    feature.date.day   = d
    feature.country    = data[0]
    feature.device     = [ x13s.IPHONE, x13s.IPAD ][ data[1] == 'iPad' ]
    feature.category   = data[2]
    feature.ftype      = ftypes[data[3]]
    feature.depth      = data[4]
    feature.rows.extend(data[5])
    feature.position   = data[6]
    feature.path.extend(data[7])
    return feature

Create a store_m instance.

store = x13s.store_m()

Loop over all days and generate features.

y = 17
m = 8
for d in range(31):
    date      = "%02d%02d%02d" % (y, m, d+1)
    json_path = "json/" + date + '.json'

    with open(json_path) as data_file:
        data  = json.load(data_file)

    for f in data:
        import_feature(f, 2000+y, m, d+1, store.features.add())

Write the data to a protobuf file, gzipping az we go.

with gzip.open("store_sample.pbz", 'wb') as out_file:
    out_file.write(store.SerializeToString())

Save the file. Convert the data.

python3 convert.py

That’s it! Now all our data is in “store_sample.pbz”.

CONFIRM THE DATA

Start a Python notebook in Jupyter.
Do the usual imports.

import gzip
import pandas as pd
import x13_store_pb2 as x13
import matplotlib.pyplot as plt
import numpy as np
from functools import reduce
import seaborn as sns
%matplotlib inline
sns.set()

Read the newly created protobuf file.

with gzip.open('store_sample.pbz', 'rb') as f:
    data = x13.store()
    data.ParseFromString(f.read())

Create a Pandas DataFrame from filtered data.

x = pd.DataFrame([[f.date.day, f.position, f.path] for f in data.features 
                  if f.depth == 2])

x.columns = ['Day', 'Position', 'Path']

x.head()

Voila!

	Day	Position	Path
0	3	3	[Home, Board, 無料]
1	3	4	[Home, Board, 免费]
2	3	15	[Home, Puzzle, 免费]
3	3	3	[Home, Board, 免费]
4	3	3	[Home, Board, Free]

Check back for Part 4 of the series! If you have any questions or advice, please comment below. Also, please tell me what the hell do I do with this data. Thank you!

Store Presence on App Store #2 – Let’s Play: Cleanup

In the first post of the series, we learnt how to gather the data, now we are going to do some cleanup.

This is #2 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    [ DONE ]
#2 - Let's Play: Cleanup   <<
#3 - Let's Play: Optimize  [ DONE ]
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

cleanup

Where Were We?

Using the methods in the Let’s Play #1, I gathered global, day-by-day, August 2017 data for the App Store presence of our latest game, Twiniwt.

Below is the size of the dataset:

[kenanb@6x13 twiniwt-1708]$ du -hsc *
4.0K 170801.json
4.0K 170802.json
1.5M 170803.json
1.5M 170804.json
1.5M 170805.json
728K 170806.json
728K 170807.json
...
684K 170830.json
732K 170831.json
23M total

Damn! Surely, it is time for cleanup and restructuring.

Scabbling and Cleanup

Let’s python3 and import the required packages.

import json, pprint

For this post, all we need is Python 3, which I presume you already have. Even though it is completely optional for cleanup part of the series, I strongly suggest installing and using Jupyter Notebook for data exploration. It is especially useful while trying to restructure the data for your needs.

We saved our daily feature data to a subdirectory in our working directory.

A dataset corresponding to 2017-08-15 is saved as:

./json/170815.json

We want to loop over a period, say, each day in August 2017.

We need to generate the filepaths for the corresponding dates in the loop.

The year:

y = 17

The month:

m = 8

Loop for each day, generate basename for the file and concatenate the whole pathname.

for d in range(31):
    date = "%02d%02d%02d" % (y, m, d+1)
    file_name = "json/" + date + '.json'

Here, we read the dataset using the pathname we just created.

    with open(file_name) as data_file:
        data = json.load(data_file)

Then, we immediately bypass all the garbage branches in the serialized dataset, and assign the key ‘rows’ to our data variable.
First, we need to have a look at the loaded data and find the path to ‘rows’.

{
  "data": {
    "data": {
      "pagination": {
        "current": 0,
        "page_interval": 1000,
        "sum": 1
      },
      "rows": [

	  ...

      ],
      "csvPermissionCode": "PERMISSION_NOT_PASS",
      "columns": [

	  ...

      ],
      "fixedColumns": {
        "tableWidth": 150,
        "fixed": 1
      }
    },
    "permission": true
  },
  "success": true
}

As you can see, this is what takes us to the ‘rows’:

    data = data['data']['data'].get('rows') or []

You probably noticed, we didn’t simply do data['data']['data']['rows'] because that path might not even exist, if for some reason your app is not in store that day.

Restructure

Cool, we got rid of immediate garbage, it’s time to clean up and restructure the actual row data.

Let’s see, this is a sample row in an App Annie Daily Featured response.

[
  [
    {
      "image": "https://static-s.aa-cdn.net/img/ios/...",
      "type": "icon",
      "thumb": "https://static-s.aa-cdn.net/img/ios/..."
    }
  ],
  [
    "China",
    "CN"
  ],
  "iPhone",
  "Board",
  "Collection List",
  "N/A",
  2,
  4,
  6,
  [
    {
      "existence": false,
      "detail": null,
      "label": "Featured Home"
    },
    {
      "existence": false,
      "detail": null,
      "label": "Board"
    },
    {
      "existence": true,
      "detail": {
        "position": [
          6
        ],
        "row": [
          4,
          4
        ]
      },
      "parent": "免费",
      "label": "Twiniwt"
    }
  ],
  [
    "N/A",
    0,
    100,
    ""
  ]
]

Above, I marked the data we want to keep in bold. The full details about the contents of the row are provided in the Store Presence on App Store #1 – Let’s Play: Scrape.

We traverse the data, removing the garbage values from each row array (in reverse order, of course, so the indices for the garbage entities do not change during deletion.)

    for d in data:
        d.pop(10)
        d.pop(5)
        d.pop(0)

The two-letter country code is enough, we don’t need the full name of the countries in each element of our dataset.

        d[0] = d[0][1]

Shorten the ‘Featured Home’ category page name, to simply, ‘Home’.

        if ( d[2] == 'Featured Home' ): d[2] = 'Home'

The way ‘Featured Path’ is structured is pretty complex for our needs. Let’s restructure it.

        n = []
        for r in d[7]:
            n.append(r['label'])
        n[-1] = d[7][-1]['parent']
        if ( n[0] == 'Featured Home' ):
            n[0] = 'Home'
        if ( n[-1].endswith('see more') ):
            n[-1] = n[-1][:-9]
            n.append('>>')
        d[5] = d[7][-1]['detail']['row']
        d[7] = n

I am skipping the details on this one, as you might want to keep it, or arrange it differently.
Below is the cleaned-up sample data we get after the process.

[
  "CN",
  "iPhone",
  "Board",
  "Collection List",
  2,
  [
    4,
    4
  ],
  6,
  [
    "Home",
    "Board",
    "免费"
  ]
]

Now that we are finished with the dataset cleanup, let’s write it back to the file.

    with open(file_name, 'w') as out_file:
        json.dump(data, 
                  out_file, 
                  indent=2, 
                  ensure_ascii=False,
                  sort_keys=True)

You can view and download the complete code gist below.

We scraped and cleaned up the data. It is now down to 3.4Mb from 23Mb, meaning we just got rid of garbage that amounts to ~86% of the dataset. Congratulations!

Yet, the data is still unsuitable for real-time processing. We will fix that in the next part of the series. Some entities are long strings while we could get away with enumerations, things like that. And while at it, let’s get rid of this JSON nonsense, shall we?

Oh, I almost forgot the coffee beans!

Check back for Part 3 of the series! If you have any questions or advice, please comment below. Thank you!

Store Presence on App Store #1 – Let’s Play: Scrape

This is #1 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    <<
#2 - Let's Play: Cleanup   [ DONE ]
#3 - Let's Play: Optimize  [ DONE ]
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

Preparing a business plan requires systematic procrastination: Store presence analytics!

All work and no play makes Jack a dull boy.

Preparing a business plan requires fourteen things: a business, a dozen cups of coffee and systematic procrastination.

I am sure you can handle the coffee and the business. So for now, I will only try to help with the procrastination:

I need some insight into the App Store presence of our latest game, Twiniwt. I visit App Annie’s Featured page, as usual.

If you don’t know about App Annie, this is a good review of the service.

Analytics you find on such dedicated services are good as general performance metrics.  However, you might want the data to support a particular claim in your business plan. The freely available content is hardly useful for that. The signal to noise ratio makes it hard to read. Besides, you can only filter the data.

We have to do some ad-hoc data science and visualization in order to get better results. Now, what was data science, again?

I live in Istanbul and I prefer a budget GNU/Linux laptop at work. Obviously, I am no data scientist. Still, some half-assed data science is better than none.

Let’s play: STORE PRESENCE ANALYTICS!

This will require at least five steps:

Scrape
Gather the data and understand its current structure
Cleanup
Restructure and delete irrelevant fields of the data
Optimize
Optimize the restructured data to a format that is faster to batch process
Analyze
Do filtering, mapping, grouping and analysis over data
Visualize
Plot and visualize the data frames
We will dedicate a seperate post to each of those titles.  The data analysis and visualization steps can actually grow to multiple posts as we go.
We won’t delve into the subject of integration, yet we will rely on cross-platform, portable data formats. Besides, we will mainly use Python, which is a very sound choice for integration.

We won’t be concerned about storage either, as the data will already shrink significantly while we are optimizing for performance.

Before we begin, let’s look at some coffee beans and get hyped!

SCRAPE

First things first. If I want to analyze store presence data, I need the store presence data. I need it in a way I can process it. I am already familiar with App Annie’s daily feature tables. Here we go:

Go to your app’s page.

Go to User Acquisition >> Featured.

This is good stuff, even though the data itself is not. 😒

App Store Presence Feature Table
WTF are those app positions! 1714? Seriously?

Anyway.

When I click the Export button, I get this.

connect
Connect this app? Hmm, why not!

Now, I am sure App Annie’s premium solution provides various great viewpoints  and customizations to the data. However, being an indie studio and all, we cannot afford it.

Hey, there is another way! I can connect my app. But what do I see when I click connect? It is asking for my App Store developer account password. I don’t care what encryption they use, there is no way I am filling those Connect forms asking for my developer account passwords.

So, I guess I am on my own, but that is OK.

Be warned that I do not know if App Annie ever intended or actually permitted the use of data they transfer to client computers in such way. So I will not tell you what to do, I will just describe how one would access such freely available data in a structured data format.

Why not having a closer look at the transferred data in Firefox? Right click anywhere in screen and click Inspect Element.

Go to Network tab.

App Annie Featured Page

Now this is where you can track requests and responses. Whenever you change the table options, a new set of data will be sent to you, the client, in JSON format.

The entry that contains the data you asked for is something like this:

/ajax/ios/app/your-app-name/daily-feature...

That is the app’s daily store presence aggregated according to your preferences.

Saving Your Data

When you need to save a piece of data a web server sent you, you simply right click the corresponding entry in Network tab, then do Copy >> Copy Response.

Open your favourite text editor and paste the clipboard to a new JSON file.

If you want to quickly do that for multiple data sets, you can instead Copy >> Copy as cURL, and modify the date etc. in copied curl command.

DATA STRUCTURE

Let’s check out a sample store presence in an App Annie Daily Featured response.

[
  [
    {
      "image": "https://static-s.aa-cdn.net/img/ios/...",
      "type": "icon",
      "thumb": "https://static-s.aa-cdn.net/img/ios/..."
    }
  ],
  [
    "China",
    "CN"
  ],
  "iPhone",
  "Board",
  "Collection List",
  "N/A",
  2,
  4,
  6,
  [
    {
      "existence": false,
      "detail": null,
      "label": "Featured Home"
    },
    {
      "existence": false,
      "detail": null,
      "label": "Board"
    },
    {
      "existence": true,
      "detail": {
        "position": [
          6
        ],
        "row": [
          4,
          4
        ]
      },
      "parent": "免费",
      "label": "Twiniwt"
    }
  ],
  [
    "N/A",
    0,
    100,
    ""
  ]
]

Now, compare it to the table, not too carefully, though. (I couldn’t bother finding the exact same record. 😎)

App Store Presence Feature Table
1714? Why do you even put that into the table? That is just wasted network bandwidth.

Below is a breakdown of the row contents:

Creative [GARBAGE]
URLs to store creative for the app.
Country
The App Store country name as well as the two-letter
ISO 3166-1 alpha-2 code for it.
Device
iPhone or iPad
Category Page
The featured category page where the placement is displayed.
Type
The feature type of the final placement displayed in the app store.
Subtitle [GARBAGE]
The text shown alongside feature banners and collection titles.
Depth
The number of steps necessary to see the final feature placement.
Row [DUPLICATE]
The final row number along the path leading to the feature placement.
Position
The final position number along the path leading to the feature placement.
Feature Path [DIRTY]
A detailed path of where the feature placement was shown in the app store.
Premium Content [GARBAGE]
Locked additional content, only activated for premium account queries.

Here is the cleaned-up version of the store presence data we want to work with.

[
  "CN",
  "iPhone",
  "Board",
  "Collection List",
  2,
  [
    4,
    4
  ],
  6,
  [
    "Home",
    "Board",
    "免费"
  ]
]

Awesome, but enough with the data scraping for today. I will show you how to cleanup and restructure the JSON data to fit our needs in next part.

Check back for Part 2 of the series! If you have any questions or advice, please comment below. Thank you!

Related Links

Check out the following official App Store guide to  building your store presence:

Make the Most of the App Store