Store Presence on App Store #3 – Let’s Play: Optimize

In the last post, we cleaned up the data, resulting in ~86% size reduction. Yet, we still need to optimize the data for real-time processing.


This is #3 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    [ DONE ]
#2 - Let's Play: Cleanup   [ DONE ]
#3 - Let's Play: Optimize  <<
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

Prelude

We follow two rules in the matter of optimization: Rule 1: Don’t do it. Rule 2 (for experts only): Don’t do it yet.

> Michael A. Jackson. Principles of Program Design, 1975 >

Luckily, in any sane language, a list starts with Element #0.

Rule 0: Just this once.

Optimize

These are our coffee beans now:

[ "CN", "iPhone", "Board", "Collection List", 2, [ 4, 4 ], 6, [ "Home", "Board", "免费" ] ]

It is time we grind them. We do this in two ways:

  • Storing the data in a move suitable format,
  • Optimizing the layout.
Choice of DATA Format

You can grind coffee beans differently for several types of coffee. Likewise, you can choose one among many storage formats according to your needs. I chose Google’s Protocol Buffers (Protobuf) because:

  • Binary format
  • Enum data type support
  • Great Python support
  • Actively maintained

You might as well choose Thrift or Avro. You may even choose to write your own binary format, which would be an overkill for our purposes. The evaluation of format options is out of topic, but I will say this much, JSON is particularly bad for this dataset, since mathematically, much of the data has the characteristics of a finite set. That is why support for enums or actual sets, which JSON lacks, is a huge plus for the data. If you stick to JSON, you will probably still want to replace those strings with numbers, and simulate enums manually. Finally, in case you choose Avro for some reason, I am sharing the initial Avro Schema I wrote before I settled on Protobuf.

Format Conversion

This is pretty much what we are going to do, now:

optimize First of all, we need to define our Schema. We start with the direct translation from the JSON version and then we can go from there.

message feature_m
{
           date_m   date        = 1;
           string   country     = 2;
           device_e device      = 3;
           string   category    = 4;
           ftype_e  ftype       = 5;
           uint32   depth       = 6;
  repeated uint32   rows        = 7;
           uint32   position    = 8;
  repeated string   path        = 9;
}

This is a good start. Now let’s enumerate the ‘device’ member.

enum device_e
{
  IPHONE                        = 0;
  IPAD                          = 1;
}

… and the ‘feature type’ member. The possible options are listed in App Annie’s Feature History and Daily Features Report Guide & FAQ page.

enum ftype_e
{
  APP_TOP                       = 0;
  COL_TOP                       = 1;
  APP_BAN                       = 2;
  COL_BAN                       = 3;
  COL_LST                       = 4;
  COL_VID                       = 5;
}

Since we are going to want to combine datasets of multiple dates into a single database, we would also need to store the ‘date’ in each feature.

message date_m
{
  int32 year                    = 1;
  int32 month                   = 2;
  int32 day                     = 3;
}

Here is the complete schema.

syntax = "proto3";
package x13;

message date_m {
  int32 year                    = 1;
  int32 month                   = 2;
  int32 day                     = 3;
}

enum device_e {
  IPHONE                        = 0;
  IPAD                          = 1;
}

enum ftype_e {
  APP_TOP                       = 0;
  COL_TOP                       = 1;
  APP_BAN                       = 2;
  COL_BAN                       = 3;
  COL_LST                       = 4;
  COL_VID                       = 5;
}

message feature_m {
           date_m   date        = 1;
           string   country     = 2;
           device_e device      = 3;
           string   category    = 4;
           ftype_e  ftype       = 5;
           uint32   depth       = 6;
  repeated uint32   rows        = 7;
           uint32   position    = 8;
  repeated string   path        = 9;
}

message store_m {
  repeated feature_m features   = 1;
}

We named it “x13_store.proto”.
Make sure protobuf is installed in your computer.
Compile the schema for Python:

protoc --python_out=./ x13_store.proto

Now it is ready to import!
Create a “convert.py” and start importing.

import json, gzip
import x13_store_pb2 as x13s

Define a mapping between the JSON fields and ftype_e.

ftypes = {'App Top Banner'        : x13s.APP_TOP,
          'Collection Top Banner' : x13s.COL_TOP,
          'App Banner'            : x13s.APP_BAN,
          'Collection Banner'     : x13s.COL_BAN,
          'Collection List'       : x13s.COL_LST,
          'Collection Video'      : x13s.COL_VID}

Function to fill feature_m instance using given feature data.

def import_feature(data, y, m, d, feature):
    feature.date.year  = y
    feature.date.month = m
    feature.date.day   = d
    feature.country    = data[0]
    feature.device     = [ x13s.IPHONE, x13s.IPAD ][ data[1] == 'iPad' ]
    feature.category   = data[2]
    feature.ftype      = ftypes[data[3]]
    feature.depth      = data[4]
    feature.rows.extend(data[5])
    feature.position   = data[6]
    feature.path.extend(data[7])
    return feature

Create a store_m instance.

store = x13s.store_m()

Loop over all days and generate features.

y = 17
m = 8
for d in range(31):
    date      = "%02d%02d%02d" % (y, m, d+1)
    json_path = "json/" + date + '.json'

    with open(json_path) as data_file:
        data  = json.load(data_file)

    for f in data:
        import_feature(f, 2000+y, m, d+1, store.features.add())

Write the data to a protobuf file, gzipping az we go.

with gzip.open("store_sample.pbz", 'wb') as out_file:
    out_file.write(store.SerializeToString())

Save the file. Convert the data.

python3 convert.py

That’s it! Now all our data is in “store_sample.pbz”.

CONFIRM THE DATA

Start a Python notebook in Jupyter.
Do the usual imports.

import gzip
import pandas as pd
import x13_store_pb2 as x13
import matplotlib.pyplot as plt
import numpy as np
from functools import reduce
import seaborn as sns
%matplotlib inline
sns.set()

Read the newly created protobuf file.

with gzip.open('store_sample.pbz', 'rb') as f:
    data = x13.store()
    data.ParseFromString(f.read())

Create a Pandas DataFrame from filtered data.

x = pd.DataFrame([[f.date.day, f.position, f.path] for f in data.features 
                  if f.depth == 2])

x.columns = ['Day', 'Position', 'Path']

x.head()

Voila!

	Day	Position	Path
0	3	3	[Home, Board, 無料]
1	3	4	[Home, Board, 免费]
2	3	15	[Home, Puzzle, 免费]
3	3	3	[Home, Board, 免费]
4	3	3	[Home, Board, Free]

Check back for Part 4 of the series! If you have any questions or advice, please comment below. Also, please tell me what the hell do I do with this data. Thank you!

Published by

Kenan Bölükbaşı

Founder, Project Leader and Developer at 6x13 Games. Game Developer & Designer, CG Generalist, Architect. Theoretical and applied knowledge in programming, design and media. Broad experience in project management. Experience in 3D (mesh, solid & CAD), 2D (raster, vector), and parametric graphics as well as asset pipelines and tools development. Blender 3D specialist (Blender Foundation Certified Trainer).

One thought on “Store Presence on App Store #3 – Let’s Play: Optimize”

Leave a Reply

Your email address will not be published. Required fields are marked *