Multi-Resolution Asset Workflow Automation

Now that we know how to do multi-resolution and asset scaling, I will show you how to aid your software stack in exporting multi-resolution assets.

The raster part is easy. You work with high resolution and scale down.

I simply use ImageMagick for this because I am not at all happy with the quality of the downscaling implementations provided by other tools. Of course, maybe they got better in time, but I do not find that worthy of occasional testing.

This task usually requires too much manual involvement, so I will not provide a batch method either.

convert -resize 50% /large/asset.png /medium/asset.png
convert -resize 25% /large/asset.png /small/asset.png

Easy.

For the vector assets, we can go for the comfort of a custom Inkscape extension.

Our extension will be placed in Extensions > Export > Sprite, and the UI will look like this:

Directory is the root visual asset directory in your project. In Twiniwt, that is ~/dev/twiniwt/Resources/res.

If you activate context, you can select if the element is a UI or Game element.

When disabled, the extension will export below assets:

~/dev/twiniwt/Resources/res/large/sprite.png
~/dev/twiniwt/Resources/res/medium/sprite.png
~/dev/twiniwt/Resources/res/small/sprite.png

When Context is enabled, and UI Element selected, the extension will export below assets:

~/dev/twiniwt/Resources/res/large/ui/sprite.png
~/dev/twiniwt/Resources/res/medium/ui/sprite.png
~/dev/twiniwt/Resources/res/small/ui/sprite.png

That’s all. Almost all our production assets in Twiniwt are exported using this simple tool.

You can download the extension as a zip file. I am placing it in Public Domain. Place the contents in the extensions subdirectory of your inkscape home directory. In GNU/Linux, this is ~/.config/inkscape/extensions/. It is probably the same in Mac OSX. Unfortunately, I haven’t even tested this extension in Windows, but I am sure it will work out-of-the-box once you locate it.

However, learning how to modify it, and writing your own production tools is more important.

Implementation

An Inkscape extension with a GUI requires two files. An XML formatted “.inx” file, and the actual implementation module.

Let’s first write the interface and specify the meta data.
Those belong to our sprite.inx file.

<?xml version="1.0" encoding="UTF-8"?>
<inkscape-extension xmlns="http://www.inkscape.org/namespace/inkscape/extension">
 <_name>Sprite</_name>
 <id>org.inkscape.sprite</id>
 <dependency type="extension">org.inkscape.output.svg.inkscape</dependency>
 <dependency type="executable" location="extensions">sprite.py</dependency>
 <dependency type="executable" location="extensions">inkex.py</dependency>
 <param name="directory" type="string" _gui-text="Directory to save images to:">~/</param>
 <param name="image" type="string" _gui-text="Image name (without extension):">sprite</param>
 <param name="has_context" type="boolean" _gui-text="Context:">false</param>
 <param name="context" type="optiongroup" _gui-text="Select context:" appearance="minimal">
 <_option value="ui">UI Element</_option>
 <_option value="game">Game Element</_option>
 </param>
 <effect needs-live-preview="false">
 <object-type>all</object-type>
 <effects-menu>
 <submenu _name="Export"/>
 </effects-menu>
 </effect>
 <script>
 <command reldir="extensions" interpreter="python">sprite.py</command>
 </script>
</inkscape-extension>

Now, the actual implementation, named sprite.py . The boilerplate:

#!/usr/bin/env python

import os
import sys
sys.path.append('/usr/share/inkscape/extensions')
try:
    from subprocess import Popen, PIPE
    bsubprocess = True
except:
    bsubprocess = False
import inkex

The class and the option parsers.

class Sprite(inkex.Effect):
    def __init__(self):
        inkex.Effect.__init__(self)
        self.OptionParser.add_option("--directory", action="store",
                                        type="string", dest="directory",
                                        default=None, help="")

        self.OptionParser.add_option("--image", action="store",
                                        type="string", dest="image",
                                        default=None, help="")

        self.OptionParser.add_option("--has_context", action="store",
                                        type="string", dest="has_context",
                                        default=None, help="")

        self.OptionParser.add_option("--context", action="store",
                                        type="string", dest="context",
                                        default=None, help="")

The utility methods.

    def get_filename_parts(self):
        if self.options.image == "" or self.options.image is None:
            inkex.errormsg("Please enter an image name")
            sys.exit(0)
        return (self.options.directory, self.options.image)

    def check_dir_exists(self, dir):
        if not os.path.isdir(dir):
            os.makedirs(dir)

Exporter for one asset:

    def export_sprite(self, filename, dpi):
        svg_file = self.args[-1]
        command = "inkscape -e \"%s\" -d \"%s\" \"%s\" " % (filename, dpi, svg_file)
        if bsubprocess:
            p = Popen(command, shell=True, stdout=PIPE, stderr=PIPE)
            return_code = p.wait()
            f = p.stdout
            err = p.stderr
        else:
            _, f, err = os.open3(command)
        f.close()

Exporter for all requested resolutions:

    def export_sprites(self, assets):
        dirname, filename = self.get_filename_parts()
        output_files = list()
        if dirname == '' or dirname == None:
            dirname = './'
        dirname = os.path.expanduser(dirname)
        dirname = os.path.expandvars(dirname)
        dirname = os.path.abspath(dirname)
        if dirname[-1] != os.path.sep:
            dirname += os.path.sep
        for directory, scale in assets.items():
            dpi = 96 * scale
            asset_dirname = dirname + directory + os.path.sep
            if self.options.has_context == 'true':
                asset_dirname = asset_dirname + self.options.context + os.path.sep
            self.check_dir_exists(asset_dirname)
            f = asset_dirname + filename + ".png"
            output_files.append(f)
            self.export_sprite(f, dpi)
        inkex.errormsg("The sprites have been saved as:" + "\n\n" + "\n".join(output_files))

This is where we define the resolutions we ask for. Change assets for your own needs if you like.

    def effect(self):
        assets = {"small": 1, "medium": 2, "large": 4}
        self.export_sprites(assets)

The entry point to the extension.

if __name__ == "__main__":
    e = Sprite()
    e.affect()

If anything in this post confuses you, please check out the how to do multi-resolution and asset scaling post. Also, please reply in comments below if you have any questions, ideas or fixes to the extension.

Good luck!

How to Set Asset Scaling and Resolution for 2D Games

When I shared the post about Optimized Parallax Backgrounds, I got asked how our asset resolution and scaling system works.

Dealing with asset scaling in 2D games can be confusing, to say the least. There are many ways to handle it. Which one to chose depends on what you expect. The problem is, you usually do not know what to expect from a responsive interface.

So beware, most scaling strategies don’t respond well to display ratio changes. They will pretend to work on your test device, then they will fail you so bad you will wish you had bookmarked this blog post earlier, which is now. Seriously.

I will now explain one asset scaling strategy that worked great for us in all 4 games we released so far. It is not my invention. I found the original method here, and also influenced by this Cocos2d-x forum post, that uses a different approach to achieve similar behaviour.

Note: In order to make things easier to grasp, I will pretend your game always works full screen regardless of device type. For windowed-mode on a PC, interpret the term “display resolution” as “window resolution”.

Now there are four things you expect from your engine to handle for you.

  • Choosing the most suitable set of graphics according to display resolution.
  • Globally scaling the selected graphics to fit the screen.
  • Letting you ignore all these and totally forget about the display while coding the game itself.
  • Earning you money, success, and preferably a slice of New York Cheesecake.

Sadly, at 6×13 Games, we couldn’t come up with a reliable way for the engine to handle the last one, either. So I will only explain the asset scaling part.

We will use Cocos2d-x engine for the examples and rely on its terminology. But Cocos only provides a basic set of transform policies, so the same idea applies to any 2D engine. The engine source is available, so you can implement the missing stuff in your own engine as well.

Resolution Policy

This actually has nothing to do with resolution. It is about framing.

Cocos2d-x will autoscale the whole scene to somehow fit the frame. Resolution policy is where you tell the engine what your understanding of “fit” is, what kind of behaviour you expect from it. The options are:

  • Exact Fit
  • No Border
  • Show All
  • Fixed Height
  • Fixed Width

I know, you need No Border!

Maybe you do. But let me help you, you really don’t. What you really want is a little more complicated than that.

You do not want a weird display aspect ratio messing up with your precious interaction area, making your UI buttons too small to press. That is what you get with No Border. And that is why we are rolling a better solution.

So, keep reading.

Safe Area

You need a safe area! An area that is not only guaranteed to be shown to user, but also guaranteed to cover as much screen space as possible. So I present you, the safe area:

Anything but the yellow area is just decoration that prevents the user from seeing black border. You never, ever put something that user really needs to see outside the Safe Area.

This is the response we expect from our framework.

The safe area is the center 480×320 units portion of our 570×360 units game area -not pixels, units. It has an aspect ratio of 1,5f .

How do we guarantee that?

We calculate the reference axis first. Then we either choose the Fixed Height, or the Fixed Width policy, according to the reference axis. If the Reference Axis is y-axis, we choose Fixed Height, otherwise Fixed Width.

Whatever our Reference Axis is, it better be Fixed.

I believe the above animation clearly shows what a Reference Axis is. Below is what it mathematically means:

float aspect_ratio = display_res.width / display_res.height;

if ( 1.5f > aspect_ratio )
{
    // Reference Axis = X_AXIS
    // ...
}
else
{
    // Reference Axis = Y_AXIS
    // ...
}

You can download our Safe Area reference SVG as well as three sample sizes exported, as a single zip file.

Now that we have decided how the engine should behave, it needs to know what portion of the screen to use for that behaviour. In Cocos2d-x terms, this is the Design Size.

Design size

Scenes are sizeless. They are just cartesian space with an origin, and possibly some stupid protagonist running around, trying to rescue the damsel who can actually take care of herself. Go away, creepy protagonist!

Anyway. In order for Cocos to fit the scene contents into the frame, it needs to know which portion of the scene we actually consider to be “the scene”. Hence, the Design Size.

In our case, Design Size is the dimensions of our Safe Area.

The moment we set the Design Size and Resolution Policy, the engine will start acting like an adult. Below is how it reacts to both ratio and physical size changes.

Cool animation, right? I made it in Blender, yay!

To sum up:

  • Cocos will scale to fit.
  • Resolution Policy tells it “how” to fit.
  • Design size tells it “what” to fit.

Great, it works! That’s it!

Or not.

Multi-Resolution Support

This would be the happy ending of our asset scaling adventure, if the only crappy, non-standardized piece of hardware on our way was the display. Far from it.

We also need our runtime assets to be fast to process, fit the video memory nicely, and be effected from the least amount of aliasing possible during scaling.

All of them require one thing: having multiple versions of our assets and picking the set of assets to load according to the display resolution, at runtime.

More precisely, we want the density -the definition- of our assets to be close to the expectations of the particular hardware. Because if your mobile device has a Standard Definition display, chances are it also has the video processing power and memory that can only handle Standard Definition, or less. Mobile manufacturers rarely skip the leg day.

Also, there are many algorithms for image scaling, with varying quality and performance characteristics. You want your realtime scaling to be of the fastest kind, which also means less quality. Therefore, it’s best to have your images prescaled to -or reproduced at- closest size.

So, we need to support multiple resolutions as part of our asset scaling strategy.

In order to do that, we put each set of assets in a different resource directory.

We use three size variants, and a directory for each: “small”, “medium”, and “large”.

We will pick the best possible size according to the display dimension of our Reference Axis, in pixels. After that, there is no special procedure to “pick” the directory, you simply add that particular directory to the Resource Search Path of your engine, omitting the others so they will not even be visible to the engine.

Content Scale Factor

We also associate a scale factor with each of those directories, so that the engine knows how the assets map to the design size. It is simple. For example, if the assets in “medium” are twice the definition of your Design Size, the scale is 2. I do that in a resource JSON file that holds other information as well. The relevant parts of the file looks like this:

{
    "version": [
        1,
        0,
        0
    ],
    "graphics": {
        "design": {
            "size": {
                "width": 480,
                "height": 320
            },
            "full": {
                "width": 570,
                "height": 360
            }
        },
        "assets": [
            {
                "scale": 1,
                "directory": "small"
            },
            {
                "scale": 2,
                "directory": "medium"
            },
            {
                "scale": 4,
                "directory": "large"
            }
        ]
    }
}

You are free to hardcode those.

There is a x13_gfx_s structure instance in our game data that holds graphics properties:

typedef struct
{
    // Scale
    int scale;

    // Safe size and full size
    vec2_s safe, full;

    // Directory
    std::string dir;

} x13_gfx_s;

We have an init function that fills it.

void
 x13_init_gfx( float width, float height )
{
    float aspect_ratio    = width / height;
    rapidjson::Value &gfx = x13_data.res.dom[ "graphics" ];

    x13_data.gfx.safe.x = gfx[ "design" ][ "size" ][ "width" ].GetUint( );
    x13_data.gfx.safe.y = gfx[ "design" ][ "size" ][ "height" ].GetUint( );
    x13_data.gfx.full.x = gfx[ "design" ][ "full" ][ "width" ].GetUint( );
    x13_data.gfx.full.y = gfx[ "design" ][ "full" ][ "height" ].GetUint( );
    int unsigned resource_index;

    if ( 1.5f > aspect_ratio )
    {
        if ( width
             > gfx[ "assets" ][ 1u ][ "scale" ].GetUint( ) * x13_data.gfx.safe.x )
        {
            resource_index = 2u;
        }
        else if ( width > gfx[ "assets" ][ 0u ][ "scale" ].GetUint( )
                           * x13_data.gfx.safe.x )
        {
            resource_index = 1u;
        }
        else
        {
            resource_index = 0u;
        }
    }
    else
    {
        if ( height
             > gfx[ "assets" ][ 1u ][ "scale" ].GetUint( ) * x13_data.gfx.safe.y )
        {
            resource_index = 2u;
        }
        else if ( height > gfx[ "assets" ][ 0u ][ "scale" ].GetUint( )
                            * x13_data.gfx.safe.y )
        {
            resource_index = 1u;
        }
        else
        {
            resource_index = 0u;
        }
    }

    x13_data.gfx.dir.assign(
     gfx[ "assets" ][ resource_index ][ "directory" ].GetString( ) );
    x13_data.gfx.scale = gfx[ "assets" ][ resource_index ][ "scale" ].GetInt( );
}

Don’t worry about the SAX stuff. Concentrate on the if statement. We call it in the Cocos2d-x specific AppDelegate::applicationDidFinishLaunching method, and then we:

  • Set design size
  • Set resolution policy
  • Set content scale factor
  • Add selected asset directory to Search Path
bool
 AppDelegate::applicationDidFinishLaunching( )
{
    // ...

    // Set OpenGLView

    // ...

    auto frame_size = glview->getFrameSize( );
    x13_init_gfx( frame_size.width, frame_size.height );

    // Set design resolution and determine fix policy.
    glview->setDesignResolutionSize(
     x13_data.gfx.safe.x,
     x13_data.gfx.safe.y,
     ( 1.5f > ( frame_size.width / frame_size.height ) )
      ? ResolutionPolicy::FIXED_WIDTH
      : ResolutionPolicy::FIXED_HEIGHT );
    // Set content scaling factor.
    director->setContentScaleFactor( x13_data.gfx.scale );
	
    std::string res_path = "res";
    file_utils->addSearchPath( res_path );   // For scene description files.
    file_utils->addSearchPath( res_path + "/audio" );
    file_utils->addSearchPath( res_path + "/fonts" );
    file_utils->addSearchPath( res_path + "/" + x13_data.gfx.dir + "/ui" );
    file_utils->addSearchPath( res_path + "/" + x13_data.gfx.dir + "/game" );
    file_utils->addSearchPath( res_path + "/generic" );

    // ...
    
    return true;
}

As of this point, we can totally forget about the asset selection, resolution and PPI variations, retina displays, weird aspect ratios etc.

We just pretend the screen size is 480×320 when coding, except the few times where you might want to put some fancy, decorative animations outside the Safe Area.

Below is how the engine handles everything.

If you need more information about the above subjects, the following outdated docs from Cocos2d-x wiki are still relevant.

Detailed explanation of Cocos2d-x Multi-resolution adaptation

Multi resolution support

 

Ok, you are all set! Now you need to fill your scenes with beautiful assets.

~Pixel-Perfect Assets

Applying Safe Area strategy to your asset production is pretty straightforward. Just grab the reference assets I shared above, and comply with the boundaries.

But if you really want to preserve every bit of quality you can in a generic way, I have two more recommendations for you.

First, not everything has to be pixel perfect. Rather, try to keep your content scaling uniform among all the assets, because that will, in turn, keep the distortion and aliasing characteristics uniform. No one ever died from a little less definition. We lived just fine watching Video CDs for years, after all.

Remember?

However, watching a VCD side-by-side with a 4K movie now? That would have lasting effects. The point is, don’t let the player’s eyes compare asset densities. Keep it uniform.

Second, raster and vector assets require opposite treatment. You use raster graphics for more organic asset types, which results better if you work with high resolution sources and scale down. Work with exactly four times the size of your biggest production asset version, and scale by %50, %25, %12,5. So if you used our asset scheme, your background textures would have the following attributes:

FINAL ASSETS:
small/bg.png  PNG image data, 570 x 360, 8-bit/color RGBA
medium/bg.png PNG image data, 1140 x 720, 8-bit/color RGBA
large/bg.png  PNG image data, 2280 x 1440, 8-bit/color RGBA

SOURCE:
source/bg.png PNG image data, 4560 x 2880, 8-bit/color RGBA

Easy! Everyone already knows that.

Now, with vector graphics, you want almost the exact opposite. You want your source asset dimensions to be 570x360px, precisely as your smallest production asset variant. And you want to export for each resolution one by one -not export once and scale up. Because if a line is at the pixel border in your smallest resolution, it will always be at the pixel borders at every resolution, as long as you keep doubling the resolution. This guarantees the pixel-perfect output.

Of course, you can use the same method for raster graphics as well, but with raster graphics, the priority is seldom the quality at the pixel level.

Lastly, exporting for multiple resolutions is boring labor. If you  want to automate your multi-resolution asset workflow with custom GUI tools, please check out my new post: Multi-Resolution Asset Workflow Automation.

Allright! That was long. It took me two full days to prepare this post. So please, do not hesitate to share and comment. Especially, if you tried the above method in your games and had problems, or success, drop me a line. Good luck!

How to Estimate Task Durations

You are always expected to estimate task durations. It is especially hard to do for development tasks. You get better in time but game projects frequently put you into unfamiliar territory.

Add realistic error

Unfortunately, how much error is realistic depends on many factors. If you actually need to retrieve that information from a blog post, we can safely say it is more than 2x in your case.

Say, your initial estimation is one month. If you have done it before, stick to your estimation. Otherwise, 2 months.

Estimation was 3 months. Your performance will consistently decrease. If you are experienced, expect two weeks delay. Otherwise, 8-9 months, there is no way an inexperienced developer can plan 3 months of work accurately.

Now, those were only for single person tasks.

Let’s say it is the first task of your newly formed team and you are the fresh, inexperienced lead. Your initial estimation was 4 months and you completely overlooked the upcoming storming stage. Everyone seems bright and get along really well, after all.

Now when it begins, you don’t want to be out in the open. So what you do is, you get behind the junior artist, as the Wacom displays are expensive, nobody will dare throw things at you.

The project will likely end up in development hell, and get cancelled, but not before everybody hated you. Bottom line is, always take stages of team development and decrease in performance into account.

Don’t stare into the abyss

You realized you don’t know how to do X. It is OK. Don’t stress over it. The unknown seems so hard it makes everything else look simpler. They are not simpler. Don’t let fuzzy parts of the task completely drive your estimation.

Visualize the steps

Don’t just outline the task and focus on the duration of each bullet. Visualize the complete pipeline instead. For a second, be overwhelmed by the amount of minor labor and glueing needed to complete it.

Reorganize the task

Don’t worry. Your estimations will get better. But that is mainly because with experience, you learn how to reorganize and regroup the tasks in ways that makes them more meaningful from a task management point-of-view.

Optimized Parallax Backgrounds

While I was designing the visuals for Twiniwt, I wanted various parallax animations for the background, but without blowing up the game size.

We value keeping the game size as small as possible, because not all parts of the world share the same network bandwidth privilages, yet everyone deserves the privilage of having a little fun. Also, there is something inherently uncomfortable about the idea of a 100MB puzzle game. But they are not only games, are they? It is interesting at what lengths the freemium model has to go to become profitable. Anyway.

Here we go.

We need a simple background first.

It is very heavily blurred, which also helps with quantizing and dithering the image, and storing as a colormap.

Now we need maps to use as parallax layers. They all have to be seamless in x-axis. Three layers for silhouette, one for modifying the silhouette with fog.

We compose all this information to a single image. The image looks like this when channels are composed.

I used DXT-5 compressed DDS files. If you use PNG as your final asset format, or export to PNG at some stage, be aware that your graphics suite might try to ignore color information of fully transparent pixels, which effectively destroys the asset.

We will also need a vignette map to divide the final color with, in order to nicely frame the composition.

All layers in place, it looks like this:

Here is the GLSL fragment shader I wrote for cocos2d-x. It automatically prefixes the shader code with some convenience definitions, but the idea is there, if you want to use it in another engine.

// Copyright (C) 2017 Kenan Bölükbaşı - 6x13 Games

#ifdef GL_ES
precision lowp float;
#endif

varying vec2 v_texCoord;
uniform mat4 u_parallax;

vec4
 lookup( vec4 pd )
{
    return texture2D( CC_Texture1,
                      vec2( v_texCoord.s + mod( pd.w * CC_Time[ 0 ], 1. ),
                            v_texCoord.t * 2. - 1. ) );
}

void
 main( void )
{
    vec3 bg = texture2D( CC_Texture0, v_texCoord ).rgb;

    if ( v_texCoord.t > .5 )
    {
        vec4 fog_pd = u_parallax[ 3 ];
        float fog_f = lookup( fog_pd ).w * .05;

        for ( int i = 0; i < 3; i++ )
        {
            vec4 mnt_pd = u_parallax[ i ];

            bg = mix(
             bg,
             mix( mnt_pd.rgb, fog_pd.rgb, fog_f * inversesqrt( mnt_pd.w ) ),
             lookup( mnt_pd )[ i ] );
        }
    }

    gl_FragColor.rgb = bg / texture2D( CC_Texture2, v_texCoord ).r;
    gl_FragColor.a   = 1.;
}

I am pretty sure this shader can be much better. Please, do not hesitate to share modifications, and I will edit the post.

We also store all color palette and movement speed information separately and load them as a uniform mat4.

{
    .76f, .67f, .49f, .01f,
    .80f, .57f, .27f, .07f,
    .78f, .43f, .25f, .25f,
    .80f, .80f, .50f, .17f
}

Because we want the palette to be specific to the mood of each background, and also it is way easier to experiment with colors this way.

In short:

PARALLAX.DDS: 342K
Microsoft DirectDraw Surface (DDS), 1024 x 256, DXT5

BG.PNG      : 91K
PNG image data, 1621 x 1024, 4-bit colormap, non-interlaced

VIGNETTE.PNG: 216K (shared among all backgrounds)
PNG image data, 811 x 512, 8-bit grayscale, non-interlaced

All high definition assets costs us ~400K per background. This way, we were able to fit 3 completely different background styles in less than 1.5MB in Twiniwt.

Rules of Thumb

Finally, achieving the best packing for games requires a holistic approach to development. It reflects on decisions made by artists as well as developers.

Some rules of thumb for anyone who wants to do production assets:

Know your file formats.

PNG, for example, is NOT “the format that stores alpha channel and compress loselessly.” The details matter. PNG specification defines multiple ways to store both color and transparency information. (Fortunately, PNG is also not your best option for final assets.)

In his seminal CppCon 2014 talk, Mike Acton describes the developer’s job as: “to solve data transformation problems.” As such, people creating production assets should be aware of what they are really feeding into that transformation. This is not the job of a technical artist, this is the responsibility of a digital artist.

Know your tools.

Not everything in file format specifications are well/strictly defined. And implementations are far from being perfect. Different tools may vary in the way they interpret files. So know how your tools handle the import/export of your assets.

bake.

This doesn’t seem like it needs reminding. But nowadays, we tend to embrace “best practices” that favor flexibility, which sometimes can carry unnecessary calculations into runtime. If the distance from the camera is always the same, maybe the amount of blur is the same. It doesn’t matter if you have that amazing focus blur shader, you can just bake the blur.

FOLLOW-UP

My good friend, Marcel Smit, reviewed the post and made some great comments regarding compression and PNG format problems. I believe they should be part of the post. Here we go:

I was thinking for the parallax scrolling you could compress it even further by using only black and white and storing the images with one bit per pixel and RLE-compression. You could blur the images after loading them.

RLE, short for Run-length encoding is a very simple method that has been around for a long time. No need for a technical description. This is Run-length encoding:

I found a very nice and super fast blurring algorithm..
https://github.com/memononen/fontstash/blob/master/src/fontstash.h#L987
It’s a clever trick to quickly blur an image. You’d need to do both a horizontal and vertical pass in constant time to do a 2D gaussian blur.

I am yet to try this method and see how it fares, but the idea makes so much sense I can’t see any reason it doesn’t work better.

There are other reasons to avoid PNG. Like you said it leaves out colors for translucent pixels. This is BAD when you’re doing bilinear filtering, as with bilinear filtering the GPU also samples adjacent pixels, but these may be white/black/whatever color your export tool used to replace the colors it optimized away with!

I can’t stress enough the importance of tools/libraries like ImageMagick, GraphicsMagick and GEGL, when you need to find what is really going on with your production assets. You can batch revise invisible properties of your assets in a matter of seconds. For example you can backup and batch remove all alpha channels to reveal how BAD it really is.

I had to write fix-up code for this for Riposte. I used the average adjacent pixel color for non-translucent neighboring pixels. Another reason to avoid it, is because it is horribly slow to decompress PNGs. Reading raw data or TGAs or decompressing your own format is likely much faster.

Thanks again, Marcel!

About Knowledge

There is a particular kind of person.

The kind that has once heard the aphorism “knowledge is power” as a kid, and taken it all too seriously. The kind that could probably spell the Latin version of the phrase. The kind that poured most of their stat points to “wisdom”, hoping to be the wizard of the story.

Of course, one rarely questions what “power” actually is. Let’s define it as the ability to influence the state of the environment as well as the behaviour of its agents.

Knowledge definitely was power, once.

The new world weakened it. The characteristics of what we call knowledge is vastly different and more fragmented now. It is still important, maybe even more so, but not nearly as powerful.

I like to retrofit one of the notes from Newton’s alchemy texts to depict the new knowledge.

The vital agent diffused through everything in the earth is one and the same. And it is a mercurial spirit, extremely subtle and supremely volatile, which is dispersed through every place.

The new knowledge will change whenever you are sleeping, whenever you look the other way. It will change whenever you blink.

It will regress, and it will get revised and deprecated. It will be staged, and it will be branched.

The new knowledge, is a repository in version control. As such, it needs an active maintainer.

In a world of assets and liabilities, knowledge is only potentially an asset, but always a liability.

The good thing is, even though it doesn’t make you more powerful, it definitely makes you better. It gives you perspective. It gives you the ability to fill the new, interdisciplinary roles that are emerging, as long as you actively maintain at least one of those repositories.

I am a programmer, with knowledge and experience in computer graphics. I studied architecture, and I did organization/event management for some years. I settled on game industry, not only because I love games, but also because I can apply all of this knowledge in games.

Whenever people ask me why I include my work as an architect as vocational experience in my game developer CV, I remind them of a particular architect:

Christopher Alexander, whose research in patterns of architectural design and urban planning in 60s helped shape how we design large scale software projects today. His work was required reading in CS circles. He heavily influenced the research on Object Oriented Programming, as well as the design of C++. The whole Design Patterns movement was solely based on Alexander’s work.

Job titles are products of a well-defined, well-tested distribution of work. Not a definitive categorization of knowledge and expertise. Multiple areas of knowledge may be hard to actively maintain, let alone to apply. But they are meaningful, as long as you are able to specialize on one. The others, even when deprecated, will keep making you better.

Text Alignment: Full Justification #1 – Word Wrap

I’m obsessed with typography. Not great with it, just obsessed. The company gets its name from the legendary Misc Fixed 6×13 Regular.

Typography is people finding ways to communicate knowledge and ideas in style.

Typography is Richard Feynman.

Typography is David Bowie.

I like full justification. It naturally makes paragraph bounds recognizable, letting us use the block in more complex page layouts. While it is easy to implement in its most basic form, it is really, really hard to get right. It usually renders plain bad. Which is probably why not everyone share my enthusiasm for full justification. I haven’t implemented it, like ever, though. I am just ranting here. Let’s change that.

Here is our introductory information on justification:
https://en.wikipedia.org/wiki/Typographic_alignment

We are going to justify text for rendering with monospaced fonts, which nicely narrows the problem domain for the purposes of a this article. Convenient, as fixed spacing means literally one less variable to worry about.

Prior Art

Better start by checking out what we are getting into, since this mostly is a solved problem, there are examples.

Knuth-Plass line-wrapping algorithm is the still the go-to solution for the text alignment problem. However, it is designed for typesetting text rendered with variable-width fonts. Still, it doesn’t hurt checking it out. Terje D. provides a simplified version at SO:

Add start of paragraph to list of active breakpoints
For each possible breakpoint (space) B_n, starting from the beginning:
   For each breakpoint in active list as B_a:
      If B_a is too far away from B_n:
          Delete B_a from active list
      else
          Calculate badness of line from B_a to B_n
          Add B_n to active list
          If using B_a minimizes cumulative badness from start to B_n:
             Record B_a and cumulative badness as best path to B_n

The result is a linked list of breakpoints to use.

The badness of lines under consideration can be calculated like this:

Each  space  is assigned  a  nominal  width,  a strechability,  and  a
shrinkability.   The  badness  is  then calculated  as  the  ratio  of
stretching  or shrinking  used, relative  to what  is allowed,  raised
e.g. to the third power (in  order to ensure that several slightly bad
lines are prefered over one really bad one)

Turns out I was wrong, it actually hurts checking it out. Anyway.

Emacs does fixed-space justification with its “fill” commands, so I skimmed Emacs sources. The package lisp/textmodes/fill.el is big but the relevant part of the implementation is ~100 LOC. Allright.

It, not unlike most Emacs text-editing code, directly modifies the text buffer:

- While cursor is still in paragraph:
  - Word wrap.
    - Go to fill column.
    - Go back and find a place to cut.
    - Make sure we're good.
    - Insert newline.
  - Go back one line and justify.
    - Merge multiple consecutive spaces among the line.
	- Count remaining spaces in line.
	- Calculate number of additional spaces needed.
	- Insert the damn spaces.

The -inattentively trimmed- “insert the damn spaces” code from Emacs sources is below, with comments added for clarity:

(let (ncols          ; number of additional space chars needed
      nspaces        ; number of spaces between words
      curr-fracspace ; current fractional space amount
      count)
  ;; I don't like WordPress one bit.
  (when (and (> ncols 0) (> nspaces 0))
    (setq curr-fracspace (+ ncols (/ nspaces 2))
          count nspaces)
    ;; I don't like WordPress one bit.
    (while (> count 0)
      (skip-chars-forward " ")
      (insert-char ?\s (/ curr-fracspace nspaces) t)
      (search-forward " " nil t)
      (setq count (1- count)
            curr-fracspace
            (+ (% curr-fracspace nspaces) ncols)))))
Word Wrap

We are not tied to a text buffer, so we can freely experiment on whatever data structure we feel like.

We will handle only word-wrapping part in this post, as we will try various versions and also see how they perform.

The preparations and test code.

(ql:quickload :cl-ppcre)

(defparameter *fill-column* 60)

(defparameter *text* "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam semper risus mauris, et dignissim lorem lacinia non. Sed vitae lacus nisi. Fusce vitae lectus non quam dictum luctus et at mauris.")

(defun word-wrap-test (fn)
  (print (funcall fn *text*))
  (gc :full t)
  (time (dotimes (x 10000) (funcall fn *text*))))

First, the most logical implementation. Reading char-by-char. Performs really well, of course, as it doesn’t do any list manipulation.

(defun word-wrap-0 (txt)
  (with-input-from-string (in txt)
    (loop for  chr of-type (or null base-char)  = (read-char in nil)
          with bol of-type integer              = 0 ; beginning of line
          and  cut of-type integer              = 0 ; potential breakpoint
          and  buf of-type (vector base-char)       ; output buffer
                 = (make-array 4 :element-type 'base-char
                                 :adjustable t
                                 :fill-pointer 0)
          while chr
          do (case chr
               (#\Newline
                (setf bol (fill-pointer buf)))
               (#\Space
                (setf cut (fill-pointer buf))))
          unless (vector-push chr buf)
            do (adjust-array buf (* 16 (array-dimension buf 0)))
               (vector-push chr buf)
          when (and (> (fill-pointer buf) (+ bol *fill-column*))
                    (< bol cut)) ; for newlines in source
            do (setf (aref buf cut) #\Newline
                     bol cut)
          finally (return buf))))

0.166 seconds of real time
398,942,396 processor cycles
13,895,328 bytes consed

The result is this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam semper risus mauris, et dignissim lorem lacinia non.
Sed vitae lacus nisi. Fusce vitae lectus non quam dictum
luctus et at mauris.

The other examples will rely on cl-ppcre:split for splitting the text.
That call singlehandedly takes:
0.518 seconds of real time
1,241,109,342 processor cycles
55,040,080 bytes consed

Now we try with nested lists all the way.

(defun word-wrap-1 (txt)
  (loop for  text of-type list    on (ppcre:split "\\s+" txt)
        as   word of-type string  = (car text)
        as   next of-type string  = (car (cdr text))
        with line of-type list    = '()
        and  crsr of-type integer = *fill-column*
        ;; Insert the word in line, moving the cursor.
        do (setf line (nconc line (list word))
                 crsr (- crsr (length word)))
        if (or (null next) (< crsr (1+ (length next))))
          ;; If EOL or EOF, line feed and carriage return.
          collect line into buffer
          and do (setf crsr *fill-column*
                       line nil)
        else do (decf crsr)
        finally (return (format nil
                                "~{~&~{~a~^ ~}~}"
                                buffer))))

0.738 seconds of real time
1,767,125,344 processor cycles
92,799,088 bytes consed

Another approach would be to format lines immediately and concatenate later.

(defun word-wrap-2 (txt)
  (loop for  text of-type list     on (ppcre:split "\\s+" txt)
        as   word of-type string   = (car text)
        as   next of-type string   = (car (cdr text))
        with line of-type string   = ""
        and  fmt  of-type function = (formatter "~a~:[ ~;~]~a")
        and  crsr of-type integer  = *fill-column*
        do (setf line (format nil fmt
                              line (string= line "") word)
                 crsr (- crsr (length word)))
        if (or (null next) (< crsr (1+ (length next))))
          collect line into buffer
          and do (setf crsr *fill-column*
                       line "")
        else do (decf crsr)
        finally (return (format nil "~{~&~A~}" buffer))))

1.087 seconds of real time
2,603,569,620 processor cycles
397,437,888 bytes consed

No line. Adjustable array buffer. Start with zero size. Allocate for every word. I know, it’s stupid.

(defun word-wrap-3 (txt)
  (loop for  text   of-type list    on (ppcre:split "\\s+" txt)
        as   word   of-type string  = (car text)
        as   next   of-type string  = (car (cdr text))
        with crsr   of-type integer = *fill-column*
        and  buffer = (make-array 0 :element-type 'character
                                    :adjustable t
                                    :fill-pointer 0)
        do (adjust-array buffer (+ (array-dimension buffer 0) (length word) 1))
           (loop for c across word do (vector-push c buffer))
           (setf crsr (- crsr (length word)))
        if (< crsr (1+ (length next)))
          do (vector-push #\Newline buffer)
             (setf crsr *fill-column*)
        else
          do (vector-push #\Space buffer)
             (decf crsr)
        finally (vector-pop buffer)
                (return buffer)))

2.019 seconds of real time
4,836,461,608 processor cycles
504,132,464 bytes consed

Let vector-push-extend handle allocation, at least doubling the buffer size.

(defun word-wrap-4 (txt)
  (loop for  text   of-type list    on (ppcre:split "\\s+" txt)
        as   word   of-type string  = (car text)
        as   next   of-type string  = (car (cdr text))
        with crsr   of-type integer = *fill-column*
        and  buffer = (make-array 1 :element-type 'character
                                    :adjustable t
                                    :fill-pointer 0)
        do (loop for c across word do
          (vector-push-extend c buffer
                              (array-dimension buffer 0)))
           (setf crsr (- crsr (length word)))
        if (< crsr (1+ (length next)))
          do (vector-push-extend #\Newline buffer)
             (setf crsr *fill-column*)
        else
          do (vector-push-extend #\Space buffer)
             (decf crsr)
        finally (vector-pop buffer)
                (return buffer)))

0.643 seconds of real time
1,538,669,380 processor cycles
99,810,880 bytes consed

Try vector-pushing, if fails, adjust array manually.

(defun word-wrap-5 (txt)
  (loop for  text   of-type list    on (ppcre:split "\\s+" txt)
        as   word   of-type string  = (car text)
        as   next   of-type string  = (car (cdr text))
        with crsr   of-type integer = *fill-column*
        and  buffer of-type (vector character)
               = (make-array 4 :element-type 'character
                               :adjustable t
                               :fill-pointer 0)
        do (loop for c across word
                 unless (vector-push c buffer)
                   do (adjust-array
                       buffer (* 16 (array-dimension buffer 0)))
                      (vector-push c buffer))
           (setf crsr (- crsr (length word)))
        if (< crsr (1+ (length next)))
          do (vector-push-extend #\Newline buffer)
             (setf crsr *fill-column*)
        else
          do (vector-push-extend #\Space buffer)
             (decf crsr)
        finally (vector-pop buffer)
                (return buffer)))

0.632 seconds of real time
1,512,965,832 processor cycles
100,453,840 bytes consed

Use fill-pointer as cursor, adjusting EOL index.

(defun word-wrap-6 (txt)
  (loop for  text   of-type list    on (ppcre:split "\\s+" txt)
        as   word   of-type string  = (car text)
        as   next   of-type string  = (car (cdr text))
        with eol    of-type integer = *fill-column*
        and  buffer of-type (vector character)
               = (make-array 4 :element-type 'character
                               :adjustable t
                               :fill-pointer 0)
        do (loop for c across word
                 unless (vector-push c buffer)
                   do (adjust-array
                       buffer (* 16 (array-dimension buffer 0)))
                      (vector-push c buffer))
        if (< eol (+ (fill-pointer buffer) 1 (length next)))
          do (vector-push-extend #\Newline buffer)
             (setf eol (+ (fill-pointer buffer) *fill-column*))
        else
          do (vector-push-extend #\Space buffer)
        finally (vector-pop buffer)
                (return buffer)))

0.632 seconds of real time
1,513,173,132 processor cycles
100,451,712 bytes consed

Get rid of redundant push-pop.

(defun word-wrap-7 (txt)
  (loop for  text   of-type list    on (ppcre:split "\\s+" txt)
        as   word   of-type string  = (car text)
        as   next   of-type string  = (car (cdr text))
        with eol    of-type integer = *fill-column*
        and  buffer of-type (vector character)
               = (make-array 4 :element-type 'character
                               :adjustable t
                               :fill-pointer 0)
        do (loop for c across word
                 unless
                 (vector-push c buffer)
                 do (adjust-array buffer
                                  (* 16 (array-dimension buffer 0)))
                    (vector-push c buffer))
           (cond ((null next))
                 ((< eol (+ (fill-pointer buffer) 1 (length next)))
                  (vector-push-extend #\Newline buffer)
                  (setf eol (+ (fill-pointer buffer) *fill-column*)))
                 (t (vector-push-extend #\Space buffer)))
        finally (return buffer)))

0.628 seconds of real time
1,503,526,500 processor cycles
100,451,712 bytes consed

More concise conditionals. Conses way better. Best performer among the versions that ppcre:split.

(defun word-wrap-8 (txt)
  (loop for  text   of-type list    on (ppcre:split "\\s+" txt)
        as   word   of-type string  = (car text)
        as   next   of-type string  = (car (cdr text))
        with eol    of-type integer = *fill-column*
        and  buffer of-type (vector base-char)
               = (make-array 4 :element-type 'base-char
                               :adjustable t
                               :fill-pointer 0)
        do (loop for c across word
                 unless (vector-push c buffer)
                   do (adjust-array buffer
                                    (* 16 (array-dimension buffer 0)))
                      (vector-push c buffer))
        when next
          if (< eol (+ (fill-pointer buffer) 1 (length next)))
            do (vector-push-extend #\Newline buffer)
               (setf eol (+ (fill-pointer buffer) *fill-column*))
          else
            do (vector-push-extend #\Space buffer)
        finally (return buffer)))

0.631 seconds of real time
1,510,134,528 processor cycles
67,648,608 bytes consed

In the next post, we will check out which one is the most suitable approach to work with for text alignment across the line.

Store Presence on App Store #3 – Let’s Play: Optimize

In the last post, we cleaned up the data, resulting in ~86% size reduction. Yet, we still need to optimize the data for real-time processing.


This is #3 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    [ DONE ]
#2 - Let's Play: Cleanup   [ DONE ]
#3 - Let's Play: Optimize  <<
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

Prelude

We follow two rules in the matter of optimization: Rule 1: Don’t do it. Rule 2 (for experts only): Don’t do it yet.

> Michael A. Jackson. Principles of Program Design, 1975 >

Luckily, in any sane language, a list starts with Element #0.

Rule 0: Just this once.

Optimize

These are our coffee beans now:

[ "CN", "iPhone", "Board", "Collection List", 2, [ 4, 4 ], 6, [ "Home", "Board", "免费" ] ]

It is time we grind them. We do this in two ways:

  • Storing the data in a move suitable format,
  • Optimizing the layout.
Choice of DATA Format

You can grind coffee beans differently for several types of coffee. Likewise, you can choose one among many storage formats according to your needs. I chose Google’s Protocol Buffers (Protobuf) because:

  • Binary format
  • Enum data type support
  • Great Python support
  • Actively maintained

You might as well choose Thrift or Avro. You may even choose to write your own binary format, which would be an overkill for our purposes. The evaluation of format options is out of topic, but I will say this much, JSON is particularly bad for this dataset, since mathematically, much of the data has the characteristics of a finite set. That is why support for enums or actual sets, which JSON lacks, is a huge plus for the data. If you stick to JSON, you will probably still want to replace those strings with numbers, and simulate enums manually. Finally, in case you choose Avro for some reason, I am sharing the initial Avro Schema I wrote before I settled on Protobuf.

Format Conversion

This is pretty much what we are going to do, now:

optimize First of all, we need to define our Schema. We start with the direct translation from the JSON version and then we can go from there.

message feature_m
{
           date_m   date        = 1;
           string   country     = 2;
           device_e device      = 3;
           string   category    = 4;
           ftype_e  ftype       = 5;
           uint32   depth       = 6;
  repeated uint32   rows        = 7;
           uint32   position    = 8;
  repeated string   path        = 9;
}

This is a good start. Now let’s enumerate the ‘device’ member.

enum device_e
{
  IPHONE                        = 0;
  IPAD                          = 1;
}

… and the ‘feature type’ member. The possible options are listed in App Annie’s Feature History and Daily Features Report Guide & FAQ page.

enum ftype_e
{
  APP_TOP                       = 0;
  COL_TOP                       = 1;
  APP_BAN                       = 2;
  COL_BAN                       = 3;
  COL_LST                       = 4;
  COL_VID                       = 5;
}

Since we are going to want to combine datasets of multiple dates into a single database, we would also need to store the ‘date’ in each feature.

message date_m
{
  int32 year                    = 1;
  int32 month                   = 2;
  int32 day                     = 3;
}

Here is the complete schema.

syntax = "proto3";
package x13;

message date_m {
  int32 year                    = 1;
  int32 month                   = 2;
  int32 day                     = 3;
}

enum device_e {
  IPHONE                        = 0;
  IPAD                          = 1;
}

enum ftype_e {
  APP_TOP                       = 0;
  COL_TOP                       = 1;
  APP_BAN                       = 2;
  COL_BAN                       = 3;
  COL_LST                       = 4;
  COL_VID                       = 5;
}

message feature_m {
           date_m   date        = 1;
           string   country     = 2;
           device_e device      = 3;
           string   category    = 4;
           ftype_e  ftype       = 5;
           uint32   depth       = 6;
  repeated uint32   rows        = 7;
           uint32   position    = 8;
  repeated string   path        = 9;
}

message store_m {
  repeated feature_m features   = 1;
}

We named it “x13_store.proto”.
Make sure protobuf is installed in your computer.
Compile the schema for Python:

protoc --python_out=./ x13_store.proto

Now it is ready to import!
Create a “convert.py” and start importing.

import json, gzip
import x13_store_pb2 as x13s

Define a mapping between the JSON fields and ftype_e.

ftypes = {'App Top Banner'        : x13s.APP_TOP,
          'Collection Top Banner' : x13s.COL_TOP,
          'App Banner'            : x13s.APP_BAN,
          'Collection Banner'     : x13s.COL_BAN,
          'Collection List'       : x13s.COL_LST,
          'Collection Video'      : x13s.COL_VID}

Function to fill feature_m instance using given feature data.

def import_feature(data, y, m, d, feature):
    feature.date.year  = y
    feature.date.month = m
    feature.date.day   = d
    feature.country    = data[0]
    feature.device     = [ x13s.IPHONE, x13s.IPAD ][ data[1] == 'iPad' ]
    feature.category   = data[2]
    feature.ftype      = ftypes[data[3]]
    feature.depth      = data[4]
    feature.rows.extend(data[5])
    feature.position   = data[6]
    feature.path.extend(data[7])
    return feature

Create a store_m instance.

store = x13s.store_m()

Loop over all days and generate features.

y = 17
m = 8
for d in range(31):
    date      = "%02d%02d%02d" % (y, m, d+1)
    json_path = "json/" + date + '.json'

    with open(json_path) as data_file:
        data  = json.load(data_file)

    for f in data:
        import_feature(f, 2000+y, m, d+1, store.features.add())

Write the data to a protobuf file, gzipping az we go.

with gzip.open("store_sample.pbz", 'wb') as out_file:
    out_file.write(store.SerializeToString())

Save the file. Convert the data.

python3 convert.py

That’s it! Now all our data is in “store_sample.pbz”.

CONFIRM THE DATA

Start a Python notebook in Jupyter.
Do the usual imports.

import gzip
import pandas as pd
import x13_store_pb2 as x13
import matplotlib.pyplot as plt
import numpy as np
from functools import reduce
import seaborn as sns
%matplotlib inline
sns.set()

Read the newly created protobuf file.

with gzip.open('store_sample.pbz', 'rb') as f:
    data = x13.store()
    data.ParseFromString(f.read())

Create a Pandas DataFrame from filtered data.

x = pd.DataFrame([[f.date.day, f.position, f.path] for f in data.features 
                  if f.depth == 2])

x.columns = ['Day', 'Position', 'Path']

x.head()

Voila!

	Day	Position	Path
0	3	3	[Home, Board, 無料]
1	3	4	[Home, Board, 免费]
2	3	15	[Home, Puzzle, 免费]
3	3	3	[Home, Board, 免费]
4	3	3	[Home, Board, Free]

Check back for Part 4 of the series! If you have any questions or advice, please comment below. Also, please tell me what the hell do I do with this data. Thank you!

Store Presence on App Store #2 – Let’s Play: Cleanup

In the first post of the series, we learnt how to gather the data, now we are going to do some cleanup.

This is #2 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    [ DONE ]
#2 - Let's Play: Cleanup   <<
#3 - Let's Play: Optimize  [ DONE ]
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

cleanup

Where Were We?

Using the methods in the Let’s Play #1, I gathered global, day-by-day, August 2017 data for the App Store presence of our latest game, Twiniwt.

Below is the size of the dataset:

[kenanb@6x13 twiniwt-1708]$ du -hsc *
4.0K 170801.json
4.0K 170802.json
1.5M 170803.json
1.5M 170804.json
1.5M 170805.json
728K 170806.json
728K 170807.json
...
684K 170830.json
732K 170831.json
23M total

Damn! Surely, it is time for cleanup and restructuring.

Scabbling and Cleanup

Let’s python3 and import the required packages.

import json, pprint

For this post, all we need is Python 3, which I presume you already have. Even though it is completely optional for cleanup part of the series, I strongly suggest installing and using Jupyter Notebook for data exploration. It is especially useful while trying to restructure the data for your needs.

We saved our daily feature data to a subdirectory in our working directory.

A dataset corresponding to 2017-08-15 is saved as:

./json/170815.json

We want to loop over a period, say, each day in August 2017.

We need to generate the filepaths for the corresponding dates in the loop.

The year:

y = 17

The month:

m = 8

Loop for each day, generate basename for the file and concatenate the whole pathname.

for d in range(31):
    date = "%02d%02d%02d" % (y, m, d+1)
    file_name = "json/" + date + '.json'

Here, we read the dataset using the pathname we just created.

    with open(file_name) as data_file:
        data = json.load(data_file)

Then, we immediately bypass all the garbage branches in the serialized dataset, and assign the key ‘rows’ to our data variable.
First, we need to have a look at the loaded data and find the path to ‘rows’.

{
  "data": {
    "data": {
      "pagination": {
        "current": 0,
        "page_interval": 1000,
        "sum": 1
      },
      "rows": [

	  ...

      ],
      "csvPermissionCode": "PERMISSION_NOT_PASS",
      "columns": [

	  ...

      ],
      "fixedColumns": {
        "tableWidth": 150,
        "fixed": 1
      }
    },
    "permission": true
  },
  "success": true
}

As you can see, this is what takes us to the ‘rows’:

    data = data['data']['data'].get('rows') or []

You probably noticed, we didn’t simply do data['data']['data']['rows'] because that path might not even exist, if for some reason your app is not in store that day.

Restructure

Cool, we got rid of immediate garbage, it’s time to clean up and restructure the actual row data.

Let’s see, this is a sample row in an App Annie Daily Featured response.

[
  [
    {
      "image": "https://static-s.aa-cdn.net/img/ios/...",
      "type": "icon",
      "thumb": "https://static-s.aa-cdn.net/img/ios/..."
    }
  ],
  [
    "China",
    "CN"
  ],
  "iPhone",
  "Board",
  "Collection List",
  "N/A",
  2,
  4,
  6,
  [
    {
      "existence": false,
      "detail": null,
      "label": "Featured Home"
    },
    {
      "existence": false,
      "detail": null,
      "label": "Board"
    },
    {
      "existence": true,
      "detail": {
        "position": [
          6
        ],
        "row": [
          4,
          4
        ]
      },
      "parent": "免费",
      "label": "Twiniwt"
    }
  ],
  [
    "N/A",
    0,
    100,
    ""
  ]
]

Above, I marked the data we want to keep in bold. The full details about the contents of the row are provided in the Store Presence on App Store #1 – Let’s Play: Scrape.

We traverse the data, removing the garbage values from each row array (in reverse order, of course, so the indices for the garbage entities do not change during deletion.)

    for d in data:
        d.pop(10)
        d.pop(5)
        d.pop(0)

The two-letter country code is enough, we don’t need the full name of the countries in each element of our dataset.

        d[0] = d[0][1]

Shorten the ‘Featured Home’ category page name, to simply, ‘Home’.

        if ( d[2] == 'Featured Home' ): d[2] = 'Home'

The way ‘Featured Path’ is structured is pretty complex for our needs. Let’s restructure it.

        n = []
        for r in d[7]:
            n.append(r['label'])
        n[-1] = d[7][-1]['parent']
        if ( n[0] == 'Featured Home' ):
            n[0] = 'Home'
        if ( n[-1].endswith('see more') ):
            n[-1] = n[-1][:-9]
            n.append('>>')
        d[5] = d[7][-1]['detail']['row']
        d[7] = n

I am skipping the details on this one, as you might want to keep it, or arrange it differently.
Below is the cleaned-up sample data we get after the process.

[
  "CN",
  "iPhone",
  "Board",
  "Collection List",
  2,
  [
    4,
    4
  ],
  6,
  [
    "Home",
    "Board",
    "免费"
  ]
]

Now that we are finished with the dataset cleanup, let’s write it back to the file.

    with open(file_name, 'w') as out_file:
        json.dump(data, 
                  out_file, 
                  indent=2, 
                  ensure_ascii=False,
                  sort_keys=True)

You can view and download the complete code gist below.

We scraped and cleaned up the data. It is now down to 3.4Mb from 23Mb, meaning we just got rid of garbage that amounts to ~86% of the dataset. Congratulations!

Yet, the data is still unsuitable for real-time processing. We will fix that in the next part of the series. Some entities are long strings while we could get away with enumerations, things like that. And while at it, let’s get rid of this JSON nonsense, shall we?

Oh, I almost forgot the coffee beans!

Check back for Part 3 of the series! If you have any questions or advice, please comment below. Thank you!

Store Presence on App Store #1 – Let’s Play: Scrape

This is #1 of the how-to series on custom store presence analysis and plotting using Python, Jupyter and lots of sciency-graphy libraries.

#1 - Let's Play: Scrape    <<
#2 - Let's Play: Cleanup   [ DONE ]
#3 - Let's Play: Optimize  [ DONE ]
#4 - Let's Play: Analyze   [ TODO ]
#5 - Let's Play: Visualize [ TODO ]
#6 - ...

Preparing a business plan requires systematic procrastination: Store presence analytics!

All work and no play makes Jack a dull boy.

Preparing a business plan requires fourteen things: a business, a dozen cups of coffee and systematic procrastination.

I am sure you can handle the coffee and the business. So for now, I will only try to help with the procrastination:

I need some insight into the App Store presence of our latest game, Twiniwt. I visit App Annie’s Featured page, as usual.

If you don’t know about App Annie, this is a good review of the service.

Analytics you find on such dedicated services are good as general performance metrics.  However, you might want the data to support a particular claim in your business plan. The freely available content is hardly useful for that. The signal to noise ratio makes it hard to read. Besides, you can only filter the data.

We have to do some ad-hoc data science and visualization in order to get better results. Now, what was data science, again?

I live in Istanbul and I prefer a budget GNU/Linux laptop at work. Obviously, I am no data scientist. Still, some half-assed data science is better than none.

Let’s play: STORE PRESENCE ANALYTICS!

This will require at least five steps:

Scrape
Gather the data and understand its current structure
Cleanup
Restructure and delete irrelevant fields of the data
Optimize
Optimize the restructured data to a format that is faster to batch process
Analyze
Do filtering, mapping, grouping and analysis over data
Visualize
Plot and visualize the data frames
We will dedicate a seperate post to each of those titles.  The data analysis and visualization steps can actually grow to multiple posts as we go.
We won’t delve into the subject of integration, yet we will rely on cross-platform, portable data formats. Besides, we will mainly use Python, which is a very sound choice for integration.

We won’t be concerned about storage either, as the data will already shrink significantly while we are optimizing for performance.

Before we begin, let’s look at some coffee beans and get hyped!

SCRAPE

First things first. If I want to analyze store presence data, I need the store presence data. I need it in a way I can process it. I am already familiar with App Annie’s daily feature tables. Here we go:

Go to your app’s page.

Go to User Acquisition >> Featured.

This is good stuff, even though the data itself is not. 😒

App Store Presence Feature Table
WTF are those app positions! 1714? Seriously?

Anyway.

When I click the Export button, I get this.

connect
Connect this app? Hmm, why not!

Now, I am sure App Annie’s premium solution provides various great viewpoints  and customizations to the data. However, being an indie studio and all, we cannot afford it.

Hey, there is another way! I can connect my app. But what do I see when I click connect? It is asking for my App Store developer account password. I don’t care what encryption they use, there is no way I am filling those Connect forms asking for my developer account passwords.

So, I guess I am on my own, but that is OK.

Be warned that I do not know if App Annie ever intended or actually permitted the use of data they transfer to client computers in such way. So I will not tell you what to do, I will just describe how one would access such freely available data in a structured data format.

Why not having a closer look at the transferred data in Firefox? Right click anywhere in screen and click Inspect Element.

Go to Network tab.

App Annie Featured Page

Now this is where you can track requests and responses. Whenever you change the table options, a new set of data will be sent to you, the client, in JSON format.

The entry that contains the data you asked for is something like this:

/ajax/ios/app/your-app-name/daily-feature...

That is the app’s daily store presence aggregated according to your preferences.

Saving Your Data

When you need to save a piece of data a web server sent you, you simply right click the corresponding entry in Network tab, then do Copy >> Copy Response.

Open your favourite text editor and paste the clipboard to a new JSON file.

If you want to quickly do that for multiple data sets, you can instead Copy >> Copy as cURL, and modify the date etc. in copied curl command.

DATA STRUCTURE

Let’s check out a sample store presence in an App Annie Daily Featured response.

[
  [
    {
      "image": "https://static-s.aa-cdn.net/img/ios/...",
      "type": "icon",
      "thumb": "https://static-s.aa-cdn.net/img/ios/..."
    }
  ],
  [
    "China",
    "CN"
  ],
  "iPhone",
  "Board",
  "Collection List",
  "N/A",
  2,
  4,
  6,
  [
    {
      "existence": false,
      "detail": null,
      "label": "Featured Home"
    },
    {
      "existence": false,
      "detail": null,
      "label": "Board"
    },
    {
      "existence": true,
      "detail": {
        "position": [
          6
        ],
        "row": [
          4,
          4
        ]
      },
      "parent": "免费",
      "label": "Twiniwt"
    }
  ],
  [
    "N/A",
    0,
    100,
    ""
  ]
]

Now, compare it to the table, not too carefully, though. (I couldn’t bother finding the exact same record. 😎)

App Store Presence Feature Table
1714? Why do you even put that into the table? That is just wasted network bandwidth.

Below is a breakdown of the row contents:

Creative [GARBAGE]
URLs to store creative for the app.
Country
The App Store country name as well as the two-letter
ISO 3166-1 alpha-2 code for it.
Device
iPhone or iPad
Category Page
The featured category page where the placement is displayed.
Type
The feature type of the final placement displayed in the app store.
Subtitle [GARBAGE]
The text shown alongside feature banners and collection titles.
Depth
The number of steps necessary to see the final feature placement.
Row [DUPLICATE]
The final row number along the path leading to the feature placement.
Position
The final position number along the path leading to the feature placement.
Feature Path [DIRTY]
A detailed path of where the feature placement was shown in the app store.
Premium Content [GARBAGE]
Locked additional content, only activated for premium account queries.

Here is the cleaned-up version of the store presence data we want to work with.

[
  "CN",
  "iPhone",
  "Board",
  "Collection List",
  2,
  [
    4,
    4
  ],
  6,
  [
    "Home",
    "Board",
    "免费"
  ]
]

Awesome, but enough with the data scraping for today. I will show you how to cleanup and restructure the JSON data to fit our needs in next part.

Check back for Part 2 of the series! If you have any questions or advice, please comment below. Thank you!

Related Links

Check out the following official App Store guide to  building your store presence:

Make the Most of the App Store