Thurloat.com Adam Thurlow

Advanced App Engine Bulk Downloading

Feb. 25, 2011 - Posted by Thurloat
http://commondatastorage.googleapis.com/thurloat/appenginedump.png

The new Bulkloader data exporter is much easier and more automated than the nasty old way of exporting. As good as it is, there were a lot of additional requirements for one export script that I was tasked to write which looked beyond the scope of the Bulkloader YAML configuration file. Through reading the documentation and diving into the I was able to find some creative solutions and for the most part -- keep using the YAML file.

The long and winding road

I'm going to go through the problems I faced, and how I was able to solve them using primarily the YAML config file with the new Bulkloader.

CSV is out, Pipe separated is in

One of the requirements of the exported data was that it wasn't comma-separated. It needed to be delimited by a pipe, "|", character. The options for exporting data via the Bulkloader are CSV, simpletext and xml. I don't see "PSV" as an option.

Here's what I did to fix this:

1
2
3
4
5
- kind: model_name
  connector: csv
  connector_options:
    export_options:
        delimiter: "|"

I was able to re-use the Bulkloader CSV connector by passing through additional arguments by proxy to the Python CSV module. The outcome was a nice clean pipe separated data format. This was much, much better than the original thought: to try and use the Python module after the CSV connector generated the file to re-write the CSV as pipe separated.

The two parameters that you can pass to the CSV module are: delimeter and dialect. These options can be found in the ConnectorSubOptions class in

Massaging Data

There were additional data formatting changes that needed to occur to meet the requirements of the "PSV" formatted file. By default, the Bulkloader auto-generated configuration writes out some default export_transform properties for objects who don't convert nicely to strings. They cover some use-cases, but not enough to help with these requirements.

Here's a few simple examples of how I leveraged the power of lambda expressions to massage the data into place.

Yes / No instead of True / False

This wasn't a big deal, but it opened my brain to using lambdas to tweak the exported data. It simply reads in the True / False value and outputs a 'yes' / 'no' string as the formatted value.

1
2
3
4
- property: boolean_field
  external_name: Boolean Field
  import_transform: transform.regexp_bool('true', re.IGNORECASE)
  export_transform: "lambda x: 'yes' if x else 'no'"

Convert list to Comma-Separated String

I used this one often in my exporter. It's common to want to store data in a list but python string formats a list as "[u'item1', u'item2']". Clearly un-acceptable for a pretty data export; this new transformer will output the list as "item1,item2".

1
2
3
- property: list_field
  external_name: List Field
  export_transform: "lambda x: ','.join(x) if type(x) is list else x if x else ''"

Convert a list of values into single values split in Multiple Columns

This example is a little more specific to my use case, however it can still serve as a good example for having a single column in the database act as multiple columns in the exported file. Here, I have a de-normalized list of person details so I can leverage App Engine's exact list equality matching function, for example: I'm looking for an object who's person is "Adam", "Thurlow" or, "thurloat@gmail.com". Here, we're separating each of those list items out into their own column.

1
2
3
4
5
6
7
8
- property: person_details
  export:
      - external_name: First Name
        export_transform: "lambda x: x[1] if type(x) is list and len(x) > 1 else ''"
      - external_name: Last Name
        export_transform: "lambda x: x[2] if type(x) is list and len(x) > 2 else ''"
      - external_name: Email
        export_transform: "lambda x: x[0] if type(x) is list and len(x) > 0 else ''"

Quoted Printables

One of the problems discovered early on was that db.Text fields longer than 80 characters ended a line with '=\n' or '=20\n'. The cause of this problem is that when you POST form data to the Blobstore: the Blobstore encodes all large text as MIME quoted-printable. The simplest way that I found to get around this was to take advantage of the python quopri module.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
python_preabmle:
...
- import: quopri
...

transformers:

- kind: model_name
  property_map:
    - property: message_body
      import_transform: db.Text
      export_transform: quopri.decode_string

Cheers!