Chapter 5. Cloud Patterns
Note that in processing semi-structured data in dataflows, our data often baloons to a large intermediate state as we compute cross products of different data sets to count, sum, average or manipulate them. Our final data, suitable for serving to our application from a nosql store, tends to shrink back down to be much smaller than the original input. In other words, the descriptions of our data we actually publish is usually much smaller than the intermediate state of our data in processing.
When our projections exceed the RAM of one computer is when we must employ tools for parallel computation such as Hadoop and Pig via Elastic MapReduce.
There is a tendency in developing analytics applications to shift operations across the stack, from the front to the back-end.
Multiple consumers of data in an analytics team should meet and agree to share intermediate, derived data. This allows shared resources for documentation, QA and operations.
Creating Keysets: Curating Ontologies
When deriving new entities and relationships, things can get complex fast. We can solve this problem by filtering for both sides of a split in a precurser step, so that both kinds of records share the same indexes. This is not always possible, but is recommended when filtering for both sets can be combined. Diverging keysets can be a major headache, so it is important to centralize and highlight filtering whenever possible.