Tue, 17 Oct 2017

Interview Study List

For a talk I'm planning to give on how to do better in coding interviews, I compiled a list of "study topics" for those who either have never learned, or may have gotten rusty, at various topics in computer science and programming.

I am not asserting that all these topics must be known by all candidates for all jobs; but rather, if there is an area that a given job is likely to focus on, this structural approach might help candidates delve into an area of study deeper and expand their current context.

Interview Preparation Study List

  • Programming Abstractions
    • Problem Solving Skills
      • Defining and understanding the problem, including constraints and acceptance criteria
      • Validating your assumptions
      • Proposing possible solutions (along with criteria to evaluate them)
      • Capturing data/testing your assumptions
      • Selecting an approach, eliminating approaches that won’t work, are too risky, or take too long
      • Committing to an approach, validating that it is the right path along the way
    • Software Design
      • Clean Code
        •   https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882
    • Data Structures & Abstractions
      • Arrays/vectors/strings, stacks, queues, linked lists, hash tables, trees
    • Recursion
    • Searching and sorting
    • Algorithmic analysis (e.g. “Big-O” notation)
  • Computer Organization & Computer Systems
    • Machine architecture (registers, I/O, basic assembly language)
    • Memory models (pointers, memory allocation, data representation)
    • Compilation (stack frames, semantic analysis, code generation)
    • Basic concurrency (threading, synchronization)
    • System processes (context switching, inter-process communication)
    • Storage and file management (file systems, virtual memory)
    • Caching (including memoization, memory/filesystem caches)
    • Networking (sockets, TCP/IP, routing)
      • Fallacies of Network Computing
        •   https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
    • Distributed systems
    • Database systems
      • Graph vs. Relational Strategies (SQL, NoSQL)
      • ACID (and how a database achieves this; i.e. row/table based locking)
      • Schemas/Schema Design
      • How databases can achieve high performance (indexes, normalization/denormalization)
      • Data Modeling
  • Computing Theory
    • Logic
    • Functions
    • State Machines
    • Regular Expressions
    • Complexity Analysis
    • Little’s Law
    • Serial/Parallel computing
    • Hashing
    • Data analysis
    • Machine learning principles
  • Probability
    • Conditional probability
    • Probability distributions
    • Bayes' Theorem
    • Law of Large Numbers/Central Limit Theorem
  • Algorithms
    • Algorithmic complexity & efficiency
    • Time/space efficiency
    • Computational strategy (divide/conquer, greedy, hashing, randomized algorithms, graph, search algorithms)
  • Domain Knowledge (broad spectrum)
    • Project/Program Management (Agile, Lean, Scrum, Kanban, Conway’s Law, Parkinson’s Law, Mythical Man Month)
    • Software Deployment & Delivery (SDLC, Code branching/flow, Source code management, Deployment Strategies)
    • Operational resiliency/scalability (Datacenter, Cloud, Loadbalancers, Read replicas, CQRS)
    • Software Architecture (Domain Driven Design, UML Context & Sequence Diagrams, Class Diagrams)
    • Full stack development (Databases, Web Servers, Caches, Edge-Caching, Browser Caching, client code/performance, JS/CSS frameworks)
    • Computer Images/Graphics (bit depth, compression strategies, PNG vs. JPG, etc., Vector vs. Bitmap)
    • Semantic Content (SGML->HTML, Metadata, Taxonomies, Ontologies)
    • Big Data, Machine Learning, Recommendation Systems, Rules Engines, Heuristics

Thu, 26 Jan 2017

Picking The Right Tool For the Job: Data Persistence

In a talk I presented several years ago, I suggested that, too often, software engineers revert to form of using a Relational Database without considering the requirements of their project. These decisions are often not painful-- at least to start, for many reasons.

Principally, relational database systems have been optimized over several decades and represent a long heritage of performance, stability and reliability benefits. However, the decision of what persistence engine to use should always be a deliberate decision that fundamentally focuses on what the core requirements for the datastore actually are. A few key ones include the read/write characteristics of the application, the query and retrieval complexity, and not least of which are the production access patterns (what volume of reads/writes you're likely to see) to help inform decisions driven by the CAP Theorem.

In this post, my goal is to show what you can do with a "graph database" as a contrast to a traditional "relational" database. Relational databases split application domain into tables containing specific content of a certain type. For example, a table of Movies, with pointers (called foreign keys) that point to other entities in other tables. For example, you might link a "Movies" table with an "Actors" table by building a "Actor_Movie" table which contains a reference to the Actor table where that actor acted in the movie referenced in the Movie table.

But the problem of querying such relational tables efficiently for some kinds of queries represents a lot of complexity in the query/schema or moves a lot of the heavy lifting (read: time, effort, bugs) into the application domain of the programmer, which is patently the opposite intent of selecting a data persistence engine which is supposed to elegantly solve some of your problems without reinventing the wheel. A graph database takes a different approach allowing you to place all of your "nouns" (actors, movies, etc.) as "nodes" in a graph, and then allowing multiple "edge types" or arrows that are drawn between the "nouns" representing a given relationship between the nodes. For example, in a movie dataset, nodes could be movies, actors, directors; edges could be "acted_in", "directed". Further, the edge "acted_in" could contain an attribute of the role (actor(Tom Hanks) acted_in(as "Forrest Gump") in movie(Forrest Gump)).

If you want to get a sample dataset running to follow along, continue; otherwise, scroll down to the "Prepare Data" section to fast forward to a demo.

On my Mac, preparing the prerequisites, installing the software, preparing the data, and playing with it was super easy (<10 minutes) using Brew. If you've already got brew, Java or your own dataset, or your platform has different prerequisites, skip the appropriate steps here. Your mileage will vary, but ultimately, use your package manager to (or manually) install Neo4J.

-2. Install Brew package manager.
$ ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
$ brew doctor

-1. Install the Java 7 SDK
Download from Oracle if you don't already have it.

0. Download the Cineasts dataset
$ cd /tmp
$ curl -O http://example-data.neo4j.org/files/cineasts_12k_movies_50k_actors.zip

1. Install Neo4j
$ brew install neo4j
2. Upgrade dataset to Neo4j 2.0.0 automatically on startup (on first start)
$ vi /usr/local/Cellar/neo4j/2.0.0/libexec/conf/neo4j.properties
Substitute your favorite text editor if you don't know vi.
(if you didn't use brew, you'll need to find your properties file)

Uncomment the line that says:

3. Place the dataset where Neo4J looks for it.
$ cd /usr/local/Cellar/neo4j/2.0.0/libexec/data/
$ mkdir graph.db
$ cd graph.db
$ mv /tmp/cineasts_12k_movies_50k_actors.zip .
$ unzip cineasts_12k_movies_50k_actors.zip

(optional, you might keep it around to reset your data)
$ rm cineasts_12k_movies_50k_actors.zip

4. Start neo4j
$neo4j start
 Prepare Data
The dataset lacks labels, which I find makes querying a bit harder, so let's fix it.

1. Open the neo4j UI:


2. Run the following queries which adds labels to the two distinct types we care about in the dataset... they take a few seconds as it does about 62,000 writes on the dataset to add labels.

MATCH (n) WHERE n.__type__ = "org.neo4j.cineasts.domain.Movie" SET n:Movie
MATCH (n) WHERE n.__type__ = "org.neo4j.cineasts.domain.Person" SET n:Person
This gives you a sense of Cypher, the SQL-like query language for Neo4J.

Consider the following query:
MATCH (n:Person {name:"Jack Nicholson"}) RETURN n
The way to read this is,

"Match a node, and call it "n" that has a label of Person, whose attribute called "name" is "Jack Nicholson", and return that node "n". This is the equivalent of a basic single row select in a relational database.

Now let's stretch the bounds of what you can easily do with a relational database. What if we wanted to answer the "Kevin Bacon" question between Kevin Spacey and Meg Ryan? Or between he and Jack Nicholson? Easy!

MATCH p=shortestPath( (s:Person {name:"Kevin Spacey"})-[*]-(r:Person {name:"Meg Ryan"}) ) RETURN p
MATCH p=shortestPath( (s:Person {name:"Kevin Spacey"})-[*]-(n:Person {name:"Jack Nicholson"}) ) RETURN p

However, this just returns an answer-- what if you want all of them?
MATCH p=allShortestPaths( (s:Person {name:"Kevin Spacey"})-[*]-(n:Person {name:"Jack Nicholson"}) ) RETURN p

On your Neo4J UI, double click nodes to expand, click nodes to see properties on that node or on that edge.

As I started to think about what kind of queries to demonstrate, I thought of some Hollywood trivia. For example, I know that Francis Ford Coppola and Nicholas Cage are related (uncle/nephew). What movies did FFC direct that NC acted in? Sounds difficult? With Cypher it's easy:
MATCH p=allShortestPaths( (s:Person {name:"Francis Ford Coppola"})-[*]-(n:Person {name:"Nicolas Cage"}) ) RETURN p

Astute readers might point out that this is a less than elegant query because it shows all relationships-- that is, it would return a movie that both NC and FFC both acted in. Fixing that is easy:
MATCH (nc:Person {name:"Nicolas Cage"})-[:ACTS_IN]->(m)<-[:DIRECTED]-(ffc:Person {name:"Francis Ford Coppola"}) RETURN m

Expressing that same logic in a relational database is not quite as easy and is likely to tax the hardware its run on much more than a graph database for equivalent performance.

And that's the key takeaway-- to do the job most efficiently, you need to pick the right tool for the job, and that means having both relational and "post-relational" data storage solutions in your toolchest.

Using an Airport Express with Yosemite

One of the perils of upgrading software is that your older hardware may be left out in the cold if your software configuration is out of date. For example, with the weather turning nice this weekend, I recently had a configuration change I needed to make to my old, trusty Airport Express that I use to AirPlay music to my outdoors speakers.

Except, the Airport Utility 6.3.4 no longer supports the old Airport Express devices. And yes, you can go and download Airport Utility 5.6.1 from Apple, which does support the older devices, except Yosemite changed the location of a few files that breaks the 5.6.1 version, so we're at an impasse.

Corey Mahler comes to the rescue, however, with a launcher that will enable you to launch 5.6.1 on Yosemite.

Unless, of course, you've reset your Airport Express to default/factory settings, in which case, Yosemite jumps in the way yet again. When you reset the device, Apple "helpfully" attempts to make configuration of new Airport devices easier by obscuring the name of the device's SSID. Normally, it's something like "Airport Express d8fc3ecb" or something of that ilk. Yosemite launches the 6.3.4 Utility when you try to join it, to help with configuration, but with an older, incompatible device, it just ends up saying "This device is incompatible with this version of Airport Utility".

So we're stuck again.

The good news is, I found a workaround.

  1. Launch Terminal.app
  2. Run the following command: /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport scan
  3. Find the name of the airport express base station name (e.g. "Airport Express d8fc3ecb")
  4. Copy the name of the SSID from Terminal.app
  5. From the wifi menu dropdown, choose "Join Other Network"
  6. Enter that as the SSID, and leave security off

Now you'll be on the proper network, and Airport Utility 5.6.1 will begin to work. Have fun enjoying your music!

Wed, 28 Dec 2016

Hackintosh Details

I recently completed a hackintosh install. For those wishing to create a similar beast, the Sierra-compatible parts list is below. I chose components with the best price/performance to keep the price under $1000. Scroll to the end to look at other options.

Type Item Price
CPU Intel Core i5-6600K 3.5GHz Quad-Core Processor ~$238.89
CPU Cooler Corsair H55 57.0 CFM Liquid CPU Cooler ~$59.99
Motherboard Gigabyte GA-Z170MX-Gaming 5 Micro ATX LGA1151 Motherboard ~$146.73
Memory Crucial Ballistix Sport LT 16GB (4 x 4GB) DDR4-2400 Memory ~$113.26
Storage Samsung 250GB 2.5" Solid State Drive ~$89.99
Video Card EVGA GeForce GT 740 4GB Superclocked Video Card ~$89.99
Case BitFenix Phenom M Arctic White MicroATX Mini Tower Case ~$81.29
Power Supply Corsair CXM 450W 80+ Bronze Certified Semi-Modular ATX Power Supply ~$57.34
Wireless Network Adapter TP-Link Archer T8E PCI-Express x1 802.11a/b/g/n/ac Wi-Fi Adapter ~$58.58

To go faster, you can try the following swaps...I can't vouch for these components as they are not in my build, but they should work.

If you prefer a more budget conscious approach, you can go for a lower spec CPU, and leave out the Gigabyte/EVGA graphics card, the CPU's onboard graphics (HD 530 for the i5 CPU) will work fine for casual gaming and browsing.

Once you get the basic computer components put together, you'll need to follow a guide, which is what ultimately worked seamlessly for me.

FTC disclosure: I have not been compensated by any company for the products mentioned here. Links go to my Amazon affiliates account.

Tue, 11 Nov 2014

Driving to Root Cause

This post is largely about software engineering, but it applies to really any situation where a preventable problem was encountered.

Too often, individuals, and worse, teams, assume that failures are inevitable, and there's nothing you can do to prevent them.

High performing teams, on the other hand, almost relish failures because they have gotten very good at addressing the reasons why they occurred. They eagerly look to assimilate the variability or lack of control of that particular variable, and lock it down and make it part of their process.

Once enough of these variables are locked down, solving problems becomes very simple. Using a strategy called the "five whys", you too can operate like a seasoned problem solving engineer. For each failure in a system, ask Why?, up to five times, until you reach a resolution that represents a broken process.

If you're doing this on a team, remind the team members that they need to bring good ego strength, and seek to understand the broken process, as this is not a finger pointing exercise!

For example, if you reach a conclusion that says "John is just a poor QA tester, he needs to do a better job” or “Jane is just a poor developer, she needs better coaching on C++”, you've pointed fingers, not found the root cause.

A better way:

Joe didn’t catch an error with the Messages section of the site.
The test cases didn’t capture a feature unique to the Messages section.
Messages test cases were built from a former spreadsheet kept for the previous version.
We assumed the old test cases would cover testing of the new feature.
The requirements for the new feature should have had better acceptance criteria for new feature, and our definition of done should have included writing new test cases for any new functionality.

Root Cause:
- Stories must have acceptance criteria for new features.
- All new functionality should have test cases written against those acceptance criteria

Root Resolution:
- Team has added “All User Stories Must Have Acceptance Criteria” to their Sprint-Ready definition of a Story.
- Team has added “No Stories that are not considered Sprint Ready allowed in the Sprint Planning Meeting” to their Team Agreement

The goal here is for Joe to realize that his job is to follow a documented process, not to beat himself up over missing a test case.

Root cause analysis always strives to find the process that is broken, and then expects a team to uphold the process. Leave things to an individual, and they are likely to break. Leave it up to a team, and have the team enforce the process and you’re likely to not repeat the same mistakes.

For super-complex root cause analysis, another technique to use is called the Ishikawa diagram, or a fishbone diagram. This allows more than a simple linear regression of the five whys and allows analysis of a web of contributing causes. Stay tuned for a post on that in the future.


Written using MacVim
Published by Blosxom
Layout: Blueprint CSS