Sat, 28 Oct 2017

It's Time for a Trusted Social Network

One of the reasons that Google has become the predominant search engine is their discovery and application of "PageRank" to search results. Google determines which search results are most relevant, in part, by examining hyperlinks in web pages.

The idea goes, if something is linked to, it implies trust. The more people who link to you, the more trust you must have, so links that you make to others have more value than other pages who don't have as many trusted links.

If that hurts your head a little, it should, because it's a recursive algorithm that constantly needs to be recomputed.

The problem with Facebook and most other social networks, is that they haven't applied the learnings from PageRank to their social networks. The news and information that gets surfaced shouldn't be based on what you most like, it shouldn't be based on ad revenue, or on other social network signals that are easy to fake.

Ebay has had a similar system for their sellers. To become trusted, you have to complete transactions that people say were legitimate. Uber and Lyft have it for their drivers (and riders). Facebook merely needs to provide a feedback loop (like "Likes", perhaps call them "Kudos" or "Rep") based on accuracy or other measures of veracity, and when articles with sufficient Kudos are shared, articles from those authors/sources are boosted. On the other hand, when authors or news sources have to publish corrections, are refuted by more trusted sources, or are found to have misled or lied, their ratings should drop, and they should unceremoniously be dropped from the "shareable" web-sphere.





Tue, 17 Oct 2017

Interview Study List

For a talk I'm planning to give on how to do better in coding interviews, I compiled a list of "study topics" for those who either have never learned, or may have gotten rusty, at various topics in computer science and programming.

I am not asserting that all these topics must be known by all candidates for all jobs; but rather, if there is an area that a given job is likely to focus on, this structural approach might help candidates delve into an area of study deeper and expand their current context.

Interview Preparation Study List



  • Programming Abstractions
    • Problem Solving Skills
      • Defining and understanding the problem, including constraints and acceptance criteria
      • Validating your assumptions
      • Proposing possible solutions (along with criteria to evaluate them)
      • Capturing data/testing your assumptions
      • Selecting an approach, eliminating approaches that won’t work, are too risky, or take too long
      • Committing to an approach, validating that it is the right path along the way
    • Software Design
      • Clean Code
        •   https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882
    • Data Structures & Abstractions
      • Arrays/vectors/strings, stacks, queues, linked lists, hash tables, trees
    • Recursion
    • Searching and sorting
    • Algorithmic analysis (e.g. “Big-O” notation)
  • Computer Organization & Computer Systems
    • Machine architecture (registers, I/O, basic assembly language)
    • Memory models (pointers, memory allocation, data representation)
    • Compilation (stack frames, semantic analysis, code generation)
    • Basic concurrency (threading, synchronization)
    • System processes (context switching, inter-process communication)
    • Storage and file management (file systems, virtual memory)
    • Caching (including memoization, memory/filesystem caches)
    • Networking (sockets, TCP/IP, routing)
      • Fallacies of Network Computing
        •   https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
    • Distributed systems
    • Database systems
      • Graph vs. Relational Strategies (SQL, NoSQL)
      • ACID (and how a database achieves this; i.e. row/table based locking)
      • Schemas/Schema Design
      • How databases can achieve high performance (indexes, normalization/denormalization)
      • Data Modeling
  • Computing Theory
    • Logic
    • Functions
    • State Machines
    • Regular Expressions
    • Complexity Analysis
    • Little’s Law
    • Serial/Parallel computing
    • Hashing
    • Data analysis
    • Machine learning principles
  • Probability
    • Conditional probability
    • Probability distributions
    • Bayes' Theorem
    • Law of Large Numbers/Central Limit Theorem
  • Algorithms
    • Algorithmic complexity & efficiency
    • Time/space efficiency
    • Computational strategy (divide/conquer, greedy, hashing, randomized algorithms, graph, search algorithms)
  • Domain Knowledge (broad spectrum)
    • Project/Program Management (Agile, Lean, Scrum, Kanban, Conway’s Law, Parkinson’s Law, Mythical Man Month)
    • Software Deployment & Delivery (SDLC, Code branching/flow, Source code management, Deployment Strategies)
    • Operational resiliency/scalability (Datacenter, Cloud, Loadbalancers, Read replicas, CQRS)
    • Software Architecture (Domain Driven Design, UML Context & Sequence Diagrams, Class Diagrams)
    • Full stack development (Databases, Web Servers, Caches, Edge-Caching, Browser Caching, client code/performance, JS/CSS frameworks)
    • Computer Images/Graphics (bit depth, compression strategies, PNG vs. JPG, etc., Vector vs. Bitmap)
    • Semantic Content (SGML->HTML, Metadata, Taxonomies, Ontologies)
    • Big Data, Machine Learning, Recommendation Systems, Rules Engines, Heuristics





Thu, 26 Jan 2017

Picking The Right Tool For the Job: Data Persistence

In a talk I presented several years ago, I suggested that, too often, software engineers revert to form of using a Relational Database without considering the requirements of their project. These decisions are often not painful-- at least to start, for many reasons.

Principally, relational database systems have been optimized over several decades and represent a long heritage of performance, stability and reliability benefits. However, the decision of what persistence engine to use should always be a deliberate decision that fundamentally focuses on what the core requirements for the datastore actually are. A few key ones include the read/write characteristics of the application, the query and retrieval complexity, and not least of which are the production access patterns (what volume of reads/writes you're likely to see) to help inform decisions driven by the CAP Theorem.

In this post, my goal is to show what you can do with a "graph database" as a contrast to a traditional "relational" database. Relational databases split application domain into tables containing specific content of a certain type. For example, a table of Movies, with pointers (called foreign keys) that point to other entities in other tables. For example, you might link a "Movies" table with an "Actors" table by building a "Actor_Movie" table which contains a reference to the Actor table where that actor acted in the movie referenced in the Movie table.

But the problem of querying such relational tables efficiently for some kinds of queries represents a lot of complexity in the query/schema or moves a lot of the heavy lifting (read: time, effort, bugs) into the application domain of the programmer, which is patently the opposite intent of selecting a data persistence engine which is supposed to elegantly solve some of your problems without reinventing the wheel. A graph database takes a different approach allowing you to place all of your "nouns" (actors, movies, etc.) as "nodes" in a graph, and then allowing multiple "edge types" or arrows that are drawn between the "nouns" representing a given relationship between the nodes. For example, in a movie dataset, nodes could be movies, actors, directors; edges could be "acted_in", "directed". Further, the edge "acted_in" could contain an attribute of the role (actor(Tom Hanks) acted_in(as "Forrest Gump") in movie(Forrest Gump)).

If you want to get a sample dataset running to follow along, continue; otherwise, scroll down to the "Prepare Data" section to fast forward to a demo.

Prerequisites
On my Mac, preparing the prerequisites, installing the software, preparing the data, and playing with it was super easy (<10 minutes) using Brew. If you've already got brew, Java or your own dataset, or your platform has different prerequisites, skip the appropriate steps here. Your mileage will vary, but ultimately, use your package manager to (or manually) install Neo4J.

-2. Install Brew package manager.
$ ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
$ brew doctor

-1. Install the Java 7 SDK
Download from Oracle if you don't already have it.

0. Download the Cineasts dataset
$ cd /tmp
$ curl -O http://example-data.neo4j.org/files/cineasts_12k_movies_50k_actors.zip


Install
1. Install Neo4j
$ brew install neo4j
2. Upgrade dataset to Neo4j 2.0.0 automatically on startup (on first start)
$ vi /usr/local/Cellar/neo4j/2.0.0/libexec/conf/neo4j.properties
Substitute your favorite text editor if you don't know vi.
(if you didn't use brew, you'll need to find your properties file)

Uncomment the line that says:
#allow_store_upgrade=true

3. Place the dataset where Neo4J looks for it.
$ cd /usr/local/Cellar/neo4j/2.0.0/libexec/data/
$ mkdir graph.db
$ cd graph.db
$ mv /tmp/cineasts_12k_movies_50k_actors.zip .
$ unzip cineasts_12k_movies_50k_actors.zip


(optional, you might keep it around to reset your data)
$ rm cineasts_12k_movies_50k_actors.zip

4. Start neo4j
$neo4j start
 Prepare Data
The dataset lacks labels, which I find makes querying a bit harder, so let's fix it.

1. Open the neo4j UI:

http://localhost:7474/browser/.

2. Run the following queries which adds labels to the two distinct types we care about in the dataset... they take a few seconds as it does about 62,000 writes on the dataset to add labels.

MATCH (n) WHERE n.__type__ = "org.neo4j.cineasts.domain.Movie" SET n:Movie
MATCH (n) WHERE n.__type__ = "org.neo4j.cineasts.domain.Person" SET n:Person
This gives you a sense of Cypher, the SQL-like query language for Neo4J.

Queries
Consider the following query:
MATCH (n:Person {name:"Jack Nicholson"}) RETURN n
The way to read this is,

"Match a node, and call it "n" that has a label of Person, whose attribute called "name" is "Jack Nicholson", and return that node "n". This is the equivalent of a basic single row select in a relational database.

Now let's stretch the bounds of what you can easily do with a relational database. What if we wanted to answer the "Kevin Bacon" question between Kevin Spacey and Meg Ryan? Or between he and Jack Nicholson? Easy!

MATCH p=shortestPath( (s:Person {name:"Kevin Spacey"})-[*]-(r:Person {name:"Meg Ryan"}) ) RETURN p
MATCH p=shortestPath( (s:Person {name:"Kevin Spacey"})-[*]-(n:Person {name:"Jack Nicholson"}) ) RETURN p

However, this just returns an answer-- what if you want all of them?
MATCH p=allShortestPaths( (s:Person {name:"Kevin Spacey"})-[*]-(n:Person {name:"Jack Nicholson"}) ) RETURN p

On your Neo4J UI, double click nodes to expand, click nodes to see properties on that node or on that edge.

As I started to think about what kind of queries to demonstrate, I thought of some Hollywood trivia. For example, I know that Francis Ford Coppola and Nicholas Cage are related (uncle/nephew). What movies did FFC direct that NC acted in? Sounds difficult? With Cypher it's easy:
MATCH p=allShortestPaths( (s:Person {name:"Francis Ford Coppola"})-[*]-(n:Person {name:"Nicolas Cage"}) ) RETURN p

Astute readers might point out that this is a less than elegant query because it shows all relationships-- that is, it would return a movie that both NC and FFC both acted in. Fixing that is easy:
MATCH (nc:Person {name:"Nicolas Cage"})-[:ACTS_IN]->(m)<-[:DIRECTED]-(ffc:Person {name:"Francis Ford Coppola"}) RETURN m

Expressing that same logic in a relational database is not quite as easy and is likely to tax the hardware its run on much more than a graph database for equivalent performance.

And that's the key takeaway-- to do the job most efficiently, you need to pick the right tool for the job, and that means having both relational and "post-relational" data storage solutions in your toolchest.




Using an Airport Express with Yosemite

One of the perils of upgrading software is that your older hardware may be left out in the cold if your software configuration is out of date. For example, with the weather turning nice this weekend, I recently had a configuration change I needed to make to my old, trusty Airport Express that I use to AirPlay music to my outdoors speakers.

Except, the Airport Utility 6.3.4 no longer supports the old Airport Express devices. And yes, you can go and download Airport Utility 5.6.1 from Apple, which does support the older devices, except Yosemite changed the location of a few files that breaks the 5.6.1 version, so we're at an impasse.

Corey Mahler comes to the rescue, however, with a launcher that will enable you to launch 5.6.1 on Yosemite.

Unless, of course, you've reset your Airport Express to default/factory settings, in which case, Yosemite jumps in the way yet again. When you reset the device, Apple "helpfully" attempts to make configuration of new Airport devices easier by obscuring the name of the device's SSID. Normally, it's something like "Airport Express d8fc3ecb" or something of that ilk. Yosemite launches the 6.3.4 Utility when you try to join it, to help with configuration, but with an older, incompatible device, it just ends up saying "This device is incompatible with this version of Airport Utility".

So we're stuck again.

The good news is, I found a workaround.

  1. Launch Terminal.app
  2. Run the following command: /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport scan
  3. Find the name of the airport express base station name (e.g. "Airport Express d8fc3ecb")
  4. Copy the name of the SSID from Terminal.app
  5. From the wifi menu dropdown, choose "Join Other Network"
  6. Enter that as the SSID, and leave security off


Now you'll be on the proper network, and Airport Utility 5.6.1 will begin to work. Have fun enjoying your music!





Wed, 28 Dec 2016

Hackintosh Details

I recently completed a hackintosh install. For those wishing to create a similar beast, the Sierra-compatible parts list is below. I chose components with the best price/performance to keep the price under $1000. Scroll to the end to look at other options.

Type Item Price
CPU Intel Core i5-6600K 3.5GHz Quad-Core Processor ~$238.89
CPU Cooler Corsair H55 57.0 CFM Liquid CPU Cooler ~$59.99
Motherboard Gigabyte GA-Z170MX-Gaming 5 Micro ATX LGA1151 Motherboard ~$146.73
Memory Crucial Ballistix Sport LT 16GB (4 x 4GB) DDR4-2400 Memory ~$113.26
Storage Samsung 250GB 2.5" Solid State Drive ~$89.99
Video Card EVGA GeForce GT 740 4GB Superclocked Video Card ~$89.99
Case BitFenix Phenom M Arctic White MicroATX Mini Tower Case ~$81.29
Power Supply Corsair CXM 450W 80+ Bronze Certified Semi-Modular ATX Power Supply ~$57.34
Wireless Network Adapter TP-Link Archer T8E PCI-Express x1 802.11a/b/g/n/ac Wi-Fi Adapter ~$58.58


To go faster, you can try the following swaps...I can't vouch for these components as they are not in my build, but they should work.


If you prefer a more budget conscious approach, you can go for a lower spec CPU, and leave out the Gigabyte/EVGA graphics card, the CPU's onboard graphics (HD 530 for the i5 CPU) will work fine for casual gaming and browsing.


Once you get the basic computer components put together, you'll need to follow a guide, which is what ultimately worked seamlessly for me.


FTC disclosure: I have not been compensated by any company for the products mentioned here. Links go to my Amazon affiliates account.






Colophon

Written using MacVim
Published by Blosxom
Layout: Blueprint CSS