Yuval Adam

Migration to Git - a Review

It’s been almost 4 months since we’ve migrated our SVN repository over to Git. TL;DR;? We haven’t looked back ever since.

I can definitely say our workflow has much improved since the migration. We spend less time merging, we spend less time fussing with SCM quirks, and we are generally much more productive.

Our repository landscape is pretty simple - we have one authoritative repository from which we push and pull the master codeline. Production servers pull from that repository whenever we decide to trigger an update. Local development branches are managed by the developers on their local machines. Sharing development branches is something we don’t do often, so we don’t directly push and pull branches between devs. Rather, we share them via the central repo. This also allows us to pull those branches onto staging servers when necessary (since the dev machines are not generally accessible from the WAN).

So we’re pretty much happy with git. Having that said, we have suffered some backlash - which has pretty much all come from TortoiseGit. Half of our devs run on Linux/Mac while the others use Windows. I won’t lie, setting up and running git on windows is (still) not an optimal experience. Windows devs are generally not keen on running command line git, so they resort to TortoiseGit. And as mentioned, TortoiseGit has lots of drawbacks. The main problem is that the Tortoise workflow tries to emulate the classic SVN experience which maps very badly to the git way of things.

The worst symptom we’ve experienced was the infamous partial merge. It boils down to a TortoiseGit user committing a merge, but unknowingly deselecting files that are thought not to be relevant (“I didn’t touch those files, so the safest way is to not commit them in the merge”). Bad, bad, BAD. In git terms this translates to “these files should not be merged, so just dump MY version on top of the existing one”. Not only does this mean the you loose the existing changes, but this means that git will have no notion of these files ever changing! This always comes up with a question from another dev asking “hey, where did my changes go?” followed by “the logs don’t show any changes, but my code isn’t here!”. This is by far the worst problems we’ve had and they have no solution other than educating each dev that “when doing merge commits, always commit all files, even if their not yours”.

All in all, we are very happy with git - and 4 months into the migration we can definitely say it has improved our workflow and increased our productivity.

Data Mining the Israeli Population Census

Fact - the Israeli population census database is freely available for download on the internet. Allow me to reiterate - the personal details of every Israeli citizen are up for grabs to anyone with an internet connection.

This database contains all the personal data on every citizen in Israel, and by law should not be available to anyone outside of official government offices. This, however, is not the case. Since 2001, the census database has been constantly - illegally - leaked, at least 5 different times. The first version was distributed under the name “Dvash”, while other versions were named “Rishumon” and “Agron”, after the software that allowed for categorized searches on the database. Searches on the database could be done using various parameters, including ID number, full or partial name, date of birth, parents, partners, location, phone number and marital status - and any combination of those.

The Israeli government claims that the database leakage has been of partial and incomplete data. The evidence shows otherwise. All Israeli citizens can be found in the leaked databases. This has been the case in every single version that has leaked over the years. This problem is obviously a systematic one.

The Israeli Police has investigated the issue several times, and although it is clear that there is ongoing leakage of the database, investigations turned up no suspects. Furthermore, Israel has recently passed an extremely controversial bill allowing for the government to create a national biometric database. This bill has been met with extreme public criticism, both about the bill itself as well as with regards to the way it has been passed, hidden from the public eye, and with no public debate. If up until now the state has failed to secure its most intimate data, who’s to guarantee this fiasco won’t happen again?

Nevertheless, currently available data has many interesting uses, besides locating your high school crush. I’d like to outline some potential applications that can make use of this data - some of which I may, or may not, have implemented myself.

Statistics - the Israeli Central Statistics Bureau publishes yearly data about the Israeli population. But certain esoteric queries, that are not made public, might be of interest:

  • name popularity by date of birth
  • ID number distribution (ID numbers are not distributed uniformly, and are usually segmented into ID blocks that are sequential inside a specific timeframe)
  • population density (mapping household addresses onto a geographic map)
  • verifying whether Benford’s Law applies to the database

Diffing - (def'n) As previously mentioned, the database has been leaked at least 5 times over the past decade. This opens up an interesting angle of investigation. Citizens are never removed from the database. Deceased persons are simply marked as such, but still exist in the database. But by diffing (comparing) different versions of the database, interesting anomalies do show up - entire families have been erased from the database. The reason for these discrepancies are unknown, and can only be speculated. A likely explanation is that said deleted persons are families of covert operatives, possibly Mossad agents. This phenomenon has been dubbed The Redactor’s Dilemma, and in the context of Israel can also be found in satellite imagery censorship (expect a separate post re: that issue).

Social Graph - individual data is, generally, not as interesting as the big picture. Assuming each citizen is a dot, and assuming parental ties are part of the database, lines can be drawn between a person and his parents - rendering the entire paternity graph of the Israeli population. Very simple computer algorithms are able to find the shortest links between dots on a graph. Thus, questions such as “How are X and Y most closely related?” can be answered within seconds (i.e. Six Degrees of Separation). Unfortunately, the Israeli population database is lacking in this aspect, for two reasons. First, Israel exists for only 63 years. Any person which was not a citizen of Israel at that time could not be recorded in the database. Furthermore, the recording of ID numbers of parents has not been consistent during the years. Current data shows that parental links can be traced back, at most, to grandparents. Thus, family links can only be traced as far as cousins (X -> parent -> grandparent -> uncle (parent sibling) -> cousin Y). Nevertheless, interesting data can still be extracted from the partial social graph.

It is important to note that the data, in the distributed form, is not readily available for these uses, since it is in a proprietary format. However, extraction of this data for advanced uses is not difficult for a person with moderate technical knowledge.

Once the raw data is extracted, the options are endless. Any person with intentions, both good and bad, can use this data as he sees fit. This is clearly an issue for the Israeli people. The leaked data cannot be recovered. We must see that we are able to secure our databases, and that only the minimal amount of citizen data is collected.

Names and address can always change. Biometric data is with us from the moment we are born to the moment we die. There is no room for mistakes.

Support the fight against the biometric database.

Migrating DNS From GoDaddy to Route 53

I just finished migrating all our DNS records from our current registrar, GoDaddy, to Amazon’s Route53 service. The immediate trigger is Amazon’s newly-added support for elastic load balancing with zone apex support - which basically allows for CNAME records on root domains (example.com). It also allows for better, more robust DNS serving, and the ability to programatically update DNS records.

The process is relatively simple. The first thing to do is sign up for the Route53 service. It costs $1/month plus a $0.5/1M queries. Now, Route53 does not have an official GUI. Programatically, I use the excellent boto library (Python). For a nice web-based GUI, there are many options, but my favorite is Interstate53.

After choosing your method of choice, you should first create a hosting zone for your domain with Route53. After that, simply start creating all the records for the domain.

Once you have that down it’s time to migrate the DNS nameservers. GoDaddy, as always, has a horrible way of doing things, but this article should help you through it. Make sure to remove all GoDaddy namesevers (*.domaincontrol.com), and add the 4 nameservers you got from Amazon.

That’s it. DNS propagation should take up to 48 hours. Make sure you keep all existing records until you’ve verified the new Route53 records are working properly.

Poor Man's VPN

Today I decided to check out Avery Pennarun’s sshuttle, or as he calls it - Poor Man’s VPN. (Thanks to Fedot for the re-discovery!)

It’s an amazing tool, that runs amazingly fast and is completely transparent. Basically what it does:

  • runs on client (Mac or Linux supported)
  • connects via SSH to target proxy server (an active SSH shell account and Python are the only prerequisites, you don’t even need root or admin access on server)
  • uploads and installs on-demand the necessary dependencies on the server (completely transparent to user)
  • reconfigure the client network service to route all (or some, if you like) traffic through proxy
  • that’s it, cheap VPN!

All the heavy lifting of changing iptables, DNS and network settings is all done for you, on-the-fly. It just works.

I highly recommend using sshuttle for when you need the occasional tunneling. Once you discover how simple it is, you’ll want to use it all the time.

GCJ 2011 - Round 1A Review

Round 1A of Google Code Jam went horribly bad. It was scheduled for 4:00-6:30 AM local time, and I was busy that weekend night. I barely got an hour of sleep before the round, and when I woke up I just couldn’t think straight.

Nevermind that the problem set was very difficult - for the first two hours there was a confusing problem description on one of the problems - but I couldn’t even work out the solution for the easiest problem.

I knew I can’t finish the round like this, and decided to retire to sleep just a little after 5AM. Fortunately, Round 1 still has two more sub-rounds that I can qualify in, 1B and 1C. I’ll mentally prepare to kick ass on Round 1B which should be much more comfortable, in terms of timing.

Startup Workaway

TL;DR;

I probably should explain - at the end of this month I’ll be leaving for 10 days in beautiful Costa Rica as part of Startup Workaway along with a dozen (and a half) awesome hackers and founders. Can’t wait. Props to Nick Tommarello et al. for arranging this whole thing, it’s going to be awesome :)

GCJ 2011 - Qualification Round Review

The Google Code Jam 2011 qualification round has been pretty good.

TL;DR;

But back to the beginning. I settled on the my rooftop with a beer, and listened to some mood-setting music (LCD Soundsystem and¬†!!!). The round started at 2AM here in Israel, and I didn’t plan on sleeping before, so I assumed I would probably not get all the problems done at once, and will need a good night’s sleep in between the problem sets. And I was right.

So obviously I started off with the two easier problems: Bot Trust and Candy Splitting. Bot Trust was an easy simulation exercise. Somewhere midway I thought I figured out a way to calculate the value by choosing the minimum value from both bots routes without running the entire simulation, but it didn’t take into effect the mutual exclusion rules. That wasted some precious time, so I went back to the simple simulation. The input set was simple enough to run a naive algorithm on it. So that was finished around the 40 minute mark.

Next up was the Candy Splitting problem. Recognizing the fact that it is nothing more than a xor summing problem is all it takes. Running a greedy algorithm on the candy set (linear!) gives you the result. Pretty easy, I finished this problem in about 20 minutes.

After that, I couldn’t quite wrap my head around the Gorosort problem, so I left it for the next day, and took a shot at the Magicka problem. It took me a while to figure it out, but then I realized it’s simply a matter of iterating over a string,¬† and replacing characters and deleting prefixes if necessary. I got the algorithm down, and implemented, but failed the first small test. I inspected the algorithm, and found nothing wrong. That got me frustrated, so I went to sleep, and figured I’d start fresh after several hours of sleep.

The next day I went out to a cafe near my place, and started working the problems again. I started off from the Magicka problem, and figured the late hours caused me to miss something simple; my only issue was that I was printing the output wrong - FACEPALM. I quickly fixed the output printing statements, and sure enough, problem solved.

Last issue, try to figure out the Gorosort problem before the round ends. This one really took me a while. I started out with some observations, such as the fact that binary flipping is always the best option (flipping more than 2 indices in the array isn’t efficient in terms of probability), and that flipping mutual pairs is obviously best. But I didn’t quite nail it down. I knew how to find the bad indices, but couldn’t quite get the calculation correct. Eventually it hit me, I was calculating the expected value completely wrong, I was taking each index flip and multiplying by two (for the 50-50 probability) instead of looking at the amortized cost!

Bottom line - qualification round was lots of fun! I have some observations like the fact that using source control (git FTW!) is really helpful in working diffs when changing strategies, and the fact that it’s important to work with a handy utility that does all the file handling and miscellanea¬† for you, but I’ll leave that to another post.

See you all in Round 1!

Feature-squash-rebase Flow

I have a new favorite flow in git. I call it feature-squash-rebase. It’s used for taking a feature branch, and applying all it’s changes in one commit onto the master HEAD.

It goes something like this. First, branch out for the new feature:

git checkout -b feature

Hack away. Commit frequently. Once you’re done, return to master:

git checkout master

Now the magic begins. First, merge the feature branch onto master by using the squash flag, this will meld all the commits into one giant commit:

git merge --squash feature

Now you can review the huge diff, and commit when ready:

git commit -m "new feature"

And once you got that down, reapply your commit onto the master HEAD by using pull with the rebase flag:

git pull --rebase

That’s it. Push your changes to the rest of the team.

git push

This flow is not necessary, but it keeps the history super-clean. I don’t use this for heavy features that I want to keep track of. But for short stuff, this flow is a keeper.