Using Data Science To Name Your Baby

Kushal Chakrabarti

5 years ago

So, we just had a baby girl a few days ago! (Yay! Mom and baby are both healthy and happy.) I’m obviously excited to be a father for the first time, but… I figure at least 70% of that excitement is about cool projects I can do to with her.

Project #1: What’s Going To Be Her Name?!

Turns out it’s harder than you’d think to name a (mixed-race) baby. Do you do a Jewish first name and Indian middle name? What about spellability and pronounceability? Go 100% American? Try finding something that’s profound in both Sanskrit and Hebrew? Oof.

One day, as my wife and I were sitting in Los Angeles trying to brainstorm names for what felt like the millionth time, I had an idea — couldn’t you use data science to do this? (!)

Well, obvs.

After a bit of negotiating, my wife and I agreed on a few criteria:

a name that sounded defensibly both Indian and Jewish
something that couldn’t be mispronounced at school
something short (11-letter last name and all)
…plus a few other things that were personal for us

That shouldn’t be too hard to prototype, right? I spent a day or two wrangling datasets, pandas and soundex. It ended up being at least a little helpful, so I spent another day making it publicly available.

NameBlender!

Without further ado, here’s NameBlender! [*] Wondering about that melanin-enhanced cutie across the room? Or that impossibly adorable redhead sitting next to you? Give it a whirl — see what your hypothetical future genetic collisions could be named!

[*] Given the prototype nature, there might be a few false positives. Specifically, it looks like there’s noise from mixed-race marriages that I haven’t had the chance to filter out. Be gentle.

Here are 100+ names for girls with Indian and Jewish roots that are 7 letters or less, sorted by American popularity, ethnic uniqueness and phonetic similarity:

Gender

Length

Origin

Last

Baby Lia!

And the ultimate demo? Here’s Lia, our new baby girl:

Methodology

(Skip unless you’re looking for a late-night cure to your insomnia. Or if you’re a data nerd.)

First, I downloaded 90.2 million political contributions from Stanford’s DIME dataset between 2008 to 2014 and scraped Family Education’s list of surnames by ethnicity. After removing corporations and deduping multiple contributions, I then classified 13.5 million unique donors by ethnicity using uniquely-ethnic surnames. Finally, after extracting first names that appeared N≥3 times in the dataset, I used fuzzy soundex and metaphone to cluster together names that sounded similar in different languages. All of this was written with python, pandas and pyphonetics and served using flask.