Posts Tagged ‘Doing the Math’

May 5, 2013

Possibly Profitable

Now that I’ve had ads up on my webapps page for a few days now I thought it’s time to analyze the numbers. I assumed the amount of revenue earned each day fallows a normal distribution. (Actually, this is an over simplification that makes the math easier. I’m sure there’s weekly patterns at a minimum, and possibly seasonal factors as well. And really, it should probably be a mixture of Gaussians for each component – revenue from views and revenue from clicks with a hyper parameter of views but I digress…)

My model shows I have a 55% probability of earning at least $100 in a year with a $118 being the expected value.

Not bad for two scripts I basically wrote in my spare time. But not great either. It if was an order of magnitude higher, I’d have great confidence that I could start my own business. With time an energy I could probably turn my webapps into something substantial. On the other hand, if it was an order of magnitude smaller this path wouldn’t seem viable at all.

Of course, it would be possible for me to earn more if I cut out the middle man. Google takes a cut of 32%. Thus if I’m earning $118, Google’s cut is $56, and the total revenue generated from ads on my webapps is $174. This tells me I could charge $14/mo for ad space. Of course, then I’d have to find advertisers and convince them that my webapps are worth it. No, the cut Google takes is worth it.

On the job front, I did find a local data scientist job to apply for. It’s not the same data I’m used to, but I’d love to branch out. Domingo and I have often joked how at home I’d be in a “Black Friday war room”, analyzing real time shopping data. So why try consider something new?

April 20, 2013

Analyzing Baby Sleep

A little over 2 months ago I decided to start tracking Nicki’s sleep. At the time she wasn’t sleeping very well and I wanted to have a dataset I could analyze.

Histogram of the number of hours Nicki spends sleeping.
It may appear like a left skew distribution, but that’s because I was using a sub optimal bedtime during the initial few weeks of my study. Without those weeks her histogram shows a normal distribution with mean 11:30-12:00.

For this analysis I mostly looked at correlation. Correlation shows the statistical relationship between two sets of numbers. It ranges from -1 to 1. Negative correlation [-1,0) shows two variables are inversely related. As one increases, the other decreases. Positive correlation shows two variables tend to increase or decrease together. The closer to 0, the weaker the correlation.

Correlation(Time put down, Time spent asleep) = -.72
When I put her to bed earlier, she tends to sleep longer.

The time I put Nicki down for bed is correlated with how long she sleeps – earlier bed times mean more sleeping! That makes intuitive sense. My circadian rhythm wakes me up at certain points, provided I’ve slept a decent amount. I’m now in the habbit of waking up at 7:00 am, regardless of what time Nicki wakes up. (Mommy misses sleeping in until noon on the weekend.) Nicki could be the same way. Earlier bedtimes mean there’s more hours between when she goes down and when she typically gets up, which could correspond to longer sleep intervals.

Every baby book I own says “Early to Bed Late To Rise“. In other words, put the baby to sleep early and she will sleep in longer. What do my numbers show?

Correlation(Time put down, Time Woken Up) = -.21
When I put her to bed earlier, she tends to wake up later.

So yes, she does tend to sleep in longer on days she goes down earlier, but it’s weak correlation. It could be that the relationship is weak, or that there are other factors at play. One possible factor is day light savings time. Specifically the position of the sun. We’re in the middle of Spring, sunrise is getting earlier and Nicki tends to wake up around sunrise. If I take a weekly average of her wake up time, I see it inching forward for the first three weeks.

Another aspect of sleep I care about is how long it takes her to fall asleep. The books all say over tired babies have a harder time falling asleep. Was it true for Nicki?

Correlation(Time put down, Number of Minutes needed to fall asleep) = 0.4
When I put her to bed earlier, she takes less time to fall asleep.

My analysis shows that, at least for Nicki, earlier bed times lead to better sleeping.

Still asleep after sunrise. Love the bear on her butt!

Of course, Correlation does not imply Causation. There could be other factors at play. Our bed time is between 7/7:30. On days she’s extra tired, she goes down a little earlier. Less sleepy, and bed time is closer to 7:30. A tired baby is more likely to fall asleep quickly and sleep longer.

If I were to do a true study I’d have to randomize her bedtime. That means some nights putting a wide awake baby down, and some nights trying to keep a tired baby awake. I may love data, but even I’m not that crazy. Still, it’s neat to see Nicki’s sleep numbers.

February 21, 2013

Writing Sample Analyzer FAQ

In preparation for my re-entering the work force (shameless self plug – I’m job seeking!) I closed down some of my old legacy sites that I’ve been keeping around for posterity. One of which was my consulting site, Even though I haven’t updated it in years, the tools were still in use, especially the Writing Sample Readability Analyzer. So I moved it to my resume website,

Since I’m getting a ton of emails about it lately, I thought I’d answer some of the frequently asked questions.

How does your analyzer work?

The analyzer uses the Flesch, Fog and Flesch-Kincaid metrics to predict reading ease. Each approximation makes the basic assumption that longer sentences are harder to read than shorter sentences, and words with more syllables are harder to read than words with less syllables. Although the underlying principle is the same, each metric is calculated slightly differently.

FleschScore = 206.835 − (1.015 × AverageSentenceLength) − (84.6 × AverageNumSyllablesPerWord)

FogScale = 0.4 x (AverageSentenceLength + PercentageOfWordsWithThreeOrMoreSyllables)

FleschKincadeScore = 0.38 x AverageSentenceLength + 11.8 x AverageNumSyllablesPerWord – 15.59

I’ve used your analyzer and another analyzer and gotten different results, why is that?

The Flesch, Fog, and Flesch-Kincaid are well defined metrics. If another system is reporting a different score for the same metric, then the input variables (either number of sentences, or number of syllables per word) must be calculated differently.

It is surprisingly not as straight forward to calculate sentence boundaries as it seems. As humans, we can identify when a sentence ends pretty easily. Since the computer can’t really parse or understand the sentence**, it can only make an educated guess based on clues like punctuation and capitalization. But not all punctuation (think abbreviations) end sentences, and not all sentences are ended with punctuation. This is especially true online were sentences are often not well formed.

The same goes for computing the number of syllables in a word. It may seem simple to just create a list of syllables per word, but language is infinite and constantly evolving. Such a list is not possible. Pronunciation (and the number of syllables) can differ in different parts of the world. Additionally, heteronyms words that have the same spelling, but different pronunciation, can have different number of syllables. The word ‘learned’ as the past tense of the verb ‘to learn’ is one syllable, but ‘learned’ the adjective to describe someone with scholastic achievement is two. Any method to calculate the number syllables per word will involve some heuristics.

The differences in calculating sentence length and the number of syllables will tend to be more noticeable on shorter samples, rather than longer. Even so, while there may be differences between different analyzers, the differences should be relatively small.

** There is an active area of research in natural language processing which tries to automatically parse and understand sentences.

Which analyzer is the most ‘accurate’?

There are two types of ‘accurate’ we can consider: which analyzer comes closer to the true Flesch, Fog and Flesch-Kincaide metrics, and which one better predicts reading ease. Keep in mind that each metric is just a heuristic based on an assumption that is often true, but not always. For example ‘kiln‘ is one syllable, but harder than the simple, three syllable word ‘together’. Depending on the kind of text you are analyzing, you may find one method or score works better for your application than another.

Let’s consider two Analyzers, one with a very good sentence boundary detection, Analyzer A, and one with a very good syllable per word calculator, Analyzer B. If you were analyzing writing samples from elementary school children, you may prefer A. That’s because young children may not write grammatically correct sentences and typically don’t have a rich vocabulary, so a more complex syllable per word calculator wouldn’t buy you much whereas a better sentence boundary detector may be necessary. On the other hand, if you were analyzing scientific journal articles, you may prefer B.

My suggestion is to use both analyzers to get a feel of which one is better for you and your task.

Will you share the code?

I have in the past, but only for extra special cases.

February 8, 2013

Mom-tographer by the Numbers

I used to hate it when someone told me how much Nicki’s changed or how big she’s gotten. The words feel like a dagger to my mom-tographer heart who never feels like I’ve taken enough photos of Nicki. Yes, I take a ton of photos, but they’re typically the same photo with slightly different angles and I rely on my iPhone way too much. So a week or so so ago I decided to sit down and finally organize the photos I have, to see if my fears were justified. (Aside: nothing makes a mom-tographer panic like missing baby photos. I couldn’t find our Thanksgiving day photos for about 2 hours. Not fun.)

Formal Photos:

I had been thinking I only use my DSLR for formal photos. The good news is that’s not really true. I loosely defined formal photos are ones that I do against a back drop or involve some amount of setup other than moving clutter out of view (e.g. the newborn photos, the Christmas lights photo and the Halloween photo). I took 3532 non-formal and 5219 formal photos.

What about Duplicates?

Well, 2084 of those photos were newborn photos. About 3/5ths of those are sleeping baby photos and there’s only so many different sleeping baby photos one can take. There were 392 Nicki and Phia photos. In my defense, I wanted a canvas print, and I wanted to be sure I had a photo that would work. Canvas prints require a lot of white space and I wanted a large high-res print that was pushing the boundaries of my camera’s image size, so I needed to get the framing right. Cropping wouldn’t work.

I’ve also been working on Nicki’s baby book. Of the 1191 photos I’ve taken just for the book, I plan to use 6. That means I take an average of 199 photos in order to get ONE for the book. Again, I’m obsessed with perfection. In my defense, some of those photos were a lot harder than others, and some required many iterations to get something passable.

Informal Photos:

I take monthly photos of her to track her growth, and I’m pretty good about taking Holiday photos. But how do I do on an every day bases? For this analysis I’m only including non-holiday and non-event (like the day she was born).

The number of informal photos I’ve taken of Nicki each month.

My initially thinking was that I wasn’t taking enough photos of Nicki when she was 2 and 3 months old, but as we can see, that’s not the case. I did pretty good for months 1 & 2 when I was on “Maternity Leave” from grad school over the summer. Month 3 was also pretty decent at 195 photos. For months 4 and 5 that would be 45 and 25 photos of Nicki the entire month. Most of those photos are also duplicates, so in month 5 I effectively only have 2 photos! On the other hand, Month 5 was our trip back east, and I took 406 informal photos at Christmas time. The ironic thing: I found two different set of Monthlies for five months. Maybe subconsciously I knew I wasn’t taking enough photos and forgot which ones I had already done?

The bump at month 6 is from the 52 week project. Looks like I just needed some motivation.


The numbers work out to 17 non-formal and 25 formal photos a day from my DSLR. Whenever I feel like I’m not taking enough, I should remind myself of those numbers. Yes, they’re not all perfect, and yes there are duplicates, but that is still quite a bit. Do I wish there were more? Sure. But I don’t think having more would actually alleviate that desire. You can never have too many baby photos in my book!

Now that I’ve organized the photos a bit better, I can more easily go back and see what photos I do have. In the process of organizing my photos I also discovered some I had forgotten about.


I always get so close to focus on the face, but sometimes it’s nice to step back and show scale. I can’t believe how tiny she was! She’s filling the rock n’ play these days!

My biggest wish, however, is that I could go back to the hospital. I took 54 photos of Nicki in the hospital. This is one time when I wish I had more duplicates!

In my seemingly never ending quest to learn more about the business of blogging, I’m constantly diving back into the numbers. Of late I’ve be curious about the stickiness factor of blogs. What makes one blog memorable while another one with similar content is just so-so? And (since my blog is the only blog I have data for) how ‘sticky’ is my blog?

Bounce Rate

I started with the most basic of statistics, the bounce rate for last year. Overall I had a 78% bounce rate, meaning 78% of my visitors do not read a second page (at least during that visit) within my blog. Breaking it down further I see that Craft Projects and Photography had the lowest bounce rates (68% and 69% respectively), where Shopping and Family Life have the highest. This isn’t really surprising. Most visitors to my blog appear to be looking for answers to questions or general information, and not specifically for me. A random surfer to my blog will likely care more about my general info posts than the personal life.

I also have a bit of a keyword/query mismatch problem it looks like. Looking at my traffic data, I see the two most popular ‘Family Life’ posts are List Overload and Still a girl. Most visitors to ‘List Overload’ are looking for the city mini 2013 which the post mentions I was interested in buying, but isn’t what the post is about. For ‘Still a Girl’, visitors are interested in 4d ultrasound pictures. Since that’s not what either post is about, I certainly can’t fault visitors for leaving!

In terms of tags, the Maternity Photography had the lowest bounce rate at 40%! Alas, Newborn Photography’s bounce rate was pathetic at 96%. So I guess maybe you, anonymous reader, don’t like all of my photography? I was surprised to see that Consumer Research and Baby Gear also do well (bounce rates of 57% and 65% respectively) despite shopping being one of the worst categories.

The bigger killer of my bounce rate in the shopping category posts is Hallmark on a budget, a post on my Hallmark shopping strategies. It’s one of my most popular shopping posts. I got a huge bump in traffic for it on Christmas day with people looking for after Christmas sales. The problem? The post was written in August 2011 with no details on where to find this years sales. The key-word mismatch problem strikes again.

New vs Repeat Visitors

Bounce rate can be a little misleading. When I visit a blog I follow I will read the top one or two posts on the main page that I haven’t seen before. Once I see a post I’ve read before, I stop and go elsewhere. If the blog displays the full content for those few posts on the index page, I’ve effectively bounced. But the bounce doesn’t mean I’m not a loyal reader. I keep coming back, I keep reading the latest posts and bouncing. Thus, to explore my blog’s stickiness, I also looked at new vs repeat visitors, not just the bounce rates.

Overall, only 11% of my hits are from repeat visitors. Bummer. But here’s where I start to get good news: for my blog’s main page 37% of all visitations are from repeat visitors! Maybe I do have some readers after all?

But wait, there’s more interesting news! Posts with the Doing the Math tag, (which I’ve always thought of as my most favorite to write for, and least popular), had the best percentage of repeat visitors for any tag, at 10%! And, ironically, when I noticed that I immediately thought ‘must be a math mistake’. More likely, though, it has to do with how I handle those posts. Since those are the posts I love, they’re also the ones I tend to tweet to my friends on twitter.

Becoming Sticky

As a scientist it’s really tempting to tweak the variables and see what happens. If I post more maternity photos (Hah, I wish!) how will that affect my traffic?

I’ve always been a firm believer that the only real Search Engine Optimization strategy is to have good content. I have a similar philosophy with blogging. I should stick to the content I enjoy writing about, and not worry about tailoring it to the queries that bring visitors to my blog. When a visitor stumbles on to my blog by means of a query that doesn’t really match the blog content, he or she tends to bounce. There’s nothing I can do about the query mismatch, but I can strive for better content.

November 29, 2012

Uniqueness of Baby Names

Baby center released its list of the top 100 baby boy and girl names for 2012. What’s not on the list? Nicole. For most of the eighties Nicole was one of the top ten, but lately it’s been on the decline.

Of my friends who gave birth in the past year, three strived for unique names. I thought they succeeded. If I had to guess, all three picked names less common than ‘Nicole’. In reality each of those names all made the top 100 list. In fact, two of those friends, who don’t even know each other, ended up picking the same exact name (though one used it as a middle name).

This got me thinking. How unique is unique these days?

Babycenter’s list is generated from their members, which isn’t necessarily a representative sample of all babies. To get a broader perspective I turned to US census data. Here’s what I found.

What does the distribution of names look like?

We all know the popularity of given names changes all the time. In 1995 the most popular girls name was ‘Jessica’, but in 2011 ‘Jessica’ was ranked 120th. I wondered if parents looking for rare names may end up like my two friends, and settle on the same “rare” name. Maybe ‘Kendall’ (currently just three ranks below ‘Jessica’, and six below ‘Nicole’) will become the new ‘Emma’.

I looked at a variety of years, but ultimately decided to compare 2011 to 1995 as they have a similar birth rate and 2012 data isn’t available yet.

There are more names to choose from

The US census data only lists names that were given to at least 5 babies. Using the birth rate data I can extrapolate that approximately 7% of babies in 1995 and 8% of babies in 2011 had a name given to less than 5 babies. So in addition to the two years having the approximately the same birth rate, the US census lists for both years represents approximately the same number of babies.

Here’s the interesting thing: In 2011 there were approximately 34 thousand unique names on the US census list whereas there were only 26 thousand unique names on the list for 1995 – for approximately the same number of babies!

The ultra common names are the names on the most decline

In 1995, 44% of girls and 58% boys (or roughly 1 in 2 babies) were given a name that made the top 100 list for that year. For 2011, only 31% of girls and 44% of boys (a little less than 2 in 5 babies) had a name that made the top 100. Thus the use of a name in the top 100 is declining.

Yet, most parents still pick names that in the top 1000. Despite there being over 19 thousand different girl names and 14 thousand different boy names given to babies in 2011, 67% of girls and 79% of boys (or roughly 7 in 10 babies) had a name in the top 1000. If we subtract out the top 100 baby names, we find 3 in 10 babies had a name ranked in the top 101-1000. That’s the same rate as in 1995!

What about Names Collisions?

Parents often state they are striving for unique names out of a desire that their child to be the only child with a given name in school. In other words, they want to avoid a name collision.

How common are name collisions?

Name collisions have been on the decline. Using a Monte Carlo simulation I was able to compute the probability of a name collision for a group of babies. Thirty years ago a group of 40 babies would be 97 to 99% likely to have a name collision. In 1995, there’s a 79.9% probability of a name collision in a group of 40 babies. In 2011, there is only a 56% probability that at least two babies in a group of 40 will have the same name.

The probability of name collisions has actually been on a decline since 1990. (Coincidently the world wide web was created in 90’s.)

Picking a name with low probability of collision

This is actually fairly easy to calculate. Let’s use the name Kaitlyn, the 100 most popular girls name for 2011, as an example.

In 2011, there were 2893 Kaitlyns born (roughly 0.15% of all girl baby births for that year). Let’s say Kaitlyn is going to go to school with just one other girl baby also born in 2011. If we pick that girl baby baby at random, she has a 0.15% chance of also being named Kaitlyn and 99.85% probability of having a different name. Let’s say Kaitlyn is going to go to school with two other girls. Each of those girls has a probability of 99.85% of not being named Kaitlyn. The probability of a name collision is one minus the probability of neither girl having the name Kaitlyn. Mathematically that’s expressed as 1-(1-0.15)x(1-0.15) or 0.3%. The general formula is:

p(name_collision) = 1 – (1 – popularity_of_name)number_of_students

According to the National Center for Educational Statistics the average elementary school has 482 students. According to the US census data, 48.8% of babies born in 2011 are girls. That would mean there are 234 other girls in addition to our Kaitlyn. We compute the probability of a name collision as follows.

p(name_collision) = 1 – (1 – 0.15)234
p(name_collision) = 1 – (1 – 0.15)234
p(name_collision) = 29.7%

Thus there is only a 29.7% probability that our Kaitlyn will go to a school with another Kaitlyn.

You would have to pick the 38th most popular girl’s name (Anna) before there’s a 50/50 chance that another child at the same elementary school having the same name. For boys you’d need to pick the 72 most popular name (Ian) to have a 50/50 chance at a name collision. What if you pick a name below the top 1000? Then there’s only a 3% chance of a name collision for girls and a 2.3% chance for boys!

Of course, this analysis is only considering a name collision with another child having the same spelling of the name. There could also be a Caitlyn, Katelynn, etc. According to NameNerds. There are an additional 6938 girls named with a spelling variants of Kaitlyn in 2011. Including these spelling variants, the chance of a name collision increases to 70%.

So what about spelling variants?

Using the new counts from NameNerds I found the following names are likely to have a name collision at an averaged sized elementary school. The probability of the name collision is in parentheses.

For Girls: Sophia (97.2%), Isabella (94.1%), Olivia (90.9%), Emma (90.1%), Chloe (88.0%), Emily (87.8%), Ava (86.0%), Abigail (84.6%), Madison (84.5%), Kaylee (81.1%), Zoey (81.0%), Mia (78.9%), Madelyn (78.5%), Addison (78.2%), Hailey (78.1%), Lily (77.3%), Aubrey (76.0%), Riley (75.6%), Aaliyah (74.9%), Layla (74.7%), Natalie (74.3%), Arianna (73.6%), Elizabeth (72.6%), Brooklyn (71.0%), Kaitlyn (69.9%), Ella (69.4%), Makayla (68.6%), Allison (68.1%), Mackenzie (67.4%), Peyton (67.2%), Kylie (67.2%), Brianna (66.3%), Lillian (65.4%), Avery (65.1%), Leah (64.4%), Maya (63.2%), Alyssa (62.8%), Amelia (62.8%), Gabriella (62.4%), Sarah (62.3%), Katherine (62.0%), Evelyn (61.8%), Jocelyn (61.7%), Grace (60.6%), Hannah (60.0%), Jasmine (59.8%), Samantha (59.4%), Alaina (59.3%), Anna (57.8%), Nevaeh (57.6%), Victoria (57.5%), Alexis (57.0%), Camila (56.3%), Savannah (56.1%), Charlotte (54.7%), Liliana (52.9%), Ashley (52.6%), Isabelle (52.0%), Kaelyn (51.4%), Lyla (51.3%), andKayla (50.4%)

For Boys: Aiden (97.5%), Jayden (95.6%), Jacob (92.8%), Jackson (92.3%), Mason (91.9%), Kayden (89.3%), Michael (88.5%), William (87.7%), Ethan (87.5%), Noah (87.2%), Alexander (86.5%), Daniel (84.6%), Elijah (83.6%), Matthew (83.5%), Anthony (83.0%), Christopher (82.5%), Caleb (81.4%), Joshua (81.2%), Liam (80.9%), Brayden (80.2%), James (80.1%), Andrew (80.1%), David (79.9%), Benjamin (79.8%), Joseph (79.7%), Logan (79.7%), Christian (79.7%), Jonathan (78.4%), Gabriel (78.1%), Landon (77.7%), Nicholas (77.0%), Lucas (76.4%), Ryan (76.3%), John (74.9%), Samuel (74.8%), Dylan (74.7%), Isaac (74.1%), Cameron (74.0%), Nathan (73.0%), Connor (72.5%), Isaiah (71.1%), Gavin (68.5%), Carter (67.8%), Jordan (67.1%), Tyler (66.1%), Evan (65.6%), Luke (65.5%), Owen (63.9%), Aaron (63.8%), Julian (63.6%), Jeremiah (63.5%), Brandon (63.4%), Zachary (63.4%), Jack (63.0%), Colton (61.5%), Adrian (61.5%), Wyatt (61.0%), Dominic (60.3%), Angel (60.1%), Eli (59.6%), Austin (59.2%), Hunter (58.9%), Justin (58.5%), Henry (58.4%), Jason (58.2%), Robert (56.9%), Charles (56.9%), Sebastian (56.6%), Thomas (56.6%), Brian (56.4%), Eric (56.3%), Tristan (56.1%), Jose (56.0%), Kevin (55.8%), Chase (55.7%), Levi (55.6%), Josiah (54.2%), Bentley (54.1%), Grayson (54.0%), Giovanni (53.8%), Carson (53.5%), Xavier (52.8%), Ian (51.7%), Jace (51.5%) and Brody (50.0%)

Some obvious ones on the list, but there are definitely some surprises, including two of my three friend’s pick! In 2011, 33% of girls were named a variant of these 61 girls’ names and 46% of boys were named a variant of these boys’ names.


My intuition of what constitutes a common baby name was clearly off. I fell into the trap of thinking names that were common when I was young are still common, and names that were rare are still rare. Even if you made the same mistake I did, the good news is it’s less likely to have a name collision today than 17 years ago.

One last thought. Throughout this analysis there was an implicit assumption that names were rarer because parents were deliberately choosing rarer names. But there are other possible explanations. With an increase in globalization prospective parents get exposed to new names. In the past two years I’ve worked with more Nikhils than Johns, and more Yis that Matts. I know people who picked exotic names for the children purely because they loved the sound of the name, and not because of any ethnicity or ancestry reasons. Perhaps this trend to uniqueness was inevitable, whether intentional or not.

Update 7/17/15: I’m pleased to announce an interactive web app based on this post. Now you can look up the uniqueness of any name for any year after 1950, see how a name is trending, and what are the odds of meeting another person with the same name!

Side Note: I know talking about Money is typically taboo, but I hope you’ll forgive me anyway. Since we’re talking about really small sums of money (under $10 total) I figure it’s probably not too offensive a topic. I also personally find the topic of blogging income fascinating, and I’m sure someone out there does as well. I pledge to be as open and honest on this topic as I can, as allowed by the terms of service that I’ve agreed to.

Six months ago, on my one year blogiversery, I mentioned I pipe dream of one day supplementing my grad school income with revenue from my blog. To be honest, I never did and still don’t expect it to happen. At that time, however, I had already earned a dollar. Now, that I have a few more data points the math geek in me couldn’t help but crunch some numbers.

The first thing I did was plotted my total revenue and fitted a trend line.

That’s a quadratic trend line with an R2 of .947 (which is statistical speak for the trend line matches the data fairly well).

Using the trend line I see it will take 24 years before I have a profitable year, meaning in 24 years I can expect the ad revenue to equal the cost of the webhosting and domain purchase for just one year. It’ll take an additional 25 years until I’m out of the red completely. Yes, it will take a predicted 50 years from the day I first installed wordpress before I recoup the costs of blogging. Guess I won’t break out the Champaign anytime soon.

On the other hand, in just 615 years I’ll be a millionaire, and in 861 years I’ll be a multimillionaire.

Since the amount of revenue is so small, I really can’t do much analysis of what types of content generates the most revenue for this blog. I can see that my visitors appear to be mostly internet searchers looking for answers to questions or general information, and not specifically for me. The most common search key words and phrases that lead to my blog include ‘do it yourself’ and ‘diy’.

Term cloud of keyword searches for my blog last month

In terms of content, the do it yourself maternity photography posts seem to be the most popular, although they will soon be eclipsed by the newborn photography posts. According to google analytics, the maternity pages have the lowest bounce and exit rate of any of the popular entrance points. Visitors who view those posts are more likely to also read other posts on my blog. Dare I hope that this means some of you out there like my photography? Or at least prefer it to my other content?

While I don’t have much in the way of repeat visitors, the traffic to my blog is increasing. The monthly number of visitors has increased 226% over this time last year. (Of course there were no maternity nor newborn DIY photography posts at this time last year!)

I will have to find more adorable subjects to photograph so that I can retire before I’m 640ish. Either that, or convince Domingo we want a really large family.

One question I have on my mind a lot lately, as I’m sure every pregnant woman starts asking, is “what are the odds of my baby coming today?” or “in the next couple of days?”. The trouble is, it’s really hard to find any kind of answer to that question online. Some babies come early, some come late. Any that come between 37 weeks and 42 are considered ‘right on time’. Well, the math nerd in me wasn’t satisfied with that answer.

I previously found this chart online, which uses a normal distribution of mean 40 weeks (or 280 days) and standard deviation of 10 days to estimate the probability of going into labor. Or N(280,102) for you statisticians out there. The normal distribution is symmetric, which would mean one’s odds of going into labor one day before one’s due date is the same as going into labor one day after. I suspect the actual probability distribution of spontaneous labor is closer to a left skew, or negative skew, normal than a standard normal distribution. A negative skew would mean one’s odds of a very premature labor are greater than one’s odds of a very postmature labor. After all, you could go into labor at 34 weeks (x = -42 days). According to the normal distribution N(280,102), 6 in every 1 million babies would be born at 34 weeks and 6 in every 1 million babies to be born at 46 weeks (x = 42 days). Given that 4 million babies are born each year in the US, that would be 24 babies would be born at 46 weeks gestation per year in the US alone! Of course these days doctors tend not to let women go more than 42 weeks due to health risks, so it’s impossible to say how far those women would have gone in their pregnancies. Still, I doubt 24 of them would have made it to 46 weeks. Baby’s got to run out of room eventually, and at some point the female body just can’t handle it anymore!

A skewed normal and normal distribution are very similar when you’re close to the middle (ie close to the due date.) The two distributions are less similar when you get further from the middle (ie further from the due date.) I was really interested in knowing how likely labor was TODAY, approximately 6 weeks before my due date, so the normal distribution wasn’t going to cut it.

I wanted to estimate a skewed distribution, but how to do that without any data? Fortunately cites several studies which indicates the true likelihood is approximately normal, so I need a skewed normal distribution that is close to N(280,102) – characteristic one. Our doctor also told us 10% babies are delivered prematurely (before 37 weeks) – characteristic two. (The normal distribution N(280,102) predicts only 3% of babies will be delivered prematurely). We also know that roughly half of pregnant women go into labor before their due date, and half afterwards – characteristic 3. Skewed distributions have three parameters (location, scale and shape), so all I had to do was tweak these parameters until I have a distribution with all three characteristics. Should be easy, right?

Five hours later…

I wanted to create my model using excel, rather than Matlab or R, two programs especially designed for statistics. I haven’t touched either in a while, and didn’t want to re-learn them. Excel has support for doing normal distributions, but nothing for skewed normal. That meant I had to implement the functions on my own, and my calculus skills are only slightly less rusty than my Matlab or R skills. At some point I probably should have given up and switched over to Matlab, but I was stubborn and determined to get it! It was a matter of pride.

In the end I came up with the following distribution. This distribution shows approximately 10% of babies will be premature, half of all pregnancies will be early while half will be late, and the squared error between the two distributions is less than 2 X 10-3. For another sanity check, it shows a mean average deliver date as 279 days, or 39 weeks 6 days.

My model (blue) as compared to the normal distribution (red). I plotted them both assuming ‘0 days’ as the due date instead of 280 to make it easier to read.

Interesting side note: while the model shows half of women go into labor before their due date, the day with the highest probability of spontaneous labor is 7 days after her due date, which matches conventional wisdom!

So what does this mean for me? Given that zippy isn’t here yet, I have a 0.1% chance of going into labor today and a 1.36% chance of going into labor in the next seven days! That’s 30 times higher than the prediction I was getting with the normal distribution!

Of course this is just an estimate, and all meant to be in good fun. Without data, my model is only a guesstimate. Nevertheless, my math nerd itch has been scratched.

You can try the tool out for yourself here.

September 1, 2011

Mathematics of Insurance


A recently engaged friend and I were discussing engagement rings, and she was shocked when I told her we did not insure our ring. “But doesn’t it have sentimental value to you?” She asked. Why yes, yes it does. But you can’t insure sentimental value, you can only insure monetary value. The insurance company won’t break out a search party should you lose your ring, or help in the police investigation if your ring gets stolen. They will write you a check to buy a new one.

Mathematically speaking, insurance doesn’t always make sense.

Insurance for possessions is similar extended warranties. Let’s consider an example from 2007. I purchased a Wii for $250. The store offered me a $10 extended warranty. Let’s say the store estimates the failure rate was 1 in a 100. Then for every 100 warranties sold, the store expects one customer’s Wii to break and to have to pay that customer the full price of a new wii, or $250. The store’s expected profit from selling 100 warranties is then $10×100 (the cost of the warranties) – $250×1 (the payout to one customer), or $750. This expectation is called the expected value. When the store sells tens of thousands of warranties, the statistical property called the Law of Large Numbers shows it is unlikely for a much larger percentage of wii’s to fail. Thus the store is not likely to lose money by paying out on the warranties. The store is a business, after all, not a charity and the goal of any business is revenue.

The expected value (EV) can be calculated for an individual warranty. This value represents the monetary worth of the warranty. The equation is: probability of failure x monetary value of failure + probability of no failure x monetary value of no failure
For the store:

EV(Store’s Value of the Warranty) = 1%x(-$250+$10) + 99%*($10) = -$2.40 + $9.90 = $7.50

In the first expression, the term -$250+$10 is the payout minus the revenue gained from the sale of the warranty. The monetary value of no failure, in the second expression, is simply the revenue from the sale of the warranty. Thus the store expects to earn $7.50 per warranty sold. Not coincidentally, it’s 1/100 of the expected value of selling 100 warranties.

We can also compute an individual consumer’s expected value of purchasing the warranty:

EV(Consumer’s Value of the Warranty) = 1%x($250-$10) + 99%*($10) = $2.40 + -$9.90 = -$7.50

The store’s expected gain is the customer’s expected loss. Each dollar the customer loses is a dollar the store gains.

We can also compare the expected value of the consumer not purchasing the warranty. In this case, the consumer does not pay the $10 fee, but is out of luck and must pay an additional $250 to replace the console, should it break. The consumer’s expected value of no warranty is:

EV(Consumer Value of No Warranty) = 1%x(-$250) + 99%*($0) = -$2.50

The expected value still negative, but the consumer’s expected value of not purchasing the warranty is less negative than the expected value of purchasing the warranty. Mathematically speaking, this means the consumer is expected to lose less money by not purchasing the warranty.

Engagement ring insurance works mostly the same way. A jewelry insurance company computes the probability that my ring will get lost, damaged or stolen. Many factors go into the calculation, including facts like the crime rate where I live, or whether or not I’ve ever reported a claim. The insurance company sets their rate accordingly, so their expected value is positive. In fact, they set their rate high enough that they can expect to be profitable and still pay their staffs wages, fixed costs for operating their business, and the women who file a claim on their rings. Again, they’re a business, not a charity. As before, the money the company is expected to make from me as a customer is equal to the money I am expected to lose.

This isn’t to say that insurance is never a good idea. In fact, it’s often a very good idea! Insurance and extended warranties are designed to provide protection for the worst case scenario, not the average case. Home, Auto, Health Insurance all have expensive worst case scenarios, beyond the financial capabilities of most people, which is why they are considered necessary. The worst case scenario when not purchasing the extended warranty is the Wii breaks, which would cost me $250 to replace. If you can afford the worst case (ie purchase a new Wii), you are usually better off not insuring. This is sometimes referred to as self insuring meaning you are setting money aside and relying on yourself to cover the financial burden should the worst case happen.

Should the worst case occur, and I lose my engagement ring, I will be okay financially speaking. Therefore, I expect to come out financially ahead by not insuring.

There was a really good article about the profitability of Etsy. Simply put, Etsy‘s profitability is dependent on the individual stores profitability. Creating a profitable Etsy store, however, is no easy feat. The biggest hurdle is determining how much to charge for a given item: too much and the shop keeper loses customers, yet too little and the shop keepers profits may not out way costs.
So how much should one charge? It’s easier to think in terms of profits and work backwards. A simplified formula for profits is

YearlyIncome = NumberOfSales * Price – Expenses

Solving for price, you get

Price = (YearlyIncome + Expenses)/NumberOfSales

You can substitute DesiredYearlyIncome to get a price target. To give a concrete example. Let’s assume that I’m striving to be one of those $30,000 a year etsy merchants as a jewelry maker, and I think I can reasonably sell 1,000 pieces a year. Then I will need to charge $30 above the materials cost for each piece I sell. This $30 is basically the cost of labor. But as Max Chafkin points out, “The vast majority of Etsy sellers are hobbyists who aren’t in it for the money and, consequently, end up charging rates for their labor that would make even a Walmart buyer blush.”.

With similar products being offered at barely above the materials costs, it’s difficult to raise the price too much without impacting sales. Why buy a necklace from me if someone is selling a comparable one for $30 less? As a result, the price I can charge is basically fixed. This means the only two variables in the equation left that can change are expenses, and number of sales.

Expenses can be difficult to change. If you’re already buying whole sale, or in bulk, you’re not going to find much wiggle room. You can also substitute cheaper supplies, but you run the risk of losing customers who want jewelry made of higher quality materials.

Increasing the number of sales is also not easy. In my example, I assumed 1,000 sales and needed to charge $30 per piece in labor costs. If I only want to charge $5 in labor costs, I would need to sell 6000 pieces. That’s roughly 16.4 sales a day. Some visitors to my store front will not purchase anything. Some visitors may be other shop owners or crafters looking for inspiration. In order to get enough potential customers to my store I will have devote time to advertising, time that won’t be spent crafting.

The way I see it, the best bet to be profitable is to reduce competition, primarily by changing the product you offer. One way to do this is by offering a product that few others can. Fill a niche. There may be many wire jewelery makers on etsy, but there are far fewer casters. There will always be crafters who will mimic cool designs they see, so you can’t just differentiate yourself by style alone. Another approach I’ve seen recently, is to take advantage of the fact that many of your shop visitors will themselves be crafters. A few etsy shops owners offer instructions for how to create their crafts for very small sums of money. There’s no materials cost, once the instructions have been created, so the $1-$2 is pure profit. It’s also a small enough sum of money that you’re unlikely to be greatly undercut.

Despite the difficulties, there are those who do manage a thriving business out of etsy. You can either look at the numbers Max Chafkin points out and either lament that only 1,000 etsy shops make 30k a year, or rejoice because 1,000 shops make 30k a year. If you’re of the latter camp, and are eager to give your business a try, there’s lots of advice about selling on etsy, including custom work to help maximize your chances of being successful.

If you’re curious, there’s also a consulting rule of thumb for determining your hourly wage that also applies to crafters doing custom work. The formula is:

(DesiredYearlyIncome + YearlyExpenses)/2000.

The 2000 comes from 40 hour work week, 50 weeks a year. Why not 52? Well you’ll need some (paid) time off, for sick days or even just to recharge. We all need a break some times.

Edited to add: I recently developed a tool called Scale-Up which can help you find out what it will take to grow your etsy business to your desired income level.

« Newer Posts - Older Posts »