Archive for the ‘Internet’ Category

A while back I discovered that someone was trying to pass off my photos of Nicki as photos of her own child. I immediatly turned to Google reverse image search to see if anyone else was using my photos without permission. The process was so slow and tedious to enter the URL of each image from my blog that I gave up after checking just a few. There had to be a better way.

I know I’m not the first blogger who has had images of her child appear elsewhere. I’m not even the first blogger in my twitter stream this year that this has happened to. Internet strangers doing inappropriate things with baby photos is something that keeps some of bloggers up at night, and makes some of us hang up our blogging hat altogether. To make it easier to detect this sort of thing, I wrote a Duplicate Image Search utility script.

duplicateimagesearch

Simply give the script the URL of a webpage you want to check. The script finds all images on the page and displays them. The images are hotlinked, meaning I am not caching them or saving them to my server. My script is basically just a proxy service. Clicking on an image will open Google’s reverse image search and you can verify that only authorized sites are the ones displaying your images.

I had bloggers in particular in mind when creating this utility script. In order to make it more useful I attempt to atomically parse the page looking for a “next” button. You could start with your main index page, and slowly comb through your entire blog.

The script isn’t perfect. I wanted to parse out the search results and just return the web address of any unusual domains that might be using your photos without permissions. Alas the only APIs I could find that would let me do this are prohibitively expensive. If I get enough interest in this script that I can amortize the cost, I’ll consider making the improvements in the future. In the mean time, I hope it helps you stop anyone from using your photos without your permission.

copyright

I feel a bit like a grumpy ole curmudgeon for doing this, but I put up an official copyright notice. I really hate to do that, but after finding a second person using my baby photos of Nicki in as many days, I felt I had to do something. At least this time I believe it was an honest mistake, and not another person pretending photos of Nicki are photos of her child.

As I was writing the copyright notice, I kept thinking about when I first got online, and my very first websites. Like most of my peers, I would occasional use an image I didn’t own the rights to. I wouldn’t use any photos with an obvious copyright, but the non-obvious ones? Sure, I’ve done that. I had an inkling it was wrong, but I figured I wasn’t harming anyone. I wasn’t profiting off it, and everyone does it. I figured if I was caught I could just say “I didn’t know it was copyrighted” (half true) or “my friend sent it to me to use, I thought it was hers” (a total copout.) The ultimate irony: I’ve heard both excuses from people using my images without permission.

It’s easy to make mistakes. There’s a misconception that a link back to the original source of a photo counts as fair use, or that just because a website grants Google permission to store and show its images the same privilege extends to anyone using Google. That’s like saying if you credit George R. R. Martin you can post the full contents of A Song of Ice and Fire on your website, or because George R.R. Martin granted HBO the rights to produce Game of Thrones, and you subscribe to HBO, you can make a Game of Thrones as well. Copyright law does not work that way.

But as difficult as it was to write my copyright notice, there is some good that come from it.

Good for you: Less ambiguity. Some people have pointed out possible copyright issues with pinning on pinterest. In the past I’ve implied I’m ok with pinterest, now I’ve explicitly stated it. I’ve also decided to reserve only some rights, not all! That means I’ve given permission to post my images/photos on your website/blog under certain circumstances. When in doubt, you can always ask. I’ll probably be so tickled pink that you want to use my work that I’ll say yes.

Good for me: Consistency. I agonized over what to do with this latest copyright violation. I started filling out a DMCA take down request when I started having flash backs to my younger self. How would I have felt had my webhost informed me that I was in violation of someone else’s copyright and they had temporary suspended my website as a result? Make no mistake, the copyright owner and my webhost would have been well within their collective rights to do so (as would I in this case.) I truly feel most people are honest, just unaware. My policy gives everyone 3 days to respond when I email them about a copyright violation. I understand that someone might be on vacation, or travel, or otherwise occupied, so I’m just looking for a response in that time that implies their taking my request seriously. Absent that, or a way to contact them, I’m afraid I will have no choice but to follow through with the DMCA take down request. Seems fair, right? I feel less bad about filing the complaint if it’s a uniform policy I apply to everyone.

Will the new policy be effective? Probably not. I suspect in both cases the copyright violator found the photos using Google image search, and never visited my blog. If they don’t visit my blog, they won’t see my copyright notice. Regardless, I will feel better about taking action.

Someone asked me why not block Google from archiving my images. For the most part Google is my friend. I actually get a fair amount of traffic to my blog through Google image search. I want potential readers to be able to find me. Still, it’s a valid point. Copyright laws are only work in countries that agree enforce them, and I can’t stop someone if I don’t know their using my images. The only way to truly prevent copyright violations is to not let anyone have access to the photos in the first place.

Since I still intend to blog, the most effective strategy I can think of is to post small images – big enough to display in a blog post, but too small to be useful elsewhere. Actually, that’s not true. The best strategy would be to post terrible photos that no one wants to steal, but that’s not a path I want to go down. At least not intentionally.

Updated 4/13: The second person violating my copyright has voluntarily removed my content when it was pointed out to her. I did not have to file a DMCA take down request.

April 10, 2013

Stolen Baby Photos

Yesterday evening I noticed a large bump in traffic forum on topix.com. Someone had posted a link to my blog on a public forum there which was generating a ton of traffic. This happens from time to time (normally it’s a link to my labor predictor) but this time it was to Newborn Photography page. I was so excited to see what strangers thought of my photography. External validation, how I love you.

Alas, it was not what I first thought.

The forum thread was dedicating to ‘outing’ one of their members. Apparently she had lifted some photos of babies online, one (some?) of which was mine. One of the forum members posted a link to my newborn photography page, as well as four other websites, as proof.

I was confused, to say the least. From what I could tell this person, “crazy mom”, had collected a number of photos and posted them to her facebook page. My guess is she did a google image search and downloaded a few photos she liked (so I guess I got that external validation afterall?). Crazy mom probably never visited my blog. I don’t even think all the baby pictures were the same gender. The only thing they had in common was each baby appeared to have been born with a full head of hair. I think she was claiming she had a boy. About thirty minutes into my investigation, the forum post was deleted. I never did see which photo she was pretending was hers.

Now, I’ve been online a long time – back before the days of Google, back when being a webmaster meant singling static HTML code onto geocities. Back then I built a fantasy website under the pen name ‘Aella Lei’. At the time I was into Greek mythology, and picked the name after one of Amazons Hercules fought during his 12 labors. It turns out that ‘Aella’ is also a french name and one day someone of that name happened upon my website. This Aella took a liking to my website, and wanted credit for it. Her theory was that since I was using the name as a pen name, and it was her given name, I owed her at least partial credit. I disagreed and ignored her.

Back then I was using a internet messaging client called ICQ. ICQ is similar to Yahoo Messenger and AOL Instant messenger except everyone had a unique number id that identifies them, rather than a unique screen name. That way there could be many ‘Aella’s. Aella set her screen name to ‘Aella Lei’, her profile website to my website, and for her bio she lifted sentences straight from the “about me” page on my website. I know this because one day she forgot she wasn’t me, and messaged me thinking I was the imposter.

Compared to Aella, this crazy mom’s brand of crazy is pretty tame. I found no evidence to suggest crazy mom was obsessed with Nicki. Still, this experience does serve to remind me that there are people out there who will fixate on individuals. Nicki is too young to have an opinion to have an opinion on how much information about her is online. It’s my job to make sure she’s protected. Perhaps it’s time to rethink how much I share online.

If by chance you happened to stumble onto my blog from topix.com, and you know who this person is who stole my photos, I’d really appreciate it if you could fill me in.

February 28, 2013

2nd Year Blogiversary

I’m not sure how I missed it, but my 2nd year Blogiversary was 3 days ago! You’d think with all my posts on meta-blogging I would have noticed it coming. I blame my distracted state on prepping for upcoming interviews.

So how’s my fledgling little blog doing?

* In terms of traffic sources, I had five times as many visitors from Pinterest last year than the year before!
* I have ten new pins, bringing my total to fourteen pins!
* Overall, page views is up 139% from the same time last year!

I noticed the newborn photography was the third most profitable page. I admit when I saw that I envisioned a random surfer thinking “I’ll try this do-it-yourself stuff”, stumbled onto my blog, and think “no way! I’m hiring a professional.”

The most popular page is my Labor Predictor. On any given day about 40-50% of the page views to my blog are on the labor predictor. I’m glad it’s so popular, it was fun writing it! I love my math-y posts.

My goal for the coming year: increase the number of non-mommy related posts. I was talking to someone the other day and it was clear that he thought of my blog as a “mommy blog”. Obviously being a new mom is a major part of my identity right now, but it’s not the only piece. While I’ll never shed the mom title, Nicki will grow older and more independent. I plan on keeping my blog for a while, not just while I have young kids!

With that said, I’m off to go take more photos of Nicki. Because I’m an obsessive momtographer like that.

February 21, 2013

Writing Sample Analyzer FAQ

In preparation for my re-entering the work force (shameless self plug – I’m job seeking!) I closed down some of my old legacy sites that I’ve been keeping around for posterity. One of which was my consulting site, bluecentauri.com. Even though I haven’t updated it in years, the tools were still in use, especially the Writing Sample Readability Analyzer. So I moved it to my resume website, sarahktyler.com.

Since I’m getting a ton of emails about it lately, I thought I’d answer some of the frequently asked questions.

How does your analyzer work?

The analyzer uses the Flesch, Fog and Flesch-Kincaid metrics to predict reading ease. Each approximation makes the basic assumption that longer sentences are harder to read than shorter sentences, and words with more syllables are harder to read than words with less syllables. Although the underlying principle is the same, each metric is calculated slightly differently.

FleschScore = 206.835 − (1.015 × AverageSentenceLength) − (84.6 × AverageNumSyllablesPerWord)

FogScale = 0.4 x (AverageSentenceLength + PercentageOfWordsWithThreeOrMoreSyllables)

FleschKincadeScore = 0.38 x AverageSentenceLength + 11.8 x AverageNumSyllablesPerWord – 15.59

I’ve used your analyzer and another analyzer and gotten different results, why is that?

The Flesch, Fog, and Flesch-Kincaid are well defined metrics. If another system is reporting a different score for the same metric, then the input variables (either number of sentences, or number of syllables per word) must be calculated differently.

It is surprisingly not as straight forward to calculate sentence boundaries as it seems. As humans, we can identify when a sentence ends pretty easily. Since the computer can’t really parse or understand the sentence**, it can only make an educated guess based on clues like punctuation and capitalization. But not all punctuation (think abbreviations) end sentences, and not all sentences are ended with punctuation. This is especially true online were sentences are often not well formed.

The same goes for computing the number of syllables in a word. It may seem simple to just create a list of syllables per word, but language is infinite and constantly evolving. Such a list is not possible. Pronunciation (and the number of syllables) can differ in different parts of the world. Additionally, heteronyms words that have the same spelling, but different pronunciation, can have different number of syllables. The word ‘learned’ as the past tense of the verb ‘to learn’ is one syllable, but ‘learned’ the adjective to describe someone with scholastic achievement is two. Any method to calculate the number syllables per word will involve some heuristics.

The differences in calculating sentence length and the number of syllables will tend to be more noticeable on shorter samples, rather than longer. Even so, while there may be differences between different analyzers, the differences should be relatively small.

** There is an active area of research in natural language processing which tries to automatically parse and understand sentences.

Which analyzer is the most ‘accurate’?

There are two types of ‘accurate’ we can consider: which analyzer comes closer to the true Flesch, Fog and Flesch-Kincaide metrics, and which one better predicts reading ease. Keep in mind that each metric is just a heuristic based on an assumption that is often true, but not always. For example ‘kiln‘ is one syllable, but harder than the simple, three syllable word ‘together’. Depending on the kind of text you are analyzing, you may find one method or score works better for your application than another.

Let’s consider two Analyzers, one with a very good sentence boundary detection, Analyzer A, and one with a very good syllable per word calculator, Analyzer B. If you were analyzing writing samples from elementary school children, you may prefer A. That’s because young children may not write grammatically correct sentences and typically don’t have a rich vocabulary, so a more complex syllable per word calculator wouldn’t buy you much whereas a better sentence boundary detector may be necessary. On the other hand, if you were analyzing scientific journal articles, you may prefer B.

My suggestion is to use both analyzers to get a feel of which one is better for you and your task.

Will you share the code?

I have in the past, but only for extra special cases.

In my seemingly never ending quest to learn more about the business of blogging, I’m constantly diving back into the numbers. Of late I’ve be curious about the stickiness factor of blogs. What makes one blog memorable while another one with similar content is just so-so? And (since my blog is the only blog I have data for) how ‘sticky’ is my blog?

Bounce Rate

I started with the most basic of statistics, the bounce rate for last year. Overall I had a 78% bounce rate, meaning 78% of my visitors do not read a second page (at least during that visit) within my blog. Breaking it down further I see that Craft Projects and Photography had the lowest bounce rates (68% and 69% respectively), where Shopping and Family Life have the highest. This isn’t really surprising. Most visitors to my blog appear to be looking for answers to questions or general information, and not specifically for me. A random surfer to my blog will likely care more about my general info posts than the personal life.

I also have a bit of a keyword/query mismatch problem it looks like. Looking at my traffic data, I see the two most popular ‘Family Life’ posts are List Overload and Still a girl. Most visitors to ‘List Overload’ are looking for the city mini 2013 which the post mentions I was interested in buying, but isn’t what the post is about. For ‘Still a Girl’, visitors are interested in 4d ultrasound pictures. Since that’s not what either post is about, I certainly can’t fault visitors for leaving!

In terms of tags, the Maternity Photography had the lowest bounce rate at 40%! Alas, Newborn Photography’s bounce rate was pathetic at 96%. So I guess maybe you, anonymous reader, don’t like all of my photography? I was surprised to see that Consumer Research and Baby Gear also do well (bounce rates of 57% and 65% respectively) despite shopping being one of the worst categories.

The bigger killer of my bounce rate in the shopping category posts is Hallmark on a budget, a post on my Hallmark shopping strategies. It’s one of my most popular shopping posts. I got a huge bump in traffic for it on Christmas day with people looking for after Christmas sales. The problem? The post was written in August 2011 with no details on where to find this years sales. The key-word mismatch problem strikes again.

New vs Repeat Visitors

Bounce rate can be a little misleading. When I visit a blog I follow I will read the top one or two posts on the main page that I haven’t seen before. Once I see a post I’ve read before, I stop and go elsewhere. If the blog displays the full content for those few posts on the index page, I’ve effectively bounced. But the bounce doesn’t mean I’m not a loyal reader. I keep coming back, I keep reading the latest posts and bouncing. Thus, to explore my blog’s stickiness, I also looked at new vs repeat visitors, not just the bounce rates.

Overall, only 11% of my hits are from repeat visitors. Bummer. But here’s where I start to get good news: for my blog’s main page 37% of all visitations are from repeat visitors! Maybe I do have some readers after all?

But wait, there’s more interesting news! Posts with the Doing the Math tag, (which I’ve always thought of as my most favorite to write for, and least popular), had the best percentage of repeat visitors for any tag, at 10%! And, ironically, when I noticed that I immediately thought ‘must be a math mistake’. More likely, though, it has to do with how I handle those posts. Since those are the posts I love, they’re also the ones I tend to tweet to my friends on twitter.

Becoming Sticky

As a scientist it’s really tempting to tweak the variables and see what happens. If I post more maternity photos (Hah, I wish!) how will that affect my traffic?

I’ve always been a firm believer that the only real Search Engine Optimization strategy is to have good content. I have a similar philosophy with blogging. I should stick to the content I enjoy writing about, and not worry about tailoring it to the queries that bring visitors to my blog. When a visitor stumbles on to my blog by means of a query that doesn’t really match the blog content, he or she tends to bounce. There’s nothing I can do about the query mismatch, but I can strive for better content.

October was a better than expected month for me. Remember how it took a full year for me to make my first dollar? And then six months to make my next? In October I earned a full dollar. Updating my blogging revenue probability model with October data also shows me it will only take me 573 years to reach $1,000,000 in revenue instead of 615! Wahoo! Kidding aside, the increase in traffic has me wondering if maybe my blog could become something. In the last four months the number of visitors to my blog has doubled. (It’s the newborn photography. Too bad I have to wait for the next baby to expand on those. Maybe you’ll find non-newborn baby photography interesting in the mean time?)

I’ve been thinking a lot about what direction to take my blog, what type of blogger I am and what type of blogger I’d like to be.

Professionally, I’m a data scientist. My blog gives me an opportunity to make an internet name for myself. On the other hand, I’m fairly new to the domestic thing, and I’m still figuring things out. Blogging helps me put my skills to the test. Sometimes people like me, and I get pinned on pintrest. Other times they offer handy suggestions and ideas that help me grow.

I love to shop (especially bargain hunting around Black Friday), and I love to give my opinions on purchases. I don’t think I’m cut out to be a blogger who does those reviews and giveaways. For one, at less than a dollar revenue per month on average, I won’t be funding any giveaways any time soon. While I would love to get products for free to review, I’m way too small a blog for that. Besides, I’m still a novice blogger finding my voice. I wouldn’t want to come across as pushing junk. For now my plan is to stick to discussing only products I bought myself.

I have gotten a few requests for guest posts, but I’m even less reluctant to go this route. I blog for me first and foremost. I’m so completely forgetful and writing things down is an easy way to remember how I feel, like how much I loved being pregnant. I also blog because I’m dyslexic, and I blogging is an easy way to practice writing.

Still, I do like the idea of small businessafying (yes, I’m pretending that’s a word) my blog some day. I’ve spent the last few months reading about what that will entail, and how to grow my blog/brand. It would be nice to defray some of the costs of blogging, small as they might be. Yet, I’m not sure I’m ready to make the kind of time commitment that will require. Aside from actually blogging, I’d have to work on building an audience. Most of my visitors are still one-time visits searching via keywords.

At the current rate of growth, my model predicts it will be another 4 and a half years before I get my first check (and owe taxes) from blogging. Here’s hoping I figure it all out by then.

A few months ago one of my mom’s groups discussed technology. It started as a conversation of elementary kids and cell phones, which everyone seemed to be against, but quickly grew to other forms of technology like personal computers and tablets. I was shocked to learn not a single one would allow their child to have a personal computer in their room prior to high school. Not a one. Many wouldn’t even allow it then. My fellow moms wanted to protect their kids from the Big Bad out there on the internet. At a time when even some elementary schools are considering giving their students pads as a learning device, this kind of concern seems short sighted.

“[Someone who] out-educates us today is going to out-compete us tomorrow” – President Obama

The Internet has become the great equalizer. Don’t have access to the best schools? You can take classes online like Khadijah Niazi, the eleven year old Pakistani girl who was the youngest ever to pass an online college-level physics class. She learned the material by watching youtube videos at the same time the Pakistani government decided to block youtube.

The bar for achievement keeps claiming higher each year. The kids at the intel science completion writing computer simulators to solve complex mathematical problems at 16 and 17. Like it or not, this is the level of competition. You can’t write a simulator at 16 if you’re just learning to use the computer at fifteen.

A few days ago I took a photo of Nicki looking at my dad’s tablet. On the surface this seems in direct conflict with AAP recommendation of no screen time for children under 2. If you read the press release they’re largely talking about TV watching. Studies have failed to prove “educational” toddler and infant programs have any real benefit. Even ambient TV can distract a parent from engaging in a child. Of course a TV as a baby sitter isn’t good long term, but that’s not what’s what we’re doing. Nicki is learning. Am I naive enough to think Nicki is learning shapes from the iPad app she’s looking at? Of course not. She is learning that she can interact with the iPad, that it reacts to her touches. We don’t do it every day, or even ever week, but over time she’ll learn how to control the tablet. When she’s old enough to actually benefit from the educational programs, she will have already mastered the tool. That’s not a bad thing in my book.

This isn’t to say you can’t be successful without growing up with a computer. Of course you can. And it’s not to say there aren’t scary things and scary people on the Internet. Of course there are. Technology is a tool that can be used for good and bad. I personally feel the benefits out way the risks.

Side Note: I know talking about Money is typically taboo, but I hope you’ll forgive me anyway. Since we’re talking about really small sums of money (under $10 total) I figure it’s probably not too offensive a topic. I also personally find the topic of blogging income fascinating, and I’m sure someone out there does as well. I pledge to be as open and honest on this topic as I can, as allowed by the terms of service that I’ve agreed to.

Six months ago, on my one year blogiversery, I mentioned I pipe dream of one day supplementing my grad school income with revenue from my blog. To be honest, I never did and still don’t expect it to happen. At that time, however, I had already earned a dollar. Now, that I have a few more data points the math geek in me couldn’t help but crunch some numbers.

The first thing I did was plotted my total revenue and fitted a trend line.


That’s a quadratic trend line with an R2 of .947 (which is statistical speak for the trend line matches the data fairly well).

Using the trend line I see it will take 24 years before I have a profitable year, meaning in 24 years I can expect the ad revenue to equal the cost of the webhosting and domain purchase for just one year. It’ll take an additional 25 years until I’m out of the red completely. Yes, it will take a predicted 50 years from the day I first installed wordpress before I recoup the costs of blogging. Guess I won’t break out the Champaign anytime soon.

On the other hand, in just 615 years I’ll be a millionaire, and in 861 years I’ll be a multimillionaire.

Since the amount of revenue is so small, I really can’t do much analysis of what types of content generates the most revenue for this blog. I can see that my visitors appear to be mostly internet searchers looking for answers to questions or general information, and not specifically for me. The most common search key words and phrases that lead to my blog include ‘do it yourself’ and ‘diy’.

termcloud
Term cloud of keyword searches for my blog last month

In terms of content, the do it yourself maternity photography posts seem to be the most popular, although they will soon be eclipsed by the newborn photography posts. According to google analytics, the maternity pages have the lowest bounce and exit rate of any of the popular entrance points. Visitors who view those posts are more likely to also read other posts on my blog. Dare I hope that this means some of you out there like my photography? Or at least prefer it to my other content?

While I don’t have much in the way of repeat visitors, the traffic to my blog is increasing. The monthly number of visitors has increased 226% over this time last year. (Of course there were no maternity nor newborn DIY photography posts at this time last year!)

I will have to find more adorable subjects to photograph so that I can retire before I’m 640ish. Either that, or convince Domingo we want a really large family.

April 3, 2012

Crooked Perspective

I have discovered a not cool interaction between my blogging software (wordpress), smart phone (iPhone) and camera (Nikon DSLR).

The Nikon, like most cameras, has an internal leveler, and can tell which way the camera is oriented to take a portrait rather than a landscape image. Along with other bits of metadata, the camera sets an orientation flag for each image. Think of the metadata as additional information about the file. The actual data in the image file, however, is still stored as a landscape, since that is what the image sensor ‘sees’. It’s like if you tilt your head and look at a glass of water. The glass appears like it’s on its side to you, but in your mind you know your head is tilted, not the glass, which is why the water isn’t spilling out.

My computer, as do most computers, renders the image according to the data in the image file. As a result, the image appears as a landscape, regardless of how the camera was oriented. I then use image editing software to rotate the image (effectively re-arranging the pixel data). The orientation flag in the metadata remains unchanged. WordPress also ignores the orientation flag. So the uploaded image appears on my blog the same way it appears on my computer.

The iPhone, however, tries to be smart. Since the orientation flag is still set, it assumes the image needs to be rotated again to display correctly. As a result, the image appears rotated an extra 90 degrees on my iPhone. But only on the iPhone, so I didn’t discover the problem until recently!

The only way to fix it is to strip the metadata so the orientation flag isn’t set, but that means going back over all my past entries and uploading a new photo for all the crooked ones. Not Cool.

To be honest, I’m surprised there isn’t an easier way to strip the image metadata. Aside from the orientation issue, metadata can include GPS location information. It’s handy for figuring out what your photos are of years after the fact, but if you upload an image with geographical information, someone can figure out where you’ve been. You can view the geolocation data in images online for yourself.

Metadata does have it’s uses. Some photographers like to store copyright information in the metadata. Camera manufactures and image processing software also like to add their mark to the metadata, as a form of free advertising for anyone looking to see how an image was created. For me, though, I wish there was an easy way to get ride of it.