Tuesday, September 18, 2012

How to: Side-loading a pdf into your Android Kindle app

This took me a while to figure out, so thought a blog post on the topic might help other people who have similar issues.

Let's say you have a pdf or other file that you want to read on your Andtroid phone in the Kindle app - admittedly, there are several other applications in the Android eco-system that are arguably better for reading pdfs (e.g. Aldiko, Cool Reader etc) and the like, but let's posit that this is what you want to do for sake of argument. (if you already use the Kindle app on your phone to read ebooks, using Kindle to also read your personal documents leads to uniformity of interface and convenience of use). Currently, Amazon requires that for the Android platform, in order to access personal data, you first email it to them at your <email>@free.kindle.com address, so they can convert it and make it available to you from within the Android ecosystem. This is nice, and somewhat convenient, but has two major (in my view) drawbacks:
1. you need to load your personal documents into the Amazon cloud, and need to manage them within the cloud using the Amazon interface thereafter,
2. you need to have a wifi or other network interface instead of simply transferring your document from the PC to your Android phone.

A rather convenient means exists to bypass the Amazon ecosystem while utilizing the Kindle app to read self-generated pdf documents like your office memos that you do not necessarily want to entrust to the cloud. To do do, you can follow the steps below:
1. download and install an Android application like FileExpert. This permits you to navigate the file system on your phone and locate files of interest
2. connect your Android phone to your computer, and turn on access to the phone's internal storage via USB.
3. create a folder in the phone (preferably on the SD card within it) to store the files you want to transfer. Transfer the files over into that directory via USB.
4. Disconnect your Android phone from the computer, next, run FileExpert (or equivalent) application. Navigate to the folder where you have stored your files. Click on the file you want to read. The OS asks you what application you would like to launch to read this file. Select the Kindle application, and you're set.

Wednesday, June 20, 2012

How (not) to conduct a Case Interview

There are several books that advise students and busy professionals on how to prepare for a Case Interview. An excellent one is Case In Point. However, having conducted and been interviewed as a candidate several times in Case Interview settings over many years, I note that many interviewers tend to lack skill in conducting Case Interviews. This is sad, quite embarrassing and many a time a dis-service to the organization the interviewer works for. A poorly conducted case interview can hurt an otherwise worthy candidate's career prospects, and cost your company an attractive hire. If a Case Interviewer does a bad job, it is unlikely she will permit the candidate to advance even if he performed well given the interview constraints. The interviewer's ego will simply not permit it. And a candidate's protests are worthless since he'll be viewed as incompetent and disgruntled because he did not advance.

Of course, the most typical Case Interview setting still remains one where a prepared interviewer meets a not-as-well-prepared candidate, with both sides being prepared for the encounter being something of an exception rather than the rule. 


What happens when an unprepared Case Interviewer meets an unprepared Case Interviewee? This falls into the realm of questions like, "when a tree falls in a forest with no one there to hear it, does it make a sound?" And we ignore it in this post.


So what are some key "rules of thumb" for Case Interview interviewers? We present a brief, and not exhaustive list here in what follows:



  1. Do your homework. Know the case question you are asking. Know it well. Know the context/setting, the problem, possible solutions, and paths to get there. Most importantly, know what "a correct answer" looks like.
  2. Know what direction(s) you are willing to let the candidate take, guide him in those directions. Do not perform this function haphazardly, pushing one way and then another, because you yourself do not know how one should proceed from start to finish.
  3. Scope the problem beforehand. You can have multiple objectives, but you must decouple them and ask the candidate to solve each piece first before going on to the next. Do not cut back and forth between multiple questions. This confuses both yourself and the candidate.
  4. Know what questions you expect the candidate to ask, what data you want to present them when they do, and how you want to respond if they ask questions you did not anticipate.
  5. Be reasonable with the case problems you make up. There has to be a way to get from your question or premise to an acceptable answer. Well prepared candidates can crack reasonable questions in half the time you think needs to be allotted. By the same token, if the question is poorly defined, or you base it on some vague idea in your mind, no candidate will be able to finish it in time.
  6. Know what assumptions make sense, and which ones you are willing to accept. If the prepared candidate asks "I am assuming there are 300M people in the US, and the average life-span in the US is around 80 years, is that reasonable?". Feel free to say no, and ask for clarifications, but once you say yes, and the candidate does all the math based on these assumptions, don't backtrack 15 mins into the problem telling him the base assumptions are incorrect.
  7. Last, but most important, do not penalize the candidate for your lack of preparation. If the "answer" the candidate comes up with, based on assumptions you approved earlier does not look right, gently ask the candidate to redo the calculations with more reasonable assumptions, but acknowledge that your approved assumptions were somehow incorrect.
The above are only some ideas on how prepared interviewers might conduct a Case Interview. Note however that there are exceptions to every rule. An interviewer is justified in breaking one or more of the above rules if she is conducting a "confrontational" or "hostile" interview to test the candidate's mettle in a final round. These are designed to see how the candidate handles pressure, uncertainty and difficult situations. But such interview scenarios evolving naturally (instead of deliberately) due to lack of preparation is unforgivable. Good interviewees spend many weeks of effort mastering the Case Interview technique. Hurting someone's career potential because you are unprepared is criminal. Do not conduct Case Interviews if unprepared.



Reverse Inboxes

In this post, the author examines his social network by analyzing his email. Email headers - when both the sent and inbox folders are considered together, give strong indications of who sends email that is inbound, and who one sends email out to. This, when taken together with the number of times a particular unique identifier appears in the To: or the CC: or the BCC: fields of the email messages - perhaps entries in these three different types of fields are weighted differently - gives a pretty good indication of the strength of one's professional or personal relationships with other people.

We can fine-tune the mechanical classification described in the earlier paragraph a little bit further if we were to perform either sentiment analysis or some other text data mining in order to determine if a. the content is of a personal or a professional nature, and b. whether the email document is positive, negative or neutral in terms of expressed sentiment. These things together can be used to construct a network or graph of the relationships between the various people in one's communications sphere.

One can, if one performs this exercise with various people's email boxes, also determine if one person's network is more dense and more completely connected than another's and whether the span or expanse of one person's social network is greater than another's. This might also permit us to differentiate between people's personal and professional networks, and the set of people (or nodes) that straddle the two domains.

We posit that it is possible to have a nodding acquaintance with lots of people, and know some people really well - it's hard to know many people really well - and this is something we can confirm by performing this "reverse inbox analysis" for various people's inboxes.

To perform this analysis effectively while maintaining the author's privacy, rather than present the email addresses of individual users from the author's mailbox, an MD5 hash of the email address concatenated with a fixed random string of data is used instead. This hash value is repeatable but randomized, so provides an individualized marker by email address though the identity embedded in the email address is not visible.

If one were to compute the reverse inbox based social networks for different users, and then connect them together, a very complete view of the organization's or community's social network as a whole would emerge. We discuss this in a related post on "Mining your Social Network". We content ourselves with providing a simple reverse inbox implementation in this post.

[code to follow shortly]

Saturday, June 2, 2012

The Itsy Bitsy Spider

In nature, spiders are beautiful things. Yes, they can be classified as creepy things when we look at particular specimens, with their long spindly legs and abrupt quick movements. But viewed in the abstract, they do some very unique things - for one, the fibers they emit to build spider-webs are considered to be stronger and more resilient than steel fibers of the same diameter. But in this blog we talk about a different kind of spider altogether.

The WWW is a mesh or web of hyperlinks. What better creature to crawl this abstract space and extract information from this structure of data and knowledge than a spider - and indeed, programs that perform this function are called spiders. In this post, we present a simple Python spider program that crawls the web. (a snake and a spider in one sentence ...)

Let us look at some applications of this kind of a program.

  1. let us say you are trying to scrape all content that hangs off a single web-page across a set of multiple hyperlinks. One way of gathering all this content would be to write a program that iteratively processes each link on the page and collects the data it reads, organizing it into separate files. Freeware programs like HTTrack, the website copier, do just this. But this is a severely limited spider since it is restricted to just those links that either a. hang off the specified web-page, or to links that are hosted by a particular domain or sub-domain of the internet e.g. www.my-company.com
  2. the more exotic application - document retrieval for web search. For Google to be able to respond to search queries, it must first build an indexed data set of various documents from the web. To do this, Google unleashes multiple multi-threaded spiders onto the web, and they collect all documents it makes sense to collect (using some intelligent criteria), following links from one document to the next, until a large part of the web is gathered. Google then indexes these documents to construct a data set that can be rapidly queried to respond to user queries.
  3. Avinash Kaushik, in his excellent books, distinguishes between internal (as in, within a company's intranet, for employees and other internal users) and external search applications with a focus on web analytics. These two perhaps do merit special discussion even in this simple classification which is why we include this bullet item here. However the mechanics for (3) are very similar to those for (2) from a web-crawler perspective, with some of the limitations from (1) imposed on the spiders in question.
  4. desktop search - except here the spider crawls through the directory tree structure of one's file system instead of the Internet, but the operational principle is the same.
We focus our implementation on a very simple instance of (2) above. Also, there is etiquette one needs to follow for building spiders (e.g. the robots exclusion protocol), and since we are not building an industrial strength implementation but one just for fun, we gloss over these details and "hobble" our implementation, constraining it to not explore more than a pre-set maximum number of links before it halts.

Also, for the layperson reading this, while it is somewhat glamorous to visualize this process as the spider crawling the WWW, what really happens of course is that the spider program is resident on the host computer running it, it is simply downloading html documents from the specified starting site, following hyperlinks wherever they may lead, then downloading those other documents,... and so on.

A comparison of how many web-pages were indexed by the various web search engines shows how up to a point, increasing the index size improves search quality. We say "up to a point" because a web search engine  called Cuil (2008-2010), indexed slightly over 120B web-pages at one time, but an unfortunate inability to produce search results of comparable quality to other leading search engines led to its early demise. To be a popular web search destination, it is not just important to index as many meaningful, non-repeating, non-machine-(auto) generated, non-interstitial-screen pages from the web, it is also important to return results that provide the kinds of information the user is looking for.

Code is below. Sample output follows the code.


import os, sys, urllib2, re; 


u=sys.argv[1]; # the url to start with
lynx=[u]; # store the start url into the download set
for u in lynx: #process each link in turn
 r=urllib2.urlopen(u); # get the requested object
 x=r.read(); # read the document from the website
 xlines=x.split(" "); # split and process the lines
 c=[i for i in xlines if i.lower().find("http://")>-1]; # collect all hyperlinks
 for i in c: # process each hyperlink, ignoring some words
  if i.find("CALLOUT|")>-1: continue;
  stpos=i.index("http://");
  if i.find(".html")==-1: continue;
  if i.find("\n")>-1: continue;
  enpos=[]; # code segment that follows parses the retrieved document
  enpos+=[k.start() for k in re.finditer(".html",i)];
  #print stpos,enpos;
  enpos=[j for j in enpos if j>stpos+13];
  if len(enpos)==0: continue;
  enpos=min(enpos);
  candidate=i[stpos:enpos]+".html"; # store retrieved urls
  if candidate not in lynx: lynx+=[candidate];
 if len(lynx)>100: break; # hobble the spider so it doesn't go haywire


g=open("t.txt","w"); # write crawled urls to file
for i in lynx: g.write(i+"\n");
g.close();




Sunday, May 20, 2012

On Robust Hyperlinks

As the Internet grows and the WWW expands with it, users are becoming more savvy - young people today do not even have a memory of a time when it wasn't as prevalent. Savvy users tend to be more demanding of services they use. Also, as web servers become fronts for more critical business applications (e.g. store-fronts), there is the potential for revenue leakage if errors like a 404 Not Found message are returned in responses to user requests. In addition, many a time one comes across dangling hyperlinks (that point to void) which make the browsing experience less pleasant than it could otherwise have been. These kinds of scenarios make the case for Robust Hyperlinks [1].

What are they?
Robust hyperlinks are hyperlinks or URLs designed with a few extra words relating to the underlying web-page's content within them, say separated by dashes. (There are lots of FAT URLs floating around encoding all kinds of parameters in all kinds of applications, so why frown on the use of a few words to make hyperlinks more robust?)

Why use them?
Sunny Day Scenarios
Now, when the client (user sitting at a browser) makes a request for a page by clicking on the text label for the URL with a robust hyperlink, things work as usual, and the html document associated with the hyperlink is downloaded and rendered into the user's browser. So far so good, and it appears all that extra work was for naught.

Rainy Day Scenarios
What happens if Bob, the web-master for the site, carelessly updated the web-page and left that hyperlink dangling (e.g. by moving the underlying document to a different directory, or somehow renaming it)? The user at the browser receives an ugly "404 Not Found" or similar message from the server, and potentially has to start looking for other sources of information, such as, for example, by Googling for it.

Robust hyperlinks beg the question, is this second user step really necessary? If the client, say Google Chrome, Internet Explorer, Mozilla Firefox, or Opera, or whatever other browser the user uses, were able to:
1. read the server-returned error message,
2. parse it,
3. extract the five "keywords" from the request tied to that interaction,
4. query a Web Search engine for similar documents, and
5. finally present the result of such search queries back to the user seamlessly with a note indicating the original request were not fulfilled,
would that not make for a much nicer, more seamless user experience?

Alice, who makes the original html document request now has multiple links to choose from, all similar in content to the original hyperlink she clicked on, in terms of content.

How to get them to work?
In previous posts, we already explored the notions of TF-IDF or term-frequency/inverse document frequency, a mechanism that enables us to determine statistically improbably keywords or phrases from any particular document given a corpus or body of documents. If the keywords for a robust hyperlink were constructed from the top five statistically improbable phrases for the document in question, considering their TF-IDF ranking given a corpus of general web-pages (yes, this might take some work on the part of the organization hosting the web-pages, but is mostly amortized across the large number of hyperlinks they host, so the incremental cost per link is likely minimal), then this set of keywords is a statistically relevant expression of what makes this web-page more unique or less general when compared to most of the other web-pages out there.

What this also means is that if the client were to reformulate a query using these five keywords as search terms in a Web Search engine, the documents returned are likely to be good matches for the content Alice wants to consume.

Why five keywords? It appears the authors who proposed the scheme in the paper listed in the references have performed empirical tests and five was an ideal sweet spot between having too long or cumbersome URLs and too sparse URLs that might throw off the search operation to follow.

What's in it for me?
The organization hosting the original URL might argue against doing all this extra work at the cost of an additional fraction of a cent per hyperlink if this only serves to drive web-traffic away, potentially to competitor sites. A small incremental investment on the server side could be made to have the server conduct an internal search on the hosting organization's web-space to locate other documents that might serve as substitutes for the now broken link. This way, the robust hyperlink implementation serves to bolster the business case for the hosting organization's added spending as well.

Issues with Trusting Trust (with apologies to Ken Thompson)
For this scheme to work, one must necessarily trust the people who build the hyperlink to encode it only with words that are truly reflective of the content. This is also true today, but since links today are not automatically parsed by browser add-ons to query for additional content, there is a lower likelihood that an innocent click on a dead link might lead the user to see a web-page with links that are not-relevant, offensive or plain undesirable. This happens mostly because users do not typically read the structure of the link they click on. Perhaps over time, if robust hyperlinks really take off in a big way, intermediation services that verify the keywords and their relevance given the content of the underlying delivered documents will become more prevalent. Such services might be able to safeguard users' browsing experiences.

A second issue might be one of a hijacked search. If Alice starts with a certain set of search keywords that led her to the original hyperlink, she may be led off track by the keyword searches that result from the set of hyperlinks associated with the first failed document retrieval. To ensure usability considerations like this are addressed, it might make sense for the browser based implementation to give Alice a clearer sense of when a search query was done on her behalf using robust hyperlink keywords, so she can manage her Internet searches more effectively.

Examples:
The authors of the paper in the references provide several complete examples with nice, readable descriptions here.

References:
[1] http://www.eecs.berkeley.edu/Pubs/TechRpts/2000/CSD-00-1091.pdf, Thomas Phelps, Robert Wilensky, UC Berkeley.

Friday, May 11, 2012

Mining your Social Network

Stream of Consciousness Ramblings on Social Media...

On Being Human
No sentient being exists in isolation. Beings coexist together with others of their kind drawing on a sense of community, with the potential to work together to build things bigger than any one among them could even conceive of, alone. And perhaps this characteristic isn't so much restricted to just sentient beings. Are ants sentient? Bees? Both have fairly well-developed colonies.

Some leading technology companies [citation needed] work on the notion of Hive Intelligence - where reconfigurable sub-components of a systems can combine together in different ways to best adapt themselves to solve problems in some fairly different settings. Artificial intelligence has developed to a point where adaptive robots can be constructed where small robots collaborate and configure themselves into a larger automaton to solve new and difficult problems. Adaptive machines are fascinating - the Borg were arguably the most interesting race from Star Trek: The Next Generation. But we digress... The things that make societies and civilizations interesting, is mostly the collected knowledge of the species - knowledge that is accumulated and grown over millenia - an idea nicely discussed in the H.G. Wells novel "Christina Alberta's Father".

Humans are among the most complex of beings to ever exist. Environmental adaptation is almost second nature to us. Unless we are completely asocial, as some of our species sometimes are, we build networks of relationships wherever we go. And these networks evolve over time. In this post, we examine one aspect of the power of social networks, from the standpoint of network establishment, evolution, degradation, and regeneration, with a view to studying the power of this medium, and why it is perhaps unlikely that a single social networking platform will continue to predominate the global conversation.

The Times, ... They Are A Changin'....
"Know yourself and your enemy. You need not then fear the result of a thousand battles."
                                                                 -- Sun Tzu, "The Art of War"

People born into the digital age are grow up in a world where email, mobile phones, and texting are commonplace. This world also serves as an interesting laboratory to study (social) network evolution. How do relationships change over time? Mining historical email data can provide valuable clues. What bonds were stronger at what age? How does the relative strength of these bonds change? How does age of the individual, time (which generation the individual belonged to), and place (geography, and associated cultural context) define what bonds are "important", which media are used, for which aspect of social interaction, and what, and how much data is shared?

The Lonely Network
In other words, every human being is a DG (Directed Graph from Graph Theory, which may not necessarily be Acyclic in this case) in social network terms, with a node representing themselves at the center and other nodes representing their "friends". Social media connect these DGs together, "decentralizing" or democratizing them. But we can of course, learn a lot from even a single individual's graph, and studying that is perhaps much easier anyway.

In related posts, we start our exploration of this idea. In the post titled "The Reverse Inbox", we present a simple prototype that de-constructs a sample gmail mailbox, uniquely anonymizes every contact to protect the innocent, then presents a view of the mailbox owner's social network. An advantage we have with experiments of this nature is that given the large amounts of storage now available on free Internet email services, no one ever has to delete all their email, so histories going back several years can be easily extracted from even such a simple exercise.

We can see what links are important, in which directions, mine the network to infer the relative strengths of people's relationships, and construct a snail trail graph showing their evolution - the strengthening and weakening of different links, over time. We hypothesize that links tied to family, close friends, and co-workers grow stronger, with the latter fading as one moves from one job or career to the next.

When these lonely networks are joined however, the decentralizing process adds a new capability. Nodes that were hitherto unconnected in each individual's network now have paths between them. This is a concrete example of Metcalfe's Law in action: "The value of a network is proportional to the square of the number of nodes within it". You are node A, you want to get to node C, but don't know the way. The network tells you how to reach C via B, D, ...

How Social Are Social Media?
This also brings us to another idea. In Facebook and LinkedIn, all links between nodes (i.e. people) are bi-directional by default. In other words, friends are only friends if both people acknowledge the relationship, even if one does this somewhat reluctantly in some cases. Social networking seems to have misplaced the idea of "polite blocking" that was widely supported and prevalent in the Instant Messaging space. Twitter is somewhat different though, with people collecting followers like a non-rolling stone gathers moss, without necessarily pro-actively accepting each new follower into the fold. Arguably, these lead to different social dynamics.

Even more interesting is the fact that in Facebook it is relatively easy to "unfriend" someone - break the link. Can't do this so easily in LinkedIn - why, you might even need to call the company to get someone off your network. So connections on LinkedIn are perhaps worth more than those on Facebook - harder to break must mean a higher standard to forge - or people will eventually migrate to this behavior. For Twitter, we have the notion of "Social Capital": the number of one's followers. Links are cheap for individuals, but have greater value to the people they follow - since they contribute to social capital. Perhaps there is a business model here - celebrities, products, or groups being followed can set a limit to the number of links they will accept, based on certain criteria, and monetize the power of social networking per link accepted. If nothing else, perceived scarcity increases perceived notional value. (see other blog post on "Monetizing Internet Services")

A Heaven for Dead Networks?
What happens when networks die? What is death exactly? Without waxing philosophical, we can learn some lessons from examples of "death" in the social space. Two examples that leap to mind here are Friendster (now seeing a revival in Asia?) and MySpace. But to die, one has to be alive first.

Networks gain a measure of life through their expanse (number of nodes) and connected-ness (number of links). Smaller, strongly linked networks are more cohesive and perhaps provide better "value" to their constituent nodes than larger, more sparsely connected ones. Some combined measure of these two metrics would define the value that individuals connected on the network can derive from their association with it. It also stands to reason that perhaps each link in the network DG should also be weighted according to its importance given some of the discussion in the previous section.

To complete the anthropomorphic comparison: As the total metric for a network grows, it springs to life. As the rate of change of these metrics increases, the network transitions into adolescence and adulthood. As growth slows, it transitions to maturity and old age, and as nodes leave and links dangle or get deleted, the total metric decreases, and the network slowly atrophies and then dies.

Dr. Jure Leskovec at Stanford University does some interesting work with social networks including the study of the diffusion of information e.g. viral marketing and inferring social connectedness from the spread of disease within a population. His very interesting talk is here.

The Fountain of Eternal Youth
So how to stay younger longer? Since value builds from network effects and positive externalities, the best way to retain users is to a. grow value through "sticky" network services, b. evolve individual users' view of their immediate "vicinity" in their social graph to meet their social needs, and c. make it harder for them to leave (i.e. increase switching costs).

As an example, Facebook does (tries to do) all three. (a) by building their network as a social platform on which other companies like gaming firms Zynga can build and host interactive multi-player games, Facebook credits - its own currency, etc. (b) by enabling support for various kinds of interactions (professional vs. college vs. ...) and (c) by holding on to users' data and encouraging them to share more through features like "timeline".

As more users sign on, unless they can move taking their networks with them, it gets harder and harder for them to go elsewhere, especially since they have invested a lot of time building up their "presence" in the social graph, and all the connections they have established. This ensures survival of the network for a while - at least until the majority tire of it.

Forward the Federation
By no means is the battle for supremacy in this arena over. There are features Facebook lacks. Google+, Path and others are building these into their core DNA rather than as adaptations or mutations into an existing animal, and it will be interesting to see what the future brings.

At 900M users and counting, Facebook is likely not to get too many new users given almost 1 in every 6 people on the planet are already on it, and to a large extent, geography dictates social network choice (there are other social networks that are growing faster in Asia). Perhaps the next step in the evolution of social network technology is "federated social networking", but we leave that for discussion on another day.

If federated social networking becomes a reality however, the impact of "stickiness" is reduced and walls between networks become more porous, and a new dynamic will come into play.










Monday, April 16, 2012

Monetizing Internet Services

In this post we start out with a brief discussion of a simple taxonomy of service delivery models for Internet-based applications and then go on to discuss different business models that can be used to support these services. Finally, we put these together to discuss how the different kinds of Internet services may be monetized. The goal is to present several examples to drive the points home and make the discussion useful in a practical, real-world, setting.


Classes of Services
Services may be classified in different ways using different criteria. Here, we explore a couple of simple ones, and use these as a basis for building ideas that follow.


By Underlying Protocol or "Connectedness"
Different services have different characteristics and rely on different protocols. Looking at things from a protocol perspective... Some use best-effort routing underneath, with the UDP protocol where the loss of a small number of packets does not significantly degrade the user experience. Others utilize a protocol (TCP) that layers timers, packet sequencing, and a re-transmit mechanism on top of the underlying pipes, to ensure higher quality of the user experience, albeit with greater overheads. ... and new protocols are constantly in the works, crafted over time by hard-working engineers at the IETF and the IRTF.


"Sessions"?
From a more pedestrian standpoint, we may classify services in simpler ways - those that require a "session" vs. those that do not. For instance, voice and video calls require a session over which to transmit and receive data, while a simple web download does not (though strictly speaking, this latter service is provided over a TCP "connection" which we distinguish from the notion of session in this case). However, websites may utilize other mechanisms like cookies, web bugs etc to track different interactions between a user and a website, and construct a session context from these.


Modes of Interaction
Some services follow the "client server" paradigm. A user, sitting at a client machine, makes a request that his browser connects to a server on the Internet to fulfil, downloading data onto the user's screen in the process. 


Sometimes, the user (and his client) may be in a sheltered network like his company's intranet, at work, and the server may be in a different company network so there are intermediate "intelligent" nodes the user's request, and the server response must transit between the two endpoints. 


Other services, such as the file-sharing ones hosted by Torrents, are "peer-to-peer" services where computers are treated somewhat more democratically as equals, and these "equal" nodes are able to connect and communicate once they find each other using a network provided "directory".


Yet other services follow the "interception" or "intermediation" paradigm. In these cases, the server providing the service interjects service related information into the communication stream. This can be done effectively at the boundary between two domains at a server that all packets must transit, assuming the communication between end-points is not encrypted (if it is, there is no obvious way for interception-style services to work in those settings).


This taxonomy will be useful as we examine how we might offer, and bill for, Internet services.


How to generate revenue?
Lord Kelvin is reported to have said  "[...] you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind [...]".


To bill someone for something, unless you intend to collect a flat fee for unlimited access on a per-time-period basis, you need to be able to measure use. But how does one go about doing this? This is where event logging, accounting, and security controls come into play. Let us now briefly look at these three ideas.


Event logging refers to the process by which we gather statistics for various pre-defined network occurrences or events. You may want to collect information for a wider variety of events than the set you need to bill for the service because billing models may change over time, and you never know what information you might need to validate your current billing model, or perhaps even design a new one. Having more data at hand is usually (though not always) better than having less. Of course, collecting more data puts more of a burden on the service and might impact performance adversely. So there is a trade-off here (so typical in engineering).


Accounting or billing (not strictly synonymous, though we use them inter-changeably here) is the process that tells you what to do with the data you have so painstakingly collected. Collating the data together, and generating bills falls into this bucket.


Security controls are needed from a billing perspective to ensure authentication, authorization and accounting functions (sometimes referred to as AAA) enable you to ensure that services are provided to parties who are who they say they are, are authorized by the parties using the service, who can then be billed with non-reputation guarantees i.e. they cannot later claim the service was delivered to someone else or that it was not delivered to them at all. These are supported in different systems in various ways.

... and the $$$ comes rolling in... (with examples)
The key to being able to generate revenue is to provide some service that adds value, and then get as large a user-base to sign up as possible, and then either bill the users themselves, or some third parties that benefits from the user population that logs into your portal (think advertising revenue), or some combination thereof. Start-ups these days tend to build a multi-platform model (sometimes also called a multi-sided market) where you first give away service for free to attract users, then attract advertisers to generate revenue, then change the provided service to a tier-ed structure where higher levels of service require payment from the user base as well. With the advent of cloud technologies, as user data gets stored more with the service provider (who can rent large amounts of storage space in the cloud from companies like Amazon), this raises user switching costs, making it more likely that users will stick with you, and you have a growing and captive user base. This last factor is a major contributor to the success of services like Facebook, Twitter, and LinkedIn. 

In Social Media based services, users value the networks they build, but these are hosted on the Facebook servers and are not things the users can pick up and leave with. The same goes for media about the users that the users themselves, or others inject into the system. Twitter users cannot take their followers onto a competitor's system unless a big mass of other users moves with them as well. All these services are successful primarily because of network effects. Metcalfe's law ("the value of a network is proportional to the square of the number of users connected to the system") holds strong sway over the services here. LinkedIn is a third example of a successful service - this one focused on somewhat more serious professional networking applications. One difference between LinkedIn and Facebook is that the former is somewhat more rigid. You can "unfriend" people on Facebook, but if you allow someone into your network on LinkedIn, you cannot take them out. This may have consequences to how these two services grow over time.


Does anyone remember Second Life from Linden Labs? They even used to have their own currency (called Linden Dollars) in the Virtual World. Somehow, Second Life went from all the rage a few years ago to a somewhat more quiet existence now. Guess there is a cycle to everything - turn, turn, turn.


We note some different ways in which mobile applications are monetizing value today. The first two models are obvious ones - 1. through advertising leveraging platforms like AdMob, and 2. by selling downloads at say $0.99 each. Lately though, more and more apps are selling "usage rights" - you can play the game so many times, or read so many documents with our software each time you pay a $0.99 fee. And perhaps periodically free usage "turns" can be made available through special promotions or at first download, to gain in popularity and drum up interest.


 This is similar to the "revenue sharing" model telecom equipment vendor firms sometimes use: give away the platform (hardware) for free, then have the service provider pay you a certain percentage for each transaction they bill the subscriber for.  This is also similar to how Apple bills application developers. You pay a fee to sell your app via the iTunes platform, then get a percentage of the total money you make from downloads.

[This is a place-holder for a work in progress.]

Tuesday, February 28, 2012

Synchronicity

In mathematics, functions are beautiful things. They represent lines, curves, surfaces, and things in higher dimensional spaces that defy verbal description. With applications in areas that span from practical statistics to abstruse string theory, no one can argue their importance to daily life.

In programming, functions are just as important (I would have said "if not more important", but that seemed needlessly tautological). Interesting, and sometimes overlooked though, are synchronicity aspects of function invocation in programming. Now that we already have a post on location based services, we can leverage that to make our points here more clearly. So let's embark on this journey.

Most functions in computer science are synchronous. By this we mean, we build a function, we call it with zero, one or more arguments, and it returns a value. The code we write that invokes the function populates the arguments, makes the function invocation, then waits patiently till the function call completes, assigns the returned variable to a program variable, and continues execution. Of course, depending on whether the variables were passed by value or by reference, and whether any global variables were also used, etc, there may be a number of other "side-effects" caused by the function invocation on other aspects of program execution but we blithely ignore these here for sake of simplicity.

Synchronous function execution is thus, the simplest, and most intuitive way that functions, and yes sub-routines, and the like are used. But is that all functions can be used for? Let's study a couple of other models.

Enter the Asynchronous function call
When we look at complex APIs - such as, for example the OSA/Parlay APIs for mobile network application development, these contain function calls where what the function is asked to do cannot be done immediately. It is unreasonable to expect a program to wait for as long as a function might take to return in such cases. A way out of this dilemma might be to embed this function call in a separate thread from the main thread of program execution, and have that thread block on the function while the rest of the program goes on its merry way. An arguably more elegant approach is through the use of asynchronous methods.

So how do these work? Software platforms like CORBA and Java RMI (does anyone still use this anymore?) permit one to invoke a function and register a "call-back". The call-back is a reference to a function in the main program that the called function (typically run across the network on another machine somewhere), can invoke on it. So the logic flow proceeds something like this:

  1. main program A invokes function x on program B, passing along a call-back reference c.
  2. B returns an immediate return value, typically indicating whether the function was well structured and     if it will be executed.
  3. the function x returns synchronously to A the immediate return value from (2) above.
  4. B continues executing the function, and when it has an update for A, passes this value to A by invoking call-back c. c returns immediately (i.e. c is synchronous to B)
  5. at this point, c is either still open, or is closed based on pre-determined call-back protocol design between A and B specified in the API.
All this might seem needlessly complicated. But it can be extremely useful. So we look at an example below.

Using Asynchronous Functions - An Example
Let's say you (an application) want periodic location updates for a mobile handset. You could poll the network for location every 30 seconds or so synchronously, and speed up the poll interval if the handset is traveling faster, and slow down the interval if it is slower-moving. But this puts the burden on the application to keep timers etc., A more elegant way to implement this would be to have the application register a call-back with the underlying network node saying: "here's the terminal whose location I need: 555-555-1111. send it to me every 30 sec at this address http://updatesRUs/5555551111x will you? till I tell you to stop, thanks. and if you think it's not really moving that much, feel free to slow down updates a bit"

And the app can continue about its business and process updates as it receives them. When the user tied to the app decides he has for example, reached his destination and does not need location information anymore, the app can send another message to the network saying "remember my request for 555-555-1111? cancel that now please". And the call-back address specified above disappears, so the network cannot post any more updates. Useful, isn't it?

From callbacks to "recruiting"
An erstwhile colleague of mine (the author thanks A. Sahuguet) once showed me another model called "recruiting" that can be even more useful. The way this works is, A invokes a method on B, B on C, and C gets back directly to A. In other words, B "recruits" C to respond to A's request. Typically, networked software implementations require the request/response flow to be A->B->C->B->A, but this is shortened to A->B->C->A in this scenario. Of course there are issues with security etc, and yes, there are ways to address them, but we do not cover those here. How one might set up asynchronous function calls to support a recruiting model is left as an exercise to the reader.

Hope this made fun reading! 

Fifty Nifty Pytools

In this post, we look at several simple but useful python code snippets. While we say "fifty" in the post title, it is hoped that over time the count will grow much larger.

[Snippet 0001] Particularly while working in Finance, one needs very strong date libraries and tools. Python gives us datetime which is excellent, and additional functionality can be built atop that. When one has to tie together data from excel spreadsheets, databases e.g. MySQL, and other sources and maintain a single data set that can be used for data-mining or back-testing, having good date functions at hand becomes critical. An example of such is our first snippet below. More date-related snippets will likely follow.

import os, sys;
from datetime import *;


import os, sys;
from datetime import *;

def dtConv(x,s="OBJ"): # converts dates from one form to another
 def s2(n): # nested helper function. returns str form of n if >10 else "0"+str(n)
  if n<10: return "0"+str(n);
  return str(n);

 # first, parse the input depending on type, collecting year, month, day as int
 # styles (s) supported: 
 # s="OBJ" return type is a date object. the default return type
 # s="TXT" return type is of the form "yyyymmdd" e.g. "20120131"
 # s="XL"  return type is of the form "m/d/yyyy" e.g. "2/3/2012"
 # s="XL0" return type is of the form "mm/dd/yyyy" e.g. "02/03/2012"
 # s="DB"  return type is of the form "yyyy-m-d" e.g. "2012-2-3"
 # s="DB0" return type is of the form "yyyy-mm-dd" e.g. "2012-02-03"
 if type(x)==date: y,m,d=x.year,x.month,x.day;
 else: 
  if x.count("/")==2: y,m,d=int(x.split("/")[2]),int(x.split("/")[0]),int(x.split("/")[1]);
  if x.count("-")==2: y,m,d=int(x.split("/")[0]),int(x.split("/")[1]),int(x.split("/")[2]);
  if x.count("/")==0 and x.count("-")==0 and len(x)==8: y,m,d=int(x[:4]),int(x[4:6]),int(x[6:]);
  
 # next, we generate output in the form requested
 if s=="OBJ": return date(y,m,d);
 if s=="XL": return "/".join([str(m),str(d),str(y)]);
 if s=="DB": return "-".join([str(y),str(m),str(d)]);
 if s=="XL0": return "/".join([s2(m),s2(d),s2(y)]);
 if s=="DB0": return "-".join([s2(y),s2(m),s2(d)]);
 if s=="TXT": return s2(y)+s2(m)+s2(d);
 return -1;



Examples of use:
dtConv("1/2/2012") gives datetime.date(2012,1,2)
dtConv("1/2/2012","DB0") gives "2012-01-02"
dtConv("1/2/2012","TXT") gives "20120102"
dtConv("20120102") gives datetime.date(2012,1,2)

[Snippet 0002] Simple object example with output. We define a circular list for ordinary list objects, accessing beyond the end of list causes errors. clists overcome this problem through modular arithmetic on index location.

class clist(object):
# creates a circular list object
 num_inst=0; # enforces singleton pattern

 def __init__(self,arg1=[]): # constructor
  if clist.num_inst==0: 
   clist.num_inst+=1;
   self.L=arg1;
  else: 
   print "cannot have more than one instance of class clist";
   self.__del__();

 def __del__(self): # object destructor
  pass;

 def __len__(self): # get length of clist
  return len(self.L);

 def __getitem__(self,key): # get an item of clist
  pos=key%len(self.L);
  return self.L[pos];

 def __contains__(self,key):
  if key in self.L: return True;
  return False;

 def __reversed__(self): # reverse clist contents
  self.L.reverse();
  
 def content(self): # accessor for clist contents
  return self.L;

 def redef(self,L): # reset clist contents
  self.L=L;

# sample use:
>>> execfile("clist.py");
>>> b=clist([1,2,3,4,'a','b','c']);
>>> len(b)
7
>>> reversed(b)
>>> b.content();
['c', 'b', 'a', 4, 3, 2, 1]
>>> b.redef([1,2,3,4,5,6,7]);
>>> b.content();
[1, 2, 3, 4, 5, 6, 7]
>>> len(b);
7
>>> b[1]
2
>>> b.content()
[1, 2, 3, 4, 5, 6, 7]
>>> 'a' in b
False
>>> 1 in b
True
>>> c=clist([1,2,3]);

cannot have more than one instance of class clist
>>> b[-13]
2
>>> b[13]
7
>>>

[Snippet 0003] Here, we combine the previous two snippets to design a date class dt that behaves like the examples from snippet 0001 i.e. we can store the date any way we like and seamlessly convert from one form to another as desired. Notice how clean the Object Oriented approach looks in comparison to the purely procedural technique, though of course, for more complex examples, there is probably a greater up-front price to pay in designing things in the OO way as opposed to coding procedurally. The former is arguably easier to maintain however. The code (dt.py) and output follow:


import os, sys;
from datetime import *;

def s2(x): # returns "03" if x==3, or "10" if x==10
 if x<10: return "0"+str(x); # i.e. adds leading zeroes as needed
 else: return str(x);

class dt(object):
 # creates a dt object and provides means to view it in different ways

 def __init__(self,x): # constructor
  if type(x)==str and x.count('/')==2: # covers XL and XL0 types
   m,d,y=x.split("/");
   self.m,self.d,self.y=int(m),int(d),int(y);
  if type(x)==str and x.count('-')==2: # covers DB and DB0 forms
   m,d,y=x.split("-");
   self.m,self.d,self.y=int(m),int(d),int(y);
  if type(x)==date: self.y,self.m,self.d=x.year,x.month,x.day;
   # covers the date object format
  
 def __del__(self): # destructor
  pass;

 def OBJ(self): # returns the date object
  return date(self.y,self.m,self.d);

 def TXT(self): # returns the text representation
  m,d=s2(self.m),s2(self.d);  
  return str(self.y)+m+d;

 def XL(self): # returns the Excel date type
  return "/".join([str(self.m),str(self.d),str(self.y)]);

 def XL0(self): # returns Excel date type with leading 0s
  return "/".join([s2(self.m),s2(self.d),str(self.y)]);

 def DB(self): # returns the MySQL DB date type
  return "-".join([str(self.y),str(self.m),str(self.d)]);

 def DB0(self): # returns the MySQL DB date type with LZs
  return "-".join([str(self.y),s2(self.m),s2(self.d)]);

# sample output generated as below
>>> execfile("dt.py");
>>> a=dt("4/10/2012");
>>> a.OBJ();
datetime.date(2012, 4, 10)
>>> a.TXT();
'20120410'
>>> a.DB();
'2012-4-10'
>>> a.DB0();
'2012-04-10'
>>> a.XL0();
'04/10/2012'
>>> a=dt(date(2012,4,10));
>>> a.OBJ()
datetime.date(2012, 4, 10)
>>> a.TXT()
'20120410'
>>> a.XL0();
'04/10/2012'
>>> a.XL();
'4/10/2012'
>>> a.DB();
'2012-4-10'
>>> a.DB0();
'2012-04-10'
>>>

[Snippet 0004] Python provides native support for a number of different and useful data-types. However, sometimes we may need to solve problems for which the required data types are not readily available. For instance, sometimes we may want to store data in a tree or a trie. How would we go about doing this? Luckily, Python is very intuitive in its support for these kinds of derived data structures. We present a simple implementation below that can be used to build graphs, DAGs, tries etc along with some sample examples of its use.


class node(object): # the class that creates every node object
 allNodes=[]; # static class variable to store all nodes created

 def __init__(self,val=-1): # constructor that creates node with specified tag
  self.val=val;
  self.parent=[];
  self.child=[];
  node.allNodes+=[self];

 def getVal(self): # accessor function that returns the tag value of a node
  return self.val;

 def addChild(self,n): # mutator function that connects a node to a child
  if self.getVal()!=n.getVal():
   self.child+=[n];

 def addParent(self,n): # mutator function that connects a node to a parent
  self.parent+=[n];

 def getChildren(self): # returns a list of child nodes for a node
  return self.child;

 def getChildVals(self): # returns a list of child node values for a node
  t=self.getChildren();
  r=[i.getVal() for i in t];
  return r;

 def getChildByVal(self,val): # returns a particular child node of a node by value
  p=self.getChildren();
  q=self.getChildVals();
  if val not in q: return None;
  else: return p[q.index(val)];

# Example usage

a=node(2);
b=node(3);
c=node(4);
d=node(5);
e=node(6);
f=node(7);
g=node(8);

a.addChild(b);
a.addChild(c);
b.addChild(g);
c.addChild(d);
c.addChild(e);
e.addChild(f);

b.addParent(a);
c.addParent(a);
d.addParent(c);
e.addParent(c);
f.addParent(e);
g.addParent(b);


def getDFSChain(n): # get the depth first search chain of nodes and values
 if type(n)!=node: return -1;
 r=[n];
 c=n;
 for i in r: r+=i.getChildren();
 for i in r: print i.getVal(),;

getDFSChain(a);

# output follows:
2 3 4 8 5 6 7

[Snippet 0005] This is a very interesting code snippet I remember reading on the web somewhere. Comes from a Google software engineer with a screen name of "Tryptych". I don't remember which website I read it, but the idea was so clean and elegant I remember it clearly. It is an algorithm to factorize numbers. Code follows along with example output:

import os, sys;

def F(x): # this function factors a number x into its prime factors
 r=[];
 i=2;
 while x>1:
  while (x % i)==0: 
   r+=[i];
   x/=i;
  i+=1;
 return r;

print "prime factors of 100 are: ",F(100);
print "prime factors of 1024 are: ",F(1024);
print "prime factors of 1789 are: ",F(1789);
print "prime factors of 2013 are: ",F(2013);
print "prime factors of 11204243 are: ",F(11204243);
print "prime factors of 112042431 are: ",F(112042431);
print "prime factors of 1120424311 are: ",F(1120424311);

# output follows:
prime factors of 100 are:  [2, 2, 5, 5]
prime factors of 1024 are:  [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
prime factors of 1789 are:  [1789]
prime factors of 2013 are:  [3, 11, 61]
prime factors of 11204243 are:  [19, 23, 25639]

prime factors of 112042431 are:  [3, 3, 101, 123259]
prime factors of 1120424311 are:  [1120424311]

[Snippet 0006] 

Sunday, February 26, 2012

On The Design of Location Based Services

In this post, we present a broad-brush, 10,000 foot level survey of developments in the Location Based Services domain. This post is a work in progress. There are several other ideas that merit coverage here including privacy and security concerns, other means of tracking user location, and leading edge location services application scenarios.

As mobile Internet devices become more prevalent, there is a larger pull for mobility characteristics to be blended into the end-user experience. Earlier, people would log into the nearest computer and find directions to the nearest ATM, restaurant, or POI (Point Of Interest) by providing their current geographical location. Later, they graduated to providing their location (i.e. the location of their computer) once, which the computer would "register" (either using cookies or by saving with the user profile on the application website) and send out seamlessly to applications that needed this information to provide value-added services.

This in turn gradually led to more "sophisticated" usage scenarios where applications would recognize the geographic location of the RAS (remote access server) or other IP address pool from which the user's computer was assigned an address and divine the user's location from that (though due to idiosyncracies of how some ISP networks are set up and the details of the underlying infrastructure used - DSL vs. Cable, with head-end locations etc, this can sometimes lead to embarrassing errors).

As Location Based Services (or LBS for short henceforth) evolved further, cellular networks employed technologies that factored in the time difference of arrival of signals from the handset, or the angle of arrival of the signal from a handset or combinations thereof to determine the location of a handset in 3-dimensional space relative to a network of stationary cell-phone towers. During this time, acronyms like PDEs (Position Determining Equipment), MPCs (Mobile Positioning Centers), GMLCs, TDOA (Time Difference Of Arrival), EFLT, AFLT, .... (and others too numerous to mention, and perhaps not all worth even googling for anymore) were widely prevalent. But once GPS-capable hardware became available in handsets, LBS really took off in a big way.

Aside: Note how the above describe scenarios where there are sets of mobile handsets that communicate with each other via a network of stationary nodes (cell towers). Of course, more ad-hoc implementations of a set of mobile handsets that talk directly to each other in true peer-to-peer fashion (like CB radios or walkie-talkies) or the kinds of networks covered under IETF MANET standards also exist, though for simplicity we ignore them here since location in such systems can get much more complicated. <end of Aside>

Some issues persisted even with GPS. A common problem in large metropolitan areas is that of an "urban canyon". This happens when between tall buildings, say in NYC, or within buildings where surrounded by sky-scrapers, there is limited access to satellites, so one has to revert to the older ways of determining location (see para 1 above for example) to be able to provide value-added services to consumers.

As Wifi proliferated, methods akin to those from the second paragraph above were employed at Wifi hotspots to determine the relative strengths of signals from different hotspots (if these were part of a single larger infrastructure network), and then used as a basis to compute the possible location of the mobile client. This alleviated to some extent, the problems with urban canyons, since hybrid technologies could now be employed to effectively location enable applications in areas where such coverage would previously not have been possible. Hybrid as in, use GPS where GPS is available, else, if a Wifi capability existed on the handset and Wifi networks were also located in the vicinity, utilize those to get either the fix, or improve the accuracy of a fix obtained by other methods. Once this base location capability was available, service logic could be overlaid atop this.

This is where we are today. I guess what matters more than the technologies used to support location data is the applications that reside atop the stack. Once can deploy some very useful or even some very "cool" (even if unuseful) applications today, all thanks to location. Examples include geocaching, location-based games, automated mobile tourist guides, of course, GPS-enabled driving directions, and even research projects like "street stories" (yes, this was very cool in 2002).


Tuesday, February 7, 2012

Facebook posts Sentiment Analysis

In a previous post, we examined the utility of mining social media such as the micro-blogging site Twitter. Unrestricted, light, friendly, uncensored, and sometimes trivial and uninteresting information is shared by people with one another on such media. However, at the same time, there are some insightful posts or tweets, reporting from the site of event occurrences, from people closest to the action etc., which gives this information a certain relevance, an immediacy, a currency, an accuracy, and even a certain un-pretentiousness that comes from it being delivered by ordinary people who want to get their voice out there, get heard, share a view etc., which makes social media so fascinating.

People don't usually care that Alice had chicken for dinner. Perhaps even Alice's closest friends do not. But people do care what someone - say a civilian - on the ground in a war-zone has to say about what she has seen, is experiencing, along with video footage where available, to break a news-story to the broader world. In this latter sense, social media's contribution to the world at large is immense.

In this post, we look at legally mining data from that other large social media engine, Facebook. As of this writing, Facebook has ~800M subscribers, which makes it the size of a medium size to large country if it were a geographic agglomeration of people. And these people tend to interact with each other, some quite frequently, and do so by posting messages, sharing photographs, using applications, "liking", "poking" and doing other Facebook specific actions, all with a view to having fun, "keeping in touch", and generally having a good time being social.

Mining Facebook data can be done in two ways:
1. Searching through Facebook public domain posts and messages. This does not require one to log into Facebook but can be achieved by merely using the published public domain Facebook API to access data that many Facebook users do not necessarily even know they have placed into public domain, though lately at least part of the user population seems to be growing up to data ownership and privacy issues.
2. Searching Facebook posts not publicly accessible - this requires that one log in, but provides a deeper access to the Facebook "graph" that connects various Facebook objects including messages, groups, pictures, links etc. all together into a structure that can be queried via a REST-ful API.

We implement a very simple version of (1) in the code below. Again, we note that mining data here can have various practical applications, such as for example, performing sentiment analysis of the user population towards world or local events e.g. the crisis in Europe or Greece, the Arab Spring, elections in the US, the price of oil, Madonna's performance during the Super-Bowl etc., Some sentiment analysis can even be used to help with marketing of products and services and even for applications like investment management. There have been news-stories of how sentiment analysis was used to predict the direction of Netflix stock.

Our implementation stops short of performing the actual sentiment analysis because we have already implemented simple sentiment analysis in our earlier Twitter post. The same approach can be replicated here by the interested reader with minimal effort. Several additional enhancements can also be made if (2) above is used to determine how "connected" a post-writer is to the rest of the Facebook graph (we can do this in the Twitter context using the number of followers one might have). Facebook also offers additional media like pictures that can be also used to add additional context to the story.

One issue we face while performing sentiment analysis in the Facebook context that we do not see with Twitter comes from the fact that Facebook posts tend to be longer on average and are not limited to the 140 characters of a typical Twitter micro-blog tweet. This means that even our simple sentiment analysis algorithm needs to be tweaked to compute the overall sentiment of a post by calculating the relative percentage of positive, negative, and neutral sentiment key-words, and then interpret these in the larger context of the post. Similarly an additional hurdle we face here is that larger posts offer greater capacity for one to express his/her creativity, and that might mean there are more posts that are sarcastic, satirical etc in nature, and text-based analysis unless very sophisticated, is likely to miss these nuances in meaning, making things more difficult from a classification stand-point.

Code for simple Facebook data-mining is posted below, for a sample query with the key-words "quantitative easing" and the generated output file. Enjoy!

Some other issues with the code (an optimist might have titled this section "Future Work" or "Next Steps"):

  1. This is a very simple, unpretentious implementation focused on the core issue of mining Facebook posts for data and parsing the results into a human-readable, usable input form for sentiment analysis. It can easily be made much more sophisticated, but we just hit the highlights here and move on to other things.
  2. I did not build in a processor for special characters in the larger unicode data set. So these appear as noise in the output.
  3. I do not check for messages that repeat and filter them out, with good reason. Sometimes messages with minimal changes are re-posted by other people, sometimes with, and sometimes without attribution to the original post. I guess the general rules about plagiarism vs. original thought do not apply as much to social media. 


Sample Source Code:

import os, sys, urllib2; # include standard libraries of interest


# function that takes two lists as input, reads them,
# then returns a list of tuples (x,y) where the x is from the first
# list, and the second element y is the smallest number larger than x
# from the second list. we use this as a helper function to parse the
# output of the query to Facebook.
def L(a,b):  
 r=[];
 for i in a: 
  t=[j for j in b if j>i+2][0];
  r+=[(i+2,t)];
 return r;


# program sample usage is "python fbmine.py "quantitative easing" qe2.txt
# here fbmine.py is this source code file
# "quantitative easing" is the string of space separated keywords we mine for
# qe2.txt is the output file generated by this data-mining exercise.
wrds=sys.argv[1]; # wrds is the string of words we want to filter for
wrds=wrds.split(" ");
s="";
for i in wrds[:-1]: s+=i+"%20"; # populate the query 


s+=wrds[-1];
#print "query: http://graph.facebook.com/search?q="+s+"&type=post&limit=1000\n\n"; # create the query string and launch it
req=urllib2.Request("http://graph.facebook.com/search? q="+s+"&type=post&limit=1000");
response=urllib2.urlopen(req); # collect the query results
txt=response.read();
txt=txt.replace("\\n","").replace("\\",""); # some simple cleanup of read data


p=txt.split("\"");
m1=[i for i in range(len(p)) if p[i]=="message"]; # parsing the messages
m2=[i for i in range(len(p)) if p[i]==","];
R=L(m1,m2); # using the helper function
g=open(sys.argv[2],"w"); # generating and writing out the output
for i in R: 
 s="";
 for j in range(i[0],i[1]): s+=p[j]+" \n";
 g.write("-----------------------------------------------------------------\n");
 g.write(s+"\n");


g.write("------------------------------------------------------------------\n");


Sample Output File: (file was generated around 1815 hrs Friday Feb 10 2012)

--------------------------------------------------------------------------------
The Bank of England has announced another round of 'quantitative easing', this time printing u00a350 billion of money.Keep it up lads; at this rate soon we'll all be billionaires, just like everyone in Zimbabwe.Turns out that smashing a stake through a vampire's heart works, even if your neighbours cat's not a vampire. 


--------------------------------------------------------------------------------
The Bank of England has announced another round of 'quantitative easing', this time printing u00a350 billion of money.Keep it up lads; at this rate soon we'll all be billionaires, just like everyone in Zimbabwe. 


--------------------------------------------------------------------------------
Neat, expression, So why the blithering flip.....Very interesting article on printing money..  It's English, but applies equally here, I think:  (Be sure to read the last link in article)http://blogs.telegraph.co.uk/news/danielhannan/100136397/quantitative-easing-has-failed-and-failed-again-what-madness-has-seized-our-leaders/ 


--------------------------------------------------------------------------------
The Bank of England has announced another round of 'quantitative easing', this time printing u00a350 billion of money.Keep it up lads; at this rate soon we'll all be billionaires, just like everyone in Zimbabwe. 


--------------------------------------------------------------------------------
just saw ths funny joke....... The Bank of England has announced another round of 'quantitative easing', this time printing u00a350 billion of money.Keep it up lads; at this rate soon we'll all be billionaires, just like everyone in Zimbabwe. 


--------------------------------------------------------------------------------
http://www.zerohedge.com/news/obama-revises-cbo-deficit-forecast-predicts-110-debt-gdp-end-2013quantitative easing is not the panacea that Obama is hoping for 


--------------------------------------------------------------------------------
The system is FRAUD! 


--------------------------------------------------------------------------------
The Bank of England has announced another round of 'quantitative easing', this time printing u00a350 billion of money. Keep it up lads; at this rate soon we'll all be billionaires, just like everyone in Zimbabwe. 


--------------------------------------------------------------------------------
The bank of england has just announced another round of 'quantitative easing', this time printing u00a350 billion in notes.Keep it up lads, at this rate we'll soon all be billionaires. Just like everyone in zimbabwe. 


--------------------------------------------------------------------------------
u2018u201cQuantitative Easing is a transfer of wealth from the poor to the rich,u201d he says, u201cIt floods banks with money, which they use to pay themselves bonuses. The banks have money, and assets, so they can borrow easily. The poor guy, who is unemployed and can't borrow, is not going to benefit from it.u201d The QE process pushes asset prices up, he says, which is great for those who own stocks, shares and expensive houses. u201cBut the state is subsidising the rich. It is the top 1 per cent who benefit from Quantitative Easing, not the 99 per cent.u201du2019 -- Nassim Taleb 


--------------------------------------------------------------------------------
Quantitative easing is now more vile on the lips than any four letter word http://tgr.ph/zNpxTg 


--------------------------------------------------------------------------------
http://blogs.telegraph.co.uk/news/danielhannan/100136397/quantitative-easing-has-failed-and-failed-again-what-madness-has-seized-our-leaders/ 


--------------------------------------------------------------------------------

[...] I've truncated this to save space.