Saturday, July 27, 2013

Notes Android Apps

There are several other blog post here at xtechnotes.blogspot.com about apps for Android but they have been scattered. The goal of this post is specifically on Android Apps and presented in their relevant categories. 

One obvious point is that there are thousands of apps out there. The list below deliberately contains only a few apps per category. The goal is to list the top apps only, so readers don't need to face with hundreds of apps for the same functionality, especially many of the available apps are questionable.

For reference, the other older sites are:
http://xtechnotes.blogspot.com.au/2010/12/notes-zenithink-zt-180-epad-android.html
http://xtechnotes.blogspot.com.au/2010/06/mobile-applications-for-apple-android.html

All apps listed below are for Android (may be available for other Operating System as well) and totally FREE.

NEWS
Feedly - aggregates websites and blogs.
Pulse - aggregates websites and blogs with offline cache support?
Flipboard - creates personalized magazine from specified topics.
Zite - learns your reading habits and compiles into a magazine format.


PHOTOS
Instagram - take photo and edit with many filters.
Snapseed - photo editor with gesture control.


UTILITIES
Onavo Extend - track data usage of apps And compress data download.
AutomateIt - setup triggers and actions to automate tasks on device.

Astro File Manager
ES File Explorer
File Manager Explorer
Inkredible - Note taking app that is not sensitive to palm accidentally resting on the tablet.
WiFi Analyser and Surveyor
Disk Usage - DiskUsage Shows How Your Android's Storage Space Is Used

VOICE / VIDEO / TEXTS
There are numerous apps that allow for voice calls, video calls and text messages. The list is available in:
http://ausfinance.blogspot.com.au/2013/06/mobile-apps-free-apps-for-call-text.html

SOCIAL
Whatsapp Messenger - free sms send from phone.
1stavailable - Medical booking program.

BLOGS
Blogger
Wordpress

PRODUCTIVITY
Evernote - note taking, image capture, OCR on server side.
Receipts (Proximiant) - scan receipts

LIFESTYLE
Runkeeper - measures pace, distance ad calories.
Kindle - ebook reader.
Zeebox Au - electronic program guide for free tv and foxtel. 
Jamie's Recipes - instructions for a variety of dishes


MEDIA PLAYERS
MX player
RealPlayer
RockPlayer

MUSIC - Radio
Pandora Radio - specify artist, song or composer and it Will create a radio station with relevan music for free.
TuneIn Radio - listen to radio from local and worldwide stations.

Stitcher Radio
Live365

TV
Aview - Free TV from BBC.


GAMES
Clash of Clans Free
Real Racing 3 
Dead Trigger


TRAVEL
XE Currency - conversion of world currency with historic rates, works offline.
TripAdvisor Hotels Flights - research and review hotels, restaurants, flights.
Kayak - comparison for flights, hotels and car rentals.


GOOGLE
Handouts - live chat and video calls.
Translate - translate between different languages.
Maps - street and satellite maps.


SECURITY
Avast Mobile Security
Avira Antivirus Security
BitDefender Mobile Security
Lookout
Trend Micro Mobile Security
Eset Mobile Security & Antivirus
Qihoo 360 Mobile Security
Norton Mobile Security
Zoner Antivirus Free
Wiper - app for doing various things like messaging, songs, but in a secure manner

Thursday, July 11, 2013

Notes PostgreSQL


Installation Notes
--prefix=/usr/local/pgsql # default


1)  ./configure --prefix=/home/chee/usr/pgsql --with-perl --with-python --enable-odbc --enable-syslog

2) gmake

3) gmake check

4) gmake install

5) add path to LD_LIBRARY_PATH, PATH, MANPATH

6) create user "postgresq" -> adduser postgres

7) Install database directory:
   cd /home/chee/usr/pgsql/
   mkdir data
   chown postgres /home/chee/usr/pgsql/data    # change owner of "data" to "postgres"
   su - postgres                               # install db as user "postgres"
   ~/usr/pgsql/bin/initdb -D /home/chee/usr/pgsql/data   # installing db
   
8) Starting the database:

/home/chee/usr/pgsql/bin/postmaster -D /home/chee/usr/pgsql/data
or
    /home/chee/usr/pgsql/bin/pg_ctl -D /home/chee/usr/pgsql/data -l logfile start

     This will start the server in the foreground. To put the server in the
     background use something like

#### does not seem to work
     nohup /home/chee/usr/pgsql/bin/postmaster -D /home/chee/pgsql/data \
         </dev/null >>server.log 2>&1 </dev/null &

     To stop a server running in the background you can type

###     kill `cat /usr/local/pgsql/data/postmaster.pid`
     kill `cat /usr/local/pgsql/data/postmaster.pid`

     In order to allow TCP/IP connections (rather than only Unix domain
     socket ones) you need to pass the "-i" option to "postmaster".

  4. Create a database:

     createdb testdb

     Then enter

     psql testdb

     to connect to that database. At the prompt you can enter SQL commands
     and start experimenting.






     --with-includes=DIRECTORIES

          "DIRECTORIES" is a colon-separated list of directories that will
          be added to the list the compiler searches for header files. If
          you have optional packages (such as GNU Readline) installed in a
          non-standard location, you have to use this option and probably
          also the corresponding "--with-libraries" option.

          Example: --with-includes=/opt/gnu/include:/usr/sup/include.

     --with-libraries=DIRECTORIES

          "DIRECTORIES" is a colon-separated list of directories to search
          for libraries. You will probably have to use this option (and the
          corresponding "--with-includes" option) if you have packages
          installed in non-standard locations.

          Example: --with-libraries=/opt/gnu/lib:/usr/sup/lib.

     --with-pgport=NUMBER

          Set "NUMBER" as the default port number for server and clients.
          The default is 5432. The port can always be changed later on, but
          if you specify it here then both server and clients will have the
          same default compiled in, which can be very convenient. Usually
          the only good reason to select a non-default value is if you
          intend to run multiple PostgreSQL servers on the same machine.

     --with-CXX

          Build the C++ interface library.

     --with-perl

          Build the Perl interface module. The Perl interface will be
          installed at the usual place for Perl modules (typically under
          "/usr/lib/perl"), so you must have root access to perform the
          installation step (see step 4). You need to have Perl 5 installed
          to use this option.

     --with-python

          Build the Python interface module. You need to have root access to
          be able to install the Python module at its default place
          ("/usr/lib/pythonx.y"). To be able to use this option, you must
          have Python installed and your system needs to support shared
          libraries. If you instead want to build a new complete interpreter
          binary, you will have to do it manually.

     --with-tcl

          Builds components that require Tcl/Tk, which are libpgtcl,
          pgtclsh, pgtksh, PgAccess, and PL/Tcl. But see below about
          "--without-tk".

     --without-tk

          If you specify "--with-tcl" and this option, then programs that
          require Tk (pgtksh and PgAccess) will be excluded.

     --with-tclconfig=DIRECTORY, --with-tkconfig=DIRECTORY

          Tcl/Tk installs the files "tclConfig.sh" and "tkConfig.sh", which
          contain configuration information needed to build modules
          interfacing to Tcl or Tk. These files are normally found
          automatically at their well-known locations, but if you want to
          use a different version of Tcl or Tk you can specify the directory
          in which to find them.

     --enable-odbc

          Build the ODBC driver. By default, the driver will be independent
          of a driver manager. To work better with a driver manager already
          installed on your system, use one of the following options in
          addition to this one. More information can be found in the
          Programmer's Guide.

     --with-iodbc

          Build the ODBC driver for use with iODBC.

     --with-unixodbc

          Build the ODBC driver for use with unixODBC.

     --with-odbcinst=DIRECTORY

          Specifies the directory where the ODBC driver will expect its
          "odbcinst.ini" configuration file. The default is
          "/usr/local/pgsql/etc" or whatever you specified as
          "--sysconfdir". It should be arranged that the driver reads the
          same file as the driver manager.

          If either the option "--with-iodbc" or the option
          "--with-unixodbc" is used, this option will be ignored because in
          that case the driver manager handles the location of the
          configuration file.

     --with-java

          Build the JDBC driver and associated Java packages. This option
          requires Ant to be installed (as well as a JDK, of course). Refer
          to the JDBC driver documentation in the Programmer's Guide for
          more information.

     --with-krb4[=DIRECTORY], --with-krb5[=DIRECTORY]

          Build with support for Kerberos authentication. You can use either
          Kerberos version 4 or 5, but not both. The "DIRECTORY" argument
          specifies the root directory of the Kerberos installation;
          "/usr/athena" is assumed as default. If the relevant header files
          and libraries are not under a common parent directory, then you
          must use the "--with-includes" and "--with-libraries" options in
          addition to this option. If, on the other hand, the required files
          are in a location that is searched by default (e.g., "/usr/lib"),
          then you can leave off the argument.

          "configure" will check for the required header files and libraries
          to make sure that your Kerberos installation is sufficient before
          proceeding.

     --with-krb-srvnam=NAME

          The name of the Kerberos service principal. postgres is the
          default. There's probably no reason to change this.

     --with-openssl[=DIRECTORY]

          Build with support for SSL (encrypted) connections. This requires
          the OpenSSL package to be installed. The "DIRECTORY" argument
          specifies the root directory of the OpenSSL installation; the
          default is "/usr/local/ssl".

          "configure" will check for the required header files and libraries
          to make sure that your OpenSSL installation is sufficient before
          proceeding.

     --with-pam

          Build with PAM (Pluggable Authentication Modules) support.

     --enable-syslog

          Enables the PostgreSQL server to use the syslog logging facility.
          (Using this option does not mean that you must log with syslog or
          even that it will be done by default, it simply makes it possible
          to turn that option on at run time.)

     --enable-debug

          Compiles all programs and libraries with debugging symbols. This
          means that you can run the programs through a debugger to analyze
          problems. This enlarges the size of the installed executables
          considerably, and on non-GCC compilers it usually also disables
          compiler optimization, causing slowdowns. However, having the
          symbols available is extremely helpful for dealing with any
          problems that may arise. Currently, this option is recommended for
          production installations only if you use GCC. But you should
          always have it on if you are doing development work or running a
          beta version.

     --enable-cassert

          Enables assertion checks in the server, which test for many "can't
          happen" conditions. This is invaluable for code development
          purposes, but the tests slow things down a little. Also, having
          the tests turned on won't necessarily enhance the stability of
          your server! The assertion checks are not categorized for
          severity, and so what might be a relatively harmless bug will
          still lead to server restarts if it triggers an assertion failure.
          Currently, this option is not recommended for production use, but
          you should have it on for development work or when running a beta
          version.

     --enable-depend

          Enables automatic dependency tracking. With this option, the
          makefiles are set up so that all affected object files will be
          rebuilt when any header file is changed. This is useful if you are
          doing development work, but is just wasted overhead if you intend
          only to compile once and install. At present, this option will
          work only if you use GCC.

     If you prefer a C or C++ compiler different from the one "configure"
     picks then you can set the environment variables CC or CXX,
     respectively, to the program of your choice. Similarly, you can
     override the default compiler flags with the CFLAGS and CXXFLAGS
     variables. For example:

     env CC=/opt/bin/gcc CFLAGS='-O2 -pipe' ./configure


###############################################
Starting a session (assume database "testdb" already exist)
   =>psql testdb

Information:
   =>select current_user;
   =>select current_timestamp;

System:
- case INSENSITIVE
- use ";" to end statement
- prompt "=>" is first prompt, "->" are subsequent prompts, ended with ";"


PHP connection:
1. In data/postgresql.conf
      tcpip_socket=true

2. In data/pg_hba.conf:
      # TYPE     DATABASE    IP_ADDRESS     MASK               AUTH_TYPE  AUTH_ARGUMENT

local      all                                           trust
host       all         127.0.0.1      255.255.255.255    trust
host       all         129.94.176.241 255.255.255.255    trust

3. Run "postmaster -i ........"




User:
1. Creating user from Unix shell:   createuser demouser1
2. Creating user from within pgsql:
   a) start PostgreSQL by:  psql
   b) creating new user: CREATE USER demouser2;
3. Changing user permissions:
        test=> ALTER USER demouser2 CREATEDB;
        test=> CREATE GROUP demogroup WITH USER demouser1, demouser2;
        test=> CREATE TABLE grouptest (col INTEGER);
        test=> GRANT ALL on grouptest TO GROUP demogroup;
        test=> \connect test demouser2
        You are now connected to database test as user demouser2.
        test=> \q

Commands:
;                        to end sentence
\g                       (go) to end sentence
\p                       (print) to display buffer contents
\r                       (reset) to erase or reset buffer
\q                       (quit)to exit pgsql
\l                       (list) to list databases in the system
\d                       to list all Tables in the database
                         ORACLE: select * from cat
\d TABLE                 to list all attributes or colums of the TABLE
\connect <DB> <USER>     connect to database DB as user USER

\i FILE                  to read/run SQL script
\o FILE                  to print results to file called FILE
\o                       to switch output back to STDOUT
\t                       to switch off column titles when displaying query results
\z                       to see ownership of db objects


Data Types   PostgresQL    Oracle      Description
----------------------------------------------------------------------
char string  CHAR(n)       CHAR(n)     blank-padded string, fixed storage
             VARCHAR(n)    VARCHAR2(n) variable storage length
----------------------------------------------------------------------
number       INTEGER                   integer, +/-2billion range
             FLOAT                     float pt, 15-digit precision
             NUMERIC(p,d)  NUMBER(p,d) user-defined precision and decimal
----------------------------------------------------------------------
date/time    DATE          DATE          date
             TIME                        time
             TIMESTAMP                   date and time

DATE - use
"show datestyle" or
"SET DATESTYLE TO 'ISO'|'POSTGRES'|'SQL'|'US'|'NONEUROPEAN'|'EUROPEAN'|'GERMAN'"

***** PostgresQL has even more data types than listed here.




***************
Creating table
CREATE TABLE friend (
                     firstname CHAR(15),
                      lastname  CHAR(20)   );

Inserting Values
INSERT INTO friend VALUES (
        'Cindy',
        'Anderson'    );

Selecting Records
SELECT <attribute/column> FROM <relation/table>;
SELECT * FROM friend;
SELECT <attribute/column> FROM <relation/table> WHERE <attribute> <OP> <value>;
SELECT <attrib> FROM <relation> ORDER BY <attrib1, attrib2> DESC;
<OP> = {=, <, >, ........}
use "\t" in psql to omit the column titles.


Delete records
DELETE FROM <relation>;       !!! delete all rows
DELETE FROM <relation> WHERE <attribute> <OP> <value>;


Update
UPDATE <relation>  SET <attribute> = <value> WHERE <attribute> = <value>;


Destroying table
DROP TABLE <relation>


Input / Output data using COPY
NULL is displayed as  \N
COPY table TO 'file' USING DELIMITERS '|';
COPY table FROM 'file';
COPY table FROM stdin;
COPY table TO stdout;
COPY table TO 'file' WITH NULL AS '\';          NULL as blanks
COPY table FROM 'file' WITH NULL AS '\';        NULL as blanks
COPY table FROM 'file' WITH NULL AS '?';        NULL as '?'

Copying across network
- use stdin, stdout or pgsql's \copy command

Wednesday, July 03, 2013

Notes BigData

Notes BigData
===============

Definition
Web Intelligence and Big Data course
Why Big Data
Hadoop Ecosystem
MapReduce
Miscellaneous
Analysis


Definition
===========
Ref:  http://www.intel.com.au/content/www/au/en/big-data/unstructured-data-analytics-paper.html

- All history until 2003 - 5 exabytes
- 2003 to 2012 - 2.7 zettabytes
- data generated by more sources, devices, including video,
- data are UNSTRUCTURED, texts, dates, facts. Traditional Analytics are Structured Data (RDBMS).
- Analytics = Profit. Gartner survey - outperform competitors by 20% for those who use Big Data.
-



Web Intelligence and Big Data (WIBD) course
======================================
50 billion pages indexed by google.



More surprising events is better news.
- if event has prob p, then
    information = log_2 (p)  bits of information.

Mutual Information (MI) - between transmitted and received channels
- need to maximise MI
- eg mutual information between Ad$ and Sales
- eg adsense - given webpage, guess keywords.

IDF = inverse document frequency
- rare words make better keyword.
- IDF of Word = Log_2 (N / N_w)
  where N = total docs, N_w = number of word 'Word' in total docs.

TF = Term Frequency
- number of times the terms appear in that specific document.
- more frequent words (in that doc) make better keywords.
- TF = freq of w in doc d = n_w^d

TF-IDF = term freq x IDF = n_w^d x log_2 (N/N_w)
- words with high TF-IDF are good keywords.

Mutual Information between all pages and all words is prop to
   SUM_d  Sum_w  {  n_w^d x log_2 (N/N_w)  }

Mutual Information:  Input F -- Machine Learning -- Output B
Feature F, Behaviour B are independent.
Entropy H(F), H(B)
Mutual Information I(F,B) = SUM_f, SUM_b p(f,b) log { p(f,b) / p(f).p(b) }
Shannon: H(F) + H(B) - H(F,B)

WIBD - Naive Bayes
===================
Consider problem: P(BUY / r,f,g,c) where r,f,g,c are feature or keywords in web shopping.
Bayes Rule: P(B,R) = P(B/R).P(R) = P(R/B).P(B)
Naive Bayes assume r,f,g,c are INDEPENDENT
   - can derive Likelihood
      p(r/B)*p(c/B)*p(other features /B) ..... p(B)
      ---------------------------------------------  = L
      p(r/notB)*p(c/notB)*p(other features /notB) ..... p(notB)

      so if L > 1 we have a BUY, L < 1 then no Buy.



WIBD - Learn
==============
input X = x1,x2,...xn  (n-dimensional)
output y = y1,y2, ... ym
function f(X) = E[Y/X] expectation
              = y1*P(y1/X) + y2*P(y2/X)
Classification(video 5.2)
- eg X=size, head, noise, legs, Y={animal names}
- eg X= {like, lot}, {hate, waste}, {not enjoy}, Y = {positive, negative}
Clustering - unsupervised
- allow us to get classes from data. Need to choose right features.
- used when we DON'T know outputs relationship to start with.
- by Defn Clustering are regions MORE populated than random data
- add random data so that Po(X). So that r= P(X)/P0(X) is large means there is clustering
  then f(X)=E[Y/X]=r/(1+r) y=1 for real data, y=0 for added random uniform data.
- find things that do together to form a cluster. Eg Negative sentiment: hate, bad - but no one need to tell us they are Negative to start with.
- other means of clustering: k-means, LSH
Rules
- finding which features are related (correlated) to be each other. ie trying to cluster the features, instead of clustering the data.
- compare data which are independent features: Po(X) =  P(x1) * P(x2) * ... * P(xn)
  where x1 = chirping, x2 = 4 legged, etc xi={animal features}
  eg P(Chirping) = number of chirping / number of total Data
- Associative Rule Mining
  if there are features, A,B,C,D, want to infer some rule, eg A,B,C => (infer) D
  high support P(A,B,C,D) > s;   technique is to find P(A)>s, P(B)>s etc first
  high confidence P(D/A,B,C) > c
  high interestingness P(D/A,B,C) / P(D)  > i
- Recommendation of books - customers are features of books and vice versa.
  Use latent Models: matrix m x n = m x k TIMES k x n
                  eg people x books = people x genre TIMES genre x books
  NNMF - Non-negative

features: unemployment direction, interest rate direction, fraud

WIBD - Connect
===============
Logic Inference
if A then B    is SAME as ~A OR B  
Obama is president of USA:  isPresidentOf(Obama, USA) - predicates, variables
IF X is president of C  THEN X is leader of C:    IF  isPresidentOf(X,C) THEN isLeaderOf(X,C)
Query: If K then Q, consider that the query means ~K OR Q is TRUE,
       also same as K AND ~Q is FALSE.
       So proving K AND ~Q is FALSE, this means If K Then Q.

WIRD - Prediction
==============
Linear Least Squares Regression:
  x(i,j) with j features to predict, i-th data points with results yi for the i-th point
  Minimizing f(x) = E(y/X) is same as minimizing error = E(y-f(x))^2,  
  ... so let f(x) = xT.f   where f is the features vector of unknowns.
  Finding vector derivative and equate to zero ->  xT.x.f - xT.y = 0
  R^2 used to measure linear regression
Non - Linear correlation
  - Logisitic Regression - f(x) = 1 / (1 + exp(-f^T. x))
  - Support Vector Machines - Data may be high order correlated, eg parabolic correlation etc.
 Neural Networks
   - linear least squares
   - non-linear like logistic
   - feed-forward, multilayer
   - feed-back, like belief network
Which Prediction technique
FEATURES   TARGET  CORRELATION         TECHNIQUE
num        num     stable/linear       Linear Regression
cat        num                         Linear Regression, Neural Networks
num        num     unstable/nonlinear  Neural Networks
num        cat     stable/linear       Logistic Regression
num        cat     unstable/nonlinear  Support Vector Machines
cat        cat                         Support Vector Machines, Naive Bayes, Other Probabilistic Graphical Models



Why Big Data
==============
eg why Google(MapReduce), Yahoo(PIGS), Facebook(Hive) have to invent new stack
Challenges
1. Fault tolerance
2. Variety of Data Types, eg images, videos, music
3. Manage data volumes without archiving. Traditional need archives.
4. Parallelism was an add-on
Disadvantages
1. Could not scale
2. Not suited for compute-intensive deep analytics, eg in web-world
3. price-performance challenge. uses commodity hardware, open-source


Hadoop Ecosystem (See NotesHadoop)
===================================

MapReduce (See NotesHadoop)
=============


Miscellaneous
==============
About our speaker: Bio: Ross is Chief Data Scientist at Teradata and currently works with major clients throughout Australia and New Zealand to help them exploit the value of ‘big data’. He specialized in deployments involving non-relational, semi structured data and analyses such as path analysis, text analysis and social network analysis. Previously, Ross was deputy headmaster of John Colet School for 18 years before working as a SAS analyst, a business development manager at Minitab Statistical Software and founder and lead analyst at datamilk.com.

Ross Farrelly has a BSc (hons 1st class) in pure mathematics from Auckland University, a Masters in Applied Statistics from Macquarie University and a Masters of Applied Ethics from the Australian Catholic University.

Analysis
=========
path analysis
text analysis
social network analysis
natural language processing

Notes Hadoop

Notes Hadoop
=============

References
Famous Websites and their Big Data
Hadoop Ecosystem
MapReduce
Pig Latin
Hive
Twitter Case Study




References
===========
http://searchcloudcomputing.techtarget.com/definition/Hadoop
http://radar.oreilly.com/2011/01/what-is-hadoop.html
http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_?taxonomyId=9&pageNumber=3


Famous Websites and their Big Data
===================================
Facebook - data analytics build around Hive
LinkedIN - infrastructure build around Hadoop


Hadoop Ecosystem (See NotesHadoop)
===================================
Big Data - 3Vs (Volume, Velocity, Variety)

Hadoop

Hadoop Streaming - enables user to write Map function and Reduce function, in any language they want. This middleware component will make these functions work under the Hadoop ecosystem.

Sqoop - JDBC-based Hadoop to DB data movement facility. Can transfer from RDBMS to HDFS. Can transfer from HDFS to RDBMS.
- Use Case - Archiving Old Data. Using Sqoop, data from RDBMS can be easily push to Hadoop clusters. Storing data in Hadoop instead of using Tape archives is more cost effective, provide fast access when needed, use one single technology for old and new data hence single know-how.

Hive - Enables users to use SQL to operate on Hadoop data. Hive contains only a subset of standard SQL. May also be used to perform SQL joins with tables from different DB systems, eg on table from MySQL with another table from DB2 or even Spreadsheets.

Pig - "Apache Pig is a high-level procedural language for querying large semi-structured data sets using Hadoop and the MapReduce Platform.
Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset." "instead of writing a separate MapReduce application, you can write a single script in Pig Latin that is automatically parallelized and distributed across a cluster. "

Fuse - a middleware that allow users to access HDFS using standard file system commands (in Linux).

Flume-ng (next generation) - enable a load ready file to be prepared and then transferred to RDBMS using the RDBMS high speed loaders. The functionality is covered by Sqoop.

Oozie - chaing together multiple Hadoop jobs.

HBASE - high performance key-value store.

All Open Source. Most of the components are Java based, does not mean users need to program in Java.


MapReduce (See NotesHadoop)
=============
- Message passing, data parallel, pipelined work. Higher level compared to traditional Shared Memory or Distributed Message Passing paradigms.
- programmer need to specify only Mapper and Reducer. Message passing handled by the implementation itself.


Pig Latin
==========
http://www.ibm.com/developerworks/library/l-apachepigdataquery/#list1
http://pig.apache.org/docs/r0.7.0/tutorial.html#Pig+Tutorial+File


Hive
=====
Ref: [1] http://hive.apache.org/docs/r0.9.0/

What is Hive?
- Is a data warehouse infrastructure built on top of Hadoop.
- Provides tools to enable easy data ETL, (Extract, Transform, Load)
- put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files.
- HiveQL easy for people familiar with SQL.
- Enable MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
- Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.

What Hive is NOT?
- Based on Hadoop, which is a batch processing system, Hive does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours.

In summary, low latency performance is not the top-priority of Hive's design principles. What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats."


Twitter Case Study
===================
Ref: "Large-Scale Machine Learning at Twitter"; Jimmy Lin and Alek Kolcz

Hadoop - at the core of the infrastructure.
Hadoop Distributed File System (HDFS) - data from other DBs, application logs, etc are written real time or batch processed into HDFS.
Pig - analytics done using Pig, which is a high-level dataflow language. It compiles the Pig script into physical plans and executed on Hadoop.