xTechNotes - Technical Computing Programming Notes: July 2013

Saturday, July 27, 2013

Notes Android Apps

There are several other blog post here at xtechnotes.blogspot.com about apps for Android but they have been scattered. The goal of this post is specifically on Android Apps and presented in their relevant categories.

One obvious point is that there are thousands of apps out there. The list below deliberately contains only a few apps per category. The goal is to list the top apps only, so readers don't need to face with hundreds of apps for the same functionality, especially many of the available apps are questionable.

For reference, the other older sites are:
http://xtechnotes.blogspot.com.au/2010/12/notes-zenithink-zt-180-epad-android.html
http://xtechnotes.blogspot.com.au/2010/06/mobile-applications-for-apple-android.html

All apps listed below are for Android (may be available for other Operating System as well) and totally FREE.

NEWS
Feedly - aggregates websites and blogs.
Pulse - aggregates websites and blogs with offline cache support?
Flipboard - creates personalized magazine from specified topics.
Zite - learns your reading habits and compiles into a magazine format.

PHOTOS
Instagram - take photo and edit with many filters.
Snapseed - photo editor with gesture control.

UTILITIES
Onavo Extend - track data usage of apps And compress data download.
AutomateIt - setup triggers and actions to automate tasks on device.
Astro File Manager
ES File Explorer
File Manager Explorer
Inkredible - Note taking app that is not sensitive to palm accidentally resting on the tablet.
WiFi Analyser and Surveyor
Disk Usage - DiskUsage Shows How Your Android's Storage Space Is Used

VOICE / VIDEO / TEXTS
There are numerous apps that allow for voice calls, video calls and text messages. The list is available in:
http://ausfinance.blogspot.com.au/2013/06/mobile-apps-free-apps-for-call-text.html

SOCIAL
Whatsapp Messenger - free sms send from phone.
1stavailable - Medical booking program.

BLOGS
Blogger
Wordpress

PRODUCTIVITY
Evernote - note taking, image capture, OCR on server side.
Receipts (Proximiant) - scan receipts

LIFESTYLE
Runkeeper - measures pace, distance ad calories.
Kindle - ebook reader.
Zeebox Au - electronic program guide for free tv and foxtel.
Jamie's Recipes - instructions for a variety of dishes

MEDIA PLAYERS
MX player
RealPlayer
RockPlayer

MUSIC - Radio
Pandora Radio - specify artist, song or composer and it Will create a radio station with relevan music for free.
TuneIn Radio - listen to radio from local and worldwide stations.
Stitcher Radio
Live365

TV
Aview - Free TV from BBC.

GAMES
Clash of Clans Free
Real Racing 3
Dead Trigger

TRAVEL
XE Currency - conversion of world currency with historic rates, works offline.
TripAdvisor Hotels Flights - research and review hotels, restaurants, flights.
Kayak - comparison for flights, hotels and car rentals.

GOOGLE
Handouts - live chat and video calls.
Translate - translate between different languages.
Maps - street and satellite maps.

SECURITY
Avast Mobile Security
Avira Antivirus Security
BitDefender Mobile Security
Lookout
Trend Micro Mobile Security
Eset Mobile Security & Antivirus
Qihoo 360 Mobile Security
Norton Mobile Security
Zoner Antivirus Free
Wiper - app for doing various things like messaging, songs, but in a secure manner

Thursday, July 11, 2013

Notes PostgreSQL

Installation Notes
--prefix=/usr/local/pgsql # default

1) ./configure --prefix=/home/chee/usr/pgsql --with-perl --with-python --enable-odbc --enable-syslog

2) gmake

3) gmake check

4) gmake install

5) add path to LD_LIBRARY_PATH, PATH, MANPATH

6) create user "postgresq" -> adduser postgres

7) Install database directory:
cd /home/chee/usr/pgsql/
mkdir data
chown postgres /home/chee/usr/pgsql/data # change owner of "data" to "postgres"
su - postgres # install db as user "postgres"
~/usr/pgsql/bin/initdb -D /home/chee/usr/pgsql/data # installing db

8) Starting the database:

/home/chee/usr/pgsql/bin/postmaster -D /home/chee/usr/pgsql/data
or
/home/chee/usr/pgsql/bin/pg_ctl -D /home/chee/usr/pgsql/data -l logfile start

This will start the server in the foreground. To put the server in the
background use something like

#### does not seem to work
nohup /home/chee/usr/pgsql/bin/postmaster -D /home/chee/pgsql/data \
</dev/null >>server.log 2>&1 </dev/null &

To stop a server running in the background you can type

### kill `cat /usr/local/pgsql/data/postmaster.pid`
kill `cat /usr/local/pgsql/data/postmaster.pid`

In order to allow TCP/IP connections (rather than only Unix domain
socket ones) you need to pass the "-i" option to "postmaster".

4. Create a database:

createdb testdb

Then enter

psql testdb

to connect to that database. At the prompt you can enter SQL commands
and start experimenting.

--with-includes=DIRECTORIES

"DIRECTORIES" is a colon-separated list of directories that will
be added to the list the compiler searches for header files. If
you have optional packages (such as GNU Readline) installed in a
non-standard location, you have to use this option and probably
also the corresponding "--with-libraries" option.

Example: --with-includes=/opt/gnu/include:/usr/sup/include.

--with-libraries=DIRECTORIES

"DIRECTORIES" is a colon-separated list of directories to search
for libraries. You will probably have to use this option (and the
corresponding "--with-includes" option) if you have packages
installed in non-standard locations.

Example: --with-libraries=/opt/gnu/lib:/usr/sup/lib.

--with-pgport=NUMBER

Set "NUMBER" as the default port number for server and clients.
The default is 5432. The port can always be changed later on, but
if you specify it here then both server and clients will have the
same default compiled in, which can be very convenient. Usually
the only good reason to select a non-default value is if you
intend to run multiple PostgreSQL servers on the same machine.

--with-CXX

Build the C++ interface library.

--with-perl

Build the Perl interface module. The Perl interface will be
installed at the usual place for Perl modules (typically under
"/usr/lib/perl"), so you must have root access to perform the
installation step (see step 4). You need to have Perl 5 installed
to use this option.

--with-python

Build the Python interface module. You need to have root access to
be able to install the Python module at its default place
("/usr/lib/pythonx.y"). To be able to use this option, you must
have Python installed and your system needs to support shared
libraries. If you instead want to build a new complete interpreter
binary, you will have to do it manually.

--with-tcl

Builds components that require Tcl/Tk, which are libpgtcl,
pgtclsh, pgtksh, PgAccess, and PL/Tcl. But see below about
"--without-tk".

--without-tk

If you specify "--with-tcl" and this option, then programs that
require Tk (pgtksh and PgAccess) will be excluded.

--with-tclconfig=DIRECTORY, --with-tkconfig=DIRECTORY

Tcl/Tk installs the files "tclConfig.sh" and "tkConfig.sh", which
contain configuration information needed to build modules
interfacing to Tcl or Tk. These files are normally found
automatically at their well-known locations, but if you want to
use a different version of Tcl or Tk you can specify the directory
in which to find them.

--enable-odbc

Build the ODBC driver. By default, the driver will be independent
of a driver manager. To work better with a driver manager already
installed on your system, use one of the following options in
addition to this one. More information can be found in the
Programmer's Guide.

--with-iodbc

Build the ODBC driver for use with iODBC.

--with-unixodbc

Build the ODBC driver for use with unixODBC.

--with-odbcinst=DIRECTORY

Specifies the directory where the ODBC driver will expect its
"odbcinst.ini" configuration file. The default is
"/usr/local/pgsql/etc" or whatever you specified as
"--sysconfdir". It should be arranged that the driver reads the
same file as the driver manager.

If either the option "--with-iodbc" or the option
"--with-unixodbc" is used, this option will be ignored because in
that case the driver manager handles the location of the
configuration file.

--with-java

Build the JDBC driver and associated Java packages. This option
requires Ant to be installed (as well as a JDK, of course). Refer
to the JDBC driver documentation in the Programmer's Guide for
more information.

--with-krb4[=DIRECTORY], --with-krb5[=DIRECTORY]

Build with support for Kerberos authentication. You can use either
Kerberos version 4 or 5, but not both. The "DIRECTORY" argument
specifies the root directory of the Kerberos installation;
"/usr/athena" is assumed as default. If the relevant header files
and libraries are not under a common parent directory, then you
must use the "--with-includes" and "--with-libraries" options in
addition to this option. If, on the other hand, the required files
are in a location that is searched by default (e.g., "/usr/lib"),
then you can leave off the argument.

"configure" will check for the required header files and libraries
to make sure that your Kerberos installation is sufficient before
proceeding.

--with-krb-srvnam=NAME

The name of the Kerberos service principal. postgres is the
default. There's probably no reason to change this.

--with-openssl[=DIRECTORY]

Build with support for SSL (encrypted) connections. This requires
the OpenSSL package to be installed. The "DIRECTORY" argument
specifies the root directory of the OpenSSL installation; the
default is "/usr/local/ssl".

"configure" will check for the required header files and libraries
to make sure that your OpenSSL installation is sufficient before
proceeding.

--with-pam

Build with PAM (Pluggable Authentication Modules) support.

--enable-syslog

Enables the PostgreSQL server to use the syslog logging facility.
(Using this option does not mean that you must log with syslog or
even that it will be done by default, it simply makes it possible
to turn that option on at run time.)

--enable-debug

Compiles all programs and libraries with debugging symbols. This
means that you can run the programs through a debugger to analyze
problems. This enlarges the size of the installed executables
considerably, and on non-GCC compilers it usually also disables
compiler optimization, causing slowdowns. However, having the
symbols available is extremely helpful for dealing with any
problems that may arise. Currently, this option is recommended for
production installations only if you use GCC. But you should
always have it on if you are doing development work or running a
beta version.

--enable-cassert

Enables assertion checks in the server, which test for many "can't
happen" conditions. This is invaluable for code development
purposes, but the tests slow things down a little. Also, having
the tests turned on won't necessarily enhance the stability of
your server! The assertion checks are not categorized for
severity, and so what might be a relatively harmless bug will
still lead to server restarts if it triggers an assertion failure.
Currently, this option is not recommended for production use, but
you should have it on for development work or when running a beta
version.

--enable-depend

Enables automatic dependency tracking. With this option, the
makefiles are set up so that all affected object files will be
rebuilt when any header file is changed. This is useful if you are
doing development work, but is just wasted overhead if you intend
only to compile once and install. At present, this option will
work only if you use GCC.

If you prefer a C or C++ compiler different from the one "configure"
picks then you can set the environment variables CC or CXX,
respectively, to the program of your choice. Similarly, you can
override the default compiler flags with the CFLAGS and CXXFLAGS
variables. For example:

env CC=/opt/bin/gcc CFLAGS='-O2 -pipe' ./configure

###############################################
Starting a session (assume database "testdb" already exist)
=>psql testdb

Information:
=>select current_user;
=>select current_timestamp;

System:
- case INSENSITIVE
- use ";" to end statement
- prompt "=>" is first prompt, "->" are subsequent prompts, ended with ";"

PHP connection:
1. In data/postgresql.conf
tcpip_socket=true

2. In data/pg_hba.conf:
# TYPE DATABASE IP_ADDRESS MASK AUTH_TYPE AUTH_ARGUMENT

local all trust
host all 127.0.0.1 255.255.255.255 trust
host all 129.94.176.241 255.255.255.255 trust

3. Run "postmaster -i ........"

User:
1. Creating user from Unix shell: createuser demouser1
2. Creating user from within pgsql:
a) start PostgreSQL by: psql
b) creating new user: CREATE USER demouser2;
3. Changing user permissions:
test=> ALTER USER demouser2 CREATEDB;
test=> CREATE GROUP demogroup WITH USER demouser1, demouser2;
test=> CREATE TABLE grouptest (col INTEGER);
test=> GRANT ALL on grouptest TO GROUP demogroup;
test=> \connect test demouser2
You are now connected to database test as user demouser2.
test=> \q

Commands:
; to end sentence
\g (go) to end sentence
\p (print) to display buffer contents
\r (reset) to erase or reset buffer
\q (quit)to exit pgsql
\l (list) to list databases in the system
\d to list all Tables in the database
ORACLE: select * from cat
\d TABLE to list all attributes or colums of the TABLE
\connect <DB> <USER> connect to database DB as user USER

\i FILE to read/run SQL script
\o FILE to print results to file called FILE
\o to switch output back to STDOUT
\t to switch off column titles when displaying query results
\z to see ownership of db objects

Data Types PostgresQL Oracle Description
----------------------------------------------------------------------
char string CHAR(n) CHAR(n) blank-padded string, fixed storage
VARCHAR(n) VARCHAR2(n) variable storage length
----------------------------------------------------------------------
number INTEGER integer, +/-2billion range
FLOAT float pt, 15-digit precision
NUMERIC(p,d) NUMBER(p,d) user-defined precision and decimal
----------------------------------------------------------------------
date/time DATE DATE date
TIME time
TIMESTAMP date and time

DATE - use
"show datestyle" or
"SET DATESTYLE TO 'ISO'|'POSTGRES'|'SQL'|'US'|'NONEUROPEAN'|'EUROPEAN'|'GERMAN'"

***** PostgresQL has even more data types than listed here.

***************
Creating table
CREATE TABLE friend (
firstname CHAR(15),
lastname CHAR(20) );

Inserting Values
INSERT INTO friend VALUES (
'Cindy',
'Anderson' );

Selecting Records
SELECT <attribute/column> FROM <relation/table>;
SELECT * FROM friend;
SELECT <attribute/column> FROM <relation/table> WHERE <attribute> <OP> <value>;
SELECT <attrib> FROM <relation> ORDER BY <attrib1, attrib2> DESC;
<OP> = {=, <, >, ........}
use "\t" in psql to omit the column titles.

Delete records
DELETE FROM <relation>; !!! delete all rows
DELETE FROM <relation> WHERE <attribute> <OP> <value>;

Update
UPDATE <relation> SET <attribute> = <value> WHERE <attribute> = <value>;

Destroying table
DROP TABLE <relation>

Input / Output data using COPY
NULL is displayed as \N
COPY table TO 'file' USING DELIMITERS '|';
COPY table FROM 'file';
COPY table FROM stdin;
COPY table TO stdout;
COPY table TO 'file' WITH NULL AS '\'; NULL as blanks
COPY table FROM 'file' WITH NULL AS '\'; NULL as blanks
COPY table FROM 'file' WITH NULL AS '?'; NULL as '?'

Copying across network
- use stdin, stdout or pgsql's \copy command

Wednesday, July 03, 2013

Notes BigData

Notes BigData
===============

Definition
Web Intelligence and Big Data course
Why Big Data
Hadoop Ecosystem
MapReduce
Miscellaneous
Analysis

Definition
===========
Ref: http://www.intel.com.au/content/www/au/en/big-data/unstructured-data-analytics-paper.html

- All history until 2003 - 5 exabytes
- 2003 to 2012 - 2.7 zettabytes
- data generated by more sources, devices, including video,
- data are UNSTRUCTURED, texts, dates, facts. Traditional Analytics are Structured Data (RDBMS).
- Analytics = Profit. Gartner survey - outperform competitors by 20% for those who use Big Data.
-

Web Intelligence and Big Data (WIBD) course
======================================
50 billion pages indexed by google.

More surprising events is better news.
- if event has prob p, then
information = log_2 (p) bits of information.

Mutual Information (MI) - between transmitted and received channels
- need to maximise MI
- eg mutual information between Ad$ and Sales
- eg adsense - given webpage, guess keywords.

IDF = inverse document frequency
- rare words make better keyword.
- IDF of Word = Log_2 (N / N_w)
where N = total docs, N_w = number of word 'Word' in total docs.

TF = Term Frequency
- number of times the terms appear in that specific document.
- more frequent words (in that doc) make better keywords.
- TF = freq of w in doc d = n_w^d

TF-IDF = term freq x IDF = n_w^d x log_2 (N/N_w)
- words with high TF-IDF are good keywords.

Mutual Information between all pages and all words is prop to
SUM_d Sum_w { n_w^d x log_2 (N/N_w) }

Mutual Information: Input F -- Machine Learning -- Output B
Feature F, Behaviour B are independent.
Entropy H(F), H(B)
Mutual Information I(F,B) = SUM_f, SUM_b p(f,b) log { p(f,b) / p(f).p(b) }
Shannon: H(F) + H(B) - H(F,B)

WIBD - Naive Bayes
===================
Consider problem: P(BUY / r,f,g,c) where r,f,g,c are feature or keywords in web shopping.
Bayes Rule: P(B,R) = P(B/R).P(R) = P(R/B).P(B)
Naive Bayes assume r,f,g,c are INDEPENDENT
- can derive Likelihood
p(r/B)*p(c/B)*p(other features /B) ..... p(B)
--------------------------------------------- = L
p(r/notB)*p(c/notB)*p(other features /notB) ..... p(notB)

so if L > 1 we have a BUY, L < 1 then no Buy.

WIBD - Learn
==============
input X = x1,x2,...xn (n-dimensional)
output y = y1,y2, ... ym
function f(X) = E[Y/X] expectation
= y1*P(y1/X) + y2*P(y2/X)
Classification(video 5.2)
- eg X=size, head, noise, legs, Y={animal names}
- eg X= {like, lot}, {hate, waste}, {not enjoy}, Y = {positive, negative}
Clustering - unsupervised
- allow us to get classes from data. Need to choose right features.
- used when we DON'T know outputs relationship to start with.
- by Defn Clustering are regions MORE populated than random data
- add random data so that Po(X). So that r= P(X)/P0(X) is large means there is clustering
then f(X)=E[Y/X]=r/(1+r) y=1 for real data, y=0 for added random uniform data.
- find things that do together to form a cluster. Eg Negative sentiment: hate, bad - but no one need to tell us they are Negative to start with.
- other means of clustering: k-means, LSH
Rules
- finding which features are related (correlated) to be each other. ie trying to cluster the features, instead of clustering the data.
- compare data which are independent features: Po(X) = P(x1) * P(x2) * ... * P(xn)
where x1 = chirping, x2 = 4 legged, etc xi={animal features}
eg P(Chirping) = number of chirping / number of total Data
- Associative Rule Mining
if there are features, A,B,C,D, want to infer some rule, eg A,B,C => (infer) D
high support P(A,B,C,D) > s; technique is to find P(A)>s, P(B)>s etc first
high confidence P(D/A,B,C) > c
high interestingness P(D/A,B,C) / P(D) > i
- Recommendation of books - customers are features of books and vice versa.
Use latent Models: matrix m x n = m x k TIMES k x n
eg people x books = people x genre TIMES genre x books
NNMF - Non-negative

features: unemployment direction, interest rate direction, fraud

WIBD - Connect
===============
Logic Inference
if A then B is SAME as ~A OR B
Obama is president of USA: isPresidentOf(Obama, USA) - predicates, variables
IF X is president of C THEN X is leader of C: IF isPresidentOf(X,C) THEN isLeaderOf(X,C)
Query: If K then Q, consider that the query means ~K OR Q is TRUE,
also same as K AND ~Q is FALSE.
So proving K AND ~Q is FALSE, this means If K Then Q.

WIRD - Prediction
==============
Linear Least Squares Regression:
x(i,j) with j features to predict, i-th data points with results yi for the i-th point
Minimizing f(x) = E(y/X) is same as minimizing error = E(y-f(x))^2,
... so let f(x) = xT.f where f is the features vector of unknowns.
Finding vector derivative and equate to zero -> xT.x.f - xT.y = 0
R^2 used to measure linear regression
Non - Linear correlation
- Logisitic Regression - f(x) = 1 / (1 + exp(-f^T. x))
- Support Vector Machines - Data may be high order correlated, eg parabolic correlation etc.
Neural Networks
- linear least squares
- non-linear like logistic
- feed-forward, multilayer
- feed-back, like belief network
Which Prediction technique
FEATURES TARGET CORRELATION TECHNIQUE
num num stable/linear Linear Regression
cat num Linear Regression, Neural Networks
num num unstable/nonlinear Neural Networks
num cat stable/linear Logistic Regression
num cat unstable/nonlinear Support Vector Machines
cat cat Support Vector Machines, Naive Bayes, Other Probabilistic Graphical Models

Why Big Data
==============
eg why Google(MapReduce), Yahoo(PIGS), Facebook(Hive) have to invent new stack
Challenges
1. Fault tolerance
2. Variety of Data Types, eg images, videos, music
3. Manage data volumes without archiving. Traditional need archives.
4. Parallelism was an add-on
Disadvantages
1. Could not scale
2. Not suited for compute-intensive deep analytics, eg in web-world
3. price-performance challenge. uses commodity hardware, open-source

Hadoop Ecosystem (See NotesHadoop)
===================================

MapReduce (See NotesHadoop)
=============

Miscellaneous
==============
About our speaker: Bio: Ross is Chief Data Scientist at Teradata and currently works with major clients throughout Australia and New Zealand to help them exploit the value of ‘big data’. He specialized in deployments involving non-relational, semi structured data and analyses such as path analysis, text analysis and social network analysis. Previously, Ross was deputy headmaster of John Colet School for 18 years before working as a SAS analyst, a business development manager at Minitab Statistical Software and founder and lead analyst at datamilk.com.

Ross Farrelly has a BSc (hons 1st class) in pure mathematics from Auckland University, a Masters in Applied Statistics from Macquarie University and a Masters of Applied Ethics from the Australian Catholic University.

Analysis
=========
path analysis
text analysis
social network analysis
natural language processing

Notes Hadoop

Notes Hadoop
=============

References
Famous Websites and their Big Data
Hadoop Ecosystem
MapReduce
Pig Latin
Hive
Twitter Case Study

References
===========
http://searchcloudcomputing.techtarget.com/definition/Hadoop
http://radar.oreilly.com/2011/01/what-is-hadoop.html
http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_?taxonomyId=9&pageNumber=3

Famous Websites and their Big Data
===================================
Facebook - data analytics build around Hive
LinkedIN - infrastructure build around Hadoop

Hadoop Ecosystem (See NotesHadoop)
===================================
Big Data - 3Vs (Volume, Velocity, Variety)

Hadoop

Hadoop Streaming - enables user to write Map function and Reduce function, in any language they want. This middleware component will make these functions work under the Hadoop ecosystem.

Sqoop - JDBC-based Hadoop to DB data movement facility. Can transfer from RDBMS to HDFS. Can transfer from HDFS to RDBMS.
- Use Case - Archiving Old Data. Using Sqoop, data from RDBMS can be easily push to Hadoop clusters. Storing data in Hadoop instead of using Tape archives is more cost effective, provide fast access when needed, use one single technology for old and new data hence single know-how.

Hive - Enables users to use SQL to operate on Hadoop data. Hive contains only a subset of standard SQL. May also be used to perform SQL joins with tables from different DB systems, eg on table from MySQL with another table from DB2 or even Spreadsheets.

Pig - "Apache Pig is a high-level procedural language for querying large semi-structured data sets using Hadoop and the MapReduce Platform.
Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset." "instead of writing a separate MapReduce application, you can write a single script in Pig Latin that is automatically parallelized and distributed across a cluster. "

Fuse - a middleware that allow users to access HDFS using standard file system commands (in Linux).

Flume-ng (next generation) - enable a load ready file to be prepared and then transferred to RDBMS using the RDBMS high speed loaders. The functionality is covered by Sqoop.

Oozie - chaing together multiple Hadoop jobs.

HBASE - high performance key-value store.

All Open Source. Most of the components are Java based, does not mean users need to program in Java.

MapReduce (See NotesHadoop)
=============
- Message passing, data parallel, pipelined work. Higher level compared to traditional Shared Memory or Distributed Message Passing paradigms.
- programmer need to specify only Mapper and Reducer. Message passing handled by the implementation itself.

Pig Latin
==========
http://www.ibm.com/developerworks/library/l-apachepigdataquery/#list1
http://pig.apache.org/docs/r0.7.0/tutorial.html#Pig+Tutorial+File

Hive
=====
Ref: [1] http://hive.apache.org/docs/r0.9.0/

What is Hive?
- Is a data warehouse infrastructure built on top of Hadoop.
- Provides tools to enable easy data ETL, (Extract, Transform, Load)
- put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files.
- HiveQL easy for people familiar with SQL.
- Enable MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
- Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.

What Hive is NOT?
- Based on Hadoop, which is a batch processing system, Hive does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours.

In summary, low latency performance is not the top-priority of Hive's design principles. What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats."

Twitter Case Study
===================
Ref: "Large-Scale Machine Learning at Twitter"; Jimmy Lin and Alek Kolcz

Hadoop - at the core of the infrastructure.
Hadoop Distributed File System (HDFS) - data from other DBs, application logs, etc are written real time or batch processed into HDFS.
Pig - analytics done using Pig, which is a high-level dataflow language. It compiles the Pig script into physical plans and executed on Hadoop.

xTechNotes - Technical Computing Programming Notes

Saturday, July 27, 2013

Notes Android Apps

Thursday, July 11, 2013

Notes PostgreSQL

Wednesday, July 03, 2013

Notes BigData

Notes Hadoop

Translate

Followers

Search This Blog

Blog Archive

Support xTechNotes Blog

Keywords

Related Books

Tech Blog Links

Other Links

Total Pageviews

xTechNotes - Technical Computing Programming Notes

Saturday, July 27, 2013

Notes Android Apps

Thursday, July 11, 2013

Notes PostgreSQL

Wednesday, July 03, 2013

Notes BigData

Notes Hadoop

Subscribe To

Translate

Followers

Search This Blog

Blog Archive

Support xTechNotes Blog

Keywords

Related Books

Tech Blog Links

Other Links

Total Pageviews