Rough IT notes of no use for others


title: “Focused Learning compilation”
author: “Gopal Kumar”
date: “18 August 2015”
output: html_document

<!— YAML comments —>

# Learning MDown, pandoc, knitr, xeletex, R, postgreSQL, QGIS, Python, Andoridapps, CynagenMod
1. EMACS24, Markdown, pandoc, git, Hg, XeLeTeX, beamer, Impress.js,
2. R, knitr, statistics, Machine learning, Bigdata, Hadoop, SQL/Hive
3. SQL, postgreSQL, remotesensing, QGIS, RGIS
4. Python, Cpp, Java, JavaScript, Joomla, LAMPS[1],
5. Android Apps, CynagenMod, GSM4g
6. Linux, networking, KaliLinux, TCP/IP, linuxfromscretch, Cisco router program, switch, dns bind8,
7. IndustrialIP, Openbus Modbus, RespberryPi, Ardeno, Automation, Artificial intelligence
8. IndusrialDrives, Rotating Machines, sensors and actuators,, linear Motors, IGBT, IGCT,
9. Misc, HAMcode, make, scriptprogram
10. LibreOffice, Apache Server, ERP, webERP, webcollab
11. AutoCAD, FEA analysis, simulations,
12. Nature, advanture, travel, sustainable devlopment, law, history , geography, space, physics,

## Good sites with my login http://vk.com/datascience
## Basic UNIX commands http://www.datasciencecentral.com/group/resources/forum/topics/data-science-cheat-sheet
## install cygwin and then
You don’t need to spend hours learning UNIX Cygwin console:
cd, pwd, ls , tail -100, head -150 , cp, mv, mkdir, rmdir, wc, cat,
grep:
sort, uniq (sort alphabetically or numerically option)
gzip: compress/uncompress files
wc:
grep:
chmod:
history:
cron, crontab: schedule tasks (running an executable once a day)

> , >> to append , | (the pipe ), & (see section 2, used for background or batch mode ), * (see examples) and ! (see example)

## Book read and has some data http://www.win-vector.com/blog/introduction-to-data-science/
##github ggmtechn63

## Some sites
– datacentral <http://www.datasciencecentral.com/profiles/blogs/20-data-science-r-python-excel-and-machine-learning-cheat-sheets&gt;
– Large data sets for use <https://www.quandl.com/search&gt;
– Historical data visulaisation <http://101.datascience.community/&gt;
– some massive data mining softwares with more links site <http://www.kdnuggets.com/&gt;
Data Science
Data Science Cheat Sheet – Basic
Data Science Cheat Sheet – Advanced
Hadoop
Hadoop for Dummies cheatsheet
Getting Started Apache Hadoop Reference Card
Hadoop Command Line cheatsheet
Working with HDFS from the command line – Hadoop Cheat sheet
R
R cheat sheet (Google Drive)
R functions for Regression Analysis
R functions for Time series Analysis
R Cheat Sheet
Data Visualization with R
Data Analysis the data.table way
Data Visualisation with ggplot2 cheatsheet by R studio
Python
Python 2.7 Quick Reference Sheet
Python Cheat Sheet by DaveChild
Python Basics Reference sheet
NumPy / SciPy / Pandas Cheat Sheet

Machine Learning

Choosing the right estimator Machine Learning cheatsheet
Patterns for Predictive learning cheatsheet
Machine learning algorithm cheat sheet for Microsoft Azure
Machine Learning cheatsheet Github 1
Machine Learning cheatsheet Github 2
Machine Learning which algorithm performs best?
Cheat sheet 10 machine learning algorithms R commands

Things a Linux user must learn
Learn bash: Just read the complete man page of bash (man bash).
Learn vim: You might be using Emacs or Eclipse for your work all the time but nothing can compete vim.
Learn ssh: Learn the basics of passwordless authentication.
Learn basics of bash job management: Using &, Ctrl-C, fg, bg, Ctrl-Z, jobs, kill.
Learn basic commands for file management: ls and ls -l, less, head, tail and tail -f, ln and ln -s (hard links and soft links), chown, mount, chmod, df, du (du -sk *).
Learn basic commands for network management: dig, ifconfig.
Learn how to use grep, find and sed.
Learn how to use aptitude or yum (depends on the distro) to find and install packages.

For daily use
In bash, you may use Ctrl+R to search in command history.
In bash, you may use Ctrl+W to delete the last word, and Ctrl+U to delete the complete line.
Use cd – command to go back to the previous working directory.
Learn how to use xargs.

$ find . -name \*.py | xargs grep some_function

$ cat hosts | xargs -I{} ssh root@{} hostnameX

Use pstree -p command to get see the process tree. Learn various signals. eg to suspend a process, use kill -STOP [pid]. Type man 7 signal in terminal for complete guide.
If you want to keep running a background process forever then you can use nohup or disown.
Use netstat -lntp command to see what the processes are listening.
You should check about lsof also.
In your bash script you can use subshells to group commands.

# Do something in current dir

(cd /some/other/dir; other-command)

# Continue in original dir

Trimming of strings: ${var%suffix} and ${var#prefix}. For example if var=foo.pdf, then echo ${var%.pdf}.txt prints “foo.txt”.
The output of a command can be treated like a file via <(some command). For example, compare local /etc/hosts with a remote one: diff /etc/hosts <(ssh somehost cat /etc/hosts)
Know about “here documents” in bash.
Learn how to redirect both standard output and standard error via: some-command >logfile 2>&1.
You should know about ASCII table (with hex and decimal values). Type man ascii in terminal.
While working remotely via ssh, you should use screen or dtach to save your session.
For web deveopers use of curl and curl -I, wget etc is useful.
To convert HTML page to text file: lynx -dump -stdin
If you must handle XML, xmlstarlet is good.
In ssh, learn how to port tunnel with -L or -D (and occasionally -R). Also learn how to access web sites from a remote server.
If you were typing a command but then changed your mind, Press Alt+shift+3. It will add # at the beginning and enter it as a comment.

Data processing

Learn about sort and uniq.
Learn about cut, paste, and join.
Learn how to get union, intersection and difference of text files.

cat a b | sort | uniq > c # c is a union b
cat a b | sort | uniq -d > c # c is a intersect b
cat a b b | sort | uniq -u > c # c is set difference a – b
Summing all numbers in the second column of a text file, code given below is probably 3X faster and 3X shorter than equivalent Python.

awk ‘{ x += $2 } END { print x }’

Learn about strings and grep command.
To split files into different parts learn about split (to split by size) and csplit (to split by a pattern).

System debugging

To know the status of your disk, cpu or network use iostat, netstat, top (or the better htop), and (especially) dstat.
To know your system’s memory status use free and vmstat command.
Use mtr which is a network diagnostic tool.
To find out which process or socket is using bandwidth, try iftop or nethogs.
You may use ab tool which is helpful for quick checking of web server performance.
For more serious network debugging take use of wireshark or tshark.
Learn how to use strace, and that you can strace a running process (with -p). This is helpful if your program is failing, hanging, or crashing, and you don’t know why.
Use the ldd command to check shared libraries.
Learn how to connect to a running process with gdb and get its stack traces.
Knowledge of /proc is very helpful. Examples: /proc/cpuinfo, /proc/xxx/smaps, /proc/xxx/exe, /proc/xxx/cwd, /proc/xxx/fd/.
When debugging why something went wrong in the past? To know about this use the sar command. It collects, reports and saves system activity information.
## EMACS24

OpenDir CX d, newfile CX-CF, save S, save as CW, Quit C, new window below CX 2, sidebyside CX 3, insertfile CX-i
screen CV MV
C bf np M bf np
select Cspc cut paste CY
#Short summary of important commands
***This is short summary of markdown commands***
————
*******
<gopalkumar@email.com>
[this is web address](www.yahoo.com)
# Rmarkdown shortnotes

—– or ********
~~strikeout~~
**bold** and *italics*
– itemised list
1. enumerate
<email>
[weblinktext]{weblinkaddress.com}
# H1
## H2
###### H6

Alternatively,
Alt-H1
======

Alt-H2
——

[I’m an inline-style link with title](https://www.google.com “Google’s Homepage”)

[I’m a reference-style link][Arbitrary case-insensitive reference text]

[I’m a relative reference to a repository file](../blob/master/LICENSE)

[You can use numbers for reference-style link definitions][1]

Or leave it empty and use the [link text itself]

ome text to show that the reference links can follow later.

[arbitrary case-insensitive reference text]: https://www.mozilla.org
[1]: http://slashdot.org
[link text itself]: http://www.reddit.com
Here’s our logo (hover to see the title text):

Inline-style:
![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png “Logo Title Text 1”)

Reference-style:
![alt text][logo]

[logo]: https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png “Logo Title Text 2”
Markdowns highlighting many languages (and not-really-languages, like diffs and HTTP headers); see the highlight.js demo page.

Inline `code` has `back-ticks around` it.

Inline code has back-ticks around it.

Blocks of code enclosed by three back-ticks “`
“`javascript
var s = “JavaScript syntax highlighting”;
alert(s);
“`

“`python # If not indicated , no syntex highlight but can use html tags <b> tag </b>
s = “Python syntax highlighting”
print s
“`

“`
No language indicated, so no syntax highlighting.
But let’s throw in a <b>tag</b>.
“`

var s = “JavaScript syntax highlighting”;
alert(s);

s = “Python syntax highlighting”
print s
**Tables
The outer pipes (|) are optional, and you don’no need to align raw Markdown line up. You can also use inline Markdown.

Markdown | Less | Pretty
— | — | —
*Still* | `renders` | **nicely**
1 | 2 | 3

**block quotes
> Blockquotes are very handy in email to emulate reply text.
> This line is part of the same quote. cascade also

Quote break.

**You can also use raw HTML in your Markdown, and it’ll mostly work pretty well.

Horizontal Rule

Three or more…

Hyphens

***

Asterisks

___

Underscores

Three or more…

Insert imagelinks in pure Markdown, but losing the image sizing and border:

<!— sample to be linked
[![IMAGE ALT TEXT HERE](http://youtube.com/YOUTUBE_VIDEO_ID_HERE/0.jpg)%5D(http://www.youtube.com/watch?v=YOUTUBE_VIDEO_ID_HERE)

—>

Lists Unordered You may use any of the following symbols to denote bullets for each list item:

* valid bullet
– valid bullet nested inside
+ valid bullet
Inline code

Wrap inline snippets of code with `.

For example, <section></section> should be wrapped as “inline”.

For example, `<section></section>` should be wrapped as “inline”.

Indented code

Or indent several lines of code by at least four spaces, as in:
// Some comments
line 1 of code
line 2 of code
line 3 of code

Block code “fences”

Use “fences” “` to block in multiple lines of code.

“` html
Sample text here…
“`

Sample text here…

HTML:

<pre>
<p>Sample text here…</p>
</pre>

Syntax highlighting

GFM, or “GitHub Flavored Markdown” also supports syntax highlighting. To activate it, simply add the file extension of the language you want to use directly after the first code “fence”, `js, and syntax highlighting will automatically be applied in the rendered HTML. For example, to apply syntax highlighting to JavaScript code:

“` javascript
grunt.initConfig({
assemble: {
options: {
assets: ‘docs/assets’,
data: ‘src/data/*.{json,yml}’,
helpers: ‘src/custom-helpers.js’,
partials: [‘src/partials/**/*.{hbs,md}’]
},
pages: {
options: {
layout: ‘default.hbs’
},
files: {
‘./’: [‘src/templates/pages/index.hbs’]
}
}
}
};
“`

Renders to this complicated HTML:
Tables

Tables by adding pipes as dividers and by adding a line of dashes, separated by bars beneath the header.
Pipe Need not vertically aligned.

| Option | Description |
| —— | ———– |
| data | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext | extension to be used for dest files. |
Align: colon on one side of dashes below any heading will align text for that column.

| Option | Description |
| ——:| ———–:|
| data | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext | extension to be used for dest files. |

Option Description
data path to data files to supply the data that will be passed into templates.
engine engine to be used for processing templates. Handlebars is the default.
ext extension to be used for dest files.
Links with title

[Upstage](https://github.com/upstage/ “Visit Upstage!”)

Renders to (hover over the link, there should be a tooltip):
Upstage HTML: <a href=”https://github.com/upstage/&#8221; title=”Visit Upstage!”>Upstage</a>

Named Anchors: enable you to jump to the specified anchor point on the same page. eg

# Table of Contents
* [Chapter 1](#chapter-1)
* [Chapter 2](#chapter-2)
* [Chapter 3](#chapter-3)

will jump to these sections:

## Chapter 1 <a id=”chapter-1″></a>
Content for chapter one.

## Chapter 2 <a id=”chapter-2″></a>
Content for chapter one.

## Chapter 3 <a id=”chapter-3″></a>
Content for chapter one.

NOTE that specific placement of the anchor tag seems to be arbitrary.
They are placed inline here since it seems to be unobtrusive, and it works.

##Footnotes
Type marker [^1]
Type the footnote key at the end of a long document.
[^1]: Cupcake Ipsum is fun text.

[^2]: [Cupcake Ipsum](http://www.cupcakeipsum.com/#)
##Images

Images have a similar syntax to links but include a preceding exclamation point.

![Minion](http://octodex.github.com/images/minion.png)

or

![Alt text](http://octodex.github.com/images/stormtroopocat.jpg “The Stormtroopocat”)

Like links, Images also have a footnote style syntax

![Alt text][id]

With a reference later in the document defining the URL location:

[id]: http://octodex.github.com/images/dojocat.jpg “The Dojocat”

The above cheatsheet noted from http://assemble.io/docs/Cheatsheet-Markdown.html
(the site about static blog genration?)
**Markup http://johnmacfarlane.net/pandoc/README.html**
**Write and publish a book**
Detailed writeup is very good at http://www.aristeia.com/authorAdvice.html
****
** Resource for Android**
Good resource and directions at http://wiki.cyanogenmod.org/w/Doc:_Development_Resources
**Ditch the MS word**
http://inundata.org/2012/12/04/how-to-ditch-word/
Softwares
Pandoc – format convertor
Mendeley refernce manager, export to bib
Markdown editor
Knitr to insert data tables

citations
cite this reference, add it in like so:
some statement [@Costello2009].
statement with multiple citations [@Costello2009; @Costello2010].
Compile With citations:
pandoc document.md -o document.pdf –bibliography citations.bib
With Formatting specific for a journal?
Download the citation styles from here and drop it into your folder. Then specify that style during document generation:
pandoc document.md -o document.pdf –bibliography cite.bib –csl style.csl
Do lot more like adding in results, tables, figures, and equations using mathjax but I’ll save the more advanced stuff for a future post.
———–
Usage: make [options] [target] …
Options:
–always-make Unconditionally make all targets.
–directory=DIRECTORY
–file=FILE, –makefile=FILE
–include-dir=DIRECTORY Search DIRECTORY for included makefiles.
–keep-going Keep going when some targets can’t be made.
–print-data-base Print make’s internal database.
———————–
****
Very low-level like the kernel, libc (aka bionic), and many Linux-ish parts in C.
Low leverl and 3rd-party in C or C++. ART (Andr Runtime for end-user programs), net tools, sound, shell, graphics drivers, etc.
The interactng user-facing Android “framework” like UI elements, most apps, in Java.

.mk files, Makefiles, and the /build directory, create a flashable .zip from source, primarily located in /build directory.
The various components/programs which together make up Android are each built independently through Android-specific Android.mk.
The Android.mk generally exists for each sub-project (or “module”) in its source-code directory.
This file directs the build system on exactly how to build that module, and where to put it in Android’s directory structure.
The files, once built, goes in /out/target/project/CODENAME directory (CODENAME is code name of device).
From there, they are zipped up and the final flashable (by recovery) .zip and flashable (by fastboot) .img files are produced.

You peek at what’s been built there in /out, as the directories that are turned into the .img and .zip files are still around.
In addition to the /build directory, the Android source code is organized into a hierarchy of folders.
Take a look here at a brief description of what’s where in the source code.
The $OUT directory

Helpful Tip: After you build, you can type cd $OUT to automatically go to the /out/target/project/CODENAME directory.

kernel This is the kernel, obviously.
/system — all the stuff that will become the /system folder on Android.
/root — files that are turned into the ram disk loaded and run by the kernel. The first program to be run by the kernel is called init, and it uses the init.rc and init.CODENAME.rc files to determine what happens next. See an discussion of that here.

/recovery/root The ramdisk that contains the recovery mode is here.

Shortcut commands every CM dev should know(your computer, not device).
$ . build/envsetup.sh — Note the “.” at the beginning. This load environment variables to your shell and aliases for shortcuts.

know more about “$ . build/envsetup.sh” to know more about the breakfast, brunch and lunch commands,see Envsetup_help page

croot — this command will take you to the root of the source code.
mm and mm -B — this is the “make module” command
very useful if working on a particular module and don’t need to rebuild everything. cd into the directory that you want to test, then just type mm to build just the module in the working directory. to buid from scratch, add the -B.
This is a good companion with adb sync system below, which you can use to push the newly built files directly to your device for testing without having to reflash everything.

Make
make modules — this command will show all available targets. You can build a single one by make my_target.
make showcommands — this command will enable the verbose build mode.
ADB See also: adb
adb shell — command line shell inside your device.
adb reboot — quickly make your device reboot.
adb remount — If errored pushing files to /system due to it being in read-only mode, adb remount will remount /system into read-write mode— have root permissions. in liue of -o rw,remount /system (as root) or something.
adb sync system — powerful command will sync the contents of $OUT/system with the /system directory on your device. Need write access to /system.
for making and testing quick changes to your device.

**android learn http://wiki.cyanogenmod.org/w/Doc:_Development_Resources**

****

## Pandoc usages Read page [guidepages pandoc here]{http://johnmacfarlane.net/pandoc/README.html}

>pandoc test1.md -f markdown -t html -s -o test1.html # -f input file -o output file with extension for automatic
>pandoc commandline default accept markdown text and onclose by Cd output html. -flag for other input mode
##compile by pandoc inputfile.oldtype -t tranlateto -o outputfile.newtype
install.packages(“<the package’s name>”) # R will download the package from CRAN

**Short summary of important commands**
*******
<gopalkumar@email.com> [this is web address](www.yahoo.com)

install.packages(“<the package’s name>”) # R will download the package from CRAN

Once installed, available to use running: library(“<the package’s name>”)

1. To load data: RODBC, RMySQL, RPostgresSQL, RSQLite
XLConnect, xlsx – Read write Microsoft Excel files from R. Else spreadsheets from Excel as .csv’s.
foreign – SAS / SPSS data set
No package required. read.csv, read.table, and read.fwf.

2. To manipulate data:
plyr – shortcuts for subsetting, summarizing, rearranging,joining together data sets, “groupwise” ops
reshape2 – changing layout of data sets. melt function to convert your data to prefered long format.
stringr – Easy to learn tools for regular expressions and character strings.
lubridate – Tools that make working with dates and times easier.

3. To visualize data : ggplot2 ; rgl – Interactive 3D visualizations; googleVis – Let’s you use Google Chart

4. To model data
car – car’s Anova function is popular for making type II and type III Anova tables.
mgcv – Generalized Additive Models
lme4/nlme – Linear and Non-linear mixed effects models
randomForest – Random forest methods from machine learning
multcomp – Tools for multiple comparison testing
vcd – Visualization tools and tests for categorical data
glmnet – Lasso and elastic-net regression methods with cross validation
survival – Tools for survival analysis
caret – Tools for training regression and classification models

5. To report results
shiny – Easily make interactive, web apps with R.
knitr – Write R code in your Latex markdown (e.g. HTML) documents. Automated reporting. Knitr in RStudio.
xtable – Takes an R object (eg data frame) and returns latex or HTML code to Copy and paste, or pair up with knitr.
slidify – Slidify lets you build HTML 5 slide shows straight from R. Write your slides in with R and markdown.

6. For Spatial data
sp, maptools – Tools for loading and using spatial data including shapefiles.
maps – Easy to use map polygons for plots.
ggmap – Download street maps straight from Google maps and use them as a background in your ggplots.

7. For Time Series and Financial data
zoo – Provides the most popular format for saving time series objects in R.
xts – Very flexible tools for manipulating time series data sets.
quantmod – Tools for downloading financial data, plotting common charts, and doing technical analysis.
8. To write high performance R code
Rcpp – Write R functions that call C++ code for lightning fast speed.
data.table – An alternative way to organize data sets for very, very fast operations. Useful for big data.
parallel – Use parallel processing in R to speed up your code or to crunch large data sets.
9. To work with the web
XML – Read and create XML documents with R
jsonlite – Read and create JSON data tables with R
httr – A set of useful tools for working with http connections
10. To write your own R packages
devtools – An essential suite of tools for turning your code into an R package.
test that – Easy way to write unit tests for your code projects.
roxygen2 – A quickly document your R packages. roxygen2 turns inline code comments into documentation pages.
Read entire package development process online in Hadley Wickham’s Advanced R Programming Book

**Noted from https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
***This is short summary of markdown commands***
##learning ddplr package of Hedley:
install.packages(“dplyr”)
You’ll probably also want to install the data packages used in most examples: install.packages(c(“nycflights13”, “Lahman”)).

Read the intro vignette: vignette(“introduction”, package = “dplyr”).
To make the most of dplyr, I also recommend that you familiarise yourself with the principles of tidy data:
Tidy data messy data
Every value belongs to a variable and an observation.
A variable contains all values that measure the same underlying attribute (like
height, temperature, duration) across units.
An observation contains all values measured on
the same unit (like a person, or a day, or a race) across attributes.
In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Fixed variables describe the experimental design and are known in advance.
Computer scientists often call fixed variables dimensions,
and statisticians usually denote them with subscripts on random variables.
Measured variables are what we actually measure in the study.
Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous.
Rows can then be ordered by the first variable, breaking ties with the second and subsequent (fix

Five most common problems with messy datasets, along with their remedies:
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Variables are stored in both rows and columns.
• Multiple types of observational units are stored in the same table.
• A single observational unit is stored in multiple tables.

tidied with a small set of tools: melting, string splitting, and casting

Data Manupulation : four fundamental verbs :
• Filter: subsetting or removing observations based on some condition. R functions subset() and transform().
• Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
• Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means). aggregate()
• Sort: changing the order of observations.
plyr package provides tidy summarise() and arrange()

join operator works by matching common variables and adding new columns. merge() in base R, or the join() function in plyr

# key opertor of dplyr() is tbl.
library(dplyr) # for functions
library(nycflights13) # for data flights

# Caches data in local SQLite db
flights_db1 <- tbl(nycflights13_sqlite(), “flights”)

# Caches data in local postgres db
flights_db2 <- tbl(nycflights13_postgres(), “flights”)

carriers_df <- group_by(flights, carrier)
carriers_db1 <- group_by(flights_db1, carrier)
carriers_db2 <- group_by(flights_db2, carrier)

dplyr implements the following verbs useful for data manipulation:

select(): focus on a subset of variables
filter(): focus on a subset of rows
mutate(): add new columns
summarise(): reduce each group to a smaller number of summary statistics
arrange(): re-order the rows
dplyr implements the four most useful joins from SQL:

inner_join(x, y): matching x + y
left_join(x, y): all x + matching y
semi_join(x, y): all x with match in y
anti_join(x, y): all x without match in y

And provides methods for:

intersect(x, y): all rows in both x and y
union(x, y): rows in either x or y
setdiff(x, y): rows in x, but not y

Recommend loading plyr first, then dplyr, so that the faster dplyr functions come first in the search path.
function provided by both dplyr and plyr works in a similar way, but dplyr functions tend to be faster.
Data Manipulation with dplyr : from DataScience+ provides some great, easy-to-use functions for performing exploratory data analysis and manipulation.
The airquality dataset for New York from May 1973 – September 1973.

head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2

load up the two packages:

library(datasets)
library(dplyr)

Functions.

## Filter : return rows that satisfy a following condition. eg return all the rows where Temp is larger than 80 and Month is after May.

filter(airquality, Temp > 80 & Month > 5)
Ozone Solar.R Wind Temp Month Day
1 NA 286 8.6 78 6 1
2 NA 287 9.7 74 6 2
3 NA 186 9.2 84 6 4

## Mutate : add new variables eg temperature in Celsius.

mutate(airquality, TempInC = (Temp – 32) * 5 / 9)
Ozone Solar.R Wind Temp Month Day TempInC
1 41 190 7.4 67 5 1 19.44444
2 36 118 8.0 72 5 2 22.22222
3 12 149 12.6 74 5 3 23.33333

## Summarise : multiple values into a single value. powerful when used in conjunction with the other functions. na.rm = TRUE will remove all NA values while calculating the mean.

summarise(airquality, mean(Temp, na.rm = TRUE))
mean(Temp)
1 77.88235

## Group By : by one or more variables. Will group the data together based on the Month, and then the summarise function is used to calculate the mean temperature in each month.

summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
Month mean(Temp)
1 5 65.54839
2 6 79.10000
3 7 83.90323

## Sample : select random rows from a table. count or fraction

sample_n(airquality, size = 10)
sample_frac(airquality, size = 0.1)

## Count : tallies observations based on a group. It is slightly similar to the table function in the base package. For example:

count(airquality, Month)
Month n
1 5 31
2 6 30
3 7 31
4 8 31
5 9 30
## Arrange : to arrange the rows in the descending order of Month, and then in the ascending order of Day.

arrange(airquality, desc(Month), Day)
Ozone Solar.R Wind Temp Month Day
1 96 167 6.9 91 9 1
2 78 197 5.1 92 9 2
3 73 183 2.8 93 9 3
4 91 189 4.6 93 9 4
5 47 95 7.4 87 9 5
6 32 92 15.5 84 9 6

## Pipe : represented by %>% can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month. The conventional way to write the code for this would be:

filteredData <- filter(airquality, Month != 5)
groupedData <- group_by(filteredData, Month)
summarise(groupedData, mean(Temp, na.rm = TRUE))

With piping, the above code can be rewritten as:

airquality %>%
filter(Month != 5) %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm = TRUE))

This is a very basic example, as the number of operations/functions perfomed on the data increase, the pipe operator becomes more and more useful!
## Pig Latin

Pig Latin statements work with relations. A relation can be defined as follows:
A relation is a bag (more specifically, an outer bag).
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.

PIG LATIN:
A Pig relation is a bag of tuples.
A Pig relation is like a table in a relational database, where the tuples in the bag correspond to the rows in a table.
Unlike a relational table, however, Pig relations don’t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order.
Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
Once installed, available to use running: library(“<the package’s name>”)

# These are the good packages to remember

1. To load data: RODBC, RMySQL, RPostgresSQL, RSQLite
XLConnect, xlsx – Read write Micorsoft Excel files from R. Else spreadsheets from Excel as .csv’s.
foreign – SAS / SPSS data set
No package required. read.csv, read.table, and read.fwf.

2. To manipulate data:
plyr – shortcuts for subsetting, summarizing, rearranging,joining together data sets, “groupwise” ops
reshape2 – changing layout of data sets. melt function to convert your data to prefered long format.
stringr – Easy to learn tools for regular expressions and character strings.
lubridate – Tools that make working with dates and times easier.

3. To visualize data : ggplot2 ; rgl – Interactive 3D visualizations; googleVis – Let’s you use Google Chart

4. To model data
car – car’s Anova function is popular for making type II and type III Anova tables.
mgcv – Generalized Additive Models
lme4/nlme – Linear and Non-linear mixed effects models
randomForest – Random forest methods from machine learning
multcomp – Tools for multiple comparison testing
vcd – Visualization tools and tests for categorical data
glmnet – Lasso and elastic-net regression methods with cross validation
survival – Tools for survival analysis
caret – Tools for training regression and classification models

5. To report results
shiny – Easily make interactive, web apps with R.
knitr – Write R code in your Latex markdown (e.g. HTML) documents. Automated reporting. Knitr in RStudio.
xtable – Takes an R object (eg data frame) and returns latex or HTML code to Copy and paste, or pair up with knitr.
slidify – Slidify lets you build HTML 5 slide shows straight from R. Write your slides in with R and markdown.

6. For Spatial data
sp, maptools – Tools for loading and using spatial data including shapefiles.
maps – Easy to use map polygons for plots.
ggmap – Download street maps straight from Google maps and use them as a background in your ggplots.

7. For Time Series and Financial data
zoo – Provides the most popular format for saving time series objects in R.
xts – Very flexible tools for manipulating time series data sets.
quantmod – Tools for downloading financial data, plotting common charts, and doing technical analysis.
8. To write high performance R code
Rcpp – Write R functions that call C++ code for lightning fast speed.
data.table – An alternative way to organize data sets for very, very fast operations. Useful for big data.
parallel – Use parallel processing in R to speed up your code or to crunch large data sets.
9. To work with the web
XML – Read and create XML documents with R
jsonlite – Read and create JSON data tables with R
httr – A set of useful tools for working with http connections
10. To write your own R packages
devtools – An essential suite of tools for turning your code into an R package.
testthat – Easy way to write unit tests for your code projects.
roxygen2 – A quickly document your R packages. roxygen2 turns inline code comments into documentation pages.
Read entire package development process online in Hadley Wickham’s Advanced R Programming Book
Check out the float package. It adds ability to include [H] forced here float placement. You can also select this as automatic default with \floatplacement{figure}{H}.float introduces a placement option H enforcing the placement exactly at that point.

\begin{figure}[!htbp] allows more options
package placeins provides the command \FloatBarrier to limit the floating of figures or tables and place barrier before and after a listing.

\begin{tabular}[pos]{table spec}
l c r p{‘width’} paragraph column with text vertically aligned at the top
m{‘width’} paragraph column with text vertically aligned in the middle (requires array package)
b{‘width’} paragraph column with text vertically aligned at the bottom (requires array package)
width any unit or command lengths, such as \textwidth
b bottom ; c center (default) , t top
To specify a font format (such as bold, italic, etc.) for an entire column, add >{\format} before you declare the alignment.
\begin{tabular}{ >{\bfseries}l c >{\itshape}r }

& column separator , \\ start new row (additional space by \\[6pt])
\hline horizontal line , \newline start a new line within a cell (in a paragraph column)
\cline{i-j} partial horizontal line beginning in column i and ending in column j

Manually broken paragraphs in table cells: specify the line breaks by hand using \parbox:

\begin{tabular}{cc}
boring cell content & \parbox[t]{5cm}{rather long par\\new par}
\end{tabular}

Alter the column separation: \setlength{\tabcolsep}{5pt}. The default value is 6pt.
Space between rows Re-define the \arraystretch command to set the space between rows:
\renewcommand{\arraystretch}{1.5} %Default value is 1.0.

Define many identical columns at once using the *{”num”}{”str”} syntax
\begin{tabular}{l*{6}{c}r} It has 1+6+1 = 8 colomns

@-expressions : The column separator can be specified with the @{…} construct. \begin{tabular}{r@{.}l}
Above used for aligning point by using point in place of cell wall {r@{.}l}

Rows spanning multiple columns: \multicolumn{‘num_cols’}{‘alignment’}{‘contents’}.
num_cols is columns to merge; alignment is either l, c, r, or to have text wrapping specify a width p{5.0cm}.

Columns spanning multiple rows: \usepackage{multirow} to the preamble.
then \multirow{”num_rows”}{”width”}{”contents”}
Resize tables: The graphicx packages features the command \resizebox{width}{height}{object} or \scalebox{ratio}{object} similar but need ratios.
\usepackage{graphicx}
% …
\resizebox{8cm}{!} { \begin{tabular}…whatever you want in table..
\end{tabular} }
\usepackage[table]{xcolor} use \rowcolors{<”starting row”>}{<”odd color”>}{<”even color”>}

in table \begin{center}
\rowcolors{1}{green}{pink}
\usepackage{tabularx} , \begin{tabularx}{\textwidth}{ |X|X|X|X| } tretched to make the table as wide as specified by \textwidth

\usepackage{tabulary} … \begin{tabulary}{0.7\textwidth}{LCL} for balance the column widths
The tabu environment : like tabularx. ‘to \linewidth’ specifies the target width. The X parameter optional span factor.
\begin{tabu} to \linewidth {llX[2]lllXl} … \end{tabu}

caption package two can be used
\caption{A normal caption}
\caption*{ A legend, even a table can be used \begin{tabular}{l l} item 1 & explanation 1 \\ \end{tabular} }

tabular is for the content itself (columns, lines, etc.).
table is for the location of the table on the document, plus caption and label support.
\begin{table}[position specifier] \centering \begin{tabular}{|l|}

To keep a specific subsection out of the index use: \subsection*{…}
To remove all subsections from the TOC only, use: \tableofcontents[hideallsubsections]

you can hide all subsections inside the toc if you set the counter tocdepth to 1:
\setcounter{tocdepth}{1}
\setcounter{tocdepth}{2} Increase 2 if you want to list subsubsection, etc.

\tableofcontents

% To hide subsections in the toc worked good :
\addtocontents{toc}{\protect\setcounter{tocdepth}{1}} % to hide subsections in the toc:
\subsection{This subsection is numbered but not shown in the toc}
\addtocontents{toc}{\protect\setcounter{tocdepth}{2}} %from now on subsections are shown like usual

\includegraphics[scale=0.1,angle=90]{Appendix2.pdf}

%rotate
\pagebreak[4]
\global\pdfpageattr\expandafter{\the\pdfpageattr/Rotate 90}

Newcommand with arguments in LaTeX
\newcommand {\ntag}[args] {whatevercode}: replace “\ntag” with “whatevercode”
\newcommand {\vfrac}[2] {\ensuremath{\frac{#1}{#2}}} % use #1, #2 etc
A set of {} for each argument. \ensuremath{} to put ‘math’ into regular text!
# mercurial dvcs learnsite: http://hginit.com/01.html

hg init % -> there’s will be a new directory named .hg
hg clone recipes recipes-experiment

hg add
hg commit ‘commit message’ % pop up an editor to let you type a commit message.

hg log % to see a history of changes
hg status %files that have changed
hg cat -r 0 a.txt %print -r revision version of a file
hg diff %what’s changed with a file since the last commit.
hg diff -r 0:1 a.txt %print the difference between any two revisions.

hg remove filename % Whenever you remove (or add) a file, tell Mercurial:
hg up -r 1

hg revert reponame %will immediately revert your directory of last commit
hg rollback

————-
# Read XLS
install.packages(“readxl”) #There’s not really much to say about how to use it:
library(readxl) # Use a excel file included in the package
sample <- system.file(“extdata”, “datasets.xlsx”, package = “readxl”)

# Read by position
head(read_excel(sample, 2))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

# Or by name:
excel_sheets(sample)
#> [1] “iris” “mtcars” “chickwts” “quakes”
head(read_excel(sample, “mtcars”))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# R For excel user

install.packages(“ggplot2”)
install.packages(“dplyr”)

library(ggplot2)
library(dplyr)
diamonds<- data.frame(diamonds)

## Renaming some colomn labels
names(diamonds)[8]<-“length”
names(diamonds)[9]<-“width”
names(diamonds)[10]<-“depth”

## Calculated colomns
diamonds <- mutate(diamonds, cubic=length*width*depth) ( or diamonds$cubic <- diamonds$length * diamonds$width * diamonds$depth )
## colomon summaries : In R, you would use the colMeans function.
colMeans(diamonds[,c(1,5:11)])

## let’s round the carat values to the nearest 0.25 carat .
diamonds$carat2 <- round(diamonds$carat/.25)*.25

## Pivot table : In R, use the reshape2 package and dcast function to get our data into the same pivot table format.
install.packages(“reshape2”)
library(reshape2)

pivot_table <- dcast(diamonds[,c(‘color’,’clarity’,’price’)], color~clarity, mean)
## VLOOKUP(A2,K2:K50,2,0) and Excel looks up the value in A2 in column K and returns the value in the column next to the matching value.
In R, merge function.
to calculate how far above or below a diamond’s price was compared to the average for their cut,color, clarity, and carat.
In this case, our data set A will be the diamonds data frame and data set B will be the Summary data frame.
First, let’s change the name of the price column in the Summary data frame to avgprice. This way, we won’t have two price fields when we bring it over.

names(Summary)[7]<-“avgprice”

Next, let’s merge the data sets and bring over the average price.

diamonds <- merge(diamonds, Summary[,c(1:4,7)], by.x=c(“cut”,”color”,”clarity”,”carat2″), by.y=c(“cut”,”color”,”clarity”,”carat2″))

We merged the diamonds data frame with just the columns that we needed from the Summary data frame and the result was that it added the avgprice field to our diamonds data frame.
## Conditional Statements

Excel users also periodically use conditional (IF) statements for filling in values according to whether certain conditions are met. R is also very good for doing this.

Let’s say we wanted to categorize diamonds into size categories such as small, medium, and large based on their carat weight.

diamonds$size[diamonds$carat < 0.5] <- “Small”
diamonds$size[diamonds$carat >=0.5 & diamonds$carat < 1] <- “Medium”
diamonds$size[diamonds$carat >= 1] <- “Large”

## graphs
barplot(table(diamonds$size), main=”Diamond Size Distribution”, xlab=”Size Category”, ylab=”Number of Diamonds”, col=”blue”)

Line charts
ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = color, colour = color)) +
labs(x=”Clarity”, y=”Number of Diamonds”, title=”Clarity by Color”)
Scatter plot:
ggplot(diamonds, aes(carat, price, color=clarity)) + geom_point() +
labs(x=”Carat Weight”, y=”Price”, title=”Price by Carat Weight”)
# scan reads vector
data <- matrix(scan(“birth.txt”), nrow=2, byrow=TRUE)

read.table(file=”clipboard”)`

##
library(RODBC)
connection <- odbcConnect(“<DSN>”)
Once you have set up your connection, you could also use the sqlQuery() function to get data from .xls spreadsheets:
query <- “<SQL Query>”
data <- sqlQuery(connection, query)
str(data)
At the end of an R session, don’t forget to close the connections:
odbcCloseAll()

#DIF
Use the read.DIF() function to get your DIF files into R:

data <- read.DIF(“<your spreadsheet>”, header=FALSE, as.is = !stringsAsFactors)
# read xls
library(readxl)
read_excel(“<path to file”)

library(readODS)
read.ods(“<path to your file>”, sheet = 1, formulaAsFormula = FALSE)

library(jsonlite)
data <- fromJSON(“<Path to your JSON file>”)

For a well-explained quickstart with the jsonlite package, go

library(RJSONIO)
data <- fromJSON(“<Path to your JSON file”)
## For large data . This is different from the read.table(), which creates a data frame of your data.
library(data.table)
data <- fread(“http://assets.datacamp.com/blog_assets/chol.txt&#8221;)
library(data.table)
data <- fread(“http://assets.datacamp.com/blog_assets/chol.txt&#8221;, sep=auto, nrows = -1, na.strings = c(“NA”,”N/A”,””), stringsAsFactors=FALSE )

#sqldf
library(sqldf)
bigdata <- read.csv.sql(file=”<Path to your file>”,
sql=”select * from file where …”,
colClasses=c(“character”, rep(“numeric”,10)))

Certified Big Data Analyst Hadoop Certification Courses
This Big data analytics & hadoop training program extensively covers big data and predictive analytics techniques using R and Hadoop. Candidates will get practical hands-on training on cutting edge tools and big data platforms, like R and Hadoop (MapReduce, Hbase, Hive, Pig, Oozie, Scoope and Flume).
This big data online training is crafted by experts using real life business datasets. As part of this program candidates get access to the virtual lab and several case studies on big datasets for extensive hands-on practice. At end of the program candidate would need to operationalize and complete a live project for an assimilated learning.
Who should attend this Big data analytics hadoop courses & training program? MBA Students/ IT professionals/ Recent graduates who want job in big data analytics/ data scientist role.
Certified Big Data Analytics Course Content (72 hours + practice sessions)
Business Analytics using R & Tableau
Introduction to R- environment
1. The Workspace
2. Input/ Output
3. Useful Packages (Base & other packages) in R
4. Graphic User Interfaces (R studio)
5. Customizing Startup
6. Batch Processing
7. Reusing Results
Data Input & Output (Importing & Exporting)
1. Data Structure & Data Types (Vectors, Matrices, factors, Data frames,  and Lists)
2. Importing Data (Importing data from csv, txt, Excel and other files)
3. Keyboard Input (Creating input by entering data)
4. Database Input (Connecting to database and use the data)
5. Exporting Data (Exporting files into different formats)
6. Viewing Data (Viewing partial data and full data)
7. Variable & Value Labels –  Date Values
8. Missing Data
Data Management
1. Creating New Variables (calculations & Binning)
2. Operators (Using multiple operators)
3. Built-in Functions & User Defined Function
4. Control Structures(conditional statements, Loops)
5. Sorting Data
6. Merging and Appending Data
7. Aggregating Data
8. Reshaping Data
9. Sub setting Data
10. Data Type Conversions
Visualization
1. Creating Graphs
2. Histograms & Density Plot
3. Dot Plots –  Bar Plots – Line Charts – Pie Charts – Boxplots – Scatterplots
Basic Statistics (Exploratory Analysis)
1. Descriptive Statistics(central tendency/variance)
2. Frequency Tables /Summarization
3. Hypothesis Testing
4. t-tests/z-test (1-sample, independent sample, paired sample)
5. Analysis of Variance(ANOVA)
6. Correlations/chi-square test
Advanced Analytics (Advanced Statistics)
1. Introduction to predictive modeling & applications
2. Linear(Simple & Multiple) Regression
3. Logistic Regression
4. Introduction to segmentation
5. Segmentation using cluster analysis
Data Visualization using Tableau
1. Introduction to Tableau & Environment
2. Building basic views & sharing your work- overview
3. Data importing & manipulation
4. Maps/Tables/Calculated fields
5. Parameters
6. Data visualization with Charts maps
7. Building & customizing Reports
8. Building & customizing Dashboards
Machine Learning using R
1. What is Machine Learning?
2. Applications of Machine Learning Algorithms
3. Classification & Regression Problems
4. Training & Testing concepts – Cost & optimization functions
5. Artificial Neural Networks(ANN)
6. Support Vector Machines(SVM)
7. Decision Tress & Random Forest
8. Baysian Network case
Social Media Analytics using R
1. Social Media – Characteristics of Social Media
2. Applications of Social Media Analytics
3. Metrics(Measures Actions) in social media analytics
4. Examples & Actionable Insights using Social Media Analytics
5. Text Analytics – Sentiment Analysis using R
6. Text Analytics – Word cloud analysis using R
Projects (Applying Overall Learning)
1. Solve Business problems using R/Tableau
HADOOPIntroduction to Big Data & Hadoop
1. What is Big Data?
2. Types of Data
3. Characteristics of Big Data
4. Need for understanding Big Data (Application of Big Data)
5. Traditional Approaches and its limitations
6. Introduction to Hadoop and eco-system
7. Getting Started with Hadoop (software installation etc.)
Hadoop Architecture
1. Hadoop Commercial version vs Apache Hadoop
2. Hadoop Cluster in commodity hardware
3. Hadoop core components
4. HDFS layer
5. HDFS operation principle
6. Basic Hadoop commands
MapReduce
1. Introdution to MapReduce
2. Hadoop MapReduce example
3. Hadoop MapReduce Characteristics
4. Setting up your MapReduce Environment
5. Building a MapReduce Program
6. Input Formats in MapReduce
7. OutputFormats in MapReduce
8. Basic MapReduce Programming using R
R-Hadoop
1. Introduction to RHdfs, Rmr and Rhbase
2. Develop Map reduce code using R for Local & Hadoop env
3. Exploratory analysis using R-Hadoop
4. Predictive analytics using R- Hadoop
5. Overview of Parallelization using R without Hadoop
Introduction to Flume & Sqoop
1. Introduction to Sqoop (Why, what, processing, under the hood)
2. Exporting data from Hadoop using Sqoop
3. Introduction to Flume
4. Flume Use Cases
5. Hands on Exercise using Flume and Sqoop
PIG
1. Introduction to PIG
2. Components of PIG
3. PIG Data Model
4. Creating Mapreduce programs using PIG
5. Hands on Exercise using PIG
HIVE
1. Introduction to HIVE and its characteristics
2. Components of HIVE
3. HIVE Data Models
4. Serialization/De-serialization
5. HIVE file formats
6. HIVE Query Language
7. HIVE Functions
8. Difference between HIVE and PIG
9. Hands on Exercise using HIVE
H-Base
1. HBase introduction and its Characteristics
2. HBase Architecture
3. Storage Model of HBase
4. When to use HBase
5. HBase Data Model
6. HBase Families
7. HBase Components
8. Data Storage
9. Hands on Exercise using Hbase
Mahout
1. Mahout introduction and its Characteristics
2. Mahout Architecture
3. When to use Mahout
4. What are the Machine Learing topics are covered in Mahout
5.  Hands on Exercise using Mahout
ZooKeeper
1. Introduction to ZooKeeper & its Features
2. Features of ZooKeeper
3. Challenges faced in distributed applications
4. Coordination
5. ZooKeeper: Goals and Uses
6. ZooKeeper: Entities, Data Model, Services
Misc Components
1. Overview of Apache Oozie
2. Overview of Storm
3. Overview of Apache Cassandra
4. Overview of Apache Spark
5. Overview of H2O
6. Social Media Analytics(Text Analysis, Word cloud)
Projects (Applying Overall Learning)
1. Solve Business problems using all the components of Hadoop

AnalytixLabs now also offers online Big data analytics & hadoop courses training in Delhi, India.

Advertisements