Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

R Data Frames

News R data types R Bookshelf Recommended Links   An Introduction to R R Style Guide
R data types Vectors Factors R Data Frames Matrices I/O Functions
Rstudio R Debugging General principles of debugging R help system R-Studio Keyboard Shortcuts R Environment Variables  
R Packages CRAN package repository Quantmod package TTR Package Zoo package ggplot2 package  
Brooks law  Conway Law Code Reviews and Inspections Code Metrics Software Engineering Programming style  
Software Fashion KISS Principle Tips Quotes R history Humor Etc

Introduction

You can think of a data frame as structure similar to Excel spreadsheet.

It is essentially a list of vectors. A data frame is more general than a matrix as different columns can contain different modes of data (numeric, character, etc.).   Data frames are the most common data structure  in R.

Like Excel spreadsheet R data frame is two dimensional object comprising of rows and columns, and you can address columns by names. The rows are referred by the first (left-hand] subscripts, while columns by the second (right-hand) subscript or name. Each element can be addressed by two indexes provided in square brackets, for example

gld[2,3]

You can use intervals instead of single index, for example

gld[1,4:5]

you can drop a raw or a column from the data set with minus operation, for example

gld[,-2] # drops the second column
gld[-(1:n),] # drop rows from 1 to n in x

You can select multiple columns with c operator

gld(c(date,open,close,volume)]

To select all the entries in a column, the syntax is "comma, number of column", for example

gld[,2]

Construction of a data frame from several vectors of the same length

A data frame is created with the data.frame() function:

mydata <- data.frame(vector1, vector2, ...)

 Assign the result to the metals  variable three vectors:

metals <- data.frame(date, open, close, )

Now, try printing metals to see its contents using statement print(metals) 

 > print(metals) date open close ... ... ...

There's your new data frame, neatly organized into rows, with column names (derived from the variable names) across the top.

You can get individual columns by providing their index number in double-brackets. Try getting the second column (prices) of metals:

metals[[2]] 

You could instead provide a column name as a string in double-brackets. (This is often more readable.) Retrieve the "close" column:

metals[["close"]] 

Reading CSVs

There are numerous ways to download data into data.frame. The simplest and the most common is downloading and then reading comma separated values (CSV) files. You can use read.csv to do that. It actually calls read.table with some arguments preset. The result of using read.table is a data.frame.

The first argument to read.table is the full path of the file to be loaded or URL. If you specify just the name of the file, it is assumed to be in your project folder.

yahooUrl <- "http://real-chart.finance.yahoo.com/table.csv?s=GLD&d=6&e=30&f=2015&g=d&a=10&b=18&c=2004&ignore=.csv"
theGold<- read.table (file = yahooUrl, header = TRUE, sep = ",")

The result can now be seen using head.

> head(theGold)

the first argument is the file name in quotes (or as a character variable). Notice how we explicitly used the argument names file, header and sep.  The second argument, header, indicates that the first row of data holds the column names. The third argument gives the delimiter separating data cells. Changing this to other values such as “\t” (tab delimited) or “;” (semicolon delimited) allows it to read other types of files.

There is another little argument that is helpful to use is stringsAsFactors. Setting this to FALSE (the default is TRUE) prevents character columns from being converted to factor columns. This saves computation time, which can be substantial in case of a large dataset with large number of rows and several character columns with many unique values. Also keeping the columns as character data in many case makes them easier to work with. Conversion to factors is often overkill unless we are dealing with a members of a set and can benefit from set-style operations. 

BTW stringsAsFactors argument can also be used in data.frame function for blocking conversion of strings into factors: 

theGold <- data.frame(datet=d, close=cc, volume=v, stringsAsFactors=FALSE)

There are several other arguments to read.table function. Among them the most useful are quote and colClasses. The former specifies the character used for enclosing cells and the latter the data type for each column, respectively.

When comma delimited files are poorly built, for example the cell separator has been used inside a cell you can try to use functions read.csv2 (or read.delim2) instead of read.table.

Loading Data Frames from a file

Typing in all your data by hand only works up to a point, obviously, which is why R was given the capability to easily load data in from external files.

You can create a couple data files to experiment with in you project directory. To check what files your project directory contains use function list.files():

> list.files() 

Let's assume that there is a CSV (Comma Separated Values) file "gld150730.csv"  in your project directory. You can export such a file from any spreadsheet programs or download from any web site that provide stock quotes such as http://finance.yahoo.com .  For example:

Date,Open,High,Low,Close,Volume,Adj Close
2015-07-29,104.93,105.629997,104.489998,105.169998,5613700,105.169998
2015-07-28,105.089996,105.330002,104.830002,105.019997,5522200,105.019997
2015-07-27,104.940002,105.68,104.660004,104.860001,9330000,104.860001
2015-07-24,103.610001,105.589996,103.43,105.349998,11442700,105.349998
2015-07-23,104.980003,105.300003,104.199997,104.330002,5691800,104.330002
2015-07-22,104.389999,105.089996,104.18,104.800003,8288700,104.800003
2015-07-21,105.809998,106.32,105.25,105.370003,9391100,105.370003
2015-07-20,106.599998,106.650002,105.620003,105.699997,15437900,105.699997
2015-07-17,109.110001,109.160004,108.400002,108.650002,13954500,108.650002
2015-07-16,109.669998,110.010002,109.599998,109.760002,4221900,109.760002
2015-07-15,110.00,110.190002,109.580002,110.160004,8157600,110.160004
2015-07-14,111.00,111.080002,110.629997,110.739998,2575900,110.739998
2015-07-13,110.43,111.139999,110.360001,110.989998,4268600,110.989998
2015-07-10,111.18,111.709999,111.029999,111.489998,3585200,111.489998
2015-07-09,111.800003,111.93,111.150002,111.360001,3793800,111.360001
2015-07-08,111.379997,111.650002,111.080002,111.089996,5655100,111.089996
2015-07-07,111.080002,111.139999,110.050003,110.760002,9062300,110.760002
2015-07-06,111.709999,112.580002,111.629997,112.059998,4228800,112.059998
2015-07-02,111.660004,111.839996,111.410004,111.760002,3828800,111.760002
2015-07-01,112.120003,112.510002,111.940002,111.980003,4368000,111.980003

You can load a CSV file's content into a data frame by passing the file name to the read.csv  function. Try it with the "gld150730.csv"  file:

theGold <- read.csv("gld150730.csv")

Fields in a file can be separated by tab characters rather than commas.  

For files that use separator strings other than commas, you can use the read.table  function. The sep  argument defines the separator character, and you can specify a tab character with "\t".

theGold <- read.table("gld150730.csv", sep="\t", header=TRUE)

Selecting columns

To get a single column of data from data frame you need to specify the row and do not specify any rows. For example to access the first column in data frame theGold you can use index of this column:

theDate=theGold[,1]

In general each index can be a vector. That means that you can use ranges to select set of consecutive columns (or not consequtive if  the step in sequnce is larger then 1):

theDate=theGold[,3:5]

Unlike most other programming language, in R you also can use column names, which is more convenient, then using numeric indexes. Remember that column names are a factor vector so each name has its numeric equivalent. You can get the list of names of the columns for particular data frame using the function names, for example

names(theGold)

To access a single column using this "column name" feature just put a name of the column instead of numeric index:

theDate=theGold[,"date"]

To access multiple columns by name, make the column argument a character vector of the names of the columns you want to be in the output.

goldSelectedCol <- theGold[, c("date", "open", "close", "volume")]
When you are selecting a single column R converts it into a vector and displays values horizontally. If you want the values to be displayed vertically as you used to in viewing data frames,  you need to ensure that result is still a data frame, despite having just a single column. That can be achieved using an argument drop=false
theDate=theGold[,"date",drop=false]

You can check the mode of result: it will be a data frame not a vector. For example:

class(theGold[,"date",drop=false])

The $ notation for selection of columns

Typing all those brackets can get tedious and error prone, so in R there is a shorthand notation: the data frame name, a dollar sign, and the column name (without quotes). Try using it to get the "close"  column:

theGold$close

The $ notation selects a particular column (vector) from a given data frame.

Selecting rows

To get a single row you can use nation similar to getting a single column -- specify row and do not specify any columns:

theRow2<-theGold[2, ]
To specify multiple row, use a vector, for example
theRow2<-theGold[2:10, ]
If rows are not adjacent use c function to construct a vector, for example
theRow2<-theGold[c(2,5,10), ]
As nrow function provides number of rows you can calculate variables and use them instead of constants. For example to select the last 200 rows you can use the follwong: 
rMax <- nrow(theGold)
rMin <- nMax-200
theRow2<-theGold[rMin:rMax, ]

Functions head and tail

R contain two very useful functions for operating with rows called head and tail. Which are similar to Unix utilities with the same names:

Usually a data.frame has far too many rows to print them all to the screen, so thankfully the head function prints out only the first few rows.

Try the following command on our example data frame

head(theGold)
head(theGold, n = 7)
tail(theGold)
tail(thegold, n = 10)

Getting attributes of a data frame

There are various ways to inspect a data frame, such as:

Browsing data

RStudio has a nice data browser (View(mydata)). The data frame will be displayed in nice spreadsheet like format in the upper left pane.

You can also use functions head()  and  tail()  to display rows that are interesting for you in command window.

Binding a new row or column to existing data frame

Most of the times when you are working with data frames, you are changing the data and one of the several changes you can do to a data frame is adding column or row and as the result increase the dimension of your data frame.

There are few different ways to do it but the easiest ones are cbind()  and rbind()  which are part of the base package:

mydata <- cbind(mydata, newVector)
mydata <- rbind(mydata, newVector)

Remember that the length of the newVector should match the length of the side of the data frame that you are attaching it to. For example in the cbind()  command the following statement should be TRUE:

dim(mydata)[2]==length(newVector)

To see more samples, you can always do ?base::cbind  and ?base::rbind.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Jul 30, 2015] attach() and detach() or with() functions

It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts are available. You can use either the attach() and detach() or with() functions to simplify your code.

Attach, Detach, and With

The attach() function adds the data frame to the R search path. When a variable name is encountered, data frames in the search path are checked in order to locate the variable. Using the mtcars data frame from chapter 1 as an example, you could use the following code to obtain summary statistics for automobile mileage (mpg), and plot this variable against engine displacement (disp), and weight (wt):

summary(mtcars$mpg)
plot(mtcars$mpg, mtcars$disp)
plot(mtcars$mpg, mtcars$wt)

This could also be written as

attach(mtcars)
  summary(mpg)
  plot(mpg, disp)
  plot(mpg, wt)
detach(mtcars)

The detach() function removes the data frame from the search path. Note that detach() does nothing to the data frame itself. The statement is optional but is good programming practice and should be included routinely. (I'll sometimes ignore this sage advice in later chapters in order to keep code fragments simple and short.)

The limitations with this approach are evident when more than one object can have the same name. Consider the following code:

> mpg <- c(25, 36, 47)
> attach(mtcars)


The following object(s) are masked _by_ '.GlobalEnv':    mpg
> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  'x' and 'y' lengths differ
> mpg
[1] 25 36 47

Here we already have an object named mpg in our environment when the mtcars data frame is attached. In such cases, the original object takes precedence, which isn't what you want. The plot statement fails because mpg has 3 elements and disp has 32 elements. The attach() and detach() functions are best used when you're analyzing a single data frame and you're unlikely to have multiple objects with the same name. In any case, be vigilant for warnings that say that objects are being masked.

An alternative approach is to use the with() function. You could write the previous example as

with(mtcars, {
  summary(mpg, disp, wt)
  plot(mpg, disp)
  plot(mpg, wt)
})

In this case, the statements within the {} brackets are evaluated with reference to the mtcars data frame. You don't have to worry about name conflicts here. If there's only one statement (for example, summary(mpg)), the {} brackets are optional.

The limitation of the with() function is that assignments will only exist within the function brackets. Consider the following:

> with(mtcars, {
   stats <- summary(mpg)
   stats
  })
   Min. 1st Qu.  Median    Mean 3rd Qu.     Max.
  10.40   15.43   19.20   20.09   22.80    33.90
> stats
Error: object 'stats' not found

If you need to create objects that will exist outside of the with() construct, use the special assignment operator <<- instead of the standard one (<-). It will save the object to the global environment outside of the with() call. This can be demonstrated with the following code:

> with(mtcars, {
   nokeepstats <- summary(mpg)
   keepstats <<- summary(mpg)
})
> nokeepstats
Error: object 'nokeepstats' not found
> keepstats
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    10.40   15.43   19.20   20.09   22.80   33.90

Most books on R recommend using with() over attach(). I think that ultimately the choice is a matter of preference and should be based on what you're trying to achieve and your understanding of the implications. We'll use both in this book.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Top articles

Sites



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: October, 16, 2019