Often, in data science, when you get a dataset, it is not in the exact format you want/need. So, you have to refine the dataset into something more useful - this is often called “data munging”.

In this lab, you need to read in a dataset and work on that dataset (in a dataframe) so that it can be useful. Then, we will explore the distribution within the dataset.

Step 1: Create a function (named readStates) to read a CSV file into R

# 1. Note that you are to read a URL, not a file local to your computer.

# 2. Import the data. The file is a dataset on state populations (within the United States). You should pass the URL as the argument in your readStates function. The URL:  https://www2.census.gov/programs-surveys/popest/tables/2010-2011/state/totals/nst-est2011-02.csv

Hint: google “read.csv” and “url” with respect to R commands

Step 2: Clean the dataframe

# 3. Note the issues that need to be fixed (removing columns, removing rows, changing column names).

# 4. Within your function, make sure there are 51 rows (one per state   +   the district of Columbia). Make sure there are only 5 columns with the  columns having the following names ("stateName","census2010", "base2010","populationchange","percentchange").

# 5. Make sure columns 2-4 are numbers (i.e. not strings).

Step 3: Store and Explore the dataset

# 6. Store the dataset into a dataframe, called dfStates.

# 7. Test your dataframe by calculating the mean for the census2010 column in dfStates. 

Step 4: Find the state with the Highest Population

# 8. Based on the census2010 data, what is the population of the state with the highest population? What is the name of that state?

# 9. Sort the data, in increasing order, based on the census2010 data.  

Step 5: Explore the distribution of the states

# 10. Write a function that takes two parameters. The first is a vector and the second is a number.

#11. The function will return the   percentage of the elements within the vector that is less than the number (i.e. the cumulative distribution below the value provided).

 # 12. test the function, the result should be 0.2
Distribution(c(1,2,3,4,5), 2)

 # 13. test the function with the vector ‘dfStates$census2010’, and the mean of dfStates$census2010’.
LS0tCnRpdGxlOiAiSG9tZXdvcmsgMzogQ2xlYW5pbmcvTXVuZ2luZyBEYXRhIEZyYW1lcyIKYXV0aG9yOiAKLSBbWU9VUiBOQU1FXQpkYXRlOiAiYHIgU3lzLnRpbWUoKWAiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCk9mdGVuLAlpbglkYXRhCXNjaWVuY2UsCXdoZW4JeW91CWdldAlhCWRhdGFzZXQsCWl0CWlzCW5vdAlpbgl0aGUJZXhhY3QJZm9ybWF0CXlvdQl3YW50L25lZWQuIFNvLAl5b3UJaGF2ZSB0bwlyZWZpbmUJdGhlCWRhdGFzZXQJaW50bwlzb21ldGhpbmcJbW9yZQl1c2VmdWwJLSB0aGlzCWlzCW9mdGVuCWNhbGxlZAnigJxkYXRhCW11bmdpbmfigJ0uCgpJbgl0aGlzCWxhYiwJeW91CW5lZWQJdG8JcmVhZAlpbglhCWRhdGFzZXQJYW5kCXdvcmsJb24JdGhhdAlkYXRhc2V0CShpbglhCWRhdGFmcmFtZSkJc28JdGhhdAlpdAljYW4JYmUJdXNlZnVsLiBUaGVuLAl3ZQl3aWxsCWV4cGxvcmUJdGhlCWRpc3RyaWJ1dGlvbgl3aXRoaW4JdGhlCWRhdGFzZXQuCgpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0KIyBJbnN0YWxsIHBhY2thZ2VzIGlmIG5lY2Vzc2FyeQoKYGBgCgojIyMgU3RlcCAxOiBDcmVhdGUgYSBmdW5jdGlvbiAobmFtZWQgcmVhZFN0YXRlcykgdG8gcmVhZCBhIENTViBmaWxlIGludG8gUgpgYGB7ciBTdGVwMSwgZWNobz1UUlVFfQojIDEuIE5vdGUgdGhhdCB5b3UgYXJlIHRvIHJlYWQgYSBVUkwsIG5vdCBhIGZpbGUgbG9jYWwgdG8geW91ciBjb21wdXRlci4KCiMgMi4gSW1wb3J0IHRoZSBkYXRhLiBUaGUgZmlsZSBpcyBhIGRhdGFzZXQJb24gc3RhdGUgcG9wdWxhdGlvbnMgKHdpdGhpbiB0aGUgVW5pdGVkIFN0YXRlcykuIFlvdSBzaG91bGQgcGFzcyB0aGUgVVJMIGFzIHRoZSBhcmd1bWVudCBpbiB5b3VyIHJlYWRTdGF0ZXMgZnVuY3Rpb24uIFRoZSBVUkw6IAlodHRwczovL3d3dzIuY2Vuc3VzLmdvdi9wcm9ncmFtcy1zdXJ2ZXlzL3BvcGVzdC90YWJsZXMvMjAxMC0yMDExL3N0YXRlL3RvdGFscy9uc3QtZXN0MjAxMS0wMi5jc3YKYGBgCkhpbnQ6CWdvb2dsZQnigJxyZWFkLmNzduKAnQlhbmQJ4oCcdXJs4oCdCXdpdGgJcmVzcGVjdAl0bwlSCWNvbW1hbmRzCgojIyMgU3RlcAkyOglDbGVhbgl0aGUJZGF0YWZyYW1lCmBgYHtyIFN0ZXAyLCBlY2hvPVRSVUV9CiMgMy4gTm90ZSB0aGUgaXNzdWVzIHRoYXQgbmVlZCB0byBiZSBmaXhlZCAocmVtb3ZpbmcgY29sdW1ucywgcmVtb3Zpbmcgcm93cywgY2hhbmdpbmcgY29sdW1uIG5hbWVzKS4KCiMgNC4gV2l0aGluCXlvdXIgZnVuY3Rpb24sIG1ha2Ugc3VyZSB0aGVyZSBhcmUgNTEgcm93cyAob25lCXBlcglzdGF0ZQkrCXRoZQlkaXN0cmljdCBvZiBDb2x1bWJpYSkuIE1ha2Ugc3VyZSB0aGVyZSBhcmUgb25seSA1IGNvbHVtbnMgd2l0aCB0aGUJY29sdW1ucwloYXZpbmcgdGhlIGZvbGxvd2luZyBuYW1lcyAoInN0YXRlTmFtZSIsImNlbnN1czIwMTAiLCAiYmFzZTIwMTAiLCJwb3B1bGF0aW9uY2hhbmdlIiwicGVyY2VudGNoYW5nZSIpLgoKIyA1LiBNYWtlIHN1cmUgY29sdW1ucyAyLTQgYXJlIG51bWJlcnMgKGkuZS4gbm90IHN0cmluZ3MpLgoKYGBgCgoKIyMjIFN0ZXAJMzoJU3RvcmUJYW5kCUV4cGxvcmUJdGhlCWRhdGFzZXQKYGBge3IgU3RlcDMsIGVjaG89VFJVRX0KIyA2LiBTdG9yZSB0aGUgZGF0YXNldCBpbnRvIGEgZGF0YWZyYW1lLCBjYWxsZWQgZGZTdGF0ZXMuCgojIDcuIFRlc3QgeW91ciBkYXRhZnJhbWUgYnkgY2FsY3VsYXRpbmcgdGhlIG1lYW4gZm9yIHRoZSBjZW5zdXMyMDEwIGNvbHVtbiBpbiBkZlN0YXRlcy4gCmBgYAoKCiMjIyBTdGVwCTQ6CSBGaW5kCXRoZQlzdGF0ZQl3aXRoCXRoZQlIaWdoZXN0CVBvcHVsYXRpb24KYGBge3IgU3RlcDQsIGVjaG89VFJVRX0KIyA4LiBCYXNlZCBvbiB0aGUgY2Vuc3VzMjAxMCBkYXRhLCB3aGF0IGlzIHRoZSBwb3B1bGF0aW9uIG9mIHRoZSBzdGF0ZSB3aXRoIHRoZSBoaWdoZXN0IHBvcHVsYXRpb24/IFdoYXQgaXMgdGhlIG5hbWUgb2YgdGhhdCBzdGF0ZT8KCiMgOS4gU29ydCB0aGUgZGF0YSwgaW4gaW5jcmVhc2luZyBvcmRlciwgYmFzZWQgb24gdGhlIGNlbnN1czIwMTAgZGF0YS4JCmBgYAoKCiMjIyBTdGVwCTU6CSBFeHBsb3JlCXRoZQlkaXN0cmlidXRpb24Jb2YJdGhlCXN0YXRlcwpgYGB7ciBTdGVwNSwgZWNobz1UUlVFfQojIDEwLiBXcml0ZSBhIGZ1bmN0aW9uIHRoYXQgdGFrZXMgdHdvIHBhcmFtZXRlcnMuIFRoZSBmaXJzdCBpcyBhIHZlY3RvciBhbmQgdGhlIHNlY29uZCBpcyBhIG51bWJlci4KCiMxMS4gVGhlIGZ1bmN0aW9uIHdpbGwgcmV0dXJuIHRoZQlwZXJjZW50YWdlIG9mIHRoZSBlbGVtZW50cyB3aXRoaW4gdGhlIHZlY3RvciB0aGF0IGlzIGxlc3MgdGhhbiB0aGUgbnVtYmVyIChpLmUuIHRoZSBjdW11bGF0aXZlIGRpc3RyaWJ1dGlvbiBiZWxvdyB0aGUgdmFsdWUgcHJvdmlkZWQpLgoKICMgMTIuIHRlc3QgdGhlIGZ1bmN0aW9uLCB0aGUgcmVzdWx0IHNob3VsZCBiZSAwLjIKRGlzdHJpYnV0aW9uKGMoMSwyLDMsNCw1KSwgMikKCiAjIDEzLiB0ZXN0IHRoZSBmdW5jdGlvbiB3aXRoIHRoZSB2ZWN0b3Ig4oCYZGZTdGF0ZXMkY2Vuc3VzMjAxMOKAmSwgYW5kIHRoZSBtZWFuIG9mIGRmU3RhdGVzJGNlbnN1czIwMTDigJkuCgpgYGAKCgo=