Blog
Instructions: • Please complete this exam in a Word document, and save
Instructions:
• Please complete this exam in a Word document, and save it as DATA460_Exam1_FirstnameLastname, where Firstname is your first name and Lastname is your last name. Make sure you put down the problem # clearly.
• This exam is Open Book, Open Notes. You are required to use R to solve all questions. Make sure you include the code/command, as well as the relevant output.
• You have 48 hours to complete this exam. Exam must be submitted through Canvas Assignment Tool by 11:59pm of Monday, 7/22/24 (Pacific Time).
• Round to the THIRD decimal place, unless otherwise noted in the instruction.
• PLEASE SHOW ALL YOUR WORK COMPLETELY AND CLEARLY!!!
Part 1:
Apply R to answer the following questions. Make sure you include clear headings (e.g., Problem 1 – 1). For each part of the question, make sure you include the command line/code, then paste relevant output/results, and also comment on the output/results as needed (to answer the questions)
Problem 1:
Researchers did investigation on the situation of smoking in Great Britain and got the sample data set smoking.csv (or smoking.txt), read the data set and answer the following questions.
1. Download smoking.csv (or smoking.txt) and read corresponding data into R. Example command in R: MyData <- read.csv(file="path/TheDataIWantToReadIn.csv", header=TRUE, sep=",") Note: use forward slash “/” instead of backward slash“\” in the path. Make sure to include the code/command.
2. How many observations are there in this data set? How many variables, and what are they? What is the 300th observation of nationality? Include both the code/command and the output/graph.
3. Create a numerical summary for age and compute the interquartile range. Compute the relative frequency distribution for gender. How many males are in the sample? Include both the code/command and the output/graph.
4. Using numerical summaries and a side-by-side box plot, compare the male smokers and female smokers and interpret the boxplot. Include both the code/command and the output/graph.
5. Create a bar chart or frequency table for maritalStatus, what is the proportion for Divorced, Single, Married, and Widowed, respectively? What can you interpret from these numbers? Include both the code/command and the output/graph.
Problem 2:
Apply R simulation to answer the following questions:
1. Suppose we’re flipping an unfair coin that we know only lands heads 30% of the time. Please simulate this flip 10 times, what is the proportion of heads? If you simulate this flip 100 times, what is the proportion of heads now? Include both the code/command and the output/graph.
2. Suppose we’re flipping an unfair dice and the corresponding probability of landing 1, 2, 3, 4, 5, and 6 is 0.05, 0.1, 0.15, 0.2, 0.3, and 0.2, respectively. If you simulate this flip 10 times, what is the proportion of land on side 5? Simulate this flip 100 times, what is the proportion of side 5 now? Include both the code/command and the output/graph.
3. Compare the proportions in each questions above, what conclusion can you draw? Does the number of simulations affect the proportions? If so, how? Please explain in details.
V1 2
DA 460 Midterm Exam W. Li
Problem 3:
Data set countyComplete.csv (or countyComplete.txt) shows the population information from all counties in US, apply this data set to answer the following questions:
1. Download countyComplete.csv (or countyComplete.txt) and read corresponding data into R. Example command in R: MyData <-
read.csv(file=”path/TheDataIWantToReadIn.csv”, header=TRUE, sep=”,”) Note: use forward slash “/” instead of backward slash“\” in the path. Make sure to include the code/command
2. Make a histogram of pop2010, and describe its distribution. Include both the code/command and the output/graph.
3. Create a new subset named Washington which contains only the observations of Washington, and then make a histogram of pop2010 and describe its distribution. Now compare this with the previous question. Include both the code/command and the output/graph.
4. Based on the subset Washington, make a normal probability plot of pop2010. Do all of the points fall on the line? How does this plot compare to the probability plot of the original data? Include both the code/command and the output/graph.
5. Suppose the variable pop2010 has a normal distribution, what is the probability that pop2010 is greater than 102,410? What is the probability that pop2010 is between 190,000 and 1,000,000? Include both the code/command and the output/graph.
Problem 4:
1. Collect a simple random sample of size 50 from pop2010 in data set
countyComplete.csv (or countyComplete.txt). Describe the distribution of this sample. How does it compare to the distribution of the population? Using this sample, what is your best point estimate of the population mean? Include both the code/command and the output/graph.
2. Now, collect a simple random sample of size 300a from pop2010 in data set countyComplete.csv (or countyComplete.txt). Describe the distribution of this sample. How does it compare to the distribution of the population? Using this sample, what is your best point estimate of the population mean? Compare your point estimates from the question above (question 1), what conclusion can you draw? Include both the code/command and the output/graph.
3. Create and compare histograms of sample size 50 and 300, which one is closer to symmetry? Why or why not? Include both the code/command and the output/graph.
Part 2:
Save your file as DATA460_Exam1_FirstnameLastname.docx (or .pdf) where Firstname is your first name and Lastname is your last name, and submit it through the Assignment Tool.
V1 3

