1.) Use the dataset, e2q1.csv. This is an Excel csv format file. Suppose we are interested in the response after 2 possible treatments, placebo and active coded as 0 and 1 respectively. For each unit, the response was simulated at 2, 4, and 6 weeks after baseline. The data contain simulated responses for 160 units at 2, 4, and 6 weeks after baseline (t=0). Variables in the dataset are trt (0 or 1), id (numbered 1 to 160), weeks (0, 2, 4, and 6) and y (the response variable).
In R, these commands will read the data and make the variable names recognized by commands/functions that you use.
e2q1 = read. table (“e2q1.csv”, header=T, seep=”,”) attach(e2q1)
(a) Create an interaction plot that gives mean response by treatment and time. Give the plot and briefly describe the treatment, time, and interaction effects visible in the plot.
(b) Using gls in R, fit repeated measures models for each of the following assumptions about correlations among measurements at different times:
• Compound symmetry
• Completely unstructured with possibly unequal variances
• AR (1)
• AR (1) with unequal variances
Compare the AIC and BIC values for the four different models using the anova command. Give the results and explain which model looks to be the best or most preferable.
(c) For your model choice in part (b), use the anova command to determine F-test results for treatment, time, and the trt*time interaction. Give the results and indicate which effects are statistically significant. Is there a difference in responses between the placebo group and the active treatment group? Does either treatment effect their score over time?
(d) Now use a cubic orthogonal polynomial to describe the time trend along with your chosen assumption about correlations between measurements at different times. Use the summary command to summarize the coefficient estimates. Briefly describe what is shown about the overall time trend and how the treatments differ with regard to the time trend.
2.) Use the dataset e2q2.txt. The data for this problem are a simulated time series in sequence of n = 500 observations spaced equally apart. We will aim to fit a threshold model to the data.
Answer the following questions for threshold c = 30.
(a) Plot the data, interpret the plot, and comment on the threshold value. Would you suggest another value?
(b) Estimate an AR (1) model for the original data in each of the two regions. Provide the model output and discuss significance of terms:
Use the model = ts. intersect (y, lag1y=lag (y, -1)) P = model [,2]
Because P now contains just one column, you may replace P [,1] with P in the rest of your code.
(c) Comment on the suitability of this model. Compare the actual and predicted values and comment.
(d) Estimate an AR (4) model for the original data in each of the two regions. Compare this model to the AR (1) model estimated above. Which model would you most suggest between these two choices?
Bay Auctions—Boosting and Bagging. Using the eBay auction data (file eBayAuctions.csv) with variable Competitive as the outcome variable, partition the data into training (60%) and validation (40%).
a. Run a classification tree, using the default settings of DecisionTreeClassifier. Looking at the validation set, what is the overall accuracy? What is the lift on the first decile?
b. Run a boosted tree with the same predictors (use AdaBoostClassifier with DecisionTreeClassifier as the base estimator). For the validation set, what is the overall accuracy? What is the lift on the first decile?
c. Run a bagged tree with the same predictors (use BaggingClassifier). For the validation set, what is the overall accuracy? What is the lift on the first decile?
d. Run a random forest (use RandomForestClassifier). Compare the bagged tree to the random forest in terms of validation accuracy and lift on first decile. How are the two methods conceptually different?
Gateway International Airport (GIA) has experienced substantial growth in both commercial and general aviation operations during the past several years. (An operation is a landing or takeoff.) Because of the initiation of new commercial service at the airport, which is scheduled for several months in the future, the Federal Aviation Administration (FAA) has concluded that the increased operations and associated change in the hourly distribution of takeoffs and landings will require an entirely new work schedule for the current air traffic control (ATC) staff. The FAA feels that GIA might need to hire additional ATC personnel, because the present staff of five probably will not be enough to handle the expected demand.
After examining the various service plans that each commercial airline submitted for the next 6-month period, the FAA developed an average hourly demand forecast of total operations (Figure 1) and a weekly forecast of variation from the average daily demand (Figure 2). An assistant to the manager for operations has been delegated the task of developing workforce requirements and schedules for the ATC staff to maintain an adequate level of operational safety with a minimum of excess ATC “capacity.”
The various constraints are:
1. Each controller will work a continuous, 8-hour shift (ignoring any lunch break), which always will begin at the start of an hour at any time during the day (i.e., any and all shifts begin at X:00), and the controller must have at least 16 hours off before resuming duty.
2. Each controller will work exactly 5 days per week.
3. Each controller is entitled to 2 consecutive days off, with any consecutive pair of days being eligible.
4. FAA guidelines will govern GIA’s workforce requirements so that the ratio of total operations to the number of available controllers in any hourly period cannot exceed 16.
Questions
1. Assume that you are the assistant to the manager for operations at the FAA. Use the techniques of workshift scheduling to analyze the total workforce requirements and days-off schedule. For the primary analysis, assume that
a. Operator requirements will be based on a shift profile of demand (i.e., 8 hours).
b. There will be exactly three separate shifts each day, with no overlapping of shifts.
c. The distribution of hourly demand in Figure 1 is constant for each day of the week, but the levels of hourly demand vary during the week as shown in Figure 2
This assignment is designed to give you hands-on experience in performing both regression and time series forecasting. You will be given a particular real-life time series, and are asked to perform regression for predictions and to perform a time series forecasting. In addition, you are asked to perform a sensitivity analysis by using different parameter values and calculating measures of error for each of those values. Course Outcomes This assignment is directly linked to the following key learning outcomes from the course syllabus: CO1: Use descriptive, Heuristic and prescriptive analysis to drive business strategies and actions CO3: Analyze the role of analytics in supporting decision making for various other stakeholder groups within and outside of your organization CO5: Utilize applied analytics and definitions of measures of success to provide a strategic analytic roadmap for an organization
Question 12 Use the Excel data file, Electrical Power Usage to: 1. Create a scatterplot of the size of the house (x-axis) and the electrical power usage (y-axis). Display the least squares regression line and r2 value on the scatterpl ot. Based on the scatterplot, does it appear that the size of the home and the electrical power usage are strongly related? is the relationship a linear one? 2. Conduct a simple I inear regression analysis to determine if there is a significant relationship between the size of the house (predictor variable) and electrical power usage (dependent variable). a. State the null and alternative hypotheses. b. Did you reject the null hypothesis at the alpha =MS level (report the F-value and p-value)? Explain why or why not c. Report and interpret the correlation coefficient (r), r2, and the standard error of the estimate. Also, state the regression model and interpret the regression coefficients. Complete the problem and answer the questions in Excel.
Question 13 Use the Excel data set Stock_Market” to conduct a multiple linear regression ana lysis to determine if there is a significant relationship between the return on the independent variables, average equity (Xi), annual dividend rate (X2), and the dependent variable, stock price (Y). a. State the null and alternative hypotheses. b. Did you reject the null hypothesis at the alpha =MS level? Explain why or why not (report the F-value and p-value). c. Report and interpret the correlation coefficient (R), R2, Adjusted-R2, the standard error of the estimate, and the regression coefficients. d. Check for mu lti col linearity by calculating the correlation coefficient between the predictor variables. Does it appear that Xi and X2 are correlated? Explain why or why not. Complete the problem and answer the questions in Excel.