R is a very good tool for exploring some unknown data and checking different hypotheses about it. I would like to analyze the data collected from weather stations – temperature, humidity and solar radiation and understand the seasonal variation of data. This will help me to understand how to construct synthetic data with properties similar to real data.
Real data can be obtained from the National Centers for Environmental Information. I am using 5min data that will give me enough information for intra-day variation. This data is stored in table where each column is space separated. I use the data only from one whether station (NY_Ithaca_13_E.txt) for 3 consecutive years. You can get the data using the “wget” and concatenate each year data with “cat” command:
cat 2014-NY_Ithaca_13_E.txt 2015-NY_Ithaca_13_E.txt 2016-NY_Ithaca_13_E.txt > NY_Ithaca_13_E.txt
The first think to do after starting the R is to change the working directory to where your data is located and check that you are in the right directory:
setwd("/tmp") getwd() >  /home/user/tmp
Next I load the data in dataframe. This data is stored in columns separated with spaces. Since the number of spaces vary, I will use the sep=”” option.
dft <- read.csv("NY_Ithaca_13_E.txt", sep="", header = FALSE)
Now the data is stored in dft. When the whether station sensors do not work for some reason, some of the entries are missing. These are marked with -9999 values that I filter out.
df = subset(dft, dft[,9] > -9999) dim(dft) > 305112 23 dim(df) > 305376 23
We can see that 264 entries are filtered.
Now we can do a lot of fun stuff with this data. Lets start with the temperature. According to the documentation on NOAA site, it is 9-th column and is in Celsius. Lets start the exploration by ploting all temperature data.
type=”l” is for plotting lines instead of points. We can clearly see the seasons for each of the 3 years of data. Now lets see the empirical distribution function of this data:
This seems to be a nontrivial distribution with multiple modes and asymmetric. Lets zoom into the data and choose a day in the middle of the summer around 50000-th observation.
df[50945:50945,] > V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 50976 64758 20140627 0 20140626 1900 2 -76.25 42.44 20.7 0 46 0 19.3 C 0 78 0 0.328 21.7 1204 0 1.13 0
The V2 column is the date (2014-06-27 in UTC time) and V3 is 0:00h. I prefer to work in UTC since each whether
station can be in different time zone.
To get a data for 3 days we can use the “number:number” notation to slice data in small chunks.
Data is sampled each 5 min: 12x24x3 = 864 data points for 3 days.
Clearly we can see the day-night temperature variation that is our expectation.
Now lets see the empirical distribution function of this data:
This is two modal distribution, with “day” and “night” mode.
Now lets look at the precipitation measurements. They are in 10-th column in “mm”.
df = subset(dft, dft[,10] > -999) plot(df[,10], type="l")
We can see again the seasonality for each of the 3 years of data. Next, let see the empirical distribution function. The kernel density estimation can be tricky here, instead I plot the histogram in logarithmic scale.
hd = hist(df[,10]) plot(hd$count, type='h', log="y", lwd=30, lend=2)
Solar radiation is measured in watts/meter^2 and is stored in 11-th column. Once again I filter for missing observations and plot the data for 3 years.
df = subset(dft, dft[,11] > -999) plot(df[,11], type="l")
To plot the distribution I will filter the “0.0” values that are mostly measured during the night time.
df2 = subset(dft, dft[,11] > 0) hd = hist(df2[,11], nclass=80)
And the daily values for around the same time as the 3 day temperature plot.
and the distribution
hd = hist(df2[50945:51809,11], nclass=80)
Note that I used “df” for the first and modified “df2” for the second one.
The shape for each day is almost the same but is disrupted by reverted spikes. We can guess they are from clouds obstructing the solar radiation.
The shape for each day is a function of distance from the Sun, the solar cycle, and cross-cycle changes. I cannot find in the provided documentation if radiation is measured as power received on Earth per unit area on a horizontal surface (Insolation), so it depends on the height of the Sun above the horizon.
The Wikipediq article have some information on how to construct a theoretical model.
Humidity is 16-th column from the data set.
The plot clearly shows the seasonal variation. The distribution is asymmetric and unimodal.
Daily plots are similar to temperature plots
and the daily distribution function seems to be 2 (3?) modal.