Research findings

Html presentation of my research findings

View project on GitHub

Data origin and access time

SP500 stock price data is downloaded from Yahoo!Finance. The list of SP500 components is based on the Wikipedia table of ticker symbols of all components. For future references, the original Wikipedia table is stored here. Ticker symbols and data were accessed on September 29th, 2015.

In addition, the Wikipedia table also contains information about industrial affiliation, which also was accessed and downloaded on September 29th, 2015.

Raw data: number of observations

Data was downloaded from 1962-01-02 to 2015-09-24 for 505 different ticker symbols listed on Wikipedia.

The unusual number of 505 stocks comes from the fact that 5 of the 500 different companies have two different stocks listed (in fact, some have even more, but only two of them where included into the list of S&P500 constituents).

Company Ticker
Comcast A Corp CMCSA
Comcast Special Corp Class A CMCSK
Discovery Communications-A DISCA
Discovery Communications-C DISCK
Google Inc Class A GOOGL
Google Inc Class C GOOG
News Corp. Class A NWSA
News Corp. Class B NWS
Twenty-First Century Fox Class A FOXA
Twenty-First Century Fox Class B FOX

Observations per date

Most companies have no prices at the beginning of the sample period. Hence, the number of observations gradually increases over time as ever more companies become part of the dataset:

Number of observations

The maximum number of observations is reached for nine different companies:

Ticker # observations
AA 13526
BA 13526
CAT 13526
DD 13526
DIS 13526
GE 13526
HPQ 13526
IBM 13526
KO 13526

Individual companies’ price series also do not tend to exhibit any data gaps in between, so that sample size increases quite regularly with duration to the first observation:

Number of observations

Overall, the differences of the times of first observations leads to quite distinct numbers of available observations per stock:

Number of observations

A more detailed look at the tail region of critically low numbers of observations:

Number of observations

21 stocks have less than 1000 observations, 14 less than 750, and 9 even less than 500 observations:

Ticker First observation # observations
ALLE 2013-11-18 466
BXLT 2015-06-15 72
CPGX 2015-06-17 70
GOOG 2014-03-27 378
KHC 2015-07-06 58
NAVI 2014-04-17 363
PYPL 2015-07-06 58
QRVO 2015-01-02 184
WRK 2015-06-24 65

Raw data: returns of zero

The dataset contains a lot of returns that are equal to zero, hence representing days without price movements. From overall 6,831,135 possible observations (number of dates number of assets) only 3,215,398 are not missing, which amounts to 52.9% missing values. Taking all available observations, in turn, another 5.7% of them are zero.

The number of zero returns differs over assets:

Number of observations

Making a more meaningful comparison, however, requires an adjustment to the different number of observations per stock. Looking at relative frequencies of zero returns, individual companies still differ:

Number of observations

There does exist a pattern, however, as stocks with first available observations dating back for a longer time tend to have a higher frequency of zero returns:

Number of observations

The reason for the differences between assets is a changing minimum tick size over time, as larger minimum tick sizes imply more returns of zero. The following tick size changes did occur during the sample period:

The consequences of tick size changes on zero return frequencies best can be seen by looking at zero return frequencies for each day. Relative frequency thereby is measured with regards to the number of available observations at that given day, hence excluding any missing observations.

Number of observations

In numbers, the following zero return frequencies exist per time period:

period zero return frequency
1962-01-03 - 1997-06-23 0.132529
1997-06-24 - 2001-01-28 0.0454137
2001-01-28 - 2015-09-24 0.0097912

The tick size changes hence cause non-stationary time series, as the statistical properties of the data clearly change over time. Still, however, we will neglect this non-stationarity, as positive and negative small returns should get shrinked to zero in a similar manner, and hence roughly cancel out. Nevertheless, this simplification might entail some danger, as some assets have more than 10% zero returns, with a maximum frequency of zero returns of almost 25% for one asset.

Volatility vs zero return frequency

As increasing volatility diminishes the probability of returns being close to zero, we should expect a negative relation between volatility and zero return frequencies. In order to test this hypothesis, the S&P500 index was downloaded from Yahoo!Finance, and its underlying volatility series was estimated with a GARCH(1,1) model with t-distributed innovations.

As the zero return frequency series contains structural breaks, the relation between volatility and zero return frequencies must be examined for each tick size regime separately. Calculating the correlation between both time series gives the following values:

period zero return frequency
1962-01-03 - 1997-06-23 -0.00663718
1997-06-24 - 2001-01-28 -0.223812
2001-01-28 - 2015-09-24 -0.187395

One would expect that correlations should be decreasing in absolute size, since volatility should have less effects the smaller the tick size gets. An explanation could be, that the first regime (especially in the beginning) consists of only very few assets, for whom the S&P500 volatility might be a bad approximation if they do not follow general market patterns closely.

In order to get more detailed insights, the following graphics shows the estimated S&P500 volatility series, standardized to mean zero and standard deviation of one, together with a smoothed estimator for the standardized zero return frequencies. The smoothed estimator is plotted with negative sign, in order to translate the negative relation between both lines into a positive one.

Number of observations

Number of observations

Number of observations

All graphics show some indication of co-movement between both lines, and the relation between both lines gets stronger starting approximately 1975.

Raw data: descriptive analysis

Processed data: descriptive analysis

References