Predicting crops and crop yields has been tried for centuries. A hundred years ago regression models were used. Now satellite data, combined with machine learning, is used to predict crops. It’s only very recently becoming accurate


Importance of Predicting Crops

Humanity depends on farming, not just for food but for jobs and the economy. Over 1 billion people, nearly 30% of the entire global workforce is involved in farming.

While people have been eating grains and plants for over 100,000 years crop or arable is relatively “new”. The process of actively managing the land and plants started around 11,000 years ago in the Neolithic period  – also known as the New Stone Age.

Wheat, corn, and rice were some of the initial crops to be farmed. Wheat is one of the first known crops to be farmed. Rice farming is even more recent, starting around 5,000 years ago in India.

The production of food and crops directly controls the size of the population. If there is not enough food the population can’t grow. As farming improves, the population grows.  In the 1700s the population of England was around 5.7 million and it had been around this number for nearly 2,000 years.  Following the invention of new farming techniques in the 1750s, the English population nearly tripled to 16.6million in just 100 years. 

The invention of the fertilization process, Haber-Bosch, in the 1900s  enabled the global population to explode from 1.6 billion to 7.7 billion in just over one hundred years.

The population explosion in the UK and around the world are due to improved crop production. 

Given the importance of arable farming, trying to predict the yield of crops has around as long as farming.

Havestnig Crops - Wheat
Harvesting Wheat – The Importance of Crop Predictions

Predicting Crops – Regression Models

Over the millennia, as farming evolved people stopped using the stars and animal sacrifices to manage their crops and moved to more and more scientific methods.

As farming accelerated in the industrial revolution, mechanization, mathematics, and scale become important. More and more modern solutions were being deployed. 

Regression analysis,  which was only invented in the 19th Century, was being used in farming by the 1900s.  In the 1920s R.A Fisher was promoting the use of regression for agriculture and in 1925 the Cobb-Douglas Function was being used to measure agricultural productivity.

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables

Regression models for crop predictions tend to depend on parameters such as weather, soil variables, and historic crop yield to make predictions. 

These models have now been used in farming for nearly a hundred years. The use of regression techniques accelerated as the relevant data and computing power became readily available


While some analysis has shown good levels of accuracy for linear regression methods others have argued that linear regression models are limited

The application of correlation and regression analysis has provided some qualitative understanding of the variables and their interactions that were involved in cropping systems and has contributed to the progress of agricultural science. However, the quantitative information obtained from this type of analysis is very site specific. The information obtained can only be reliably applied to other sites where climate, important soil parameters and crop management are similar to those used in developing the original functions. Thus, the quantitative applicability of regression based crop yield models for decision making is severely limited.

Predicting Crops – Machine Learning

Machine learning is well known to be used in many areas of life, from advertising to preventing fraud. Machine Learning is now being used to predict crop yields.

Studies have shown that methods such as Random Forests are outperforming linear regression. Another study showed that M-5 Prime has the highest level of accuracy (on a variety of metrics) when predicting crop yields.

In this second case, using M-5 Prime, the parameters were:


  • Planting area (PA)
  • Irrigation water depth (IWD)
  • Solar radiation (SR)
  • Rainfall (RF)
  • Maximum temperature (MaxT)
  • Average temperature (AvgT)
  • Minimum temperature (MinT)
  • Season-duration cultivar (SDC).

The results show that M5-Prime and k-nearest neighbor techniques obtain the lowest average RMSE errors (5.14 and 4.91), the lowest RMSE errors (79.46% and 79.78%), the lowest average MAE errors (18.12% and 19.42%), and the highest average correlation factors (0.41 and 0.42). Since M5-Prime achieved the largest number of crop yield models with the lowest errors, it was deemed “very suitable” for crop yield prediction, at scale, in agricultural planning.

What is Accurate?

When predicting crops there are multiple challenges including ensuring the prediction is accurate.  I.e If a prediction model states that there will be 1 million hectares of wheat, what is the actual truth?  What figure is used to verify the prediction?

Knowing the “ground truth” is a challenge as humans are inaccurate in providing results. Studies have shown that 73% of farmers for region underreported their yields – when “self-reporting”.  The underreporting was significant with a 6%+ error rate. I.e farmers were underreporting by an average of 568 kg per ha, compared to an expected yield of around 9,000 kg/ha.

With three quarters for the farmers under-reporting by over 5%, this means finding the ground truth is not straight forward.

Predicting Crops – Satellite Data

Predicting crops, regardless of the algorithm used, is dependent on high-quality data.  Satellite data is providing new data and driving new levels of accuracy.

The latest levels of accuracy for crop prediction are now 96% to 98% – using a combination of satellite data sources and decision tree models. It has not always been this way, and there have been a lot of challenges around predicting data via satellites.