Sunday, June 27, 2021

How can Big Data Principles Resolve the Limitations of Machine Learning?

 A math student who is taught about a certain subject is then given a set of several practice problems and an answer key. As the student works through the problem set they reflect upon the accuracy of their answers and progressively become better at solving said problems. While the number of “practice problems” is substantially greater, and in some cases there is no initial teaching, Machine learning (ML) functions very similarly to the math student. ML is a subfield of computer science which focuses on the use of data and algorithms to imitate human learning [11]. Unlike its human counterpart, however, it is relatively easy to produce an objective assessment of the faults and bottlenecks of machine learning. 

ML algorithms attempt to process and analyze large volumes of data in order to gradually improve their function. However, these algorithms suffer greatly from things such as training overhead, large computational load, and long run times [1]. New and emerging research on machine learning [3, 4, 5, 6, 9] draws attention to the application of big data principles such as distributed systems (DS) and massively parallel processing (MPP) as a means to resolve these limitations. Machine learning stands at the forefront of modern technology. As such, it is no secret that there is tremendous demand for machine learning professionals and an insufficient supply. This shortage coupled with the complex interdisciplinary nature of incorporating big data methodologies in machine learning remains the reason why so few have sought to apply the ideas. Despite the challenges, the apparent benefits make the subject a worthwhile pursuit. 


Reduced Training Overhead and Run Time


Just as corporations must train their employees before they are suitable for work, a machine learning algorithm must undergo the same process. The training process is often rather hefty and contains sizable room for optimization. While it varies greatly by purpose, from several hours to several days or even weeks [12], the initial amount of time that must be invested for a ML algorithm to adequately function is no short period. Distributed systems can be used to significantly reduce the amount of time spent in the training phase. While there is an inherent system overhead that results from using a DS [14], a varied approach to ML which handles data by bulk rather than individually will have far greater scalability into larger data sets. Depending on the efficiency of the distributed system, this strategy could enable the training process for ML algorithms to be conducted several tens or even hundreds of times faster. In the case that an abundance of data is available for use, the process could be even further sped up by using MPP. MPP forgoes the sequential limitation of a distributed system and instead processes multiple data simultaneously. While  the amount of data processing needed to reach a convergence would increase, the increase in rate of data processing would more than make up the difference, resulting in an even greater reduction in overhead. 


Similar to the reduction in training overhead, ML algorithms could potentially see significant optimizations to operation time through the use of distributed systems and MPP. These optimizations, however, are more specific to the purpose and function of an algorithm and do not hold true for all scenarios. A ML algorithm which regularly processes large bulks of data would heavily benefit from the drastically increased scalability provided by a DS and MPP. On the other hand, a ML algorithm whose regular function is to handle several small inputs or single points of data would suffer greatly from the inherent overhead involved with using a distributed or parallel system. 


Figure 1: A sample graph highlighting the differences in operation time for a standard system compared to a distributed system 


Improved Handling of Computational Load


The amount of data created, copied, and consumed is growing at an exceptional rate. In 2020 totalled to 64.2 zettabytes of data and is expected to grow at a rate of almost 20% annually [15]. With this rapid growth in data usage also comes the need for improved infrastructure to support it. As it stands commercial scale machine learning applications already have high resource requirements. It would prove an increasingly difficult challenge to scale single systems to keep up with the demands of advancing technology. 


One of the most modern solutions to data storage and processing in big data is the use of Hadoop which is a type of distributed file system that implements the MapReduce algorithm. The Hadoop distributed file system (HDFS) uses a network of low-cost systems to quickly distribute data and computational load while providing the benefits of fault tolerance and high availability [16]. HDFS provides a solution to data scalability not only because of its cost efficiency, but because it also is capable of expanding its resource pool by adding new systems to the network. Adapting machine learning algorithms to process through a system such as HDFS would loosen the resource demands posed for commercial machine learning applications as well as enable them to handle the scalability demands of the future. 


Figure 2: A simple overview of the HDFS architecture


Conclusion


Machine learning is on the cutting edge of modern technology, yet its possibilities remain vastly unexplored. While countless feats have already been achieved through the use of machine learning, the field’s insufficient supply of professionals and sophisticated nature leaves many shortcomings. Of the largest of those shortcomings is the optimization of machine learning algorithms and scalability towards the future. Many commercial scale applications of machine learning suffer from hefty training overhead and lengthy run times. These applications also face issues regarding inadequate infrastructure to deal with the expanding consumption of data that arises with the advancement of technology. Both these issues find a solution in adaptation of big data principles such as distributed systems and massively parallel processing.


There is immense value in the pursuit of applying big data principles to machine learning. As we are hardly scratching the surface of what is possible with machine learning it is imperative that we seek to improve it and prepare it for the future where possible. With such high demand for ML professionals and growing education, it is only a matter of time before these principles find some incorporation into machine learning. 



References


[1] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, “Machine learning on big data: opportunities and challenges,” Neurocomputing, vol. 237, pp. 350–361, May 2017. https://doi.org/10.1016/j.neucom.2017.01.026


[2] Q. Bi, K. E. Goodman, J. Kaminsky, and J. Lessler, “What is machine learning? A primer for the epidemiologist,” American Journal of Epidemiology, October 2019. https://doi.org/10.1093/aje/kwz189


[3] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning,” ACM Computing Surveys, vol. 52, no. 4, pp. 1–43, September 2019. https://doi-org.colorado.idm.oclc.org/10.1145/3320060


[4] H. Zhang, “Introduction to distributed deep learning systems”, Peetum Inc, February 2018. https://petuum.medium.com/intro-to-distributed-deep-learning-systems-a2e45c6b8e7


[5] P. Yue, F. Gao, B. Shangguan, and Z. Yan, “A machine learning approach for predicting computational intensity and domain decomposition in parallel geoprocessing,” International Journal of Geographical Information Science, vol. 34, no. 11, pp. 2243–2274, Feb. 2020. https://doi-org.colorado.idm.oclc.org/10.1080/13658816.2020.1730850


[6] Y. N. Khalid, M. Aleem, U. Ahmed, M. A. Islam, and M. A. Iqbal, “Troodon: A machine-learning based load-balancing application scheduler for CPU–GPU system,” Journal of Parallel and Distributed Computing, vol. 132, pp. 79–94, Oct. 2019. https://doi.org/10.1016/j.jpdc.2019.05.015


[7] Tsai, CW., Lai, CF., Chao, HC. et al. Big data analytics: a survey. Journal of Big Data, May, 2015. https://doi.org/10.1186/s40537-015-0030-3


[8] H. Kuwajima, H. Yasuoka, and T. Nakae, “Engineering problems in machine learning systems,” Machine Learning, vol. 109, no. 5, pp. 1103–1126, Apr. 2020. https://doi.org/10.1007/s10994-020-05872-w


[9] J. Zhang, J. Zhan, J. Li, J. Jin, and L. Qian, “Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster,” Concurrency and Computation: Practice and Experience, vol. 32, no. 23, Jul. 2020. https://doi-org.colorado.idm.oclc.org/10.1002/cpe.5923


[10] M. Langer, A. Hall, Z. He, and W. Rahayu, “MPCA SGD—A method for distributed training of deep learning models on Spark,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 11, pp. 2540–2556, May 2018. https://doi.org/10.1109/TPDS.2018.2833074


[11] IBM Cloud Education, “What is machine learning?,” IBM, 15-Jun-2020. [Online]. Available: https://www.ibm.com/cloud/learn/machine-learning 


[12] T. Liu, S. Alibhai, J. Wang, Q. Liu, X. He and C. Wu, "Exploring transfer learning to reduce training overhead of HPC data in machine learning," 2019 IEEE International Conference on Networking, Architecture and Storage (NAS), 2019, pp. 1-7, https://doi.org/10.1109/NAS.2019.8834723


[13] P. Sun, Y. Wen, T. N. Binh Duong and S. Yan, "Timed dataflow: reducing communication overhead for distributed machine learning systems," 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016, pp. 1110-1117, https://doi.org/10.1109/ICPADS.2016.0146



[14] Efficiency. [Online]. Available: https://cs.stanford.edu/people/eroberts/courses/soco/projects/distributed-computing/html/body_efficiency.html


[15] A. Holst, “Total data volume worldwide 2010-2025,” Statista, 07-Jun-2021. [Online]. Available: https://www.statista.com/statistics/871513/worldwide-data-created/ 


[16] “Hadoop - Introduction,” Tutorialspoint. [Online]. Available: https://www.tutorialspoint.com/hadoop/hadoop_introduction.htm


[17] R. Gibb, “ Edge Compute What is a Distributed System?,” StackPath, 26-Jun-2019. [Online]. Available: https://blog.stackpath.com/distributed-system/


[18] T. T. Contributor, “What is MPP (massively parallel processing)?,” WhatIs.com, 25-Jan-2011. [Online]. Available: https://whatis.techtarget.com/definition/MPP-massively-parallel-processing

No comments:

Post a Comment