Exploring Trends of Data Science
Data science is the practice of uncovering patterns, trends, and insights from large sets of data. Unlike AI, which aims to create intelligent behavior, data science emphasizes data analysis and employs a range of techniques—from traditional statistics to more advanced machine learning models. This thread will delve into how data scientists approach problems, the methods they use (both machine learning and beyond), and the diverse applications where data science drives decisions across industries. I’ll be posting regular updates about the evolving landscape of data science, sharing the latest methods, trends, and tools that are shaping this field.
11/1/20245 min read
Python is Number 1 in TIOBE Index
Jan. 10, 2025
Python has been named "TIOBE's Programming Language of the Year 2024" in the TIOBE Index due to achieving the highest ratings. Nonetheless, while Python offers numerous advantages, it also faces challenges such as performance limitations and runtime errors. The TIOBE Index measures programming language popularity based on global expertise, courses, and third-party support, with contributions from major platforms like Google and Amazon. Positions two through five are occupied by C++, Java, C, and C#. Notably, SQL ranks eighth, R is positioned at number 18, and SAS is at number 22.
That’s my take on it:
Python's widespread popularity is largely driven by the growing demand for data science and machine learning. Its rich ecosystem of libraries and frameworks, including TensorFlow, PyTorch, and scikit-learn, makes it an ideal choice for DSML tasks. Interestingly, certain programming languages exhibit remarkable longevity. For example, JavaScript was ranked seventh in 2000 and currently holds sixth place. Similarly, Fortran, which was ranked 11th in 1995, now occupies the tenth position. The resurgence of Fortran is notable; according to TIOBE, it excels in numerical analysis and computational mathematics, both of which are increasingly relevant in artificial intelligence. Fortran is also gaining traction in image processing applications, including gaming and medical imaging.
While some languages maintain stable rankings over time, others have shown dramatic improvements. For instance, SQL was ranked 100th in 2005 but has since risen to ninth place. Predicting the future trajectory of programming languages is challenging, underscoring the dynamic nature of the field. As the saying goes, "Never say never!"
Links:
11/20/2024
A recent article titled " AI-Assisted Genome Studies Are Riddled with Errors" by Dr. Sitaraman highlights the challenges and errors associated with using artificial intelligence (AI) in large genomics studies. Researchers have employed AI to fill in gaps in patient information and improve predictions in genome-wide association studies (GWAS). However, new research from the University of Wisconsin-Madison reveals that these AI-assisted approaches can lead to false positives and misleading correlations.
For 15 years, GWAS has been used to identify genetic variants associated with traits or diseases. Despite its success, GWAS has limitations, which scientists have attempted to overcome using AI. However, AI can introduce biases, especially when working with incomplete datasets. The research highlights that AI-assisted GWAS can create false associations between gene variants and diseases. For instance, AI models showed a high correlation between certain gene variants and type II diabetes, which was not supported by conventional GWAS. Further, the use of proxy data, such as family history, in GWAS-by-proxy (GWAX) can also lead to incorrect conclusions. For example, AI approaches showed a positive correlation between education attainment and Alzheimer's risk, contrary to established GWAS findings. The research team suggests new statistical methods to correct these biases and emphasizes the need for transparency and rigor in reporting findings from AI-assisted studies.
That’s my take on it:
No doubt machine learning methods have overshadowed conventional statistics in big data analytics. However, no solution is 100% foolproof. Like conventional statistics, machine learning methods could be misguided and misused. We should avoid the mentality that “I use a hammer and so every problem is a nail”: don’t apply ML just because it is popular or powerful, and then mindlessly assume that the conclusion must be right. Rather, we must consider the nature of the data, the question being asked, and the desired outcome. When a method is experimental and the data pattern is strange, we must evaluate it with a pair of skeptical eyes. After all, skepticism is the principle of Tukey’s exploratory data analysis.
Link: https://www.the-scientist.com/ai-assisted-genome-studies-are-riddled-with-errors-72339


Machine learning methods fail in genome studies
Is Rguroo Version 2.0 a journey to data science?
11/12/2024
Today, Rguroo announced the release of Rguroo Version 2.0, titled “Journey to Data Science.” Among its major updates is an enhanced logistic regression feature. The new diagnostic tools include interactive logistic curve plotting, prediction assessments, external data prediction, model validation, and k-fold cross-validation. Rguroo is a web-based statistical platform with a graphical user interface (GUI) that provides access to R’s capabilities without requiring users to know R programming
That’s my take on it:
While Rguroo benefits analysts and students by making R’s statistical and graphical tools more accessible, I am uncertain whether this release truly represents a “journey to data science.” After exploring the software, I noticed it lacks core data science methods, such as decision trees, random forests, XGBoost, gradient boosting, and neural networks. The focus on logistic regression—while a valuable classical statistical method—reflects a model-driven, inference-centered approach, rather than the data exploration and pattern recognition that define data science and machine learning. In fact, this highlights a broader issue: many so-called “data science” tools and programs don’t fully reflect the paradigm shift toward data-centric methodologies
Link: https://rguroo.com/


DSML trend: Best visualization tools for business intelligence
11/8/2024
According to Forbes Advisor, the top data visualization tools for business in 2024 are as follows:
·Microsoft Power BI: Leader in business intelligence (BI) with robust integration capabilities
·Tableau: Known for sophisticated interactive visualizations
·Qlik Sense: Stands out for AI integration and machine learning features
·Klipfolio: Excels in custom dashboard creation
·Looker: Provides comprehensive visualization options and data modeling
·Zoho Analytics: Seamlessly integrates with other Zoho products
·Domo: Distinguished by its custom app development capabilities
The evaluation criteria included user-friendliness, cost-effectiveness, support quality, and key features such as real-time analytics, customization options, and collaborative data sharing.
That’s my take on it: Data visualization tools are essential for both business and academic purposes, offering powerful ways to analyze and present complex data. While the tools mentioned by Forbes are indeed popular for business intelligence, there are several excellent options for academics and other specialized purposes. For example,
SAS Visual Analytics on SAS Viya: General purposes
JMP Pro: General purposes
IBM Watson Studio: General purposes
MATLAB: Popular in engineering and scientific computing, providing robust visualization tools alongside computational capabilities.
Wolfram Mathematica: A powerful and comprehensive computational software system that offers extensive capabilities for data visualization, scientific computing, and statistical analysis.
Origin: Specifically designed for scientific graphing and data analysis, popular in physical sciences and engineering.
Gephi: An open-source tool particularly useful for network analysis and visualization, popular in social sciences and complex systems research.
·Python with libraries like Matplotlib, Seaborn, and Plotly: Widely used in data science and research for its flexibility and powerful visualization options.
Factors for selecting visualization tools:
Data complexity and size: Tools like SAS Viya and IBM Watson Studio are better suited for very large datasets.
·Statistical analysis needs: JMP and Python offer more advanced statistical capabilities.
·Collaboration requirements: Cloud-based solutions like IBM Watson Studio may offer better collaboration features.
·Domain-specific needs: Some fields may have preferred tools (e.g., Gephi for network analysis whereas Matlab and Mathematica for mathematics and engineering).
Link: https://www.forbes.com/advisor/business/software/best-data-visualization-tools/






















































































































































2024 Japan: Hiroshima and Himeji