How Chinaâ€™s delivery services platforms are evolving, from smart lockers to â€˜semi-finishedâ€™ meals

October 11th 2023

nlp from scratch python :: Article Creator

7 Python Libraries For Parallel Processing

Python is long on convenience and programmer-friendly, but it isn't the fastest programming language around. Some of Python's speed limitations are due to its default implementation, CPython, being single-threaded. That is, CPython doesn't use more than one hardware thread at a time.

And while you can use Python's built-in threading module to speed things up, threading only gives you concurrency, not parallelism. It's good for running Multiple tasks that aren't CPU-dependent, but does nothing to speed up multiple tasks that each require a full CPU. This may change in the future, but for now, it's best to assume threading in Python won't give you parallelism.

Python does include a native way to run a workload across multiple CPUs. The multiprocessing module spins up multiple copies of the Python interpreter, each on a separate core, and provides primitives for splitting tasks across cores. But sometimes even multiprocessing isn't enough.

In some cases, the job calls for distributing work not only across multiple cores, but also across multiple machines. That's where the Python libraries and frameworks introduced in this article come in. Here are seven frameworks you can use to spread an existing Python application and its workload across multiple cores, multiple machines, or both.

Ray

Developed by a team of researchers at the University of California, Berkeley, Ray underpins a number of distributed machine learning libraries. But Ray isn't limited to machine learning tasks alone, even if that was its original use case. You can break up and distribute any type of Python task across multiple systems with Ray.

Ray's syntax is minimal, so you don't need to rework existing applications extensively to parallelize them. The @ray.Remote decorator distributes that function across any available nodes in a Ray cluster, with optionally specified parameters for how many CPUs or GPUs to use. The results of each distributed function are returned as Python objects, so they're easy to manage and store, and the amount of copying across or within nodes is minimal. This last feature comes in handy when dealing with NumPy arrays, for instance.

Ray even includes its own built-in cluster manager, which can automatically spin up nodes as needed on local hardware or popular cloud computing platforms. Other Ray libraries let you scale common machine learning and data science workloads, so you don't have to manually scaffold them. For instance, Ray Tune lets you perform hyperparameter turning at scale for most common machine learning systems (PyTorch and TensorFlow, among others).

Dask

From the outside, Dask looks a lot like Ray. It, too, is a library for distributed parallel computing in Python, with its own task scheduling system, awareness of Python data frameworks like NumPy, and the ability to scale from one machine to many.

One key difference between Dask and Ray is the scheduling mechanism. Dask uses a centralized scheduler that handles all tasks for a cluster. Ray is decentralized, meaning each machine runs its own scheduler, so any issues with a scheduled task are handled at the level of the individual machine, not the whole cluster. Dask's task framework works hand-in-hand with Python's native concurrent.Futures interfaces, so for those who've used that library, most of the metaphors for how jobs work should be familiar.

Dask works in two basic ways. The first is by way of parallelized data structures—essentially, Dask's own versions of NumPy arrays, lists, or Pandas DataFrames. Swap in the Dask versions of those constructions for their defaults, and Dask will automatically spread their execution across your cluster. This typically involves little more than changing the name of an import, but may sometimes require rewriting to work completely.

The second way is through Dask's low-level parallelization mechanisms, including function decorators, that parcel out jobs across nodes and return results synchronously (in "immediate" mode) or asynchronously ("lazy" mode). Both modes can be mixed as needed.

Dask also offers a feature called actors. An actor is an object that points to a job on another Dask node. This way, a job that requires a lot of local state can run in-place and be called remotely by other nodes, so the state for the job doesn't have to be replicated. Ray lacks anything like Dask's actor model to support more sophisticated job distribution. However, Desk's scheduler isn't aware of what actors do, so if an actor runs wild or hangs, the scheduler can't intercede. "High-performing but not resilient" is how the documentation puts it, so actors should be used with care.

Dispy

Dispy lets you distribute whole Python programs or just individual functions across a cluster of machines for parallel execution. It uses platform-native mechanisms for network communication to keep things fast and efficient, so Linux, macOS, and Windows machines work equally well. That makes it a more generic solution than others discussed here, so it's worth a look if you need something that isn't specifically about accelerating machine-learning tasks or a particular data-processing framework.

Dispy syntax somewhat resembles multiprocessing in that you explicitly create a cluster (where multiprocessing would have you create a process pool), submit work to the cluster, then retrieve the results. A little more work may be required to modify jobs to work with Dispy, but you also gain precise control over how those jobs are dispatched and returned. For instance, you can return provisional or partially completed results, transfer files as part of the job distribution process, and use SSL encryption when transferring data.

Pandaral·lel

Pandaral·lel, as the name implies, is a way to parallelize Pandas jobs across multiple nodes. The downside is that Pandaral·lel works only with Pandas. But if Pandas is what you're using, and all you need is a way to accelerate Pandas jobs across multiple cores on a single computer, Pandaral·lel is laser-focused on the task.

Note that while Pandaral·lel does run on Windows, it will run only from Python sessions launched in the Windows Subsystem for Linux. Linux and macOS users can run Pandaral·lel as-is.

Ipyparallel

Ipyparallel is another tightly focused multiprocessing and task-distribution system, specifically for parallelizing the execution of Jupyter notebook code across a cluster. Projects and teams already working in Jupyter can start using Ipyparallel immediately.

Ipyparallel supports many approaches to parallelizing code. On the simple end, there's map, which applies any function to a sequence and splits the work evenly across available nodes. For more complex work, you can decorate specific functions to always run remotely or in parallel.

Jupyter notebooks support "magic commands" for actions that are only possible in a notebook environment. Ipyparallel adds a few magic commands of its own. For example, you can prefix any Python statement with %px to automatically parallelize it.

Joblib

Joblib has two major goals: run jobs in parallel and don't recompute results if nothing has changed. These efficiencies make Joblib well-suited for scientific computing, where reproducible results are sacrosanct. Joblib's documentation provides plenty of examples for how to use all its features.

Joblib syntax for parallelizing work is simple enough—it amounts to a decorator that can be used to split jobs across processors, or to cache results. Parallel jobs can use threads or processes.

Joblib includes a transparent disk cache for Python objects created by compute jobs. This cache not only helps Joblib avoid repeating work, as noted above, but can also be used to suspend and resume long-running jobs, or pick up where a job left off after a crash. The cache is also intelligently optimized for large objects like NumPy arrays. Regions of data can be shared in-memory between processes on the same system by using numpy.Memmap. This all makes Joblib highly useful for work that may take a long time to complete, since you can avoid redoing existing work and pause/resume as needed.

One thing Joblib does not offer is a way to distribute jobs across multiple separate computers. In theory it's possible to use Joblib's pipeline to do this, but it's probably easier to use another framework that supports it natively.

Parsl

Short for "Parallel Scripting Library," Parsl lets you take computing jobs and split them across multiple systems using roughly the same syntax as Python's existing Pool objects. It also lets you stitch together different computing tasks into multi-step workflows, which can run in parallel, in sequence, or via map/reduce operations.

Parsl lets you execute native Python applications, but also run any other external application by way of commands to the shell. Your Python code is just written like normal Python code, save for a special function decorator that marks the entry point to your work. The job-submission system also gives you fine-grained control over how things run on the targets—for example, the number of cores per worker, how much memory per worker, CPU affinity controls, how often to poll for timeouts, and so on.

One excellent feature Parsl offers is a set of prebuilt templates to dispatch work to a variety of high-end computing resources. This not only includes staples like AWS or Kubernetes clusters, but supercomputing resources (assuming you have access) like Blue Waters, ASPIRE 1, Frontera, and so on. (Parsl was co-developed with the aid of many of the institutions that built such hardware.)

Conclusion

Python's limitations with threads will continue to evolve, with major changes slated to allow threads to run side-by-side for CPU-bound work. But those updates are years away from being usable. Libraries designed for parallelism can help fill the gap while we wait.

Python (Programming Language)

All Articles for

Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python's syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C, and the language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming styles.

How Will The Big Data Market Evolve In The Future?

Big data has been around for some time now, becoming a more or less common concept in business. However, recent developments in AI technology have shaken up an already volatile field, inviting us to reconsider our projections of how the big data market will look in the future.

We can already see the signs that these developments have game-changing effects on the labor market, business data management, and entire organizational structures. Tracking these signs allows for a better understanding of this fast-paced evolution that we are witnessing.

Rapid developments in big data

Mostly driven by evolving web data gathering technologies, the recent breakthrough years in the big data sector have brought many positive changes. Complex machine-learning models have become more accessible, hardware and software solutions for ML algorithm training are now cheaper and more specialized, while tools for creating and optimizing the models are more readily available due to cloud technology.Apart from the advancements in ML, two other important trends that significantly influence big data processing capabilities can be distinguished:

More powerful graphic processing units (GPUs) and enhanced precision with which AI performs tasks allow businesses to make the most of parallel processing. Two or more processing units solving different aspects of the same problem now produce better solutions faster, enlarging the scope of use cases for this method.

The rapid rise of MLops (machine learning operations) allows more effective ML model deployment, observability, and experimentation in production environments.

Companies of all sizes have come to realize that big data and ML algorithms based on it are going to be among the most growing and growth-inducing factors in business. This year's incredibly high-valued acquisitions of very young tech companies go to show it. For example, Databricks paid $1.3 billion for just a few years old, 60-employee MosaicML because the latter has offered a novel and convenient method for training AI-based tools.

There is room for more innovation as current big data-based solutions are certainly not perfect. In the near future, we can expect developments in models for generating text and visuals, as well as improved tools for tasks related to communication.

On the other hand, there are legitimate concerns about biased and unethical decisions that AI can come up with when there is no human oversight. These concerns will continue to foster regulatory initiatives such as the European Union's Artificial Intelligence Act (AIA).

Growing regulation will, most probably, force firms to look for new ways to collect or generate the necessary data. Furthermore, companies will also need to figure out how to properly utilize all of the data they are now collecting instead of letting it collect dust.

Big data across industries–the impact on apps

One of the greatest benefits brought upon by big data is the immense flexibility and customizability. Undoubtedly, big data will affect almost every industry, all the way from e-commerce to SaaS. But the greatest potential lies in the latter, particularly apps.

Most other industries rely on a mix of brick-and-mortar and digital operations while applications are nearly always exclusively digital. Since big data mostly arises out of the digital space, the business of apps is better positioned to take advantage of the large volume of information.

Many of the required data collection pipelines are already in place for applications. Some may even be already collecting such volumes of data that it may be considered big data. Yet, utilization remains a relative mystery to many companies.

Key big data touchpoints for mobile app companies all revolve around improving the application itself. Every single interaction point in an application can be logged, analyzed, and acted upon. In an ideal sense, big data provides companies with the opportunity to build the "perfect" application.

Yet, such implementations of big data only touch upon the first layer of the industry. First and second-party data will only get you so far. With third-party data, either acquired from Data-as-a-Service or through implementations of web scraping, companies can begin adapting to the market as a whole.

In other words, it is one thing to inch closer to perfection when developing an app for your existing users. It's an entirely different thing to inch closer to perfection for the market as a whole. And that's where third-party-generated big data can come into play.

Big data, which includes third-party sources, can be used to predict market trends, consumer needs, and many other aspects of the entire industry. An entirely new world of business insights can become available through big data.

Finally, app companies are uniquely positioned for testing data-driven hypotheses. They can easily modify features of their services or design, and run A/B tests with a quick turnaround time. E-commerce businesses, for example, have a harder time testing hypotheses as these will often involve physical inventory.

Therefore, while big data is positioned to change all industries, the SaaS and application sectors have the most potential to extract value out of big data. But that takes a large number of specialists with rare skills.

Employers' perspective–the growing need for data specialists

As an increasing number of firms are getting interested in applying big data solutions, the demand for various kinds of data specialists is bound to continue growing. Along with the aforementioned compliance officers, big data experts capable of creating tools based on it are on top of the "most wanted" list.

Data engineering is at the center of professions in the big data sector. Data engineers are the ones responsible for obtaining data and its initial processing, enabling the creation of new models. Meanwhile, among emerging professions, the demand for MLops (machine learning operations) engineers is also growing fast. Without MLops engineers, companies usually cannot deploy or supervise machine learning models created by data scientists.

The demand for data specialists is being boosted even more by new AI-based tools, like ChatGPT, that attract huge public interest and media coverage. Up to a point, such tools might save a company's time and increase productivity. Additionally, these tools foster interest in big data and the inception of new professions. For example, the position of prompt engineer, currently boasting potentially 6-figure salaries, has not even been heard of just a few years ago.

Data democratization is another trend affecting the labor market. Companies aim to remove data silos, enabling more business users to work with data directly in the course of carrying out their main tasks. This goes along with shifting some data analysis responsibilities from data teams to product, marketing, or other departments. Thus, it can be expected that the need for specialists who are skilled in both data analytics and one of the domains of business will grow in the future.

Employees' perspective–getting the skills in demand

From the perspective of job seekers, the aforementioned developments mean two things.

The big data sector provides an increasing number of lucrative career opportunities.

Having skills in data analytics is a major advantage for specialists across the departments.

Naturally, this raises the interest in getting data-related skills among those thinking about their future career path. In terms of higher education, aside from study programs that explicitly have "data" or "AI" in their titles, future students can choose general subjects like mathematics and statistics to acquire a robust analytical background. Knowing specific programming languages, such as Python, would be beneficial, too, as data scientists and engineers today often need to automate processes (for example, data collection at scale).

Interestingly, getting the skills relevant to the big data market does not necessarily require going for the hard sciences. Social sciences, like psychology, are filled with courses on higher mathematics and statistical modeling while also training experts to interpret real-life social events and human actions.

Even humanities have perfect conditions to shine in today's big data labor market. Background in linguistics and philology can be an advantage for prompt engineers and other specialists working with natural language processing tools. Meanwhile, philosophy has AI ethics as its subdiscipline and provides the fundamentals of the interdisciplinary theory of decision-making.

There are also plenty of opportunities outside formal education to learn data-related skills, suitable for both labor market beginners and seasoned workers looking to gain additional qualifications. Various online courses allow learning on your own time, making it easy to accommodate it with a day job or other responsibilities. Such accredited private institutions like Turing College remotely prepare data specialists specifically to have skills and practical knowledge currently in demand.

Willingness to learn constantly is perhaps the most important attribute when aiming for a career in the big data sector. It all begins with learning the fundamentals of statistics, databases, and programming languages like SQL and Python for data processing. When core knowledge is in place, it is important to keep track of technical innovations, new tools, models, and firms in the big data industry. Platforms like Substack provide access to numerous blogs and newsletters that allow one to conveniently stay on top of such news.

Finally, one should have an active interest in the principles of business and how it functions in order to solve its problems with the help of big data. After all, the main goal of data processing and analysis is finding new and better ways to do business.

In conclusion

Being able to work with big data and AI provides a continually growing advantage for both companies and employees. However, the big data market is so dynamic and fast-evolving that all future predictions should come with a disclaimer. Unforeseen innovations and developments can quickly give birth to new professions while making others obsolete. The key to feeling secure in such volatile conditions is effective learning–both companies and employees should be prepared to process and impart new knowledge as it is being created–nearly in real-time.

10 Mind-Blowing Facts About Vortex Cl…
Best Steam Irons in India
China Research Team 3D-Printing Aircr…
Navigating the Waters with a Premier …

This post first appeared on Autonomous AI, please read the originial post: here