Recursos de programación de apache
Ponente: Juan Luis Cano Título: Más allá de pandas: comparativa de dataframes en Python Aula: Teoría 8 (Domingo) ----------------------------------------- Resumen: La biblioteca pandas ha sido uno de los factores decisivos para el crecimiento de Python en la década pasada dentro de la industria del análisis de datos y continúa ayudando a data scientists a resolver problemas 15 años después de su creación. Gracias a su éxito, ahora hay varios proyectos open-source que afirman mejorar pandas de diversas maneras: en esta charla haremos un repaso de dichas alternativas. Durante la charla Haremos una breve introducción a pandas, hablaremos de su importancia, y señalaremos algunas de sus limitaciones, como ya hizo su autor hace un lustro (https://wesmckinney.com/blog/apache-arrow-pandas-internals/). Enumeraremos algunas de sus alternativas y las clasificaremos (pandas-like o diferente, nodo único vs distribuido). Mencionaremos RAPIDS, Dask, Modin, y Spark por encima. Mostraremos fragmentos de código de Arrow, Vaex, y Polars a través de notebooks de Jupyter almacenados en Orchest Cloud y hablaremos de los puntos fuertes de las bibliotecas anteriores. Concluiremos dando una serie de pautas para elegir un proyecto u otro en función del caso y las necesidades. Al final de la charla la audiencia tendrá más información de cómo algunas de las alternativas modernas a pandas encajan dentro del ecosistema, entenderá cuáles proveen un camino para migrar más sencillo, y estará más preparada para juzgar cuál usar para próximos proyectos. Conocimientos básicos de pandas ayudarán a entender el resto de la presentación. Los materiales de la charla se encuentran en GitHub (https://github.com/astrojuanlu/talk-dataframes), y una serie de artículos de blog desarrollan los conceptos que se verán durante la charla: Arrow Vaex Polars
En esta serie de episodios que estamos dedicando a Confluent, hoy vamos a hablar de Schema Registry. Empezaremos comparando el funcionamiento de una aplicación síncrona con el de una aplicación asíncrona para posteriormente analizar por qué Schema Registry tiene un papel fundamental en esta última. Puedes ver el resto de episodios de esta serie en los siguientes enlaces: https://www.ivoox.com/stream-processing-kafka-es-que-audios-mp3_rf_86044083_1.html https://www.ivoox.com/que-es-como-funciona-apache-kafka-audios-mp3_rf_81153210_1.html https://www.ivoox.com/descubriendo-kafka-confluent-primeros-pasos-audios-mp3_rf_75587433_1.html Intervienen: Víctor Rodríguez, Solutions Architect en Confluent. Alberto Grande, Responsable del equipo de Innovación en Paradigma Digital. ¿Quieres escuchar nuestros podcasts? https://www.ivoox.com/podcast-apasionados-tecnologia_sq_f11031082_1.html ¿Quieres saber cuáles son los próximos eventos que organizamos?: https://www.paradigmadigital.com/eventos/
Developer Advocate at Apache APISIX A Developer Advocate with 15+ years experience consulting for many different customers in a wide range of contexts, such as telecoms, banking, insurance, large retail and in the public sector. Usually working on Java/Java EE and Spring technologies, but with focused interests like Rich Internet Applications, Testing, CI/CD and DevOps, Nicolas also doubles as a trainer and triples as a book author.
APIs are the glue that holds our information systems together. If you run more than a couple of apps, having each of them implement authentication, etc. is going to be an Ops nightmare. You definitely need a central point of management, an API Gateway. As developers, we live more and more in an interconnected world. Perhaps you’re developing microservices? Maybe you’re exposing your APIs on the web? In all cases, web APIs are the glue that binds our architecture together. In the Java world, we are very fortunate to have a lot of libraries to help us manage related concerns: rate limiting, authentication, service discovery; you name it. Yet, these concerns are cross-cutting. They impact all our applications in the same way. Perhaps libraries are not the optimal way to handle them. API Gateways are a popular and nowadays quite widespread way to move these concerns out of the applications to a central place. In this talk, I’ll describe in more detail some of these concerns and how you can benefit from an API Gateway. Then, I’ll list some of the available solutions on the market. Finally, I’ll demo APISIX, an Apache-managed project built on top of NGINX that offers quite a few features to help you ease your development.
En este tercer episodio en colaboración con Confluent hablaremos sobre qué es el Stream Processing y descubriremos que es mucho más que el procesamiento de millones de eventos por segundo. Puedes ver el resto de episodios de esta serie en los siguientes enlaces: https://www.ivoox.com/que-es-como-funciona-apache-kafka-audios-mp3_rf_81153210_1.html https://www.ivoox.com/descubriendo-kafka-confluent-primeros-pasos-audios-mp3_rf_75587433_1.html Ponentes: Sergio Durán Vegas, Responsable de Solutions Engineering en España y Portugal en Confluent. Jesús Pau de la Cruz, Arquitecto de Software en Paradigma Digital. ¿Quieres escuchar nuestros podcasts? https://www.ivoox.com/podcast-apasionados-tecnologia_sq_f11031082_1.html ¿Quieres ver otros tutoriales? https://www.youtube.com/c/ParadigmaDigital/playlists ¿Quieres saber cuáles son los próximos eventos que organizamos?: https://www.paradigmadigital.com/eventos/
En el episodio de hoy hablaremos sobre Apache Kafka. Veremos qué es, cómo funciona y las distintas distribuciones que podemos encontrar. Tambień veremos casos de uso en los que la tecnología Kafka nos puede ayudar y veremos qué beneficios aporta Confluent con respecto a otras distribuciones. Si te perdiste nuestro primer episodio sobre el ecosistema Confluent, o quieres volver a escucharlo, aqui tienes el link: https://www.ivoox.com/descubriendo-kafka-confluent-primeros-pasos-audios-mp3_rf_75587433_1.html Intervienen: Víctor Rodríguez, Arquitecto de Soluciones en Confluent. Jesús Pau, Arquitecto de Software en Paradigma Digital. Para no perderte ningún video tutorial, suscribete a nuestro canal y tendrás todas las novedades del mundo tecnológico, de transformación digital, eventos y mucho más. https://www.youtube.com/user/ParadigmaTe?sub_confirmation=1 ¿Quieres ver otros tutoriales? https://www.youtube.com/c/ParadigmaDigital/playlists ¿Quieres escuchar nuestros podcasts? https://www.ivoox.com/podcast-apasionados-tecnologia_sq_f11031082_1.html ¿Quieres saber cuáles son los próximos eventos que organizamos?: https://www.paradigmadigital.com/eventos/
Data comes at us fast” is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern cloud object stores. Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured. In this session, dive deep into an open source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time. It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables. Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data. Since the data is presented as pandas or Spark dataframes the integration with ML frameworks such as MLflow or Sagemaker is seamless.
As individuals, we use time series data in everyday life all the time; If you’re trying to improve your health, you may track how many steps you take daily, and relate that to your body weight or size over time to understand how well you’re doing. This is clearly a small-scale example, but on the other end of the spectrum, large-scale time series use cases abound in our current technological landscape. Be it tracking the price of a stock or cryptocurrency that changes every millisecond, performance and health metrics of a video streaming application, sensors for reading temperature, pressure and humidity, or the information generated from millions of IoT devices. Modern digital applications require collecting, storing, and analyzing time series data at extreme scale, and with performance that a relational database simply cannot provide. We have all seen very creative solutions built to work around this problem, but as throughput needs increase, scaling them becomes a major challenge. To get the job done, developers end up landing, transforming, and moving data around repeatedly, using multiple components pipelined together. Looking at these solutions really feels like looking at Rube Goldberg machines. It’s staggering to see how complex architectures become in order to satisfy the needs of these workloads. Most importantly, all of this is something that needed to be built, managed, and maintained, and it still doesn’t meet very high scale and performance needs. Many time series applications can generate enormous volumes of data. One common example here is video streaming. The act of delivering high quality video content is a very complex process. Understanding load latency, video frame drops, and user activity is something that needs to happen at massive scale and in real time. This process alone can generate several GBs of data every second, while easily running hundreds of thousands, sometimes over a million, queries per hour. A relational database certainly isn’t the right choice here. Which is exactly why we built Timestream at AWS. Timestream started out by decoupling data ingestion, storage, and query such that each can scale independently. The design keeps each sub-system simple, making it easier to achieve unwavering reliability, while also eliminating scaling bottlenecks, and reducing the chances of correlated system failures which becomes more important as the system grows. At the same time, in order to manage overall growth, the system is cell based – rather than scale the system as a whole, we segment the system into multiple smaller copies of itself so that these cells can be tested at full scale, and a system problem in one cell can’t affect activity in any of the other cells. In this session, I will introduce the problem of time-series, I will take a look at some architectures that have been used it the past to work around the problem, and I will then introduce Amazon Timestream, a purpose-built database to process and analyze time-series data at scale. In this session I will describe the time-series problem, discuss the architecture of Amazon Timestream, and demo how it can be used to ingest and process time-series data at scale as a fully managed service. I will also demo how it can be easily integrated with open source tools like Apache Flink or Grafana.
CDC es un conjunto de patrones que nos permite detectar cambios en una fuente de datos y actuar sobre ellos. En este webinar vamos a ver una de las implementaciones reactivas de CDC basada en Debezium, que nos permitirá replicar los cambios producidos sobre un sistema Legacy basado en DB2 y Oracle a un bus de eventos Apache Kafka en tiempo real, con la finalidad de poder realizar una transformación digital del sistema actual. Repositorio: https://github.com/paradigmadigital/debezium ¿Quiénes son los ponentes? Jesús Pau de la Cruz. Soy Ingeniero Informático por la Universidad Rey Juan Carlos y me encanta la tecnología y las posibilidades que ofrece al mundo. Interesado en el diseño de soluciones Real-time, arquitecturas distribuidas y escalables y entornos Cloud. Actualmente trabajo en Paradigma como Arquitecto Software. José Alberto Ruiz Casarrubios. Ingeniero informático de vocación, todoterreno de la tecnología y aprendiz incansable. Estoy siempre buscando nuevos retos a los que intentar aportar la mejor solución. Inmerso de lleno en el mundo del desarrollo de software y modernización de sistemas. Creyente de que aplicar el sentido común es la mejor de las metodologías y decisiones.
Una de las arquitecturas que está creciendo en uso debido a la popularidad de los microservicios es Event-Driven Architecture (EDA). Haciendo uso de patrones como Event Sourcing y Event Collaboration, permite desacoplar los microservicios y facilita la operación de los mismos. Sin embargo, al igual que con la comunicación síncrona, debe haber acuerdos entre consumidores y productores para garantizar que no se rompa la compatibilidad. En esta charla, Antón compartirá su experiencia construyendo este tipo de arquitecturas y, en concreto, los problemas a los que se ha enfrentado a la hora de gobernar esos acuerdos en arquitecturas que se expanden a varios datacenters y diferentes nubes. Contará el camino recorrido para integrar Kafka, Azure EventHub o Google PubSub usando tecnologías como Kafka Connect y Google Dataflow. #Sobre el ponente (Antón R. Yuste) I’m a Principal Software Engineer focused on Event Streaming and Real-Time Processing. I’ve experience working with different message brokers and event streaming platforms (Apache Kafka, Apache Pulsar, Google Pub/Sub and Azure EventHub) and real-time processing frameworks (Flink, Kafka Streams, Spark Structured Streaming, Google Dataflow, Azure Stream Analytics, etc.). During my career, I specialized in building internal SaaS in big corporations to make complex technologies easily used and adopted by teams so they can build solutions to real business use cases. From the very beginning, I can help with governance, operation, performance, adoption, training and any task related to system administration or backend development.