Funny&Amazing Pics&Facts: июня 2018

воскресенье, 24 июня 2018 г.

Five Myths About Artificial Intelligence

Presenting at the 2018 PMSA Annual Conference provided us with the opportunity to think through some common myths about AI. Before our presentation, we spoke with colleagues and clients, and settled on the following five persistent myths:

AI is only relevant for Facebook and Amazon. Stakeholders are familiar with consumer apps that enable image recognition on Facebook or natural language processing with Alexa. When it comes to enterprise business uses, we don’t necessarily have millions of images or much use for them. Instead, we have a lot of structured and unstructured data, and AI can be leveraged in this environment as well.
AI is continuously learning. We frequently hear the assumption that AI is always learning, that the algorithms modify themselves continuously, and if it’s not doing this, it isn’t AI. In reality, very few algorithms function this way. Yes, there’s a component of learning in most AI, but such learning happens on an infrequent basis.
AI is for every problem. A natural tendency after discovering a new tool such as AI is to look in the immediate vicinity for problems that are currently being tackled and ask if it can be solved with the new tool: “How can we leverage AI to improve what we’re already doing?” But that may not always be relevant or the most impactful. Carefully consider the problem and determine if, in fact, AI will enhance the solution.

AI means data scientists. While it’s true that data scientists are essential for creating an AI capability and the algorithms that drive it, we need team members with different skill sets, too. For example, before data scientists create the algorithms for a solution, we need data engineers to gather and wrangle data, and once the algorithms have been developed, we need software engineers to develop and maintain the operating system. If the solution is being presented to the sales force, we may want learning and development professionals to play a critical role in driving adoption.
Data has to be available in plain sight before you can use AI. Netflix didn’t own a massive database full of people’s movie preferences. The company created an infrastructure that gathered the data as users watched videos, and then used that data to predict preferences with AI. And on other occasions, data exists but is hard to find within the organization, and the process to discover it is essential.

Beyond stakeholder management, successfully leveraging AI involves creative problem-solving, disruptive thinking, patience and the right team. It’s not a question of if but when much will be done with AI in commercial life sciences. We’re happy to see so many of our clients embracing artificial intelligence.

D. Sahay co-wrote this article with Arun Shastri

Five Critical Steps for Successfully Leveraging AI

In a few days, we’ll be presenting our thoughts on how to create impact with artificial intelligence at the 2018 PMSA Annual Conference. We’ll talk about how AI is being used in life sciences and how AI could be used, and we’ll bust some myths. We’ll also share detailed advice on how to start or expand an AI capability at your organization, which includes these five critical steps:

Educate your stakeholders. Don’t assume that they understand AI just because they’re enamored by it. Executives need to understand that AI isn’t just about cool technology. It opens new possibilities for automation that can transformyour company’s business model. But AI applications in an enterprise may be very different from what people hear about in the media, such as image recognition on Facebook or voice assistant interactions through Alexa. Make sure that your executives understand what AI is (and is not) and what it takes to realize its potential.

Think outside of your immediate scope. Don’t solve problems that are already solved well with traditional methods. You have to think about answering new questions, and leveraging new data sources that may present opportunities to solve new problems or tackle the same problems but in entirely different ways. Leveraging AI to squeeze marginal gains in effectiveness in solving current problems will not inspire confidence in the power of AI.

There is no prescribed sequence in tackling problems. Create a balanced portfolio of adjacent, leapfrog and disruptive ideas. There is no rule that says that you need to start your AI journey with a full-scale, organization-wide diagnostic. You need your stakeholders to be excited, so ensure that high-impact projects are in your mix. Push your team’s boundaries and inspire them to pursue projects that grab attention around the organization early, so people say: “Wow, that’s clever! How can I get in on it?” That infectious enthusiasm can become the driver for a more methodical transformation.

Don’t obsess over quick wins. Visible change will take time, as it requires a change in organizational mindset. For example, if you’re planning to leverage AI to drive suggestions on the “next best action” to sales representatives, finding the right algorithm and tool may be just one part of it. Driving the right experience for sales reps and developing early champions before scaling are equally important. All of this takes time and patience.
Recognize that it takes more than data scientists to build an AI capability. Recruit people with the right talent and experience. If you’re going to leverage existing talent, make sure that the bar is set high for them to master new skills. To succeed, you need a team of more than a few advanced data science professionals. You’ll need data engineers, AI/ML software engineers, liaisons to the business, data scientists and more.

We’ll be sharing some more content from our presentation with you in an upcoming post. We hope to see you at PMSA and look forward to sharing more about creating impact with AI

Dharmendra Sahay co-wrote this blog post with Arun Shastri

вторник, 19 июня 2018 г.

Очень любопытный британский художник Дэвид Хокни разгадал секрет невероятной реалистичности картин эпохи Возрождения

Студенты художественных вузов и люди, интересующиеся историей искусств, знают, что на рубеже 14-15 веков в живописи произошел резкий перелом — Ренессанс. Около 1420-х годов все внезапно стали значительно лучше рисовать. Почему изображения вдруг стали такими реалистичными и детальными, а в картинах появился свет и объем? Об этом долгое время никто не задумывался. Пока Дэвид Хокни не взял в руки лупу.

Однажды он разглядывал рисунки Жана Огюста Доминика Энгра (Jean Auguste Dominique Ingres) — лидера французской академической школы 19 века. Хокни стало интересно посмотреть его небольшие рисунки в большем масштабе, и он их увеличил на ксероксе. Вот так он наткнулся на тайную сторону в истории живописи начиная с Ренессанса. Сделав ксерокопии маленьких (примерно 30 сантиметров) рисунков Энгра, Хокни был поражен тем, насколько они реалистичны. И еще ему показалось, будто линии Энгра ему что-то напоминают. Оказалось, что напоминают они ему работы Уорхола. А Уорхол делал так — проецировал фото на холст и обрисовывал его.

Слева: деталь рисунка Энгра. Справа: рисунок Мао Цзедуна Уорхола

Интересные дела, говорит Хокни. Судя по всему Энгр использовал Camera Lucida — устройство, которое представляет собой конструкцию с призмой, которая крепится, например, на стойке к планшету. Таким образом художник, глядя на свой рисунок одним глазом, видит реальное изображение, а другим — собственно рисунок и свою руку. Получается оптическая иллюзия, позволяющая точно переносить пропорции реальные на бумагу. А это как раз и есть «залог» реалистичности изображения.

Рисование портрета при помощи камеры-люциды, 1807 г.

Затем Хокни не на шутку заинтересовался этим «оптическим» видом рисунков и картин. В своей студии он вместе со своей командой развесил по стенам сотни репродукций картин, созданных на протяжении веков. Работы, которые выглядели «реально», и те, которые не выглядели. Расположив по времени создания, и регионам — север наверху, юг внизу, Хокни с командой увидел резкий перелом в живописи на рубеже 14-15 веков. Это в общем всем, кто хоть немного знает об истории искусства, известно — Ренессанс.

Может они пользовались той самой камерой-люцидой? Она была запатентована в 1807 году Вильямом Хайдом Волластоном. Хотя, на самом деле такое устройство описывается Йоханесом Кеплером еще в 1611 году в его труде Dioptrice. Тогда, может быть, они пользовались другим оптическим устройством — камерой-обскурой? Она ведь известна еще со времен Аристотеля и представляет собой темную комнату, в которую сквозь небольшое отверстие попадает свет и таким образом в темной комнате получается проекция того, что перед отверстием, но в перевернутом виде. Всё бы ничего, но то изображение, которое получается при проекции камерой-обскурой без линзы, мягко говоря, не качественное, оно не четкое, для него требуется очень много яркого света, не говоря уже о размерах проекции. Но качественные линзы было практически невозможно изготовить вплоть до 16-го века, поскольку не существовало в то время способов получить столь качественное стекло. Дела, подумал Хокни, к тому моменту уже бившийся над проблемой вместе с физиком Чарльзом Фалко. Однако есть картина Ян Ван Эйка, мастера из Брюгге, фламандского живописца эпохи раннего возрождения, — в которой спрятана подсказка. Картина называется «Портрет Четы Арнольфини».

Ян Ван Эйк «Портрет Четы Арнольфини» 1434г.

Картина просто блещет огромным количеством деталей, что довольно интересно, ведь она написана только в 1434 году. И подсказкой о том, каким образом автору удалось сделать такой большой шаг вперед в реалистичности изображения, служит зеркало. А также подсвечник — невероятно сложный и реалистичный.

Фрагмент картины Яна Ван Эйка «Портрет Четы Арнольфини» 1434г.

Хокни распирало любопытство. Он раздобыл копию такой люстры и попытался нарисовать её. Художник столкнулся с тем, что такую сложную штуку сложно нарисовать в перспективе. Еще одним важным моментом была материальность изображения этого металлического предмета. При изображении стального предмета очень важно расположить блики как можно более реально, так как это придает огромную реалистичность. Но проблема с этими бликами в том, что они двигаются, когда двигается взгляд зрителя или художника, то есть запечатлеть их вообще непросто. И реалистичное изображение металла и бликов — это тоже отличительная черта картин Ренессанса, до этого художники даже и не пытались этого делать.

Воссоздав точную трехмерную модель люстры, команда Хокни убедилась в том, что люстра на картине «Портрет Четы Арнольфини» нарисована точно в перспективе с одной точкой схода. Но проблема была в том, что таких точных оптических инструментов, как камера-обскура с линзой, не существовало еще примерно век после создания картины.

Фрагмент картины Яна Ван Эйка «Портрет Четы Арнольфини» 1434г.

На увеличенном фрагменте видно, что зеркало на картине «Портрет Четы Арнольфини» выпуклое. А значит были и зеркала наоборот — вогнутые. Даже более того, в те времена такие зеркала делались таким образом — бралась стеклянная сфера, и ее дно покрывалось серебром, затем все, кроме дна, отсекалось. Задняя сторона же зеркала не затемнялась. Значит вогнутым зеркалом Яна Ван Эйка могло быть то самое зеркало, которое изображено на картине, просто с обратной стороны. И любой физик знает, что такое зеркало при отражении проецирует картинку отражаемого. Вот тут с расчетами и исследованиями и помог Дэвиду Хокни его знакомый физик Чарльз Фалко.

Вогнутое зеркало проецирует на холст изображение башни за окном.

Размер четкой, сфокусированной части проекции примерно 30 квадратных сантиметров — а это как раз размер голов на множестве портретов эпохи Возрождения.

Хокни обрисовывает проекцию человека на холсте

Это размер к примеру портрета «Дожа Леонардо Лоредана» авторства Джовании Беллини (1501), портрета мужчины авторства Роберто Кэмпина (1430), собственно портрета Яна Ван Эйка «мужчина в красном тюрбане» и еще множества ранних голландских портретов.

Портреты эпохи Возрождения

Живопись была высокооплачиваемой работой, и естественно, все секреты бизнеса хранились в строжайшей тайне. Художнику было выгодно, чтобы все непосвященные люди считали, что секреты в руках мастера и их не украсть. Бизнес был закрытым для посторонних — художники состояли в гильдии, в ней же состояли самые разные мастера — от тех кто делал седла до тех кто делал зеркала. И в Гильдии Святого Луки (Guild of Saint Luke), основанной в Антверпене и впервые упоминаемой в 1382 году (затем подобные гильдии открылись во многих северных городах, и одной из крупнейших была гильдия в Брюгге — городе где жил Ван Эйк) тоже были мастера, изготавливающие зеркала.

Так Хокни воссоздал то, каким образом можно нарисовать сложную люстру с картины Ван Эйка. Совсем неудивительно, что размер спроецированной Хокни люстры точно совпадает с размерами люстры на картине «Портрет Четы Арнольфини». Ну и конечно блики на металле — на проекции они стоят на месте и не меняются, когда художник меняет положение.

Но проблема все еще не решена полностью, ведь до появления качественной оптики, которая нужна для использования камеры-обскуры, оставалось 100 лет, а размер проекции, получаемой с помощью зеркала, очень мал. Каким образом писать картины больше размера 30 квадратных сантиметров? Они создавались как коллаж — из множества точек зрения, получалось такое как бы шарообразное зрение с множеством точек схода. Хокни понял это, поскольку сам занимался такими картинками — он делал множество фото-коллажей, в которых достигается точно такой же эффект. Спустя почти век, в 1500-х наконец стало возможным получить и хорошо обработать стекло — появились большие линзы. А их можно было наконец вставить в камеру-обскуру, принцип действия которой был известен еще с древних времен. Камера-обскура с линзой была невероятной революцией в визуальном искусстве, поскольку теперь проекция могла быть любого размера. И еще один момент, теперь изображение было не «широкоугольным», а примерно нормального аспекта — то есть примерно таким, какое оно сегодня при фотографии с линзой с фокусным расстоянием 35-50мм.

Однако проблема в использовании камеры-обскуры с линзой заключается в том, что прямая проекция из линзы зеркальна. Это привело к большому количеству левшей в живописи на ранних этапах использования оптики. Как на этой картине из 1600-х из музея Франса Халса, где танцует пара левшей, им грозит пальцем леворукий старик,а под платье женщины заглядывает леворукая обезьянка.

На этой картине все левши

Проблема решается установкой зеркала, в которое направлена линза, таким образом получается правильная проекция. Но судя по всему, хорошее ровное и большое зеркало стоило немалых денег, так что было не у всех.

Еще одной проблемой была фокусировка. Дело в том, что некоторые части картинки при одном положении холста под лучами проекции были не в фокусе, не четкими. На работах Яна Вермеера, где совершенно очевидно видно использование оптики, его работы вообще выглядят как фотографии, можно также заметить места не в «фокусе».

Виден даже рисунок, который дает линза — пресловутое «боке». Как например здесь, на картине «Молочница» (1658) корзинка, хлеб в ней и синяя вазочка не в фокусе. А ведь человеческий глаз не может видеть «не в фокусе».

Некоторые детали картины не в фокусе

И в свете всего этого совсем неудивительно, что хорошим другом Яна Вермеера был Антоний Филлипс ван Ливенхук, ученый и микробиолог, а также уникальный мастер, создававший собственные микроскопы и линзы. Ученый стал посмертным распорядителем художника. А это позволяет предположить, что Вермеер изобразил именно своего друга на двух полотнах — «Географ» и «Астроном».

Для того, чтобы увидеть какую либо часть в фокусе, нужно изменить положение холста под лучами проекции. Но в таком случае появлялись ошибки в пропорциях. Как это видно здесь: огромное плечо «Антеи» Пармеджанино (около 1537), маленькой голове «леди Дженовезе» Антониса Ван Дайка (1626) , огромным ногам крестьянина на картине Жоржа Де ля Тура.

Ошибки в пропорциях

Безусловно все художники использовали линзы по-разному. Кто-то для набросков, кто-то составлял из разных частей — ведь теперь можно было сделать портрет, а все остальное дописать с другой моделью или же вообще с манекеном. Почти не осталось рисунков и у Веласкеса. Однако остался его шедевр — портрет папы Иннокентия 10-го (1650г.). На мантии папы — очевидно шелковой, — прекрасная игра света. Бликов. И чтобы написать все это с одной точки зрения, нужно было очень постараться. А ведь если сделать проекцию, то вся эта красота никуда не убежит — блики больше не двигаются, можно писать именно теми широкими и быстрыми мазками как у Веласкеса.

Хокни воспроизводит картину Веласкеса

Впоследствии камеру-обскуру смогли позволить себе многие художники, и это перестало быть большим секретом. Каналетто активно использовал камеру для создания своих видов Венеции и не скрывал этого. Эти картины, благодаря своей точности, позволяют говорить о Каналетто как о документалисте. Благодаря Каналетто можно увидеть не просто красивую картинку, но и саму историю. Можно увидеть, каким был первый Вестминстерский мост в Лондоне в 1746 году.

Каналетто «Вестминстерский мост» 1746г.

Британский художник Сэр Джошуа Рейнольдс владел камерой-обскурой и, очевидно, никому об этом не говорил, ведь его камера складывается и выглядит как книга. Сегодня она находится в Лондонском научном музее.

Камера-обскура, замаскированная под книгу

Наконец в начале 19 века Вильям Генри Фокс Тэлбот, воспользовавшись камерой-люцидой — той самой, в которую надо глядеть одним глазом, а рисовать руками, выругался, решив что с таким неудобством надо покончить раз и навсегда, и стал одним из изобретателей химической фотографии, а позднее популяризатором, сделавшим ее массовой.

С изобретением фотографии монополия живописи на реалистичность картинки пропала, теперь фото стало монополистом. И вот тут наконец живопись освободилась от линзы, продолжив тот путь, с которого свернула в 1400-х, и Ван Гог стал предтечей всего искусства 20 века.

Слева: Византийская мозаика 12 века. Справа: Винсент Ван Гог «Портрет господина Трабука» 1889 г.

Изобретение фотографии — это самое лучшее, что случилось с живописью за всю ее историю. Больше не нужно было создавать исключительно реальные изображения, художник стал свободен. Конечно, публике понадобился целый век, чтобы догнать художников в понимании визуальной музыки и перестать считать людей вроде Ван Гога «сумасшедшими». При этом художники стали активно пользоваться фотографиями в качестве «справочного материала». Потом уже появились такие люди как Василий Кандинский, русский авангард, Марк Ротко, Джексон Поллок. Вслед за живописью освободилась и архитектура, скульптура и музыка. Правда русская академическая школа живописи застряла во времени, и сегодня до сих пор в академиях и училищах считается позором использование фотографии в помощь, а высшим подвигом считается чисто техническое умение рисовать как можно более реалистично голыми руками.

Благодаря статье журналиста Лоуренса Вешлера, присутствовавшего при исследованиях Дэвида Хокни и Фалко, выясняется еще один интересный факт: портрет четы Арнольфини кисти Ван Эйка — это портрет итальянского купца в Брюгге. Мистер Арнольфини — Флорентиец и более того, он представитель банка Медичи (практически хозяева Флоренции времен Ренессанса, считаются покровителями искусства того времени в Италии). А это говорит о чем? О том, что он запросто мог увезти секрет гильдии Святого Луки — зеркало — с собой, во Флоренцию, в которой, как считается в традиционной истории, и начался Ренессанс, а художников из Брюгге (и соответственно и других мастеров) считают «примитивистами».

Вокруг теории Хокни-Фалко множество споров. Но зерно истины в ней, безусловно, есть. Что касается искусствоведов, критиков и историков — даже представить трудно, сколько научных трудов по истории и искусству на самом деле оказались полной ерундой, это же меняет всю историю искусства, все их теории и тексты.

Факты использования оптики нисколько не умаляют талантов художников — ведь техника это средство передачи того, что хочет художник. И наоборот, то что в этих картинах есть самая настоящая реальность, только добавляет им веса — ведь именно так выглядели люди того времени, вещи, помещения, города. Это самые настоящие документы.

Теория Хокни-Фалко подробно изложена ее автором Дэвидом Хокни в документальном фильме BBC David Hockney’s «Secret Knowledge», который можно посмотреть на YouTube (часть 1 и часть 2 англ. яз.)

Источник: AdMe.ru

суббота, 16 июня 2018 г.

Developing a file system structure to solve healthcare big data storage and archiving problems using a distributed file system

Full article title	Developing a file system structure to solve healthcare big data storage and archiving problems using a distributed file system
Journal	Applied Sciences
Author(s)	Ergüzen, Atilla; Ünver, Mahmut
Author affiliation(s)	Kırıkkale University
Primary contact	Email: munver at kku dot edu dot tr
Year published	2018
Volume and issue	8(6)
Page(s)	913
DOI	10.3390/app8060913
ISSN	2076-3417
Distribution license	Creative Commons Attribution 4.0 International
Website	http://www.mdpi.com/2076-3417/8/6/913/htm
Download	http://www.mdpi.com/2076-3417/8/6/913/pdf (PDF)

Abstract

Recently, the use of the internet has become widespread, increasing the use of mobile phones, tablets, computers, internet of things (IoT) devices, and other digital sources. In the healthcare sector, with the help of next generation digital medical equipment, this digital world also has tended to grow in an unpredictable way such that nearly 10 percent of global data is healthcare-related, continuing to grow beyond what other sectors have. This progress has greatly enlarged the amount of produced data which cannot be resolved with conventional methods. In this work, an efficient model for the storage of medical images using a distributed file system structure has been developed. With this work, a robust, available, scalable, and serverless solution structure has been produced, especially for storing large amounts of data in the medical field. Furthermore, the security level of the system is extreme by use of static Internet Protocol (IP) addresses, user credentials, and synchronously encrypted file contents. One of the most important key features of the system is high performance and easy scalability. In this way, the system can work with fewer hardware elements and be more robust than others that use name node architecture. According to the test results, the performance of the designed system is better than 97% from a Not Only Structured Query Language (NoSQL) system, 80% from a relational database management system (RDBMS), and 74% from an operating system (OS).

Keywords: big data, distributed file system, health data, medical imaging

Introduction

In recent years, advances in information technology have increased worldwide; internet usage has exponentially accelerated the amount of data generated in all fields. The number of internet users was 16 million in 1995. This number reached 304 million in 2000, 888 million in 2005, 1.996 billion in 2010, 3.270 billion in 2015, and 3.885 billion in 2017.[1][2][3] Every day, 2.5 exabytes (EB) of data are produced worldwide. Also, 90% of globally generated data has been produced since 2015. The data generated are in many different fields such as aviation, meteorology, IoT applications, health, and energy sectors. Likewise, the data produced through social media has reached enormous volumes. Not only did Facebook.com store 600 terabytes (TB) of data a day in 2014, but Google also processed hundreds of petabytes (PB) of data per day in the same year.[4][5] Data production has also increased at a remarkable rate in the healthcare sector; widespread use of digital medical imaging peripherals has triggered this data production. Also, the data generated in the healthcare sector has reached such a point that it cannot be managed easily with traditional data management tools and hardware. Healthcare has accumulated a big data volume by keeping patients’ records, creating medical imaging that helps doctors with diagnoses, outputting digital files from various devices, and creating and storing the results of different surveys. Different types of data sources produce data in various structured and unstructured formats; examples include patient information, laboratory results, X-ray devices, computed tomography (CT) devices, and magnetic resonance imaging (MRI). World population and average human lifespan is apparently increasing continuously, which means an exponential increase in the number of patients to be served. As the number of patients increases, the amount of collected data also increases dramatically. Additionally, exhaustive digital healthcare devices make higher-density graphical outputs easy additions to the growing body of data. In 2011, the amount of data in the healthcare sector in the U.S. reached 150 EB. In 2013, it appeared to have achieved 153 EB. In 2020, it is estimated that this number will reach 2.3 ZB. For example, electronic medical record (EMR) use has increased 31% from 2001 to 2005 and more than 50% from 2005 to 2008.[6][7] While neuroimaging operation data sizes had reached approximately 200 GB per year between 1985 and 1989, it has risen to 5 PB annually between 2010 and 2014, yet another indicator of the increase in data in the healthcare sector.[8]

In this way, new problems have emerged due to the increasing volume of data generated in all fields at the global level. Now there are substantial challenges to store and to analyze the data. The storage of data has become costlier than gathering it.[9] Thus, the amount of data that is produced, stored, and manipulated has increased dramatically, and because of this increase, big data and data science/knowledge have begun to develop.[10] Big data is a reference to the variety, velocity, and volume of data; concerning healthcare records, finding an acceptable approach to cover these issues is particularly difficult to accomplish.

Big data problems in healthcare and the objectives of the study according to the previous arguments are listed as follows:

1. Increasing number of patients: The global population and average human lifespan are apparently increasing. For example, in Turkey, the number of visits to a physician has increased by about 4% per year since 2012.[11]Moreover, the total number of per capita visits to a physician in healthcare facilities in 2016 was 8.6 while this value was 8.2 in 2012. As the number of patients increases, the amount of collected data also increases dramatically, which creates much more data to be managed.

2. Technological devices: Extensively used digital healthcare devices create high-resolution graphical outputs, which means huge amounts of data to be stored.

3. Expert personnel needs: To manage big data in institutions using software platforms such as Hadoop, Spark, Kubernetes, Elasticsearch, etc., qualified information technology specialists must be brought in to deploy, manage, and store big data solutions.[12]

4. Small file size problem: Current solutions for healthcare, including Hadoop-based solutions, have a block size of 64 MB (detailed in the next section). This leads to vulnerabilities in performance and unnecessary storage usage, called "internal fragmentation," that is difficult to resolve.

5. Hospital information systems (HIS): These systems represent comprehensive software and related tools that help healthcare providers produce, store, fetch, and exchange patient information more efficiently and enable better patient tracking and care. The HIS must have essential non-functional properties like (a) robustness, (b) performance, (c) scalability, and (d) availability. These properties basically depend on a constructed data management architecture, which includes configured hardware devices and installed software tools. A HIS is responsible for solving big data problems alone, though it is much more than an IT project or a traditional application. As such, third-party software tools are needed to achieve the objectives of the healthcare providers.

This study seeks to obtain a mid-layer software platform which will help to address these healthcare gaps. In other words, we have implemented a server-cluster platform to store and to return health digital image data. It acts as a bridge between the HIS and various hardware resources located on the network. There are five primary aims of this study:

1. to overcome growing data problems by implementing a distributed data layer between the HIS and server-cluster platform;

2. to reduce operational costs, with no need to employ IT specialists to install and to deploy popular big data solutions;

3. to implement a new distributed file system architecture to achieve non-functional properties like performance, security, and scalability, which are of crucial importance for a HIS;

4. to show and prove that there can be different successful big data solutions; and, especially,

5. to solve these gaps efficiently for our university HIS.

In this study, the first section describes general data processing methods. The second part discusses the work and related literature on the subject, while the third part is the materials and methods section that describes the implemented approach. The last section is the conclusion of the evaluation that emerges as the result of our work.

Big data architecture in medicine

Big data solutions, in healthcare worldwide, primarily consist of three different solutions.

The first is a database system, which has two different popular application architectures: relational database management systems (RDBMS) and NoSQL database systems. RDBMSs, as the most widely known and used systems for this purpose, store data in a structured format. The data to be processed must be of the appropriate type and format. In these systems, a single database can serve multiple users and applications. Since these systems are built on vertical growth functionality, the data structure must be defined in advance. As a result, they have a lot of constraints like atomicity, consistency, isolation, and durability. The strict rules that make these systems indispensable are beginning to be questioned today. However, due to the used hardware and software, the initial installation costs are high. Especially when the volume of data increases, the horizontal scalability feature becomes quite unsatisfactory and difficult to manage, which is a major factor of their not being a part of an overall big data solution. Also, these systems are more complex than file systems, which most importantly is not suitable for big data. Due to the deficiency of managing RDBMSs’ big data, NoSQL database systems have emerged as an alternative. The main purpose of these systems is to store the increasing unstructured data volume associated with the internet and to respond to the needs of high-traffic systems via unstructured or semi-structured formats. NoSQL databases are systems that provide high accessibility according to RDBMSs and in which data are easily scaled horizontally.[13] Reading and writing performances may be more acceptable than RDBMS. One of the most important features is that they are horizontally expandable. Thousands of servers can work together as a cluster and operate on big data. They are easy to program and manage due to their flexible structures. Another feature of these systems is that they must be doing grid computing in clusters that consist of many machines connected to a network; in this way, data process speeds have increased. However, NoSQL does not yet have as advanced data security features as RDBMSs. Some NoSQL projects are also lacking in documentation and professional technical support. Finally, the concept of "transactions" is not available in NoSQL database systems, meaning loss of data may occur, so they are not suitable for use in banking and financial systems.[14]

Basic file management functions of operating systems (OS) are used for the second solution, which are called "file servers" in literature. In this system, medical image data is stored in files and folders in underlying operating system file structure. Operating systems use a hierarchical file system. In this structure, the files are in a tree structure, called a "directory." File servers store the files in the way that is determined by HIS, according to image type, file creation date, policlinic name, and patient information. The HIS executes read and write operations by invoking system calls, which act as a low-level interface to storage devices with the help of the operating system. The advantages of using file servers are that they are simple to deliver, easy to implement, and have acceptable file operation performance. Writing, deleting, reading, and searching files on the operating system is a fairly fast process because the operating system has been specialized for file management. Especially, operating systems have more flexibility and performance than RDBMS and NoSQL systems. However, the OS could not use these advantages to be a satisfactory solution model for big data because of the lack of horizontal scalability. The main task of OS file management is to serve system calls to other applications. So, the OS is a part of the solution rather the solution alone. Besides the storage of data not being as secure as other methods, data cannot be scaled according to data size, and backup and file transfer cannot be done safely. It seems that the operating system alone is not suitable for solving big data problems.

The third method involves distributed file systems (DFS). These systems represent the most up-to-date way to support machines in various locations as a single framework and provide the most appropriate solution to the big data problem. Hadoop DFS and MapReduce are primarily used for big data storage and analytics. Hybrid solutions that include Hadoop and NoSQL are also used and criticized in the literature. However, there are some drawbacks to using the Hadoop ecosystem in the healthcare setting. The first one is the small files problem; Hadoop cannot store and manage these types of files efficiently.[15] Because Hadoop is primarily designed to manage large files greater than or equal to 64 MB, this size also acts as the default block size of Hadoop clusters.[15] For example, a file that is one gigabyte in size, consisting of 16 blocks of 64 MB Hadoop blocks, occupies 2.4 KB of space in a name node. However, 100,000 files of 100 KB occupies one gigabyte of space in data nodes, 1.5 MB in a name node. This means more MapReduce operations are required when processing small files. The healthcare sector's average medical image file size is 10 MB, and when this situation is taken into consideration, a new DFS system is needed to embrace systems that have large numbers of small files.

This study proposes a new methodology for this issue via small block size and "no name" node structure. Wanget al. have identified five different strategies for how big data analytics can be effectively used to create business value for healthcare providers. This work was carried out for a total of 26 health institutions, from Europe and the U.S., that use Hadoop and MapReduce.[16] However, it is also stated in this study that with a growing amount of unstructured data, more comprehensive analytic techniques like deep learning algorithms are required to be satisfied. A significant analysis and discussion on Hadoop, MapReduce, and STORM frameworks for big data in health care was presented by Liu and Park.[17] It stated that Hadoop and MapReduce cannot be used in real-time systems due to a performance gap. Therefore, they proposed a novel health service model called BDeHS ((Big Data e-Health Service)) which has three key benefits. Additionally, Spark and STORM can be used more effectively for real-time data analytics of large databases according to MapReduce.[18] One study provided detailed information about the architectural design of a personal health record system called “MedCloud” constructed on Hadoop’s ecosystem.[19] Another study by Erguzen and Erdal looked at big data in healthcare, a new file structure and achieving system has been developed to store regions of interest (ROIs) from MRIs. In other words, they extracted the ROI portions, which contained vital information about the patient, from the image, discarded the rest—called non-ROIs, and stored the ROIs in the newly designed file structure with a success ratio of approximately 30%. However, this work was done only to decrease the image sizes, not to effectively store big data on DFS.[7]Another study conducted by Raghupathi and Raghupathi showed that the Hadoop ecosystem has significant drawbacks for medium- or large-size healthcare providers: (a) it requires a great deal of programming skills for data analytics tasks using MapReduce; and (b) it is typically difficult to install, to configure, and to manage the Hadoop ecosystem completely. As such, it does not seem to be a feasible solution for medium- or large-scale healthcare providers.[12]

Today, Hadoop is one of the enterprise-scaled open-source solutions that makes it possible to store big data with thousands of data nodes, as well as analyze the data with MapReduce. However, there are three disadvantages to Hadoop. First, Hadoop’s default block size is 64 MB, which presents an obstacle in managing numerous small files.[15] When a file smaller than 64 MB is embedded in a 64 MB Hadoop block, it causes a gap, which is called internal fragmentation. On our system, the block size is 10 MB, which was constructed according to the average MRI file size, meaning less internal fragmentation. Second, performance issues arise when the system needs to run in a real-time environment. Third, Hadoop requires professional support to construct, operate, and maintain the system properly. These drawbacks are the key factors of why we developed this potential solution. As such, an original distributed file system has been developed for storing and managing big data in healthcare. The developed system has been shown to be quite successful for applications that run in the form of write once read many (WORM), which is a model that has many uses, such as in electronic record management systems and the healthcare sector.

Related studies

Big data represents data that cannot be stored and administrated on only one computer. Today, to administrate it, computers connected to a distributed file system and working together in a network are used. DFSs are separated into clusters consisting of nodes. Performance, data security, scalability, availability, easy accessibility, robustness, and reliability are the most important features of big data. Big data management problems can be solved by using DFS and network infrastructure. DFS-related work began in the 1970s[20], with one of the first advances being the Roe File System, developed for replica consistency, easy setup, secure file authorization, and network transparency.[21]

LOCUS, developed in 1981, is another DFS that features network transparency, high performance, and high reliability.[22] The network file system (NFS) started to be developed by Sun Microsystems in 1984. This system is the most operated DFS on UNIX. Remote procedure call (RPC) is used for communication.[23] It is designed for enabling the Unix file system to function as a "distributed" system, with the virtual file system acting as a layer. Therefore, clients can run different file systems easily and fault tolerance is high in the NFS. File status information is kept, and when an error occurs, the client reports this error status to the server immediately. File replication is not done in NFS, whereas the whole system is replicated.[24] Only the file system is shared on NFS; no printer or modem can be shared. Objects to be shared can be a unit of a directory as well as a file. It is not necessary to set up every application on a local disk in the NFS, and there can be shared by using the server. For all that the same computer can be both a server and a client. As a result, an NFS reduces data storage costs.

The Andrew file system (AFS-1983) and its successor’s CODA (1992) and OpenAFS[25] are open sources for distributed file systems. These systems have scalable and larger cluster size. Also, they can reduce server load and cache the whole file. CODA replicates on multiple servers to increase accessibility. Whereas AFS only supports Unix, OpenAFS and CODA support MacOS and Microsoft Windows. In these systems, the same namespace is created for all clients. However, replication is limited, and a read-one/write-all (ROWA) schema is used for it.[26][27]

Frangipani was developed in 1997 as a new distributed file system with two layers. The bottom layer consists of virtual disks, providing storage services, and can be scaled and managed automatically. On the top layer, there are several machines that use the Frangipani file system. These machines run distributed on the shared virtual disk. The Frangipani file system provides consistent and shared access to the same set of files. As the data used in the system grows, more storage space and higher performance hardware elements are needed. If one of the system components does not work, it continues to serve due to its availability. As the system grows, the added components do not make management complicated, and thus there is less need for human management.[28]

FARSITE (2002) is a serverless file system that runs distributed on a network, even one of physically unreliable computers. The system is a serverless, distributed file system, one that does not require centralized management. As such, there are not staff costs like a server system. FARSITE is designed to support the file I/O workload of a desktop computer in the university or a large company. It provides reasonable performance using client caching, availability, and accessibility using replication, authentication using encryption, and scalability using namespace delegation. One of the most important design goals of FARSITE is to use the benefits of Byzantine fault-tolerance.[29]

Described in 2006, the CEPH file system is located on a top layer of similar systems that do object storage. This layer separates data and metadata management. This is accomplished by the random data distribution function (CRUSH), which is designed for unreliable object storage devices (OSDs). This function replaces the file allocation table. With CEPH, distributed data replication, error detection, and recovery operations are transferred to object storage devices running on the local file system. Thus, system performance is enhanced. A distributed set of metadata makes its management extremely efficient. The Reliable Autonomic Distributed Object Store (RADOS) layer manages all filing processes. Measurements were taken under various workloads to test the performance of CEPH, which can also work within different discs size. As a result, I/O performance is extremely high. It has been shown to have scalable metadata management. Because of the measurements, it supports 250,000 meta transactions per seconds, making CEPH a high-performance, reliable, and scalable distributed file system.[30]

In 2007, Hadoop was developed, consisting of the Hadoop distributed file system (HDFS) and MapReduce parallel computing tool. Hadoop is a framework that provides analysis and transformation of very large datasets. HDFS distributes big data by dividing it into clusters on standard servers. To ensure data security, it backs the blocks up on the servers by copying them.[31] Hadoop/MapReduce is used to process and manage big data. The "map" function distributes the data on the cluster and makes it available for processing. The "reduce" function ensures the data will be combined. Hadoop has scalability, and it can easily handle petabytes of data.[32] Today, Hadoop is used by many major companies and is preferred in industrial and academic fields. Companies like LinkedIn, eBay, AOL, Yahoo, Facebook, and IBM use Hadoop generally.[33]

Announced in 2015, CalvinFS is a file system that can be scalable and has the replica property using a highly efficient database designed for metadata management. For this, it divides the metadata horizontally across multiple nodes. The file operations that need to edit the metadata item works in a distributed fashion. This system also supports standard file systems. This file system approach has shown that scaling can be done up to billions of files. While reducing reading delays, it can conduct hundreds of thousands of updates and millions of reads per second at the same time.[34]

In 2016, Al-Kahtani and Karim presented a scalable distributed system framework.[35] The system performs scaling on the central server. The proposed framework transfers the data processing work to other computers, by the server, as the amount of data collected increases. In other words, the system works in a distributed fashion when data flow increases. Other frameworks like this include the IdeaGraph algorithm[36], probabilistic latent semantic analysis (PLSA)[37], locality-aware scheduling algorithm[38], nearest neighbor algorithm[39], item-based collaborative filtering algorithm[40], recommendation algorithm[41], convex optimization[42], and parallel two-pass MDL (PTP-MDL).[41]

Jin et al. designed an efficient storage table for . The system also analyzes data and produces detailed statistical values by using MapReduce. It is designed with HBase, a distributed-column based database that runs on the Hadoop distributed file system MapReduce framework. The model is low-cost and has two name nodes. In addition, HMaster and HRegionServer allow load balancing to get better performance. However, it has been noted that the system should work to develop data block replication strategies for HDFS.[42]

Today, distributed file systems can be placed/examined in two main categories:

Big data storage: Focuses on implementing the necessary file system and cluster schema to save big data (data save)
Big data analytics: Focuses on the short and consistent analysis of the data collected from the nodes by grid computing tools (data mining)

Materials and method

This section discusses the state-of-the-art technologies and the related algorithms used in this study.

TCP/IP protocol

Protocols specify strict rules for exchanging messages between the communicating entities across a local area network or wide area network. Each computer or mobile device on the internet has one unique internet protocol (IP) address that cannot overlap other devices on the internet. The protocol uses IP addresses to connect endpoints, that is a combination of an IP address and a port number, by using packets or datagrams which contain source and destination device IP addresses and related information. This protocol needs other transport layer protocols such as the User Datagram Protocol (UDP), Lightweight User Datagram Protocol (UDP-Lite), Transmission Control Protocol (TCP), Datagram Congestion Control Protocol (DCCP), and Stream Control Transport Protocol (SCTP). TCP and UDP are the most-used in internet connections. TCP has many advantages over UDP, including (a) reliability, (b) ordered transmission of packets, and (c) streaming. Therefore, TCP/IP is used in this project.

Sockets

A TCP socket is an endpoint of a process, defined by an IP address and a port number for client/server interaction. Therefore, a TCP socket is not a connection but rather just one endpoint of a two-way communication link between two processes. There are two types of socket—client and server—which both can send and receive. The distinction between them is in how the connection is created. Client sockets initialize the connection, while the server socket continuously listens to the specified port for client requests. In this project, a .NET framework socket library was used.

Windows services

Windows services are a special type of process of the Microsoft Windows operating system. The differences between services and applications are that they (a) run in the background, (b) usually do not have a user interface or interact with the user, (c) are long-running processes (typically until the computer shuts down), and (d) automatically start when the computer is restarted. In this project, Windows service routines were implemented both client and server side (DFS). These services also have client and server socket structures to transmit data mutually.

Encryption

Cryptography is defined as a science involving advanced mathematical techniques to transmit data in a secure way that allows information to be kept secret. The data is modified using the encryption key before transmission start, and the receiver converts this modified data to the original data with the appropriate decryption key. Two types of encryption methods—symmetric and asymmetric—have been successfully used for years. Asymmetric encryption is also known as public key cryptography, a relatively new method which uses two keys (public and private) to encrypt plain text. The public key is used for encrypting, while the private key is used for decrypting. These two keys are created with special mathematical techniques. The public key can be known by everyone, but the secret key must be on the server side and no one else should know it. Although the keys are different, they are mathematically related to each other. The private key is kept secret and the public key is easily shared with target devices to which the data transfer is to be performed, because knowing this key will not contribute to solving the encrypted data. RSA (Rivest–Shamir–Adleman), DSA (Digital Signature Algorithm), Diffie–Hellman, and PGP (Pretty Good Privacy) are widely used as asymmetric key algorithms.

Symmetric cryptography—also called "single key cryptography"—is the fastest, simplest to implement, and best-known technique that involves only one secret key to cipher and decipher the related data. Symmetric key algorithms are primarily used for bulk encryption of blocks or data streams. There are many symmetric key algorithms, including Data Encryption Standard (DES), Advanced Encryption Standard (AES), Stream Cipher Algorithm, and Blowfish. Blowfish is one of the fastest block ciphers in use and has a key length of 32 bits to 448 bits. In this study, the 128-bit key length Blowfish algorithm was preferred.

File handling routines

The main point of an operating system is to have a robust, fast, and efficient file system. All kinds of operating systems divide physical storage equally into 4 KB blocks called clusters and try to manage them most efficiently. There are basically two different disk space management concepts in use. The first one is the Unix-based bitmap method. In this method, a block storage device is divided into clusters, and each cluster has a corresponding bit where a one represents a free disk block and a zero represents an allocated block. There are as many bits as the number of blocks. This map structure is used to find available, empty disk blocks for the files to be added to the file system. Another successful method for disk space allocation can be found in Windows operating systems, called File Allocation Table (FAT32), which uses a linked list data structure to point out used or empty blocks. These two methods have different advantages and disadvantages. We use both methods to make a better hybrid solution. We use bitmap structure in the header section to see whether clusters are empty or not. To track the file chains, a linked list-based FAT structure is used.

Programming language

Microsoft Visual Studio 2017 framework was used in this project. Also, C# programming languages was preferred to implement the entire system.

Brief system overview and integration

Medical imaging has made great progress in the last few decades. Over the years, different types of medical imaging have been developed such as X-ray, computed tomography (CT), molecular imaging, and magnetic resonance imaging (MRI). A HIS uses these files for reporting, data transfer, diagnostic, and treatment purposes. These images, which doctors can use to help better understand a patient's condition, are the most important part of diagnosis and treatment. In this study, client application refers to a health information management system which is running on a web-server with a static IP.

There are major theoretical and conceptual frameworks that were used in this study, which can best be summarized under three headings: service-routines (client and server side), security issues (for secure connection), and distributed file system architecture (the main part of our system). Service routines have been implemented by using client–server socket tools to accomplish secure communication between client-side and server-side, client-service routine (CSR), and data-node service routine (DNSR). These services communicate with each other using transmission blocks (shown in Figure 1) in a JSON data structure over TCP/IP. Library files (DLL files) must be installed on the client side for the client application to integrate into the system. The CSR is responsible for sending the medical image files regardless of its size and type to DNSR and reading them on demand. The CSR sends the client application’s requests (HIS) to DNSR with a secure connection. Also, one smaller Windows service has been implemented for replica nodes to search the files. Moreover, this service provides encrypted data transfer with the other nodes.

Figure 1. Windows service routines and JSON packages

The server side is the important part of this work, as it is a comprehensive kernel service responsible for (a) listening to the specified port number for any client application request; (b) the authentication process includes checking CSR IP value; and (c) processing reading, writing, and deleting operations according to requests.

Another important key part of the study is achieving a strong security level which includes (a) a primarily static IP value of CSR, (b) JSON and BSON data structure for data transfer, and (c) a symmetric encryption algorithm.

Thanks to these security measures, the security of the developed middle layer platform has been increased dramatically.

System overview

In this study, a fast, secure, serverless, robust, survivable, easy to manage, and scalable distributed file system has been developed for medical big data sets. This system is a smart distributed file system developed primarily for Kırıkkale University and called Kırıkkale University Smart Distributed File System (KUSDFS). KUSDFS is structurally designed to manage large amounts of data such as Hadoop, NFS, and AFS. The contributions of this study are explained in detail below. Today, the amount of data in the health sector is near 10% of data produced globally.[43] Healthcare software systems do not tolerate waiting on a server that they order a process on, so all transactions should be run as soon as possible or within an acceptable timeout period. Hadoop and the other solutions bear the burden of accomplishing these tasks conveniently, mainly created for big data storage needs and to overcome grid computing operations by the MapReduce framework. As a result, for Kırıkkale University healthcare software, there is a crucial need to implement its own distributed file system, the key factor in our study.

Comparisons with other data processing methods have been made to evaluate the performance of the system. The characteristics of the comparison systems are:

Hadoop: The Hadoop configuration used for comparison includes one name node and three data nodes. Red Hat Enterprise Linux Server 6.0 runs on each node. Also, Java-1.6.0, MongoDB-3.4.2 and Hadoop-2.7.2 is installed on each node.

CouchBase: One of the systems used to compare is Couchbase, a popular NoSQL database system. In a study by Elizabeth Gallagher, she claims that Couchbase is clearly as powerful as the other popular NoSQL databases.[44] In the system, each bucket is made up of 20 MB of clusters. The test machine has 6 GB of RAM and 200 GB of storage space. Microsoft SQL Server 2014 is also selected and installed on the same machine to make comparisons for the RDBMS database engine.

System components include the client applications, data nodes, virtual file system, and replica nodes.

Client applications

Client applications are remote applications that receive filing service from the KUSDFS system. The applications are platform independent because they make all requests on TCP/IP. It is enough that applications needing filing service must have a static IP. Client applications can easily use this system commonly as a "write once read many" (WORM) system for filing and archiving purposes. An application that wants to receive filing service from KUSDFS should install the Dynamic Link Library on its system. In this system, GetAvailableDatanodeIp, ChangePassword, SaveFile, ReadFile, DeleteFile, and GetFileFromReplica methods are available, detailed further in the next sections. The client application communicates with the KUSDFS data node, which is reserved for itself and manages all its operations via this node. Authentication for secure logon needing username and password is done by using symmetric cryptography methods. However, static client IP is also registered on the server to strength connection security. As there is no limit to the number of client applications, these applications can use more than one data node defined on KUSDFS when needed. Here is one of the key differences of our system that others do not have. The general structure of our system is shown in Figure 2.

Figure 2. System overall structure

Nodes

One of the essential elements of the system is its node elements, and there are two types: data nodes and replica nodes. Distributed systems usually use head name nodes and data nodes. In this work, all the functions of the head node are inherited to the data node, so only the data node is used. Because, in other systems, when the head node is crushed, the entire system will be often become unavailable, perhaps causing data loss.[20][21][22][23][24][25][26][27][28][29][30][31][32] These nodes have quite functional properties in that they can collect data node and server node functions. In this way, the survivability and availability of the system is very good. In the whole system, even if only one node remains, the system can continue to operate. This is one of the features our system possesses.

Data node

It is known that data nodes are responsible for storing and managing files that come from the client. The client applications communicate with data nodes via the TCP/IP network service. In this work, each of the data nodes has the same priority level to process the filing services for the client responsible for that. The software, dynamic link library developed by us, running in the client, provides a secure connection to data nodes via symmetric-key algorithms. However, application servers have a static IP address that are served by corresponding data nodes. Through this IP, it is connected to the data node. Data nodes also control the IP address of requests whether that is on the registration list or not. In addition, the client can encrypt the data that they want to send by the symmetric key algorithm and securely send the encrypted file to the data node optionally. Likewise, both static IP with secure authentication and encryption of files have achieved a strong security level. The data nodes will store requests from the client on the virtual file system, a large file on the data storage device. One of the most prominent features of the system is the developed virtual file structure. Also, each of the data nodes can serve in more than one client application. The nodes, to achieve optimal load balancing, are used for file storage in order. Also, the client application can be connected to multiple data nodes, so each node can serve multiple client applications at the same time. With this property, the system has both good scalability and availability.

System service architecture

There is a service running on each data node to accomplish all the functionality that the system has. This program listens to the client applications by using the server socket structure through the port allocated to it (443, which is reserved for HTTPS, but we used this port). When the service routine is enabled:

It identifies all data storage devices in the computer and prepares the device list.
It creates a large empty file that covers 80% (optimum file size according to our experiment/test) of each element in the device list (this file has been created for each drive only once). This description is defined in Figure 3. The file corresponding to a disc drive in the computer has a disc header and array-based bunch list in which each bunch has 10 MB of size. The disc header contains a bitmap of the bunch list, and each bunch is represented in a header by one bit whether it belongs to a file or not.
The structure of this file is described as a virtual file system.
The SystemDataFile located on the first node of the system is read by the data node when activated. This file contains comprehensive information for all data nodes, replica nodes, and client applications in the system. This size is a rather small size. As shown in Table 1, all data nodes get this list from the first node that has a constant IP value when the computer restarts. Then, when a new node is added to the system or there is an update in the list, the system administrator sends the CheckServerList function of the service routine to all the data nodes. Each client application is connected to the data node that is assigned to it (at least one data node). The functions used are GetAvailableDatanodeIp, SaveFile, ReadFile, DeleteFile, CheckServerList, and GetFileFromReplica.

Figure 3. System data storage device architecture

IP	Description
Table 1. System data file containing registered IP list
xxx.xxx.xxx.xxx	Data
xxx.xxx.xxx.xxx	Data
xxx.xxx.xxx.xxx	Replica
xxx.xxx.xxx.xxx	Replica
xxx.xxx.xxx.xxx	Client Application
xxx.xxx.xxx.xxx	Client Application
xxx.xxx.xxx.xxx	Client Application

SaveFile

The data node, also responsible for the client application, runs/acts like a load balancer. This node sends the IP address of the next data node that is responsible for file storage operation. Therefore, the file saving operation is done in two steps. When the client application wants to store the file, it takes the IP value of the appropriate data node from the corresponding data node. As mentioned before, a corresponding data node means client applications have a proxy node which is responsible for the routing operation.

As shown in Figure 4, the client gets the IP address of the data node that will store the file from the proxy node before storing the file. Load balance functionality provides a different data node IP for SaveFile operations. This operation is done with the GetAvailableDatanodeIP function. The system saves the client file to the data node, which is nearly got from the proxy node by using the SaveFile function. The SaveFile function saves the file in its virtual file system with a linked list structure and changes the corresponding bunch in bitmap (a hybrid of Unix bitmap and Windows FAT32 structure with revised version). Because of this operation, the client gets a unique file name that has the starting bunch number and the data node ID of the file. In summary, when the client application wants to store the file, it gets the IP address of the data node that will store the file. In this way, the client switches to data communication with the target data node to store the file. In systems that use head node structure, the data is transferred to the head node, and the head node writes this data to the data node. In this schema, the amount of data transferred is doubled because data first goes from client to head node and then goes to the data node. This type of work also reduces the number of concurrent connections to the head node. But in our system, the client communicates with the data node that will store the file, and the data is transferred once. In this way, the concurrent connection number is maximized.

Figure 4. File storage processing

The client receiving these values deals with only this data node in subsequent operations (in the commands ReadFile and DeleteFile) and run the requests. Only when the ReadFile operation fails, the client app gets one of the other copies of the file with the GetFileFromReplica command.

When a new client is added to the system, and change is made to the data nodes, the system administrator sends the CheckServerList command to the all nodes.

Virtual file system

The system creates a file that occupies 80% of the total free space on the physical device. At this point, as shown in Figure 3, the file will form a continuous area on the physical drive. Thus, the required disk I/O will decrease. This file is divided into blocks—called a bunch—of 10 MB; Figure 5 shows a bunch structure.

Figure 5. Bunch structure of file system

A bunch consists of three bytes of next bunch information, one byte of data indicating to which replica server is kept, four bytes of client application IP, and 50 bytes of the file name and data block. The bunches behave exactly like an array, and the index number starts from zero. As shown in Figure 6, the disc header consists of three bytes of data holding the amount of the bunch and 200 KB of data in bitmap format. Whether each bunch is empty or not is controlled by the bitmap structure, in which each bunch is represented by one bit, with "1" representing a used or reserved bunch, otherwise the bunch is empty, which means it is ready to be used in the file chain. This bitmap field size is 200 KB. Thus, a storage area of 16 TB is reached.

Figure 6. Structure of disc header

Each bunch holds a three-byte pointer that holds the next bunch in itself. Files that are bigger than a bunch size are kept in the file chain. As shown in Figure 7, each bunch points to the next data block with a three-byte pointer, which holds the index of the next bunch. If this pointer is "0," it means that the bunch represents EOF, in other words, this bunch is the last one of the file chain. Data nodes are responsible for storing, reading, deleting, and backing up the files sent to them. One of the most basic features of this structure is that the starting bunch number of the file stored in the nodes is sent with the file name, and in this way the reading starts with singular disk access.

Figure 7. Structure of file data bunch

With this feature, system performance is quite satisfactory. This situation is shown in Table 2. The result of comparing the performance values of KUSDFS with other systems is shown in the diagram in Figure 8.

File Size (KB)	KUSDFS (ms.)	OS (ms.)	NoSQL (ms.)	RDBMS (ms.)	Hadoop (ms.)
Table 2. The response times given by the systems for different file size (ms.)
30	0.01	0.04	0.60	0.75	0.80
1000	0.92	1.43	2.74	3.15	4.01
10000	4.40	4.48	8.44	9.97	11.15
20000	4.48	11.19	18.01	18.16	20.45
30000	11.36	14.80	27.22	27.80	30.15
50000	22.60	24.54	43.95	44.08	47.88

Figure 8. The response times given by the systems for different file

Firstly, the developed system can be used by more than one client application server. The client application server is matched to one of the data nodes, and all requests are forwarded to this server (data node). A data node can serve multiple client servers; this increases the survivability of the system. Application servers may communicate with different data nodes, and this definitely will not affect the operation of their systems. Each of the data nodes in the system can serve multiple client servers at the same time. In short, all nodes of the system can be used as a server when requested. With this feature, survivability, availability, and reliability values give very good results.

The client application works with the data node to which it is connected, when the client-server wants to store the file, requests the address of the next data node from the data node. The client-server that receives this address writes the file to the data node at this IP address. The application server writes its files in order to the data node in the system. The only task of the data node that the application is associated with is to give it an IP address. In fact, this structure resembles the name node in the Hadoop architecture.

Replica node

Replica nodes, which are used to increase the system’s fault tolerance in the DFS as an indispensable backup method, are also used and are much easier to process on this system. First, these nodes are used to make copies of files that are stored by data nodes. The data nodes asynchronously send the uploaded files to replica nodes. Replica nodes store these files in the underlying operating system file structure. Here, no special file structure or any other file processing strategies are used, but the operating system’s filing service is done. This means replica nodes run as known file servers that are only responsible for storing files on the operating system volume-directory structure. The replica count was set at three as the default replica count for DFS; for all nodes that replica count can be raised to eight.

All replica nodes have a service responsible for storing, deleting, updating, forwarding, and searching the files in the system. So, when a file search arises or is needed in one of the data nodes or client, the SearchFile command is sent to all replica nodes, then, at the same time replica nodes start to search for the target file in their volume structure as grid computing. At the end of concurrent search processes, each node returns the search results indicating that search process is successful or not. In other words, when a file wants to be searched by a node, SearchFile messages are sent to all replicas, which work in parallel, and the result is transmitted to the data node requesting the file. Replica nodes not only make file searches but they also process ReplicaWrite, ReplicaRead, and ReplicaDelete commands, which denote the operations applied to a file in writing, reading, deleting, and updating on the replica node, respectively.

Functional features

1. Serverless architecture is preferred instead of using name node. The term "serverless" does not mean that servers are not used, or there is no server. This only means that we no longer have to think too much about them.[45] At the same time, the data nodes at the same level are considered serverless hierarchically. Each of the four nodes used in this study serves file storing operations for different types of client applications. When any of these nodes are disabled, the system will automatically route the client to the other data nodes. In this way:

computing resources get used as services without having to operate around physical capacities or limits, and
it is possible to use different sources (more than one hard disk) on the same machine.

2. Corporations need small and medium block sizes. Hadoop and the others are generally configured for at least a 64 MB block size. This approach makes it difficult to acquire an excellent choice for small and medium-sized big data, with few options available to solve small and medium size big data problems.

3. The name node crash recovery problem is solved. The fault tolerance of the system is very high because there is no name node.

4. An easy to manage virtual file system has been produced on the underlying operating system. A mixture of both bitmap and linked list structure was used for managing the bunches, which have a fixed size of 10 MB.

5. The start address of the file is saved in client applications that give us sufficient performance gain.

6. The most important features of our system is any of the data nodes may be a server for any application. In other words, any machine in the system can be a file server.

7. All the IP lists which contain data nodes and replica IPs and secure connection IPs are available for all nodes in the cluster.

Non-functional specifications that KUSDFS has:

1. Performance (reading): When we compare data processing speed with other systems, we have achieved very good performance.

2. Scalability: Due to the features it has, it is sufficient to install only the service program to add new nodes into the system. In this way, the system is designed to be able to serve thousands of nodes, while it can work with several nodes according to institutional needs.

3. Survivability: The system survives despite all the negatives and maintains its minimum functions. In the developed system, even if at least one of the nodes is active, the system is still able to work.

4. Availability: The system delivers its services successfully every time. This has been done with the developed system. Especially since there is no name node, the survival of the system is quite high.

5. Security: Static IP and synchronous encryption methods were used to strengthen the security level of the system.

6. Minimum cost: Data nodes used in the system are ordinary machines and do not have any extra unnecessary features.

Discussion

Big data has long been an issue of great interest in a wide range of fields, especially in healthcare. Recently, researchers have shown an increased interest in managing this problem by using the Hadoop ecosystem. However, much of the research up to now has been descriptive in nature of how data analytics methods or MapReduce tasks are used on the data, amd they have not addressed what will be done for efficient data storage with minimal internal fragmentation for small, medium, and big institutions' needs. This study seeks to obtain a successful solution which will help to address these gaps.

It is hoped that this research will contribute to a deeper understanding of problem-oriented stand-alone solutions. The distinctive strengths of this study are:

improved read–write performance due to the hybrid architecture
robustness and availability thanks to the use of no name node and treating each node like a server
optimal load balancing
successful integration with cheap ordinary hardware elements
secure connections via 128-bit symmetric cryptology
easy scalability by simply installing the library files to the node
suitable for healthcare institutions by using 10 MB block size

Figure 9 shows the strengths, weaknesses, opportunities, and threats (SWOT) associated with the system.

Figure 9. SWOT analysis of the proposed system

The limitations of the present study naturally include:

no grid computing facilities
a fixed bunch size of 10 MB
static IP addresses that change, which causes temporary system stops; however, this situation applies only to the client application, so it does not affect the availability of the system
limited test environment with four nodes
symmetric cryptography operates with a single key, and it is important to keep the key secret for the security of this cryptography

This study clearly has advantages and disadvantages, as discussed here. It will be important that future research be conducted to improve upon this system's drawbacks.

Conclusions

In this study, we developed a fast, secure, serverless, robust, survivable, easy to manage, scalable distributed file system especially for small and medium size big data sets. The designed smart distributed file system (KUSDFS) was developed for Kırıkkale University. The system is an independent system platform, in contrast to most distributed systems. It uses the TCP/IP protocol. Server nodes, head nodes, or name nodes are not used. The system is serverless. In this way, the survivability of the system is ensured. When a data node does not work properly in the system, other nodes can execute the requests. An unlimited number of data nodes can be added easily to the system when needed only by installing the windows service routine to the node. In that regard, the system proves superior when compared with other systems.

System security has an acceptable security level compared to other distributed file systems. This is proved in two ways. The first is to check the IPs of the client machine that are served from the data node and the second is to encrypt the data that the application software sends to the data nodes.

Because a data node can serve more than one client in our system, it has a better load balance performance than that of other distributed file systems. In the same way, a client can upload data to more than one data node.

In the designed system, a disc consists of data sets bound to "bunches" and "disc headers." The disc header holds the number of bunches found in that disc and whether the bunch is empty with the bitmap structure. A bunch holds the next bunch, the replicas of the data, the IP of the client application that loads the data, and the data itself. Using the designed file system, every bunch can point to the next bunch, which is a member of the file chain.

The replication operation uses the operating system file operations system calls (in Windows, called API). Data nodes send the files uploaded to itself asynchronously to the replica nodes.

The designed distributed file system is compared with other filing systems. According to our analysis, our system performed 97% better than the NoSQL system, 80% better than the RDBMS, and 74% better than the operating system.

In the future, healthcare institutions in Kırıkkale province can be integrated to KUSDFS. At this point, the cost, workload, and technology needs of the institutions will be minimized. Instead of establishing their own archiving systems, it is possible to access and integrate with the system by only installing library files. The developed system has a fixed bunch size of 10 MB especially for healthcare. To embrace dissimilar applications that have different file size and types, a dynamic bunch size might be developed in the future. Grid computing functionalities could also be implemented to make the system more powerful and more efficient in terms of computational speed. This is an important issue for future research. In addition, GPUs—which have been more widely used in recent years—can be included in the system to achieve a more specific distributed computing platform and to accelerate grid performance. However, further work with more focus on this topic is required to establish this property.

Acknowledgements

Author contributions

A.E. conceived and structured the system. M.Ü. performed literature search, helped in experiments and analyzed the test data. Both, authors implemented the system, wrote the paper, read and approved the submitted version of the manuscript.

Funding

This work has been partly supported by the Kırıkkale University Department of Scientific Research Projects (2016/107–2017/084).

Conflicts of interest

The authors declare that there is no conflict of interest.

References

↑ "Internet Live Stats". InternetLiveStats.com. Retrieved 16 July 2016.
↑ Kemp, S. (27 January 2016). "Digital in 2016". We Are Social. We Are Social Ltd. Retrieved 27 June 2016.
↑ "Internet Growth Statistics". Internet World Stats. Miniwatts Marketing Group. Retrieved 21 May 2018.
↑ Vagata, P.; Wilfong, K. (10 April 2014). "Scaling the Facebook data warehouse to 300 PB". Facebook Code. Facebook. Retrieved 27 June 2016.
↑ Dhavalchandra, P.; Jignasu, M.; Amit, R. (2016). "Big data—A survey of big data technologies".International Journal Of Science Research and Technology 2 (1): 45–50.
↑ Dean, B.B.; Lam, J.; Natoli, J.L. et al. (2009). "Review: Use of electronic medical records for health outcomes research: A literature review". Medical Care Research and Review 66 (6): 611–38.doi:10.1177/1077558709332440. PMID 19279318.
↑ 7.0 7.1 Ergüzen, A.; Erdal, E. (2017). "Medical Image Archiving System Implementation with Lossless Region of Interest and Optical Character Recognition". Journal of Medical Imaging and Health Informatics7 (6): 1246-1252. doi:10.1166/jmihi.2017.2156.
↑ Dinov, I.D. (2016). "Volume and Value of Big Healthcare Data". Journal of Medical Statistics and Informatics 4: 3. doi:10.7243/2053-7662-4-3. PMC PMC4795481. PMID 26998309.
↑ Elgendy N.; Elragal A. (2014). "Big Data Analytics: A Literature Review Paper". In Perner, P.. Advances in Data Mining. Applications and Theoretical Aspects. Lecture Notes in Computer Science. 8557. Springer. doi:10.1007/978-3-319-08976-8_16. ISBN 9783319089768.
↑ Gürsakal, N. (2014). Büyük Veri. Dora Yayıncılık. p. 2.
↑ Başara, B.B.; Güler, C. (2017). "Sağlık İstatistikleri Yıllığı 2016 Haber Bülteni". Republic of Turkey Ministry of Health General Directorate for Health Research.
↑ 12.0 12.1 Raghupathi, W.; Raghupathi, V. (2014). "Big data analytics in healthcare: Promise and potential". Health Information Science and Systems 2: 3. doi:10.1186/2047-2501-2-3. PMC PMC4341817.PMID 25825667.
↑ Klein, J.; Gorton, I.; Ernst, N. et al. (2015). "Application-Specific Evaluation of No SQL Databases".Proceeding of the 2015 IEEE International Congress on Big Data: 83.doi:10.1109/BigDataCongress.2015.83.
↑ Davaz, S. (28 March 2014). "NoSQL Nedir Avantajları ve Dezavantajları Hakkında Bilgi". Kodcu Blog. Kodcu. Retrieved 13 June 2017.
↑ 15.0 15.1 15.2 He, H.; Du, Z.; Zhang, W.; Chen, A. (2016). "Optimization strategy of Hadoop small file storage for big data in healthcare". The Journal of Supercomputing 72 (10): 3696–3707.doi:10.1007/s11227-015-1462-4.
↑ Wang, Y.; Kung, L.; Byrd, T.A. (2018). "Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations". Technological Forecasting and Social Change 126: 3–13.doi:10.1016/j.techfore.2015.12.019.
↑ Liu, W.; Park, E.K. (2014). "Big Data as an e-Health Service". Proceedings from the 2014 International Conference on Computing, Networking and Communications. doi:10.1109/ICCNC.2014.6785471.
↑ Mishra, S. (2018). "A Review on Big Data Analytics in Medical Imaging". International Journal of Computer Engineering and Applications 12 (1).
↑ Sobhy, D.; El-Sonbaty, Y.; Elnasr, M.A. (2012). "MedCloud: Healthcare cloud computing system".Proceedings from the 2012 International Conference for Internet Technology and Secured Transactions.
↑ 20.0 20.1 Alsberg, P.A.; Day, J.D. (1976). "A principle for resilient sharing of distributed resources".Proceedings of the 2nd International Conference on Software Engineering: 562–570.
↑ 21.0 21.1 Ellis, C.S.; Floyd, R.A. (March 1983). "The ROE File System". Office of Naval Research.
↑ 22.0 22.1 Popek, G.; Walker, B.; Chow, J. et al. (1981). "LOCUS: A network transparent, high reliability distributed system". Proceedings of the Eighth ACM Symposium on Operating Systems Principles: 169–77.
↑ 23.0 23.1 Sandberg, R.; Goldberg, D.; Kleiman, S. et al. (1985). "Design and implementation of the Sun Network Filesystem". Proceedings of the USENIX Conference & Exhibition: 119–30.
↑ 24.0 24.1 Coulouris, G.; Dollimore, J.; Kindberg, T.; Blair, G. (2011). Distributed Systems: Concepts and Design (5th ed.). Pearson. pp. 1008. ISBN 9780132143011.
↑ 25.0 25.1 Heidl, S. (July 2001). "Evaluierung von AFS/OpenAFS als Clusterdateisystem". Zuse-Institut Berlin.
↑ 26.0 26.1 Bžoch, P.; Šafařík, J. (2012). "Algorithms for increasing performance in distributed file systems". Acta Electrotechnica Et Informatica 12 (2): 24–30. doi:10.2478/v10198-012-0005-7.
↑ 27.0 27.1 Karasula, B.; Korukoğlu, S. (2008). "Modern Dağıtık Dosya Sistemlerinin Yapısal Karşılaştırılması". Proceedings of the Akademik Bilişim’2008: 601–610.
↑ 28.0 28.1 Thekkath, C.A.; Mann, T.; Lee, E.K. (1997). "Frangipani: A scalable distributed file system".Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles: 224–37.
↑ 29.0 29.1 Adya, A.; Bolosky, W.J.; Castro, M. et al. (2002). "Farsite: Federated, available, and reliable storage for an incompletely trusted environment". ACM SIGOPS Operating Systems Review 36 (SI): 1–14.doi:10.1145/844128.844130.
↑ 30.0 30.1 Weil, S.A.; Brandt, S.A.; Miller, E.L. et al. (2006). "Ceph: A scalable, high-performance distributed file system". Proceedings of the 7th Symposium on Operating Systems Design and Implementation: 307–20.
↑ 31.0 31.1 Shvachko, K.; Kuang, H.; Radia, S. (2010). "The Hadoop Distributed File System".Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies: 1–10.
↑ 32.0 32.1 Yavuz, G.; Aytekin, S.; Akçay, M. (2012). "Apache Hadoop Ve Dağıtık Sistemler Üzerindeki Rolü". Dumlupinar Üniversitesi Fen Bilimleri Enstitüsü Dergisi (27): 43–54.
↑ Khidairi, S. (04 January 2012). "The Apache Software Foundation Announces Apache Hadoop v1.0".The Apache Software Foundation Blog. Apache Software Foundation.
↑ Thomson, A.; Abadi, D.J. (2015). "CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems". Proceedings of the 13th USENIX Conference on File and Storage Technologies: 1–14.
↑ Al-kahtani, M.S.; Karim, L. (2017). "An Efficient Distributed Algorithm for Big Data Processing". Arabian Journal for Science and Engineering 42 (8): 3149–3157. doi:10.1007/s13369-016-2405-y.
↑ Wang, Q.; Wang, H.; Zhang, C. et al. (2014). "A Parallel Implementation of Idea Graph to Extract Rare Chances from Big Data". Proceedings from the 2014 IEEE International Conference on Data Mining Workshop: 503–10. doi:10.1109/ICDMW.2014.91.
↑ Liang, Z.; Li, W.; Li, Y. (2013). "A parallel Probabilistic Latent Semantic Analysis method on MapReduce platform". Proceedings from the 2013 IEEE International Conference on Information and Automation: 1017–22. doi:10.1109/ICInfA.2013.6720444.
↑ Chen, T.; Wei, H.; Wei, M. et al. (2013). "LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment". Proceedings from the 2013 IEEE International Conference on Collaboration Technologies and Systems: 342–46. doi:10.1109/CTS.2013.6567252.
↑ Muja, M.; Lowe, D.G. (2014). "Scalable Nearest Neighbor Algorithms for High Dimensional Data". IEEE Transaction on Pattern Analysis and Machine Intelligence 36 (11): 2227-40.doi:10.1109/TPAMI.2014.2321376. PMID 26353063.
↑ Lu, F.; Hong, L.; Changefeng, L. (2015). "The improvement and implementation of distributed item-based collaborative filtering algorithm on Hadoop". Proceedings from the 34th Chinese Control Conference: 9078–83. doi:10.1109/ChiCC.2015.7261076.
↑ 41.0 41.1 Cevher, V.; Becker, S.; Schmidt, M. (2014). "Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics". IEEE Signal Processing Magazine 31 (5): 32–43. doi:10.1109/MSP.2014.2329397.
↑ 42.0 42.1 Jin, Y.; Deyu, T.; Yi, Z. (2011). "A Distributed Storage Model for EHR Based on HBase".Proceedings from the 2011 International Conference on Information Management, Innovation Management and Industrial Engineering: 369–72. doi:10.1109/ICIII.2011.234.
↑ Hossain, M.S.; Muhammad, G. (2016). "Cloud-assisted Industrial Internet of Things (IIoT) – Enabled framework for health monitoring". Computer Networks 101: 192–202. doi:10.1016/j.comnet.2016.01.009.
↑ Chandra, D.G. (2015). "BASE analysis of NoSQL database". Future Generation Computer Systems 52: 13–21. doi:10.1016/j.future.2015.05.003.
↑ Fromm, K. (15 October 2012). "Why The Future Of Software And Apps Is Serverless". readwrite. ReadWrite, Inc. Retrieved 15 June 2017.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article mentions "Liu and Park," yet they did not include a citation for those authors; this article adds the presumed citation associated with those names. The original URL to the Heidl citation led to a security warning from Google about the site; a substitute URL to DocPlayer has been added in its place. The original used Wikipedia as a citation about companies using Hadoop, which is frowned upon; updated with an improved source.

source:https://www.limswiki.org/index.php/Journal:Developing_a_file_system_structure_to_solve_healthcare_big_data_storage_and_archiving_problems_using_a_distributed_file_system

by Shawn Douglas (Admin)

воскресенье, 24 июня 2018 г.