Yadda ake yin naku autoscaler don gungu

Sannu! Muna horar da mutane don yin aiki da manyan bayanai. Ba shi yiwuwa a yi tunanin shirin ilimi akan manyan bayanai ba tare da gungu na kansa ba, wanda duk mahalarta ke aiki tare. A saboda wannan dalili, shirin namu koyaushe yana da shi 🙂 Muna tsunduma cikin tsarin sa, daidaitawa da gudanarwa, kuma mutanen sun ƙaddamar da MapReduce ayyuka kai tsaye a can kuma suna amfani da Spark.

A cikin wannan sakon za mu gaya muku yadda muka magance matsalar rashin daidaituwa ta hanyar yin lodi ta hanyar rubuta namu autoscaler ta amfani da gajimare. Mail.ru Cloud Solutions.

matsala

Ba a amfani da gungu na mu a cikin yanayi na yau da kullun. Zubar da ciki ba daidai ba ne. Misali, akwai azuzuwan aiki, lokacin da duk mutane 30 da malami suka je gungu suka fara amfani da shi. Ko kuma, akwai kwanaki kafin ranar ƙarshe lokacin da kaya ya ƙaru sosai. Sauran lokacin gungu yana aiki a yanayin ƙasa.

Magani #1 shine kiyaye gungu wanda zai iya jure babban lodi, amma zai kasance mara amfani sauran lokacin.

Magani #2 shine kiyaye ƙaramin gungu, wanda zaku ƙara nodes da hannu kafin azuzuwan da lokacin ɗaukar nauyi.

Magani #3 shine kiyaye ƙaramin gungu kuma rubuta autoscaler wanda zai saka idanu akan nauyin tari na yanzu kuma, ta amfani da APIs daban-daban, ƙara da cire nodes daga gungu.

A cikin wannan sakon zamuyi magana akan mafita #3. Wannan autoscaler ya dogara sosai akan abubuwan waje maimakon na ciki, kuma masu samarwa galibi ba sa samar da shi. Muna amfani da kayan aikin girgije na Mail.ru Cloud Solutions kuma mun rubuta autoscaler ta amfani da MCS API. Kuma tunda muna koyar da yadda ake aiki da bayanai, mun yanke shawarar nuna yadda zaku iya rubuta irin wannan autoscaler don dalilai na ku kuma kuyi amfani da shi tare da gajimare.

abubuwan da ake bukata

Da farko, dole ne ku sami gungu na Hadoop. Misali, muna amfani da rarraba HDP.

Domin a ƙara nodes ɗinku da sauri kuma a cire su, dole ne ku sami takamaiman rabe-raben ayyuka a tsakanin nodes.

  1. Babban kumburi. To, babu buƙatar bayyana wani abu musamman: babban kumburi na gungu, wanda, alal misali, an ƙaddamar da direban Spark, idan kuna amfani da yanayin hulɗa.
  2. Kwanan kwanan wata. Wannan shine kumburin da kuke adana bayanai akan HDFS kuma inda ake yin lissafin.
  3. Kullin kwamfuta. Wannan kumburi ne inda ba ku adana komai akan HDFS, amma inda lissafin ke faruwa.

Muhimmin batu. Autoscaling zai faru saboda nodes na nau'i na uku. Idan ka fara ɗauka da ƙara nodes na nau'in na biyu, saurin amsawa zai yi ƙasa sosai - ƙaddamarwa da sakewa zai ɗauki sa'o'i akan gungu. Wannan, ba shakka, ba shine abin da kuke tsammani daga autoscaling ba. Wato ba ma taɓa nodes na nau'in farko da na biyu ba. Za su wakilci ƙaramin gungu mai ƙarfi wanda zai wanzu cikin tsawon lokacin shirin.

Don haka, autoscaler ɗinmu an rubuta shi a cikin Python 3, yana amfani da Ambari API don sarrafa ayyukan gungu, amfani API daga Mail.ru Cloud Solutions (MCS) don farawa da injunan tsayawa.

Magani gine-gine

  1. Module autoscaler.py. Ya ƙunshi nau'o'i uku: 1) ayyuka don aiki tare da Ambari, 2) ayyuka don aiki tare da MCS, 3) ayyuka masu dangantaka kai tsaye zuwa dabaru na autoscaler.
  2. Rubutun observer.py. Ainihin ya ƙunshi dokoki daban-daban: yaushe kuma a wane lokaci don kiran ayyukan autoscaler.
  3. Fayil na tsari config.py. Ya ƙunshi, alal misali, jerin nodes da aka ba da izini don daidaitawa ta atomatik da sauran sigogi waɗanda suka shafi, misali, tsawon lokacin jira daga lokacin da aka ƙara sabon kumburi. Hakanan akwai tambarin lokaci don farkon azuzuwan, ta yadda kafin ajin za a ƙaddamar da mafi girman ƙa'idodin gungu da aka ba da izini.

Bari yanzu mu kalli guntun lambar da ke cikin fayilolin biyu na farko.

1. Autoscaler.py module

Ambari class

Wannan shine yadda ɓangarorin code ke ɗauke da aji yayi kama Ambari:

class Ambari:
    def __init__(self, ambari_url, cluster_name, headers, auth):
        self.ambari_url = ambari_url
        self.cluster_name = cluster_name
        self.headers = headers
        self.auth = auth

    def stop_all_services(self, hostname):
        url = self.ambari_url + self.cluster_name + '/hosts/' + hostname + '/host_components/'
        url2 = self.ambari_url + self.cluster_name + '/hosts/' + hostname
        req0 = requests.get(url2, headers=self.headers, auth=self.auth)
        services = req0.json()['host_components']
        services_list = list(map(lambda x: x['HostRoles']['component_name'], services))
        data = {
            "RequestInfo": {
                "context":"Stop All Host Components",
                "operation_level": {
                    "level":"HOST",
                    "cluster_name": self.cluster_name,
                    "host_names": hostname
                },
                "query":"HostRoles/component_name.in({0})".format(",".join(services_list))
            },
            "Body": {
                "HostRoles": {
                    "state":"INSTALLED"
                }
            }
        }
        req = requests.put(url, data=json.dumps(data), headers=self.headers, auth=self.auth)
        if req.status_code in [200, 201, 202]:
            message = 'Request accepted'
        else:
            message = req.status_code
        return message

Sama, a matsayin misali, zaku iya kallon aiwatar da aikin stop_all_services, wanda ke dakatar da duk sabis akan kullin gungu da ake so.

A kofar shiga class Ambari ka wuce:

  • ambari_url, misali, kamar 'http://localhost:8080/api/v1/clusters/',
  • cluster_name – sunan gungun ku a Ambari,
  • headers = {'X-Requested-By': 'ambari'}
  • da ciki auth Ga login ku da kalmar sirri don Ambari: auth = ('login', 'password').

Ayyukan da kanta ba komai bane illa kira guda biyu ta hanyar REST API zuwa Ambari. Daga mahangar ma'ana, mun fara karɓar jerin ayyuka masu gudana akan kumburi, sannan mu nemi gungu da aka ba, akan kullin da aka ba, don canja wurin ayyuka daga jeri zuwa jiha. INSTALLED. Ayyuka don ƙaddamar da duk ayyuka, don canja wurin nodes zuwa jiha Maintenance da sauransu. kama kama - su ne kawai buƙatun ta hanyar API.

Class Mcs

Wannan shine yadda ɓangarorin code ke ɗauke da aji yayi kama Mcs:

class Mcs:
    def __init__(self, id1, id2, password):
        self.id1 = id1
        self.id2 = id2
        self.password = password
        self.mcs_host = 'https://infra.mail.ru:8774/v2.1'

    def vm_turn_on(self, hostname):
        self.token = self.get_mcs_token()
        host = self.hostname_to_vmname(hostname)
        vm_id = self.get_vm_id(host)
        mcs_url1 = self.mcs_host + '/servers/' + self.vm_id + '/action'
        headers = {
            'X-Auth-Token': '{0}'.format(self.token),
            'Content-Type': 'application/json'
        }
        data = {'os-start' : 'null'}
        mcs = requests.post(mcs_url1, data=json.dumps(data), headers=headers)
        return mcs.status_code

A kofar shiga class Mcs mu wuce da aikin id a cikin gajimare da mai amfani id, kazalika da kalmar sirri. A cikin aiki vm_turn_on muna so mu kunna daya daga cikin injinan. Hankali a nan ya ɗan fi rikitarwa. A farkon lambar, ana kiran wasu ayyuka guda uku: 1) muna buƙatar samun alama, 2) muna buƙatar canza sunan mai masauki zuwa sunan na'ura a MCS, 3) samun id na wannan na'ura. Na gaba, muna kawai yin buƙatun post kuma mu ƙaddamar da wannan injin.

Wannan shine yadda aikin samun alama yayi kama:

def get_mcs_token(self):
        url = 'https://infra.mail.ru:35357/v3/auth/tokens?nocatalog'
        headers = {'Content-Type': 'application/json'}
        data = {
            'auth': {
                'identity': {
                    'methods': ['password'],
                    'password': {
                        'user': {
                            'id': self.id1,
                            'password': self.password
                        }
                    }
                },
                'scope': {
                    'project': {
                        'id': self.id2
                    }
                }
            }
        }
        params = (('nocatalog', ''),)
        req = requests.post(url, data=json.dumps(data), headers=headers, params=params)
        self.token = req.headers['X-Subject-Token']
        return self.token

Matsayi na Autoscaler

Wannan aji yana ƙunshe da ayyuka masu alaƙa da dabarun aiki da kansa.

Wannan shine yadda guntun code na wannan ajin yayi kama da:

class Autoscaler:
    def __init__(self, ambari, mcs, scaling_hosts, yarn_ram_per_node, yarn_cpu_per_node):
        self.scaling_hosts = scaling_hosts
        self.ambari = ambari
        self.mcs = mcs
        self.q_ram = deque()
        self.q_cpu = deque()
        self.num = 0
        self.yarn_ram_per_node = yarn_ram_per_node
        self.yarn_cpu_per_node = yarn_cpu_per_node

    def scale_down(self, hostname):
        flag1 = flag2 = flag3 = flag4 = flag5 = False
        if hostname in self.scaling_hosts:
            while True:
                time.sleep(5)
                status1 = self.ambari.decommission_nodemanager(hostname)
                if status1 == 'Request accepted' or status1 == 500:
                    flag1 = True
                    logging.info('Decomission request accepted: {0}'.format(flag1))
                    break
            while True:
                time.sleep(5)
                status3 = self.ambari.check_service(hostname, 'NODEMANAGER')
                if status3 == 'INSTALLED':
                    flag3 = True
                    logging.info('Nodemaneger decommissioned: {0}'.format(flag3))
                    break
            while True:
                time.sleep(5)
                status2 = self.ambari.maintenance_on(hostname)
                if status2 == 'Request accepted' or status2 == 500:
                    flag2 = True
                    logging.info('Maintenance request accepted: {0}'.format(flag2))
                    break
            while True:
                time.sleep(5)
                status4 = self.ambari.check_maintenance(hostname, 'NODEMANAGER')
                if status4 == 'ON' or status4 == 'IMPLIED_FROM_HOST':
                    flag4 = True
                    self.ambari.stop_all_services(hostname)
                    logging.info('Maintenance is on: {0}'.format(flag4))
                    logging.info('Stopping services')
                    break
            time.sleep(90)
            status5 = self.mcs.vm_turn_off(hostname)
            while True:
                time.sleep(5)
                status5 = self.mcs.get_vm_info(hostname)['server']['status']
                if status5 == 'SHUTOFF':
                    flag5 = True
                    logging.info('VM is turned off: {0}'.format(flag5))
                    break
            if flag1 and flag2 and flag3 and flag4 and flag5:
                message = 'Success'
                logging.info('Scale-down finished')
                logging.info('Cooldown period has started. Wait for several minutes')
        return message

Muna karɓar darasi don shigarwa. Ambari и Mcs, jerin nodes waɗanda aka ba da izini don ƙaddamarwa, da kuma sigogin ƙirar ƙira: ƙwaƙwalwar ajiya da cpu da aka ba da su zuwa kumburi a cikin YARN. Hakanan akwai sigogi na ciki guda 2 q_ram, q_cpu, waɗanda jerin gwano ne. Amfani da su, muna adana dabi'u na nauyin tari na yanzu. Idan muka ga cewa a cikin mintuna 5 da suka gabata an sami ƙaruwa akai-akai, to mun yanke shawarar cewa muna buƙatar ƙara kumburin +1 zuwa gungu. Haka lamarin yake ga jihar rashin amfani da tari.

Lambar da ke sama misali ne na aikin da ke cire na'ura daga gungu kuma ya dakatar da shi a cikin gajimare. Da farko akwai sokewa YARN Nodemanager, sannan yanayin ya kunna Maintenance, sa'an nan kuma mu dakatar da duk ayyuka akan na'ura kuma mu kashe na'ura mai mahimmanci a cikin gajimare.

2. Mai duba rubutun.py

Misali code daga can:

if scaler.assert_up(config.scale_up_thresholds) == True:
        hostname = cloud.get_vm_to_up(config.scaling_hosts)
        if hostname != None:
            status1 = scaler.scale_up(hostname)
            if status1 == 'Success':
                text = {"text": "{0} has been successfully scaled-up".format(hostname)}
                post = {"text": "{0}".format(text)}
                json_data = json.dumps(post)
                req = requests.post(webhook, data=json_data.encode('ascii'), headers={'Content-Type': 'application/json'})
                time.sleep(config.cooldown_period*60)

A ciki, muna bincika ko an ƙirƙiri yanayi don haɓaka ƙarfin gungun kuma ko akwai wasu injuna a ajiyar, sami sunan mai masaukin ɗayansu, ƙara shi zuwa gungu kuma buga saƙo game da shi akan Slack ɗin ƙungiyarmu. Bayan haka yana farawa cooldown_period, lokacin da ba mu ƙara ko cire wani abu daga gungu ba, amma kawai saka idanu akan kaya. Idan ya daidaita kuma yana cikin madaidaicin madaidaicin ƙimar kaya, to muna ci gaba da sa ido kawai. Idan kumburi daya bai isa ba, sai mu kara wani.

Ga lokuta idan muna da darasi a gaba, mun riga mun san tabbas cewa kumburi ɗaya ba zai isa ba, don haka nan da nan za mu fara duk nodes ɗin kyauta kuma mu ci gaba da aiki har zuwa ƙarshen darasin. Wannan yana faruwa ne ta amfani da lissafin tamburan ayyuka.

ƙarshe

Autoscaler shine mafita mai kyau kuma mai dacewa ga waɗancan lokuta lokacin da kuka fuskanci lodin tari mara daidaituwa. Kuna cimma daidaitattun tsarin gungu da ake so don manyan lodi kuma a lokaci guda kar ku kiyaye wannan gungu yayin ɗaukar kaya, adana kuɗi. To, ƙari wannan duk yana faruwa ta atomatik ba tare da shigar ku ba. Autoscaler kanta ba komai bane illa saitin buƙatun zuwa API mai sarrafa tari da API mai ba da girgije, an rubuta bisa ga wata dabara. Abin da kuke buƙatar tunawa da gaske shine rarraba nodes zuwa nau'ikan 3, kamar yadda muka rubuta a baya. Kuma za ku yi farin ciki.

source: www.habr.com

Add a comment